New AI Research from Apple and Equal AI Reveals Redundancies in Transformer Architecture: How Streamlining Feed Forward Networks Increases Efficiency and Accuracy

Rate this post

The transformer design that has become popular recently has been adopted as a standard method for natural language processing (NLP) activities, particularly for machine translation (MT). This architecture has exhibited impressive scaling properties, which means that adding more model parameters leads to better performance on various NLP tasks. Several studies and investigations have validated this observation. Although Transformers excels in scalability, there are parallel moves to make this model more effective and deployable in the real world. This should take care of latency, memory usage and disk space issues.

Researchers are actively investigating methods to solve these problems, including component trimming, parameter sharing, and dimensionality reduction. A widely used transformer architecture consists of several essential parts, two of the most important of which are the feed forward network (FFN) and attention.

  1. Attention – The attention mechanism allows the model to capture the relationships and dependencies between words in a sentence, regardless of their position. It acts as a sort of mechanism to help the model determine which parts of the input text are most relevant for each word it is currently analyzing. It depends on understanding the context and connection between the words in the phrase.
  1. Feed Forward Network (FFN): FFN is responsible for non-linearly transforming each input token independently. It adds complexity and expressiveness to the understanding of each word model by performing specific mathematical operations on each word’s counterpart.
🔥 Keep up with important advances in AI research with our newsletter – subscribe now while it’s free!

In recent research, a team of researchers has focused on investigating the role of FFN in transformer architecture. They find that FFN exhibits a high degree of redundancy when the model has a large component and uses a large number of parameters. They found that they could reduce the number of model parameters without significantly compromising accuracy. They have achieved this by removing the FFN from the decoder layers and instead using a single shared FFN at the encoder layers.

  1. Decoder level: Each encoder and decoder in the standard transformer model has its own FFN. The researchers removed the FFN from the decoder layers.
  1. Encoder layer: They used a single FFN that was shared by all encoder layers instead of having an individual FFN for each encoder layer.

Researchers have shared the advantages of this approach, which are as follows.

  1. Parameter Reduction: They drastically reduced the amount of parameters in the model by deleting and sharing FFN elements.
  1. The accuracy of the model decreased only modestly even after removing a large number of its parameters. This shows that there is some degree of functional redundancy between the numerous FFNs of the encoder and the FFNs of the decoder.
  1. Scaling back: They expand the hidden dimension of the shared FFN so that the architecture is restored to its previous size while maintaining or increasing the performance of the model. Compared to previous large-scale transformer models, this significantly improved accuracy and model processing speed, i.e., latency.

In conclusion, this research shows that feed forward networks in transformer designs, especially in decoder layers, can be streamlined and shared without significantly affecting model performance. This not only reduces the computational burden of the model but also improves its effectiveness and applicability for various NLP applications.

check paper All credit for this research goes to the researchers in this project. Also, don’t forget to participate Our 30k+ ML SubReddit, 40k+ Facebook community, Discord channelAnd Email newsletterWhere we share the latest AI research news, cool AI projects and more.

If you like our work, you will like our newsletter.

Tanya Malhotra is an undergrad from University of Petroleum and Energy Studies, Dehradun, pursuing BTech in Computer Science Engineering with specialization in Artificial Intelligence and Machine Learning.
She is a data science enthusiast with good analytical and critical thinking skills, along with a keen interest in learning new skills, leading teams and managing work in an organized manner.

🚀 Check out Noah AI: ChatGPT with hundreds of your Google Drive documents, spreadsheets, and presentations (Sponsored)

Leave a Comment