OLMo Hybrid Model | Efficient RNN-Transformer Fusion

🌟 OLMo Hybrid Model: Combining RNN and Transformer in One Solution

The Allen Institute has introduced a new model called OLMo Hybrid 7B. It is based on alternating layers of Gated DeltaNet and standard attention in a 3-to-1 ratio. This approach allows for solving more tasks simultaneously while processing fewer tokens, significantly reducing the requirements for training data volume.

Gated DeltaNet is a type of recurrent neural network enhanced with negative values in transition matrices. This small change in the internal state update mechanism enables Gated DeltaNet layers to implement dynamics similar to element permutation, allowing models to perform state-tracking tasks that are inaccessible to pure transformers.

Research at AI2 has shown that hybrid models are generally more expressive than just the sum of their parts. There is a specific class of tasks—so-called memory-based state tracking—that cannot be solved by pure transformers or pure RNNs alone, but the hybrid approach with alternating layer types handles them effectively after just one pass.

Experiments with models ranging from 60 million to one billion parameters demonstrated that GDN consistently outperforms Mamba2 both in pure and hybrid configurations. Additionally, uniform layer alternation proves more effective than concentrating attention in the middle of the network. The 3-to-1 ratio balances quality and computational costs on medium and large scales.

Regarding testing:

On the MMLU benchmark, OLMo Hybrid achieves the same accuracy as the OLMo 3B model with 7 billion parameters, while using nearly half the tokens—49% fewer; according to Common Crawl data, the savings reach about 35%.

The data efficiency coefficient of the hybrid is approximately 83.7%, compared to about 94.9% for transformers. Notably, as model size increases, savings become even more significant: roughly 1.3 times at the billion-scale level and nearly double at models with 70 billion parameters.

After further training and adaptation for long-context processing, OLMo Hybrid outperforms OLMo 3B across all tested areas. For example, on the RULER task with context lengths up to 64,000 tokens, it scores 85.0 points versus 70.9 for the baseline model.

License: Apache 2.0.

Additional materials include an article, a set of models, and a technical report.

Created with n8n:
https://cutt.ly/n8n

Created with syllaby:
https://cutt.ly/syllaby

OLMo Hybrid Model | Efficient RNN-Transformer Fusion | AI Research