As machine learning models get bigger and more powerful, researchers are increasingly looking for ways to curb their huge computational appetites and improve efficiency. Nowhere is this better demonstrated than with transformer architectures, whose superior capabilities for processing long text sequences have propelled them to the forefront in the fields of natural language processing (NLP) and sequence modeling. , but whose quadratic computation complexity has hampered their application and accessibility.
In the new journal Hierarchical transformers are more efficient language models, a team from the University of Warsaw, OpenAI and Google Research offers Hourglass, a new hierarchical transformation language model that works on shortened sequences and reaches a new state of the art in image generation on ImageNet32.
The team summarizes the main contributions of their study as follows:
- We show how the hierarchy can improve the efficiency of transformers in a language modeling setup.
- Hourglass significantly outperforms the baseline both in terms of perplexity achieved at a given linear computational cost and empirical metrics such as running memory.
- Hourglass achieves leading-edge results among autoregressive models on the ImageNet32 build task and competitive results on other image generation and language modeling tasks.
- The hourglass can be used with any kind of attention, which opens new avenues for future research on transformers capable of handling longer sequences and on improving the trade-off between efficiency and precision.
The high computational costs of transformers are mainly due to their self-attention mechanisms: each self-attention layer has a quadratic complexity in the length of the context. Previous studies have proposed techniques such as scattered attention mechanisms designed to modify this attention mechanism without changing the overall architecture of the transformer. However, most of these techniques still force the model to run on a sequence of the same length as the input, which, as the article explains, leads to both fundamental and practical shortcomings. Basically, while the goal is for models to create high-level representations of words, entities, or even entire events, these occur at a much different granularity than the individual letters the model receives in Entrance. On a practical level, even linear complexity layers can be very slow and memory intensive when processing very long sequences with poor granularity.
To address these issues, the researchers first modify the architecture of the transformer to shorten the internal sequence of activations as the layer stack deepens and extend it before generation. They then merge the tokens into groups using a shortening operation to reduce the overall length of the sequence. Finally, they over-sample these tokens again, combined with sequences from previous layers.
The team compared Hourglass with various basic transformer models in terms of required operating memory, computational cost, and puzzlement over popular Enwik8, ImageNet, and CIFAR benchmarks for text and image generation tasks.
In the experiments, Hourglass surpassed transformer baselines in terms of the perplexity achieved at a given linear computational cost, improved the efficiency of language modeling on enwik8, and reached a new state of the art for transformer models on the ImageNet32 build task. The results indicate that the proposed hierarchy transformer architectures are able to handle longer sequences and improve the tradeoff between efficiency and precision. The researchers suggest that future work could focus on the shortening mechanism itself and on choosing the best hierarchy for particular tasks.
The paper Hierarchical transformers are more efficient language models is on arXiv.
Author: Hecate Il | Editor: Michael Sarazen
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Weekly Synchronized Global AI to get weekly AI updates.