FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
- Tri DaoDaniel Y. FuStefano ErmonA. RudraChristopher R'e
- 27 May 2022
Computer Science
This work proposes FlashAttention, an IO-aware exact attention algorithm that uses tiling to reduce the number of memory reads/writes between GPU high bandwidth memory (HBM) and GPU on-chip SRAM, and is optimal for a range of SRAM sizes.
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
- Tri Dao
- 17 July 2023
Computer Science
This work tweak the algorithm to reduce the number of non-matmul FLOPs, and parallelize the attention computation, even for a single head, across different thread blocks to increase occupancy, and distribute the work between warps to reduce communication through shared memory.
StarCoder: may the source be with you!
- Raymond LiLoubna Ben Allal H. D. Vries
- 9 May 2023
Computer Science
Trans. Mach. Learn. Res.
This work performs the most comprehensive evaluation of Code LLMs to date and shows that StarCoderBase outperforms every open Code LLM that supports multiple programming languages and matches or outperforms the OpenAI code-cushman-001 model.
HiPPO: Recurrent Memory with Optimal Polynomial Projections
- Albert GuTri DaoStefano ErmonA. RudraC. Ré
- 17 August 2020
Computer Science, Mathematics
This formal framework yields a new memory update mechanism (HiPPO-LegS) that scales through time to remember all history, avoiding priors on the timescale and enjoys the theoretical benefits of timescale robustness, fast updates, and bounded gradients.
Hyena Hierarchy: Towards Larger Convolutional Language Models
- Michael PoliStefano Massaroli Christopher Ré
- 21 February 2023
Computer Science
This work proposes Hyena, a subquadratic drop-in replacement for attention constructed by interleaving implicitly parametrized long convolutions and data-controlled gating, and sets a new state-of-the-art for dense-attention-free architectures on language modeling in standard datasets.
Combining Recurrent, Convolutional, and Continuous-time Models with Linear State-Space Layers
- Albert GuIsys Johnson Christopher R'e
- 26 October 2021
Computer Science, Engineering
A simple sequence model inspired by control systems that generalizes RNN heuristics, temporal convolutions, and neural differential equations while addressing their shortcomings, and introduces a trainable subset of structured matrices that endow LSSLs with long-range memory.
Hungry Hungry Hippos: Towards Language Modeling with State Space Models
- Tri DaoDaniel Y. FuKhaled Kamal SaabA. ThomasA. RudraChristopher Ré
- 28 December 2022
Computer Science
A new SSM layer, H3, is proposed that is explicitly designed for the impact on language modeling and achieves promising initial results, achieving lower perplexity than Transformers and outperforming Transformers in zero- and few-shot learning on a majority of tasks in the SuperGLUE benchmark.
Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time
- Zichang LiuJue Wang Beidi Chen
- 26 October 2023
Computer Science
DejaVu is proposed, a system that uses a low-cost algorithm to predict contextual sparsity on the fly given inputs to each layer, along with an asynchronous and hardware-aware implementation that speeds up LLM inference.
Monarch: Expressive Structured Matrices for Efficient and Accurate Training
- Tri DaoBeidi Chen Christopher Ré
- 1 April 2022
Computer Science, Mathematics
Surprisingly, the problem of approximating a dense weight matrix with a Monarch matrix, though nonconvex, has an analytical optimal solution and can achieve favorable accuracy-efficiency tradeoffs in several end-to-end sparse training applications.
Learning Fast Algorithms for Linear Transforms Using Butterfly Factorizations
- Tri DaoAlbert GuMatthew EichhornA. RudraC. Ré
- 14 March 2019
Computer Science, Mathematics
This work introduces a parameterization of divide-and-conquer methods that can automatically learn an efficient algorithm for many important transforms, and can be incorporated as a lightweight replacement of generic matrices in machine learning pipelines to learn efficient and compressible transformations.
...
...