Tri Dao | Semantic Scholar

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Tri DaoDaniel Y. FuStefano ErmonA. RudraChristopher R'e

Computer Science

27 May 2022

This work proposes FlashAttention, an IO-aware exact attention algorithm that uses tiling to reduce the number of memory reads/writes between GPU high bandwidth memory (HBM) and GPU on-chip SRAM, and is optimal for a range of SRAM sizes.

arXiv

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

Tri Dao

Computer Science

International Conference on Learning…

17 July 2023

This work tweak the algorithm to reduce the number of non-matmul FLOPs, and parallelize the attention computation, even for a single head, across different thread blocks to increase occupancy, and distribute the work between warps to reduce communication through shared memory.

arXiv

StarCoder: may the source be with you!

Raymond LiLoubna Ben Allal H. D. Vries

Computer Science

Trans. Mach. Learn. Res.

9 May 2023

This work performs the most comprehensive evaluation of Code LLMs to date and shows that StarCoderBase outperforms every open Code LLM that supports multiple programming languages and matches or outperforms the OpenAI code-cushman-001 model.

arXiv

HiPPO: Recurrent Memory with Optimal Polynomial Projections

Albert GuTri DaoStefano ErmonA. RudraC. Ré

Computer Science, Mathematics

Neural Information Processing Systems

17 August 2020

This formal framework yields a new memory update mechanism (HiPPO-LegS) that scales through time to remember all history, avoiding priors on the timescale and enjoys the theoretical benefits of timescale robustness, fast updates, and bounded gradients.

arXiv

Hyena Hierarchy: Towards Larger Convolutional Language Models

Michael PoliStefano Massaroli Christopher Ré

Computer Science

International Conference on Machine Learning

21 February 2023

This work proposes Hyena, a subquadratic drop-in replacement for attention constructed by interleaving implicitly parametrized long convolutions and data-controlled gating, and sets a new state-of-the-art for dense-attention-free architectures on language modeling in standard datasets.

arXiv

Combining Recurrent, Convolutional, and Continuous-time Models with Linear State-Space Layers

Albert GuIsys Johnson Christopher R'e

Computer Science, Engineering

Neural Information Processing Systems

26 October 2021

A simple sequence model inspired by control systems that generalizes RNN heuristics, temporal convolutions, and neural differential equations while addressing their shortcomings, and introduces a trainable subset of structured matrices that endow LSSLs with long-range memory.

arXiv

Hungry Hungry Hippos: Towards Language Modeling with State Space Models

Tri DaoDaniel Y. FuKhaled Kamal SaabA. ThomasA. RudraChristopher Ré

Computer Science

International Conference on Learning…

28 December 2022

A new SSM layer, H3, is proposed that is explicitly designed for the impact on language modeling and achieves promising initial results, achieving lower perplexity than Transformers and outperforming Transformers in zero- and few-shot learning on a majority of tasks in the SuperGLUE benchmark.

arXiv

Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time

Zichang LiuJue Wang Beidi Chen

Computer Science

International Conference on Machine Learning

26 October 2023

DejaVu is proposed, a system that uses a low-cost algorithm to predict contextual sparsity on the fly given inputs to each layer, along with an asynchronous and hardware-aware implementation that speeds up LLM inference.

arXiv

Monarch: Expressive Structured Matrices for Efficient and Accurate Training

Tri DaoBeidi Chen Christopher Ré

Computer Science, Mathematics

International Conference on Machine Learning

1 April 2022

Surprisingly, the problem of approximating a dense weight matrix with a Monarch matrix, though nonconvex, has an analytical optimal solution and can achieve favorable accuracy-efficiency tradeoffs in several end-to-end sparse training applications.

arXiv

Learning Fast Algorithms for Linear Transforms Using Butterfly Factorizations

Tri DaoAlbert GuMatthew EichhornA. RudraC. Ré

Computer Science, Mathematics

International Conference on Machine Learning

14 March 2019

This work introduces a parameterization of divide-and-conquer methods that can automatically learn an efficient algorithm for many important transforms, and can be incorporated as a lightweight replacement of generic matrices in machine learning pipelines to learn efficient and compressible transformations.

PubMed