‘self-attention’ directory

Gwern

‘self-attention’ directory

See Also
Gwern
Links
Miscellaneous
Bibliography

[page summary]

Gwern

“Absolute Unit NNs: Regression-Based MLPs for Everything ”, Gwern 2023

Absolute Unit NNs: Regression-Based MLPs for Everything

“Research Ideas ”, Gwern 2017

Research Ideas

“GPT-3 Creative Fiction ”, Gwern 2020

GPT-3 Creative Fiction

Links

“It’s All Connected: A Journey Through Test-Time Memorization, Attentional Bias, Retention, and Online Optimization ”, Behrouz et al 2025

It’s All Connected: A Journey Through Test-Time Memorization, Attentional Bias, Retention, and Online Optimization

“Dynamic Tanh: Transformers without Normalization ”, Zhu et al 2025

Dynamic Tanh: Transformers without Normalization

“(How) Do Language Models Track State? ”, Li et al 2025

(How) Do Language Models Track State?

“Thinking Slow, Fast: Scaling Inference Compute With Distilled Reasoners ”, Paliotta et al 2025

Thinking Slow, Fast: Scaling Inference Compute with Distilled Reasoners

“Leveraging the True Depth of LLMs ”, González et al 2025

Leveraging the true depth of LLMs

“Language Models Use Trigonometry to Do Addition ”, Kantamneni 2025

Language Models Use Trigonometry to Do Addition

“Test-Time Regression: a Unifying Framework for Designing Sequence Models With Associative Memory ”, Wang et al 2025

Test-time regression: a unifying framework for designing sequence models with associative memory

“How Has DeepSeek Improved the Transformer Architecture? ”, Erdil 2025

How has DeepSeek improved the Transformer architecture?

“Where Does In-Context Learning Happen in Large Language Models? ”, Sia et al 2025

Where does In-context Learning Happen in Large Language Models?

“MiniMax-01: Scaling Foundation Models With Lightning Attention ”, MiniMax et al 2025

MiniMax-01: Scaling Foundation Models with Lightning Attention

“Emergent Effects of Scaling on the Functional Hierarchies within Large Language Models ”, Foop 2025

Emergent effects of scaling on the functional hierarchies within large language models

“ICLR: In-Context Learning of Representations ”, Park et al 2024

ICLR: In-Context Learning of Representations

“Hymba: A Hybrid-Head Architecture for Small Language Models ”, Dong et al 2024

Hymba: A Hybrid-head Architecture for Small Language Models

“Procedural Knowledge in Pretraining Drives Reasoning in Large Language Models ”, Ruis et al 2024

Procedural Knowledge in Pretraining Drives Reasoning in Large Language Models

“Long Context RAG Performance of Large Language Models ”, Leng et al 2024

Long Context RAG Performance of Large Language Models

“Ask, and It Shall Be Given: Turing Completeness of Prompting ”, Qiu et al 2024

Ask, and it shall be given: Turing completeness of prompting

“The Belief State Transformer ”, Hu et al 2024

The Belief State Transformer

“ALTA: Compiler-Based Analysis of Transformers ”, Shaw et al 2024

ALTA: Compiler-Based Analysis of Transformers

“Tackling the Abstraction and Reasoning Corpus With Vision Transformers: the Importance of 2D Representation, Positions, and Objects ”, Li et al 2024

Tackling the Abstraction and Reasoning Corpus with Vision Transformers: the Importance of 2D Representation, Positions, and Objects

“Differential Transformer ”, Ye et al 2024

Differential Transformer

“Were RNNs All We Needed? ”, Feng et al 2024

Were RNNs All We Needed?

“NGPT: Normalized Transformer With Representation Learning on the Hypersphere ”, Loshchilov et al 2024

nGPT: Normalized Transformer with Representation Learning on the Hypersphere

“Masked Mixers for Language Generation and Retrieval ”, Badger 2024

Masked Mixers for Language Generation and Retrieval

“The Mamba in the Llama: Distilling and Accelerating Hybrid Models ”, Wang et al 2024

The Mamba in the Llama: Distilling and Accelerating Hybrid Models

“When Can Transformers Count to n? ”, Yehudai et al 2024

When Can Transformers Count to n?

“What Matters in Transformers? Not All Attention Is Needed ”, He et al 2024

What Matters in Transformers? Not All Attention is Needed

“Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More? ”, Lee et al 2024

Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More?

“An Empirical Study of Mamba-Based Language Models ”, Waleffe et al 2024

An Empirical Study of Mamba-based Language Models

“Attention As a Hypernetwork ”, Schug et al 2024

Attention as a Hypernetwork

“Scalable Matmul-Free Language Modeling ”, Zhu et al 2024

Scalable Matmul-free Language Modeling

“A Theoretical Understanding of Self-Correction through In-Context Alignment ”, Wang et al 2024

A Theoretical Understanding of Self-Correction through In-context Alignment

“Attention As an RNN ”, Feng et al 2024

Attention as an RNN

“Your Transformer Is Secretly Linear ”, Razzhigaev et al 2024

Your Transformer is Secretly Linear

“Retrieval Head Mechanistically Explains Long-Context Factuality ”, Wu et al 2024

Retrieval Head Mechanistically Explains Long-Context Factuality

“Let’s Think Dot by Dot: Hidden Computation in Transformer Language Models ”, Pfau et al 2024

Let’s Think Dot by Dot: Hidden Computation in Transformer Language Models

“Towards Smaller, Faster Decoder-Only Transformers: Architectural Variants and Their Implications ”, Suresh & P 2024

Towards smaller, faster decoder-only transformers: Architectural variants and their implications

“RULER: What’s the Real Context Size of Your Long-Context Language Models? ”, Hsieh et al 2024

RULER: What’s the Real Context Size of Your Long-Context Language Models?

“ReFT: Representation Finetuning for Language Models ”, Wu et al 2024

ReFT: Representation Finetuning for Language Models

“Do Language Models Plan Ahead for Future Tokens? ”, Wu et al 2024

Do language models plan ahead for future tokens?

“Streamlining Redundant Layers to Compress Large Language Models ”, Chen et al 2024

Streamlining Redundant Layers to Compress Large Language Models

“Long-Form Factuality in Large Language Models ”, Wei et al 2024

Long-form factuality in large language models

“Mechanistic Design and Scaling of Hybrid Architectures ”, Poli et al 2024

Mechanistic Design and Scaling of Hybrid Architectures

“8 Google Employees Invented Modern AI. Here’s the Inside Story: They Met by Chance, Got Hooked on an Idea, and Wrote the Transformers Paper—The Most Consequential Tech Breakthrough in Recent History ”, Levy 2024

8 Google Employees Invented Modern AI. Here’s the Inside Story: They met by chance, got hooked on an idea, and wrote the Transformers paper—the most consequential tech breakthrough in recent history

“How Well Can Transformers Emulate In-Context Newton’s Method? ”, Giannou et al 2024

How Well Can Transformers Emulate In-context Newton’s Method?

“RNNs Are Not Transformers (Yet): The Key Bottleneck on In-Context Retrieval ”, Wen et al 2024

RNNs are not Transformers (Yet): The Key Bottleneck on In-context Retrieval

“A Phase Transition between Positional and Semantic Learning in a Solvable Model of Dot-Product Attention ”, Cui et al 2024

A phase transition between positional and semantic learning in a solvable model of dot-product attention

“Rethinking Patch Dependence for Masked Autoencoders ”, Fu et al 2024

Rethinking Patch Dependence for Masked Autoencoders

“Attention versus Contrastive Learning of Tabular Data—A Data-Centric Benchmarking ”, Rabbani et al 2024

Attention versus Contrastive Learning of Tabular Data—A Data-centric Benchmarking

“Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet ”

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

“SwitchHead: Accelerating Transformers With Mixture-Of-Experts Attention ”, Csordás et al 2023

SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention

“Look Before You Leap: A Universal Emergent Decomposition of Retrieval Tasks in Language Models ”, Variengien & Winsor 2023

Look Before You Leap: A Universal Emergent Decomposition of Retrieval Tasks in Language Models

“Can a Transformer Represent a Kalman Filter? ”, Goel & Bartlett 2023

Can a Transformer Represent a Kalman Filter?

“Efficient Transformer Knowledge Distillation: A Performance Review ”, Brown et al 2023

Efficient Transformer Knowledge Distillation: A Performance Review

“Rethinking Attention: Exploring Shallow Feed-Forward Neural Networks As an Alternative to Attention Layers in Transformers ”, Bozic et al 2023

Rethinking Attention: Exploring Shallow Feed-Forward Neural Networks as an Alternative to Attention Layers in Transformers

“In-Context Vectors: Making In Context Learning More Effective and Controllable Through Latent Space Steering ”, Liu et al 2023

In-context Vectors: Making In Context Learning More Effective and Controllable Through Latent Space Steering

“On Prefrontal Working Memory and Hippocampal Episodic Memory: Unifying Memories Stored in Weights and Activation Slots ”, Whittington et al 2023

On prefrontal working memory and hippocampal episodic memory: Unifying memories stored in weights and activation slots

“LSS Transformer: Ultra-Long Sequence Distributed Transformer ”, Wang et al 2023

LSS Transformer: Ultra-Long Sequence Distributed Transformer

“Simplifying Transformer Blocks ”, He & Hofmann 2023

Simplifying Transformer Blocks

“GateLoop: Fully Data-Controlled Linear Recurrence for Sequence Modeling ”, Katsch 2023

GateLoop: Fully Data-Controlled Linear Recurrence for Sequence Modeling

“Not All Layers Are Equally As Important: Every Layer Counts BERT ”, Charpentier & Samuel 2023

Not all layers are equally as important: Every Layer Counts BERT

“Implicit Chain-Of-Thought Reasoning via Knowledge Distillation ”, Deng et al 2023

Implicit Chain-of-Thought Reasoning via Knowledge Distillation

“Training Dynamics of Contextual N-Grams in Language Models ”, Quirke et al 2023

Training Dynamics of Contextual N-Grams in Language Models

“The Impact of Depth and Width on Transformer Language Model Generalization ”, Petty et al 2023

The Impact of Depth and Width on Transformer Language Model Generalization

“Transformers Learn Higher-Order Optimization Methods for In-Context Learning: A Study With Linear Models ”, Fu et al 2023

Transformers Learn Higher-Order Optimization Methods for In-Context Learning: A Study with Linear Models

“Characterizing Mechanisms for Factual Recall in Language Models ”, Yu et al 2023

Characterizing Mechanisms for Factual Recall in Language Models

“Linear Representations of Sentiment in Large Language Models ”, Tigges et al 2023

Linear Representations of Sentiment in Large Language Models

“Masked Hard-Attention Transformers and Boolean RASP Recognize Exactly the Star-Free Languages ”, Angluin et al 2023

Masked Hard-Attention Transformers and Boolean RASP Recognize Exactly the Star-Free Languages

“How Many Pretraining Tasks Are Needed for In-Context Learning of Linear Regression? ”, Wu et al 2023

How Many Pretraining Tasks Are Needed for In-Context Learning of Linear Regression?

“Never Train from Scratch: Fair Comparison of Long-Sequence Models Requires Data-Driven Priors ”, Amos et al 2023

Never Train from Scratch: Fair Comparison of Long-Sequence Models Requires Data-Driven Priors

“Vision Transformers Need Registers ”, Darcet et al 2023

Vision Transformers Need Registers

“Interpret Vision Transformers As ConvNets With Dynamic Convolutions ”, Zhou et al 2023

Interpret Vision Transformers as ConvNets with Dynamic Convolutions

“Replacing Softmax With ReLU in Vision Transformers ”, Wortsman et al 2023

Replacing softmax with ReLU in Vision Transformers

“Gated Recurrent Neural Networks Discover Attention ”, Zucchet et al 2023

Gated recurrent neural networks discover attention

“One Wide Feedforward Is All You Need ”, Pires et al 2023

One Wide Feedforward is All You Need

“Activation Addition: Steering Language Models Without Optimization ”, Turner et al 2023

Activation Addition: Steering Language Models Without Optimization

“Linearity of Relation Decoding in Transformer Language Models ”, Hernandez et al 2023

Linearity of Relation Decoding in Transformer Language Models

“The Hydra Effect: Emergent Self-Repair in Language Model Computations ”, McGrath et al 2023

The Hydra Effect: Emergent Self-repair in Language Model Computations

“Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla ”, Lieberum et al 2023

Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla

“FlashAttention-2: Faster Attention With Better Parallelism and Work Partitioning ”, Dao 2023

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

“One Step of Gradient Descent Is Provably the Optimal In-Context Learner With One Layer of Linear Self-Attention ”, Mahankali et al 2023

One Step of Gradient Descent is Provably the Optimal In-Context Learner with One Layer of Linear Self-Attention

“Lost in the Middle: How Language Models Use Long Contexts ”, Liu et al 2023

Lost in the Middle: How Language Models Use Long Contexts

“Trainable Transformer in Transformer ”, Panigrahi et al 2023

Trainable Transformer in Transformer

“Transformers Learn to Implement Preconditioned Gradient Descent for In-Context Learning ”, Ahn et al 2023

Transformers learn to implement preconditioned gradient descent for in-context learning

“White-Box Transformers via Sparse Rate Reduction ”, Yu et al 2023

White-Box Transformers via Sparse Rate Reduction

“Blockwise Parallel Transformer for Long Context Large Models ”, Liu & Abbeel 2023

Blockwise Parallel Transformer for Long Context Large Models

“TTT-NN: Test-Time Training on Nearest Neighbors for Large Language Models ”, Hardt & Sun 2023

TTT-NN: Test-Time Training on Nearest Neighbors for Large Language Models

“Brainformers: Trading Simplicity for Efficiency ”, Zhou et al 2023

Brainformers: Trading Simplicity for Efficiency

“GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints ”, Ainslie et al 2023

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

“Mimetic Initialization of Self-Attention Layers ”, Trockman & Kolter 2023

Mimetic Initialization of Self-Attention Layers

“Toeplitz Neural Network for Sequence Modeling ”, Qin et al 2023

Toeplitz Neural Network for Sequence Modeling

“Finding Neurons in a Haystack: Case Studies With Sparse Probing ”, Gurnee et al 2023

Finding Neurons in a Haystack: Case Studies with Sparse Probing

“How Does GPT-2 Compute Greater-Than?: Interpreting Mathematical Abilities in a Pre-Trained Language Model ”, Hanna et al 2023

How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model

“Coinductive Guide to Inductive Transformer Heads ”, Nemecek 2023

Coinductive guide to inductive transformer heads

“Tighter Bounds on the Expressivity of Transformer Encoders ”, Chiang et al 2023

Tighter Bounds on the Expressivity of Transformer Encoders

“Tracr: Compiled Transformers As a Laboratory for Interpretability ”, Lindner et al 2023

Tracr: Compiled Transformers as a Laboratory for Interpretability

“Skip-Attention: Improving Vision Transformers by Paying Less Attention ”, Venkataramanan et al 2023

Skip-Attention: Improving Vision Transformers by Paying Less Attention

“Hungry Hungry Hippos: Towards Language Modeling With State Space Models ”, Fu et al 2022

Hungry Hungry Hippos: Towards Language Modeling with State Space Models

“Scalable Adaptive Computation for Iterative Generation ”, Jabri et al 2022

Scalable Adaptive Computation for Iterative Generation

“Pretraining Without Attention ”, Wang et al 2022

Pretraining Without Attention

“Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent As Meta-Optimizers ”, Dai et al 2022

Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent as Meta-Optimizers

“Transformers Learn In-Context by Gradient Descent ”, Oswald et al 2022

Transformers learn in-context by gradient descent

“What Learning Algorithm Is In-Context Learning? Investigations With Linear Models ”, Akyürek et al 2022

What learning algorithm is in-context learning? Investigations with linear models

“Efficiently Scaling Transformer Inference ”, Pope et al 2022

Efficiently Scaling Transformer Inference

“Transformers Learn Shortcuts to Automata ”, Liu et al 2022

Transformers Learn Shortcuts to Automata

“Multi-CLS BERT: An Efficient Alternative to Traditional Ensembling ”, Chang et al 2022

Multi-CLS BERT: An Efficient Alternative to Traditional Ensembling

“Transformers Implement First-Order Logic With Majority Quantifiers ”, Merrill & Sabharwal 2022

Transformers Implement First-Order Logic with Majority Quantifiers

“The Lie Derivative for Measuring Learned Equivariance ”, Gruver et al 2022

The Lie Derivative for Measuring Learned Equivariance

“Relaxed Attention for Transformer Models ”, Lohrenz et al 2022

Relaxed Attention for Transformer Models

“What Can Transformers Learn In-Context? A Case Study of Simple Function Classes ”, Garg et al 2022

What Can Transformers Learn In-Context? A Case Study of Simple Function Classes

“Multitrack Music Transformer: Learning Long-Term Dependencies in Music With Diverse Instruments ”, Dong et al 2022

Multitrack Music Transformer: Learning Long-Term Dependencies in Music with Diverse Instruments

“N-Grammer: Augmenting Transformers With Latent n-Grams ”, Roy et al 2022

N-Grammer: Augmenting Transformers with latent n-grams

“Log-Precision Transformers Are Constant-Depth Uniform Threshold Circuits ”, Merrill & Sabharwal 2022

Log-Precision Transformers are Constant-Depth Uniform Threshold Circuits

“Neural Differential Equations for Learning to Program Neural Nets Through Continuous Learning Rules ”, Irie et al 2022

Neural Differential Equations for Learning to Program Neural Nets Through Continuous Learning Rules

“FlashAttention: Fast and Memory-Efficient Exact Attention With IO-Awareness ”, Dao et al 2022

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

“TATS: Long Video Generation With Time-Agnostic VQGAN and Time-Sensitive Transformer ”, Ge et al 2022

TATS: Long Video Generation with Time-Agnostic VQGAN and Time-Sensitive Transformer

“Transformer Language Models without Positional Encodings Still Learn Positional Information ”, Haviv et al 2022

Transformer Language Models without Positional Encodings Still Learn Positional Information

“Overcoming a Theoretical Limitation of Self-Attention ”, Chiang & Cholak 2022

Overcoming a Theoretical Limitation of Self-Attention

“It’s Raw! Audio Generation With State-Space Models ”, Goel et al 2022

It’s Raw! Audio Generation with State-Space Models

“General-Purpose, Long-Context Autoregressive Modeling With Perceiver AR ”, Hawthorne et al 2022

General-purpose, long-context autoregressive modeling with Perceiver AR

“Transformer Memory As a Differentiable Search Index ”, Tay et al 2022

Transformer Memory as a Differentiable Search Index

“The Dual Form of Neural Networks Revisited: Connecting Test Time Predictions to Training Patterns via Spotlights of Attention ”, Irie et al 2022

The Dual Form of Neural Networks Revisited: Connecting Test Time Predictions to Training Patterns via Spotlights of Attention

“Attention Approximates Sparse Distributed Memory ”, Bricken & Pehlevan 2021

Attention Approximates Sparse Distributed Memory

“An Explanation of In-Context Learning As Implicit Bayesian Inference ”, Xie et al 2021

An Explanation of In-context Learning as Implicit Bayesian Inference

“Long-Range Transformers for Dynamic Spatiotemporal Forecasting ”, Grigsby et al 2021

Long-Range Transformers for Dynamic Spatiotemporal Forecasting

“Train Short, Test Long: Attention With Linear Biases (ALiBi) Enables Input Length Extrapolation ”, Press et al 2021

Train Short, Test Long: Attention with Linear Biases (ALiBi) Enables Input Length Extrapolation

“Do Vision Transformers See Like Convolutional Neural Networks? ”, Raghu et al 2021

Do Vision Transformers See Like Convolutional Neural Networks?

“Stable, Fast and Accurate: Kernelized Attention With Relative Positional Encoding ”, Luo et al 2021

Stable, Fast and Accurate: Kernelized Attention with Relative Positional Encoding

“RASP: Thinking Like Transformers ”, Weiss et al 2021

RASP: Thinking Like Transformers

“On the Distribution, Sparsity, and Inference-Time Quantization of Attention Values in Transformers ”, Ji et al 2021

On the Distribution, Sparsity, and Inference-time Quantization of Attention Values in Transformers

“SAINT: Improved Neural Networks for Tabular Data via Row Attention and Contrastive Pre-Training ”, Somepalli et al 2021

SAINT: Improved Neural Networks for Tabular Data via Row Attention and Contrastive Pre-Training

“Not All Images Are Worth 16×16 Words: Dynamic Transformers for Efficient Image Recognition ”, Wang et al 2021

Not All Images are Worth 16×16 Words: Dynamic Transformers for Efficient Image Recognition

“Less Is More: Pay Less Attention in Vision Transformers ”, Pan et al 2021

Less is More: Pay Less Attention in Vision Transformers

“FNet: Mixing Tokens With Fourier Transforms ”, Lee-Thorp et al 2021

FNet: Mixing Tokens with Fourier Transforms

“Do You Even Need Attention? A Stack of Feed-Forward Layers Does Surprisingly Well on ImageNet ”, Melas-Kyriazi 2021

Do You Even Need Attention? A Stack of Feed-Forward Layers Does Surprisingly Well on ImageNet

“RoFormer: Enhanced Transformer With Rotary Position Embedding ”, Su et al 2021

RoFormer: Enhanced Transformer with Rotary Position Embedding

“ALD: Efficient Transformers in Reinforcement Learning Using Actor-Learner Distillation ”, Parisotto & Salakhutdinov 2021

ALD: Efficient Transformers in Reinforcement Learning using Actor-Learner Distillation

“Attention Is Not All You Need: Pure Attention Loses Rank Doubly Exponentially With Depth ”, Dong et al 2021

Attention is Not All You Need: Pure Attention Loses Rank Doubly Exponentially with Depth

“Do Transformer Modifications Transfer Across Implementations and Applications? ”, Narang et al 2021

Do Transformer Modifications Transfer Across Implementations and Applications?

“Linear Transformers Are Secretly Fast Weight Programmers ”, Schlag et al 2021

Linear Transformers Are Secretly Fast Weight Programmers

“Unlocking Pixels for Reinforcement Learning via Implicit Attention ”, Choromanski et al 2021

Unlocking Pixels for Reinforcement Learning via Implicit Attention

“Transformer Feed-Forward Layers Are Key-Value Memories ”, Geva et al 2020

Transformer Feed-Forward Layers Are Key-Value Memories

“AdnFM: An Attentive DenseNet Based Factorization Machine for CTR Prediction ”, Wang et al 2020

AdnFM: An Attentive DenseNet based Factorization Machine for CTR Prediction

“Inductive Biases for Deep Learning of Higher-Level Cognition ”, Goyal & Bengio 2020

Inductive Biases for Deep Learning of Higher-Level Cognition

“Long Range Arena (LRA): A Benchmark for Efficient Transformers ”, Tay et al 2020

Long Range Arena (LRA): A Benchmark for Efficient Transformers

“Current Limitations of Language Models: What You Need Is Retrieval ”, Komatsuzaki 2020

Current Limitations of Language Models: What You Need is Retrieval

“Efficient Transformers: A Survey ”, Tay et al 2020

Efficient Transformers: A Survey

“HiPPO: Recurrent Memory With Optimal Polynomial Projections ”, Gu et al 2020

HiPPO: Recurrent Memory with Optimal Polynomial Projections

“Efficient Attention: Breaking The Quadratic Transformer Bottleneck ”, Gwern 2020

Efficient Attention: Breaking The Quadratic Transformer Bottleneck

“Pre-Training via Paraphrasing ”, Lewis et al 2020

Pre-training via Paraphrasing

“Masked Language Modeling for Proteins via Linearly Scalable Long-Context Transformers ”, Choromanski et al 2020

Masked Language Modeling for Proteins via Linearly Scalable Long-Context Transformers

“GPT-3: Language Models Are Few-Shot Learners ”, Brown et al 2020

GPT-3: Language Models are Few-Shot Learners

“Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks ”, Lewis et al 2020

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

“Synthesizer: Rethinking Self-Attention in Transformer Models ”, Tay et al 2020

Synthesizer: Rethinking Self-Attention in Transformer Models

“PowerNorm: Rethinking Batch Normalization in Transformers ”, Shen et al 2020

PowerNorm: Rethinking Batch Normalization in Transformers

“On Layer Normalization in the Transformer Architecture ”, Xiong et al 2020

On Layer Normalization in the Transformer Architecture

“REALM: Retrieval-Augmented Language Model Pre-Training ”, Guu et al 2020

REALM: Retrieval-Augmented Language Model Pre-Training

“BERT’s Output Layer Recognizes All Hidden Layers? Some Intriguing Phenomena and a Simple Way to Boost BERT ”, Kao et al 2020

BERT’s output layer recognizes all hidden layers? Some Intriguing Phenomena and a simple way to boost BERT

“Rethinking Attention With Performers ”, Choromanski & Colwell 2020

Rethinking Attention with Performers

“Dynamic Convolution: Attention over Convolution Kernels ”, Chen et al 2019

Dynamic Convolution: Attention over Convolution Kernels

“Generalization through Memorization: Nearest Neighbor Language Models ”, Khandelwal et al 2019

Generalization through Memorization: Nearest Neighbor Language Models

“Multiplicative Interactions and Where to Find Them ”, Jayakumar et al 2019

Multiplicative Interactions and Where to Find Them

“The Bottom-Up Evolution of Representations in the Transformer: A Study With Machine Translation and Language Modeling Objectives ”, Voita et al 2019

The Bottom-up Evolution of Representations in the Transformer: A Study with Machine Translation and Language Modeling Objectives

“Large Memory Layers With Product Keys ”, Lample et al 2019

Large Memory Layers with Product Keys

“What Does BERT Look At? An Analysis of BERT’s Attention ”, Clark et al 2019

What Does BERT Look At? An Analysis of BERT’s Attention

“Are 16 Heads Really Better Than One? ”, Michel et al 2019

Are 16 Heads Really Better than One?

“Pay Less Attention With Lightweight and Dynamic Convolutions ”, Wu et al 2019

Pay Less Attention with Lightweight and Dynamic Convolutions

“On the Turing Completeness of Modern Neural Network Architectures ”, Pérez et al 2019

On the Turing Completeness of Modern Neural Network Architectures

“Music Transformer ”, Huang et al 2018

Music Transformer

“Character-Level Language Modeling With Deeper Self-Attention ”, Al-Rfou et al 2018

Character-Level Language Modeling with Deeper Self-Attention

“Attention Is All You Need ”, Vaswani et al 2017

Attention Is All You Need

“A Deep Reinforced Model for Abstractive Summarization ”, Paulus et al 2017

A Deep Reinforced Model for Abstractive Summarization

“Get To The Point: Summarization With Pointer-Generator Networks ”, See et al 2017

Get To The Point: Summarization with Pointer-Generator Networks

“RAM: Dynamic Computational Time for Visual Attention ”, Li et al 2017

RAM: Dynamic Computational Time for Visual Attention

“Hybrid Computing Using a Neural Network With Dynamic External Memory ”, Graves et al 2016

Hybrid computing using a neural network with dynamic external memory

“Scaling Memory-Augmented Neural Networks With Sparse Reads and Writes ”, Rae et al 2016

Scaling Memory-Augmented Neural Networks with Sparse Reads and Writes

“Modeling Human Reading With Neural Attention ”, Hahn & Keller 2016

Modeling Human Reading with Neural Attention

“Iterative Alternating Neural Attention for Machine Reading ”, Sordoni et al 2016

Iterative Alternating Neural Attention for Machine Reading

“Adaptive Computation Time for Recurrent Neural Networks ”, Graves 2016

Adaptive Computation Time for Recurrent Neural Networks

“Foveation-Based Mechanisms Alleviate Adversarial Examples ”, Luo et al 2015

Foveation-based Mechanisms Alleviate Adversarial Examples

“Generating Images from Captions With Attention ”, Mansimov et al 2015

Generating Images from Captions with Attention

“DRAW: A Recurrent Neural Network For Image Generation ”, Gregor et al 2015

DRAW: A Recurrent Neural Network For Image Generation

“Neural Turing Machines ”, Graves et al 2014

Neural Turing Machines

“Neural Machine Translation by Jointly Learning to Align and Translate ”, Bahdanau et al 2014

Neural Machine Translation by Jointly Learning to Align and Translate

“On Learning Where To Look ”, Ranzato 2014

On Learning Where To Look

“Generating Sequences With Recurrent Neural Networks ”, Graves 2013

Generating Sequences With Recurrent Neural Networks

“Efficient Transformers: A Survey § Table 1 ”

Efficient Transformers: A Survey § Table 1

“Attention and Augmented Recurrent Neural Networks ”

Attention and Augmented Recurrent Neural Networks

“Hierarchical Object Detection With Deep Reinforcement Learning ”

Hierarchical Object Detection with Deep Reinforcement Learning :

View HTML:

/doc/www/imatge-upc.github.io/f106bf397cea4b3c184a40c91893ee695f7646df.html

“The Transformer Family: Attention and Self-Attention • Multi-Head Self-Attention • Transformer • Adaptive Computation Time (ACT) • Improved Attention Span: (Longer Attention Span (Transformer-XL) / Adaptive Attention Span / Localized Attention Span (Image Transformer)) • Less Time and Memory Cost: (Sparse Attention Matrix Factorization (Sparse Transformers) / Locality-Sensitive Hashing (Reformer)) • Make It Recurrent (Universal Transformer) • Stabilization for RL (GTrXL) ”

The Transformer Family: Attention and Self-Attention • Multi-Head Self-Attention • Transformer • Adaptive Computation Time (ACT) • Improved Attention Span: (Longer Attention Span (Transformer-XL) / Adaptive Attention Span / Localized Attention Span (Image Transformer)) • Less Time and Memory Cost: (Sparse Attention Matrix Factorization (Sparse Transformers) / Locality-Sensitive Hashing (Reformer)) • Make it Recurrent (Universal Transformer) • Stabilization for RL (GTrXL) :

View HTML:

/doc/www/lilianweng.github.io/8428b81f97ec6ee2f1a32d095f038a405e743f80.html#openai

“100M Token Context Windows ”

100M Token Context Windows :

View HTML:

/doc/www/magic.dev/10d25fd80679c2f82ef9b54a1ee926499842f78b.html

“Learning to Combine Foveal Glimpses With a Third-Order Boltzmann Machine ”

Learning to combine foveal glimpses with a third-order Boltzmann machine :

View PDF:

/doc/www/pdfs.semanticscholar.org/972d1bb895d5d4fc9493ea05e660d61b9932b8d8.pdf

“Show, Attend and Tell: Neural Image Caption Generation With Visual Attention ”

Show, attend and tell: Neural image caption generation with visual attention :

View PDF:

/doc/www/proceedings.mlr.press/dfee6126675536fdd017d3051a161a6d79ccb35e.pdf

“Recurrent Models of Visual Attention ”

Recurrent models of visual attention :

View HTML:

/doc/www/proceedings.neurips.cc/4eb939caec6f17fbc9716d110a34f3c908b0cc75.html

“Can Active Memory Replace Attention? ”

Can Active Memory Replace Attention? :

View HTML:

/doc/www/proceedings.neurips.cc/ca4ef6808478dad2149c5617162e8c0d19640518.html

“Dzmitry Bahdanau ”

Dzmitry Bahdanau :

View HTML:

/doc/www/rizar.github.io/1e88f991d7ad84ac219ea8c8a61b6e3176cc9746.html

“Scaling Automatic Neuron Description ”

Scaling Automatic Neuron Description :

View HTML:

/doc/www/transluce.org/6c2fbae2fa9fde8568197325cfcd9709487c70d4.html

“Monitor: An AI-Driven Observability Interface ”

Monitor: An AI-Driven Observability Interface

“Interpreting GPT: the Logit Lens ”

interpreting GPT: the logit lens :

View External Link:

https://www.lesswrong.com/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens

“A Sober Look at Steering Vectors for LLMs ”

A Sober Look at Steering Vectors for LLMs :

View HTML:

/doc/www/www.greaterwrong.com/0e3720e4371330f663eb94d94bb64c80e1af3096.html

“One-Shot Steering Vectors Cause Emergent Misalignment, Too ”

One-shot steering vectors cause emergent misalignment, too

“A Survey of Long-Term Context in Transformers: Sparse Transformers • Adaptive Span Transformers • Transformer-XL • Compressive Transformers • Reformer • Routing Transformer • Sinkhorn Transformer • Linformer • Efficient Attention: Attention With Linear Complexities • Transformers Are RNNs • ETC • Longformer ”

A Survey of Long-Term Context in Transformers: Sparse Transformers • Adaptive Span Transformers • Transformer-XL • Compressive Transformers • Reformer • Routing Transformer • Sinkhorn Transformer • Linformer • Efficient Attention: Attention with Linear Complexities • Transformers are RNNs • ETC • Longformer :

View HTML:

/doc/www/www.pragmatic.ml/f15ba6e21a65128a4daf2146db0ba229b6b2e6b9.html

“FlashAttention-3: Fast and Accurate Attention With Asynchrony and Low-Precision ”

FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision :

View HTML:

/doc/www/www.together.ai/c17cf751cc778ec4481da07e013f94580bf3db97.html

Miscellaneous

Bibliography

https://arxiv.org/abs/2503.10622#facebook: “Dynamic Tanh: Transformers without Normalization ”, Jiachen Zhu, Xinlei Chen, Kaiming He, Yann LeCun, Zhuang Liu

link-bibliography
https://arxiv.org/abs/2502.20339: “Thinking Slow, Fast: Scaling Inference Compute With Distilled Reasoners ”, Daniele Paliotta, Junxiong Wang, Matteo Pagliardini, Kevin Y. Li, Aviv Bick, J. Zico Kolter, Albert Gu, François Fleuret, Tri Dao

link-bibliography
https://arxiv.org/abs/2501.08313#minimax: “MiniMax-01: Scaling Foundation Models With Lightning Attention ”, MiniMax, Aonian Li, Bangwei Gong, Bo Yang, Boji Shan, Chang Liu, Cheng Zhu, Chunhao Zhang, Congchao Guo, Da Chen, Dong Li, Enwei Jiao, Gengxin Li, Guojun Zhang, Haohai Sun, Houze Dong, Jiadai Zhu, Jiaqi Zhuang, Jiayuan Song, Jin Zhu, Jingtao Han, Jingyang Li, Junbin Xie, Junhao Xu, Junjie Yan, Kaishun Zhang, Kecheng Xiao, Kexi Kang, Le Han, Leyang Wang, Lianfei Yu, Liheng Feng, Lin Zheng, Linbo Chai, Long Xing, Meizhi Ju, Mingyuan Chi, Mozhi Zhang, Peikai Huang, Pengcheng Niu, Pengfei Li, Pengyu Zhao, Qi Yang, Qidi Xu, Qiexiang Wang, Qin Wang, Qiuhui Li, Ruitao Leng, Shengmin Shi, Shuqi Yu, Sichen Li, Songquan Zhu, Tao Huang, Tianrun Liang, Weigao Sun, Weixuan Sun, Weiyu Cheng, Wenkai Li, Xiangjun Song, Xiao Su, Xiaodong Han, Xinjie Zhang, Xinzhu Hou, Xu Min, Xun Zou, Xuyang Shen, Yan Gong, Yingjie Zhu, Yipeng Zhou, Yiran Zhong, Yongyi Hu, Yuanxiang Fan, Yue Yu, Yufeng Yang, Yuhao Li, Yunan Huang, Yunji Li, Yunpeng Huang, Yunzhi Xu, Yuxin Mao, Zehan Li, Zekang Li, Zewei Tao, Zewen Ying, Zhaoyang Cong, Zhen Qin, Zhenhua Fan, Zhihang Yu, Zhuo Jiang, Zijia Wu

link-bibliography
https://arxiv.org/abs/2410.18077#deepmind: “ALTA: Compiler-Based Analysis of Transformers ”, Peter Shaw, James Cohan, Jacob Eisenstein, Kenton Lee, Jonathan Berant, Kristina Toutanova

link-bibliography
https://arxiv.org/abs/2410.06405: “Tackling the Abstraction and Reasoning Corpus With Vision Transformers: the Importance of 2D Representation, Positions, and Objects ”, Wenhao Li, Yudong Xu, Scott Sanner, Elias Boutros Khalil

link-bibliography
https://arxiv.org/abs/2410.01201: “Were RNNs All We Needed? ”, Leo Feng, Frederick Tung, Mohamed Osama Ahmed, Yoshua Bengio, Hossein Hajimirsadegh

link-bibliography
https://arxiv.org/abs/2408.15237: “The Mamba in the Llama: Distilling and Accelerating Hybrid Models ”, Junxiong Wang, Daniele Paliotta, Avner May, Alexander M. Rush, Tri Dao

link-bibliography
https://arxiv.org/abs/2406.15786: “What Matters in Transformers? Not All Attention Is Needed ”, Shwai He, Guoheng Sun, Zheyu Shen, Ang Li

link-bibliography
https://arxiv.org/abs/2406.13121#google: “Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More? ”, Jinhyuk Lee, Anthony Chen, Zhuyun Dai, Dheeru Dua, Devendra Singh Sachan, Michael Boratko, Yi Luan, Sébastien M. R. Arnold, Vincent Perot, Siddharth Dalmia, Hexiang Hu, Xudong Lin, Panupong Pasupat, Aida Amini, Jeremy R. Cole, Sebastian Riedel, Iftekhar Naim, Ming-Wei Chang, Kelvin Guu

link-bibliography
https://arxiv.org/abs/2406.07887: “An Empirical Study of Mamba-Based Language Models ”, Roger Waleffe, Wonmin Byeon, Duncan Riach, Brandon Norick, Vijay Korthikanti, Tri Dao, Albert Gu, Ali Hatamizadeh, Sudhakar Singh, Deepak Narayanan, Garvit Kulshreshtha, Vartika Singh, Jared Casper, Jan Kautz, Mohammad Shoeybi, Bryan Catanzaro

link-bibliography
https://arxiv.org/abs/2404.15574: “Retrieval Head Mechanistically Explains Long-Context Factuality ”, Wenhao Wu, Yizhong Wang, Guangxuan Xiao, Hao Peng, Yao Fu

link-bibliography
https://arxiv.org/abs/2404.15758: “Let’s Think Dot by Dot: Hidden Computation in Transformer Language Models ”, Jacob Pfau, William Merrill, Samuel R. Bowman

link-bibliography
https://arxiv.org/abs/2403.18802#deepmind: “Long-Form Factuality in Large Language Models ”, Jerry Wei, Chengrun Yang, Xinying Song, Yifeng Lu, Nathan Hu, Jie Huang, Dustin Tran, Daiyi Peng, Ruibo Liu, Da Huang, Cosmo Du, Quoc V. Le

link-bibliography
https://arxiv.org/abs/2403.17844: “Mechanistic Design and Scaling of Hybrid Architectures ”, Michael Poli, Armin W. Thomas, Eric Nguyen, Pragaash Ponnusamy, Björn Deiseroth, Kristian Kersting, Taiji Suzuki, Brian Hie, Stefano Ermon, Christopher Ré, Ce Zhang, Stefano Massaroli

link-bibliography
https://www.wired.com/story/eight-google-employees-invented-modern-ai-transformers-paper/: “8 Google Employees Invented Modern AI. Here’s the Inside Story: They Met by Chance, Got Hooked on an Idea, and Wrote the Transformers Paper—The Most Consequential Tech Breakthrough in Recent History ”, Steven Levy

link-bibliography
https://arxiv.org/abs/2401.14391: “Rethinking Patch Dependence for Masked Autoencoders ”, Letian Fu, Long Lian, Renhao Wang, Baifeng Shi, Xudong Wang, Adam Yala, Trevor Darrell, Alexei A. Efros, Ken Goldberg

link-bibliography
https://arxiv.org/abs/2311.13657: “Efficient Transformer Knowledge Distillation: A Performance Review ”, Nathan Brown, Ashton Williamson, Tahj Anderson, Logan Lawrence

link-bibliography
https://arxiv.org/abs/2311.02265: “Not All Layers Are Equally As Important: Every Layer Counts BERT ”, Lucas Georges Gabriel Charpentier, David Samuel

link-bibliography
https://arxiv.org/abs/2310.15154: “Linear Representations of Sentiment in Large Language Models ”, Curt Tigges, Oskar John Hollinsworth, Atticus Geiger, Neel Nanda

link-bibliography
https://arxiv.org/abs/2310.02980: “Never Train from Scratch: Fair Comparison of Long-Sequence Models Requires Data-Driven Priors ”, Ido Amos, Jonathan Berant, Ankit Gupta

link-bibliography
https://arxiv.org/abs/2309.10713: “Interpret Vision Transformers As ConvNets With Dynamic Convolutions ”, Chong Zhou, Chen Change Loy, Bo Dai

link-bibliography
https://arxiv.org/abs/2309.08586: “Replacing Softmax With ReLU in Vision Transformers ”, Mitchell Wortsman, Jaehoon Lee, Justin Gilmer, Simon Kornblith

link-bibliography
https://arxiv.org/abs/2308.10248: “Activation Addition: Steering Language Models Without Optimization ”, Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J. Vazquez, Ulisse Mini, Monte MacDiarmid

link-bibliography
https://arxiv.org/abs/2305.18466: “TTT-NN: Test-Time Training on Nearest Neighbors for Large Language Models ”, Moritz Hardt, Yu Sun

link-bibliography
https://arxiv.org/abs/2306.00008#google: “Brainformers: Trading Simplicity for Efficiency ”, Yanqi Zhou, Nan Du, Yanping Huang, Daiyi Peng, Chang Lan, Da Huang, Siamak Shakeri, David So, Andrew Dai, Yifeng Lu, Zhifeng Chen, Quoc V. Le, Claire Cui, James Laundon, Jeff Dean

link-bibliography
https://arxiv.org/abs/2305.09828: “Mimetic Initialization of Self-Attention Layers ”, Asher Trockman, J. Zico Kolter

link-bibliography
https://arxiv.org/abs/2301.02240: “Skip-Attention: Improving Vision Transformers by Paying Less Attention ”, Shashanka Venkataramanan, Amir Ghodrati, Yuki M. Asano, Fatih Porikli, Amirhossein Habibian

link-bibliography
https://arxiv.org/abs/2212.14052: “Hungry Hungry Hippos: Towards Language Modeling With State Space Models ”, Daniel Y. Fu, Tri Dao, Khaled K. Saab, Armin W. Thomas, Atri Rudra, Christopher Ré

link-bibliography
https://arxiv.org/abs/2212.10544: “Pretraining Without Attention ”, Junxiong Wang, Jing Nathan Yan, Albert Gu, Alexander M. Rush

link-bibliography
https://arxiv.org/abs/2212.07677#google: “Transformers Learn In-Context by Gradient Descent ”, Johannes von Oswald, Eyvind Niklasson, Ettore Randazzo, João Sacramento, Alexander Mordvintsev, Andrey Zhmoginov, Max Vladymyrov

link-bibliography
https://arxiv.org/abs/2211.15661#google: “What Learning Algorithm Is In-Context Learning? Investigations With Linear Models ”, Ekin Akyürek, Dale Schuurmans, Jacob Andreas, Tengyu Ma, Denny Zhou

link-bibliography
https://arxiv.org/abs/2211.05102#google: “Efficiently Scaling Transformer Inference ”, Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury, Anselm Levskaya, Jonathan Heek, Kefan Xiao, Shivani Agrawal, Jeff Dean

link-bibliography
https://arxiv.org/abs/2210.05043: “Multi-CLS BERT: An Efficient Alternative to Traditional Ensembling ”, Haw-Shiuan Chang, Ruei-Yao Sun, Kathryn Ricci, Andrew McCallum

link-bibliography
https://arxiv.org/abs/2208.01066: “What Can Transformers Learn In-Context? A Case Study of Simple Function Classes ”, Shivam Garg, Dimitris Tsipras, Percy Liang, Gregory Valiant

link-bibliography
https://arxiv.org/abs/2206.01649#schmidhuber: “Neural Differential Equations for Learning to Program Neural Nets Through Continuous Learning Rules ”, Kazuki Irie, Francesco Faccio, Jürgen Schmidhuber

link-bibliography
https://arxiv.org/abs/2205.14135: “FlashAttention: Fast and Memory-Efficient Exact Attention With IO-Awareness ”, Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, Christopher Ré

link-bibliography
https://arxiv.org/abs/2204.03638#facebook: “TATS: Long Video Generation With Time-Agnostic VQGAN and Time-Sensitive Transformer ”, Songwei Ge, Thomas Hayes, Harry Yang, Xi Yin, Guan Pang, David Jacobs, Jia-Bin Huang, Devi Parikh

link-bibliography
https://arxiv.org/abs/2202.09729: “It’s Raw! Audio Generation With State-Space Models ”, Karan Goel, Albert Gu, Chris Donahue, Christopher Ré

link-bibliography
https://arxiv.org/abs/2202.07765#deepmind: “General-Purpose, Long-Context Autoregressive Modeling With Perceiver AR ”, Curtis Hawthorne, Andrew Jaegle, Cătălina Cangea, Sebastian Borgeaud, Charlie Nash, Mateusz Malinowski, Sander Dieleman, Oriol Vinyals, Matthew Botvinick, Ian Simon, Hannah Sheahan, Neil Zeghidour, Jean-Baptiste Alayrac, João Carreira, Jesse Engel

link-bibliography
https://arxiv.org/abs/2108.12409#facebook: “Train Short, Test Long: Attention With Linear Biases (ALiBi) Enables Input Length Extrapolation ”, Ofir Press, Noah Smith, Mike Lewis

link-bibliography
https://arxiv.org/abs/2108.08810#google: “Do Vision Transformers See Like Convolutional Neural Networks? ”, Maithra Raghu, Thomas Unterthiner, Simon Kornblith, Chiyuan Zhang, Alexey Dosovitskiy

link-bibliography
https://arxiv.org/abs/2106.06981: “RASP: Thinking Like Transformers ”, Gail Weiss, Yoav Goldberg, Eran Yahav

link-bibliography
https://arxiv.org/abs/2105.15075: “Not All Images Are Worth 16×16 Words: Dynamic Transformers for Efficient Image Recognition ”, Yulin Wang, Rui Huang, Shiji Song, Zeyi Huang, Gao Huang

link-bibliography
https://arxiv.org/abs/2105.14217: “Less Is More: Pay Less Attention in Vision Transformers ”, Zizheng Pan, Bohan Zhuang, Haoyu He, Jing Liu, Jianfei Cai

link-bibliography
https://arxiv.org/abs/2105.03824#google: “FNet: Mixing Tokens With Fourier Transforms ”, James Lee-Thorp, Joshua Ainslie, Ilya Eckstein, Santiago Ontanon

link-bibliography
https://arxiv.org/abs/2105.02723: “Do You Even Need Attention? A Stack of Feed-Forward Layers Does Surprisingly Well on ImageNet ”, Luke Melas-Kyriazi

link-bibliography
https://openreview.net/forum?id=qVyeW-grC2k#google: “Long Range Arena (LRA): A Benchmark for Efficient Transformers ”, Yi Tay, Mostafa Dehghani, Samira Abnar, Yikang Shen, Dara Bahri, Philip Pham, Jinfeng Rao, Liu Yang, Sebastian Ruder, Donald Metzler

link-bibliography
https://arxiv.org/abs/2009.06732#google: “Efficient Transformers: A Survey ”, Yi Tay, Mostafa Dehghani, Dara Bahri, Donald Metzler

link-bibliography
https://arxiv.org/abs/2008.07669: “HiPPO: Recurrent Memory With Optimal Polynomial Projections ”, Albert Gu, Tri Dao, Stefano Ermon, Atri Rudra, Christopher Re

link-bibliography
abstract: “Efficient Attention: Breaking The Quadratic Transformer Bottleneck ”, Gwern

link-bibliography
https://arxiv.org/abs/2005.00743#google: “Synthesizer: Rethinking Self-Attention in Transformer Models ”, Yi Tay, Dara Bahri, Donald Metzler, Da-Cheng Juan, Zhe Zhao, Che Zheng

link-bibliography
https://arxiv.org/abs/2003.07845: “PowerNorm: Rethinking Batch Normalization in Transformers ”, Sheng Shen, Zhewei Yao, Amir Gholami, Michael W. Mahoney, Kurt Keutzer

link-bibliography
https://arxiv.org/abs/2001.09309: “BERT’s Output Layer Recognizes All Hidden Layers? Some Intriguing Phenomena and a Simple Way to Boost BERT ”, Wei-Tsung Kao, Tsung-Han Wu, Po-Han Chi, Chun-Cheng Hsieh, Hung-Yi Lee

link-bibliography
https://arxiv.org/abs/1912.03458#microsoft: “Dynamic Convolution: Attention over Convolution Kernels ”, Yinpeng Chen, Xiyang Dai, Mengchen Liu, Dongdong Chen, Lu Yuan, Zicheng Liu

link-bibliography