‘grokking (NN)’ directory

See Also
Gwern
- “Hardware Hedging Scaling Risks ”, Gwern 2024
Links
Miscellaneous
Bibliography

See Also

Parent (‘AI emergence’ tag)

Gwern

“Hardware Hedging Scaling Risks ”, Gwern 2024

Hardware hedging scaling risks

Links

“Grokking at the Edge of Numerical Stability ”, Prieto et al 2025

Grokking at the Edge of Numerical Stability

“The Complexity Dynamics of Grokking ”, DeMoss et al 2024

The Complexity Dynamics of Grokking

“Deep Learning Through A Telescoping Lens: A Simple Model Provides Empirical Insights On Grokking, Gradient Boosting & Beyond ”, Jeffares et al 2024

Deep Learning Through A Telescoping Lens: A Simple Model Provides Empirical Insights On Grokking, Gradient Boosting & Beyond

“The Slingshot Helps With Learning ”, Wu 2024

The slingshot helps with learning

“Emergent Properties With Repeated Examples ”, Charton & Kempe 2024

Emergent properties with repeated examples

“Grokking Modular Polynomials ”, Doshi et al 2024

Grokking Modular Polynomials

“Learning to Grok: Emergence of In-Context Learning and Skill Composition in Modular Arithmetic Tasks ”, He et al 2024

Learning to grok: Emergence of in-context learning and skill composition in modular arithmetic tasks

“Grokfast: Accelerated Grokking by Amplifying Slow Gradients ”, Lee et al 2024

Grokfast: Accelerated Grokking by Amplifying Slow Gradients

“Deep Grokking: Would Deep Neural Networks Generalize Better? ”, Fan et al 2024

Deep Grokking: Would Deep Neural Networks Generalize Better?

“Grokked Transformers Are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization ”, Wang et al 2024

Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization

“Unified View of Grokking, Double Descent and Emergent Abilities: A Perspective from Circuits Competition ”, Huang et al 2024

Unified View of Grokking, Double Descent and Emergent Abilities: A Perspective from Circuits Competition

“A Tale of Tails: Model Collapse As a Change of Scaling Laws ”, Dohmatob et al 2024

A Tale of Tails: Model Collapse as a Change of Scaling Laws

“Critical Data Size of Language Models from a Grokking Perspective ”, Zhu et al 2024

Critical Data Size of Language Models from a Grokking Perspective

“Grokking Group Multiplication With Cosets ”, Stander et al 2023

Grokking Group Multiplication with Cosets

“Dichotomy of Early and Late Phase Implicit Biases Can Provably Induce Grokking ”, Lyu et al 2023

Dichotomy of Early and Late Phase Implicit Biases Can Provably Induce Grokking

“Outliers With Opposing Signals Have an Outsized Effect on Neural Network Optimization ”, Rosenfeld & Risteski 2023

Outliers with Opposing Signals Have an Outsized Effect on Neural Network Optimization

“Grokking Beyond Neural Networks: An Empirical Exploration With Model Complexity ”, Miller et al 2023

Grokking Beyond Neural Networks: An Empirical Exploration with Model Complexity

“Grokking in Linear Estimators—A Solvable Model That Groks without Understanding ”, Levi et al 2023

Grokking in Linear Estimators—A Solvable Model that Groks without Understanding

“To Grok or Not to Grok: Disentangling Generalization and Memorization on Corrupted Algorithmic Datasets ”, Doshi et al 2023

To grok or not to grok: Disentangling generalization and memorization on corrupted algorithmic datasets

“Grokking As the Transition from Lazy to Rich Training Dynamics ”, Kumar et al 2023

Grokking as the Transition from Lazy to Rich Training Dynamics

“PassUntil: Predicting Emergent Abilities With Infinite Resolution Evaluation ”, Hu et al 2023

PassUntil: Predicting Emergent Abilities with Infinite Resolution Evaluation

“Explaining Grokking through Circuit Efficiency ”, Varma et al 2023

Explaining grokking through circuit efficiency

“Latent State Models of Training Dynamics ”, Hu et al 2023

Latent State Models of Training Dynamics

“The Clock and the Pizza: Two Stories in Mechanistic Explanation of Neural Networks ”, Zhong et al 2023

The Clock and the Pizza: Two Stories in Mechanistic Explanation of Neural Networks

“Predicting Grokking Long Before It Happens: A Look into the Loss Landscape of Models Which Grok ”, Notsawo et al 2023

Predicting Grokking Long Before it Happens: A look into the loss landscape of models which grok

“A Toy Model of Universality: Reverse Engineering How Networks Learn Group Operations ”, Chughtai et al 2023

A Toy Model of Universality: Reverse Engineering How Networks Learn Group Operations

“Progress Measures for Grokking via Mechanistic Interpretability ”, Nanda et al 2023

Progress measures for grokking via mechanistic interpretability

“Grokking Phase Transitions in Learning Local Rules With Gradient Descent ”, Žunkovič & Ilievski 2022

Grokking phase transitions in learning local rules with gradient descent

“Omnigrok: Grokking Beyond Algorithmic Data ”, Liu et al 2022

Omnigrok: Grokking Beyond Algorithmic Data

“The Slingshot Mechanism: An Empirical Study of Adaptive Optimizers and the Grokking Phenomenon ”, Thilak et al 2022

The Slingshot Mechanism: An Empirical Study of Adaptive Optimizers and the Grokking Phenomenon

“Towards Understanding Grokking: An Effective Theory of Representation Learning ”, Liu et al 2022

Towards Understanding Grokking: An Effective Theory of Representation Learning

“Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets [Paper] ”, Power et al 2022

Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets [paper]

“Learning through Atypical "Phase Transitions" in Overparameterized Neural Networks ”, Baldassi et al 2021

Learning through atypical "phase transitions" in overparameterized neural networks

“Knowledge Distillation: A Good Teacher Is Patient and Consistent ”, Beyer et al 2021

Knowledge distillation: A good teacher is patient and consistent

“Grokking: Generalization Beyond Overfitting On Small Algorithmic Datasets ”, Power et al 2021

Grokking: Generalization Beyond Overfitting On Small Algorithmic Datasets

“The Large Learning Rate Phase of Deep Learning: the Catapult Mechanism ”, Lewkowycz et al 2020

The large learning rate phase of deep learning: the catapult mechanism

“A Recipe for Training Neural Networks ”, Karpathy 2019

A Recipe for Training Neural Networks

“The Complexity Dynamics of Grokking [Blog] ”, DeMoss et al 2025

The Complexity Dynamics of Grokking [blog] :

View HTML:

/doc/www/brantondemoss.com/4a3ef0cba7954230e30875b15ffbf2c0611950da.html

“Sea-Snell/grokking: Unofficial Re-Implementation of ‘Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets’ ”

Sea-Snell/grokking: unofficial re-implementation of ‘Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets’

“Openai/grok ”

openai/grok

“Teddykoker/grokking: PyTorch Implementation of ‘Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets’ ”

teddykoker/grokking: PyTorch implementation of ‘Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets’

“Hypothesis: Gradient Descent Prefers General Circuits ”

Hypothesis: gradient descent prefers general circuits :

View External Link:

https://www.lesswrong.com/posts/JFibrXBewkSDmixuo/hypothesis-gradient-descent-prefers-general-circuits

“Grokking: Generalization beyond Overfitting on Small Algorithmic Datasets (Paper Explained) ”

Grokking: Generalization beyond Overfitting on small algorithmic datasets (Paper Explained) :

https://www.youtube.com/watch?v=dND-7llwrpw

Sort By Magic

Annotations sorted by machine learning into inferred 'tags'. This provides an alternative way to browse: instead of by date order, one can browse in topic order. The 'sorted' list has been automatically clustered into multiple sections & auto-labeled for easier browsing.

Beginning with the newest annotation, it uses the embedding of each annotation to attempt to create a list of nearest-neighbor annotations, creating a progression of topics. For more details, see the link.

`representation-theory`

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

`grokking-methods`

[see previous entry]

[see previous entry]

[see previous entry]

Miscellaneous

Bibliography

https://www.lesswrong.com/posts/LncYobrn3vRr7qkZW/the-slingshot-helps-with-learning: “The Slingshot Helps With Learning ”, Wilson Wu

link-bibliography
https://arxiv.org/abs/2405.20233: “Grokfast: Accelerated Grokking by Amplifying Slow Gradients ”, Jaerin Lee, Bong Gyun Kang, Kihoon Kim, Kyoung Mu Lee

link-bibliography
https://arxiv.org/abs/2405.19454: “Deep Grokking: Would Deep Neural Networks Generalize Better? ”, Simin Fan, Razvan Pascanu, Martin Jaggi

link-bibliography
https://arxiv.org/abs/2405.15071: “Grokked Transformers Are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization ”, Boshi Wang, Xiang Yue, Yu Su, Huan Sun

link-bibliography
https://arxiv.org/abs/2402.15175: “Unified View of Grokking, Double Descent and Emergent Abilities: A Perspective from Circuits Competition ”, Yufei Huang, Shengding Hu, Xu Han, Zhiyuan Liu, Maosong Sun

link-bibliography
https://arxiv.org/abs/2401.10463: “Critical Data Size of Language Models from a Grokking Perspective ”, Xuekai Zhu, Yao Fu, Bowen Zhou, Zhouhan Lin

link-bibliography
https://arxiv.org/abs/2310.13061: “To Grok or Not to Grok: Disentangling Generalization and Memorization on Corrupted Algorithmic Datasets ”, Darshil Doshi, Aritra Das, Tianyu He, Andrey Gromov

link-bibliography
https://arxiv.org/abs/2310.03262: “PassUntil: Predicting Emergent Abilities With Infinite Resolution Evaluation ”, Shengding Hu, Xin Liu, Xu Han, Xinrong Zhang, Chaoqun He, Weilin Zhao, Yankai Lin, Ning Ding, Zebin Ou, Guoyang Zeng, Zhiyuan Liu, Maosong Sun

link-bibliography
https://arxiv.org/abs/2306.13253: “Predicting Grokking Long Before It Happens: A Look into the Loss Landscape of Models Which Grok ”, Pascal Junior Tikeng Notsawo, Hattie Zhou, Mohammad Pezeshki, Irina Rish, Guillaume Dumas

link-bibliography
https://arxiv.org/abs/2301.05217: “Progress Measures for Grokking via Mechanistic Interpretability ”, Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, Jacob Steinhardt

link-bibliography
https://arxiv.org/abs/2210.15435: “Grokking Phase Transitions in Learning Local Rules With Gradient Descent ”, Bojan Žunkovič, Enej Ilievski

link-bibliography
https://arxiv.org/abs/2210.01117: “Omnigrok: Grokking Beyond Algorithmic Data ”, Ziming Liu, Eric J. Michaud, Max Tegmark

link-bibliography
https://arxiv.org/abs/2205.10343: “Towards Understanding Grokking: An Effective Theory of Representation Learning ”, Ziming Liu, Ouail Kitouni, Niklas Nolte, Eric J. Michaud, Max Tegmark, Mike Williams

link-bibliography
https://arxiv.org/abs/2110.00683: “Learning through Atypical "Phase Transitions" in Overparameterized Neural Networks ”, Carlo Baldassi, Clarissa Lauditi, Enrico M. Malatesta, Rosalba Pacelli, Gabriele Perugini, Riccardo Zecchina

link-bibliography
https://arxiv.org/abs/2106.05237#google: “Knowledge Distillation: A Good Teacher Is Patient and Consistent ”, Lucas Beyer, Xiaohua Zhai, Amélie Royer, Larisa Markeeva, Rohan Anil, Alexander Kolesnikov

link-bibliography
2021-power.pdf#openai: “Grokking: Generalization Beyond Overfitting On Small Algorithmic Datasets ”, Alethea Power, Yuri Burda, Harri Edwards, Igor Babuschkin, Vedant Misra

link-bibliography
https://karpathy.github.io/2019/04/25/recipe/: “A Recipe for Training Neural Networks ”, Andrej Karpathy

link-bibliography

[Quote Of The Day]

[Site Of The Day]

[Annotation Of The Day]

[adblock public service announcement]