‘MLP NN’ directory

See Also
Gwern
Links
Miscellaneous
Bibliography

See Also

Gwern

“Absolute Unit NNs: Regression-Based MLPs for Everything ”, Gwern 2023

Absolute Unit NNs: Regression-Based MLPs for Everything

“Research Ideas ”, Gwern 2017

Research Ideas

“Modular Brain AUNNs for Uploads ”, Gwern 2023

Modular Brain AUNNs for Uploads

“Language-Conditioned Absolute Unit NNs ”, Gwern 2022

Language-Conditioned Absolute Unit NNs

Links

“NeuRaLaTeX: A Machine Learning Library Written in Pure LaTeX ”, Gardner et al 2025

NeuRaLaTeX: A machine learning library written in pure LaTeX

“NeuralSVG: An Implicit Representation for Text-To-Vector Generation ”, Polaczek et al 2025

NeuralSVG: An Implicit Representation for Text-to-Vector Generation

“Titans: Learning to Memorize at Test Time ”, Behrouz et al 2024

Titans: Learning to Memorize at Test Time

“AUNN: Simple Implementation of Gwern’s AUNN Proposal ”, Roland 2024

AUNN: Simple implementation of Gwern’s AUNN proposal

“Flexible Task Abstractions Emerge in Linear Networks With Fast and Bounded Units ”, Sandbrink et al 2024

Flexible task abstractions emerge in linear networks with fast and bounded units

“The Slingshot Helps With Learning ”, Wu 2024

The slingshot helps with learning

“SimBa: Simplicity Bias for Scaling Up Parameters in Deep Reinforcement Learning ”, Lee et al 2024

SimBa: Simplicity Bias for Scaling Up Parameters in Deep Reinforcement Learning

“Bilinear MLPs Enable Weight-Based Mechanistic Interpretability ”, Pearce et al 2024

Bilinear MLPs enable weight-based mechanistic interpretability

“NGPT: Normalized Transformer With Representation Learning on the Hypersphere ”, Loshchilov et al 2024

nGPT: Normalized Transformer with Representation Learning on the Hypersphere

“How Feature Learning Can Improve Neural Scaling Laws ”, Bordelon et al 2024

How Feature Learning Can Improve Neural Scaling Laws

“Magika: AI-Powered Content-Type Detection ”, Fratantonio et al 2024

Magika: AI-Powered Content-Type Detection

“On the Complexity of Neural Computation in Superposition ”, Adler & Shavit 2024

On the Complexity of Neural Computation in Superposition

“Masked Mixers for Language Generation and Retrieval ”, Badger 2024

Masked Mixers for Language Generation and Retrieval

“GSoC 2024: Differentiable Logic for Interactive Systems and Generative Music ”

GSoC 2024: Differentiable Logic for Interactive Systems and Generative Music

“PEER: Mixture of A Million Experts ”, He 2024

PEER: Mixture of A Million Experts

“What Matters in Transformers? Not All Attention Is Needed ”, He et al 2024

What Matters in Transformers? Not All Attention is Needed

“When Parts Are Greater Than Sums: Individual LLM Components Can Outperform Full Models ”, Chang et al 2024

When Parts are Greater Than Sums: Individual LLM Components Can Outperform Full Models

“Probing the Decision Boundaries of In-Context Learning in Large Language Models ”, Zhao et al 2024

Probing the Decision Boundaries of In-context Learning in Large Language Models

“MAR: Autoregressive Image Generation without Vector Quantization ”, Li et al 2024

MAR: Autoregressive Image Generation without Vector Quantization

“Grokking Modular Polynomials ”, Doshi et al 2024

Grokking Modular Polynomials

“Grokfast: Accelerated Grokking by Amplifying Slow Gradients ”, Lee et al 2024

Grokfast: Accelerated Grokking by Amplifying Slow Gradients

“Lateralization MLP: A Simple Brain-Inspired Architecture for Diffusion ”, Hu & Rostami 2024

Lateralization MLP: A Simple Brain-inspired Architecture for Diffusion

“Bigger, Regularized, Optimistic: Scaling for Compute and Sample-Efficient Continuous Control ”, Nauman et al 2024

Bigger, Regularized, Optimistic: scaling for compute and sample-efficient continuous control

“MLPs Learn In-Context ”, Tong & Pehlevan 2024

MLPs Learn In-Context

“Verified Neural Compressed Sensing ”, Bunel et al 2024

Verified Neural Compressed Sensing

“An Exactly Solvable Model for Emergence and Scaling Laws in the Multitask Sparse Parity Problem ”, Nam et al 2024

An exactly solvable model for emergence and scaling laws in the multitask sparse parity problem

“Neural Redshift: Random Networks Are Not Random Functions ”, Teney et al 2024

Neural Redshift: Random Networks are not Random Functions

“Surfing the OCEAN: The Machine Learning Psycholexical Approach 2.0 to Detect Personality Traits in Texts ”, Giannini et al 2024

Surfing the OCEAN: The machine learning psycholexical approach 2.0 to detect personality traits in texts

“Neural Spline Fields for Burst Image Fusion and Layer Separation ”, Chugunov et al 2023

Neural Spline Fields for Burst Image Fusion and Layer Separation

“SwitchHead: Accelerating Transformers With Mixture-Of-Experts Attention ”, Csordás et al 2023

SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention

“SMERF: Streamable Memory Efficient Radiance Fields for Real-Time Large-Scene Exploration ”, Duckworth et al 2023

SMERF: Streamable Memory Efficient Radiance Fields for Real-Time Large-Scene Exploration

“Grokking Group Multiplication With Cosets ”, Stander et al 2023

Grokking Group Multiplication with Cosets

“Rethinking Attention: Exploring Shallow Feed-Forward Neural Networks As an Alternative to Attention Layers in Transformers ”, Bozic et al 2023

Rethinking Attention: Exploring Shallow Feed-Forward Neural Networks as an Alternative to Attention Layers in Transformers

“HyperFields: Towards Zero-Shot Generation of NeRFs from Text ”, Babu et al 2023

HyperFields: Towards Zero-Shot Generation of NeRFs from Text

“Grokking Beyond Neural Networks: An Empirical Exploration With Model Complexity ”, Miller et al 2023

Grokking Beyond Neural Networks: An Empirical Exploration with Model Complexity

“To Grok or Not to Grok: Disentangling Generalization and Memorization on Corrupted Algorithmic Datasets ”, Doshi et al 2023

To grok or not to grok: Disentangling generalization and memorization on corrupted algorithmic datasets

“Polynomial Time Cryptanalytic Extraction of Neural Network Models ”, Shamir et al 2023

Polynomial Time Cryptanalytic Extraction of Neural Network Models

“One Wide Feedforward Is All You Need ”, Pires et al 2023

One Wide Feedforward is All You Need

“Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla ”, Lieberum et al 2023

Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla

“Self Expanding Neural Networks ”, Mitchell et al 2023

Self Expanding Neural Networks

“The Clock and the Pizza: Two Stories in Mechanistic Explanation of Neural Networks ”, Zhong et al 2023

The Clock and the Pizza: Two Stories in Mechanistic Explanation of Neural Networks

“Scaling MLPs: A Tale of Inductive Bias ”, Bachmann et al 2023

Scaling MLPs: A Tale of Inductive Bias

“Any Deep ReLU Network Is Shallow ”, Villani & Schoots 2023

Any Deep ReLU Network is Shallow

“Does the First Letter of One’s Name Affect Life Decisions? A Natural Language Processing Examination of Nominative Determinism ”, Chatterjee et al 2023

Does the First Letter of One’s Name Affect Life Decisions? A Natural Language Processing Examination of Nominative Determinism

“How Does GPT-2 Compute Greater-Than?: Interpreting Mathematical Abilities in a Pre-Trained Language Model ”, Hanna et al 2023

How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model

“Two-Step Training: Adjustable Sketch Colorization via Reference Image and Text Tag ”, Yan et al 2023

Two-Step Training: Adjustable Sketch Colorization via Reference Image and Text Tag

“HyperDiffusion: Generating Implicit Neural Fields With Weight-Space Diffusion ”, Erkoç et al 2023

HyperDiffusion: Generating Implicit Neural Fields with Weight-Space Diffusion

“The Quantization Model of Neural Scaling ”, Michaud et al 2023

The Quantization Model of Neural Scaling

“TSMixer: An All-MLP Architecture for Time Series Forecasting ”, Chen et al 2023

TSMixer: An All-MLP Architecture for Time Series Forecasting

“Loss Landscapes Are All You Need: Neural Network Generalization Can Be Explained Without the Implicit Bias of Gradient Descent ”, Chiang et al 2023

Loss Landscapes are All You Need: Neural Network Generalization Can Be Explained Without the Implicit Bias of Gradient Descent

“A Toy Model of Universality: Reverse Engineering How Networks Learn Group Operations ”, Chughtai et al 2023

A Toy Model of Universality: Reverse Engineering How Networks Learn Group Operations

“Looped Transformers As Programmable Computers ”, Giannou et al 2023

Looped Transformers as Programmable Computers

“Organic Reaction Mechanism Classification Using Machine Learning ”, Burés & Larrosa 2023

Organic reaction mechanism classification using machine learning

“DataMUX: Data Multiplexing for Neural Networks ”, Murahari et al 2023

DataMUX: Data Multiplexing for Neural Networks

“Merging Enzymatic and Synthetic Chemistry With Computational Synthesis Planning ”, Levin et al 2022

Merging enzymatic and synthetic chemistry with computational synthesis planning

“Magic3D: High-Resolution Text-To-3D Content Creation ”, Lin et al 2022

Magic3D: High-Resolution Text-to-3D Content Creation

“How Much Does Attention Actually Attend? Questioning the Importance of Attention in Pretrained Transformers ”, Hassid et al 2022

How Much Does Attention Actually Attend? Questioning the Importance of Attention in Pretrained Transformers

“Compressing Multidimensional Weather and Climate Data into Neural Networks ”, Huang & Hoefler 2022

Compressing multidimensional weather and climate data into neural networks

“Deep Differentiable Logic Gate Networks ”, Petersen et al 2022

Deep Differentiable Logic Gate Networks

“The Lazy Neuron Phenomenon: On Emergence of Activation Sparsity in Transformers ”, Li et al 2022

The Lazy Neuron Phenomenon: On Emergence of Activation Sparsity in Transformers

“The Unreasonable Effectiveness of Fully-Connected Layers for Low-Data Regimes ”, Kocsis et al 2022

The Unreasonable Effectiveness of Fully-Connected Layers for Low-Data Regimes

“Scaling Forward Gradient With Local Losses ”, Ren et al 2022

Scaling Forward Gradient With Local Losses

“The Lie Derivative for Measuring Learned Equivariance ”, Gruver et al 2022

The Lie Derivative for Measuring Learned Equivariance

“Omnigrok: Grokking Beyond Algorithmic Data ”, Liu et al 2022

Omnigrok: Grokking Beyond Algorithmic Data

“DreamFusion: Text-To-3D Using 2D Diffusion ”, Poole et al 2022

DreamFusion: Text-to-3D using 2D Diffusion

“`g.pt`: Learning to Learn With Generative Models of Neural Network Checkpoints ”, Peebles et al 2022

g.pt: Learning to Learn with Generative Models of Neural Network Checkpoints

“Random Initializations Performing above Chance and How to Find Them ”, Benzing et al 2022

Random initializations performing above chance and how to find them

“Scaling Laws vs Model Architectures: How Does Inductive Bias Influence Scaling? ”, Tay et al 2022

Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?

“Why Do Tree-Based Models Still Outperform Deep Learning on Tabular Data? ”, Grinsztajn et al 2022

Why do tree-based models still outperform deep learning on tabular data?

“Revisiting Pretraining Objectives for Tabular Deep Learning ”, Rubachev et al 2022

Revisiting Pretraining Objectives for Tabular Deep Learning

“RHO-LOSS: Prioritized Training on Points That Are Learnable, Worth Learning, and Not Yet Learnt ”, Mindermann et al 2022

RHO-LOSS: Prioritized Training on Points that are Learnable, Worth Learning, and Not Yet Learnt

“MLP-3D: A MLP-Like 3D Architecture With Grouped Time Mixing ”, Qiu et al 2022

MLP-3D: A MLP-like 3D Architecture with Grouped Time Mixing

“ChordMixer: A Scalable Neural Attention Model for Sequences With Different Lengths ”, Khalitov et al 2022

ChordMixer: A Scalable Neural Attention Model for Sequences with Different Lengths

“Sparse Mixers: Combining MoE and Mixing to Build a More Efficient BERT ”, Lee-Thorp & Ainslie 2022

Sparse Mixers: Combining MoE and Mixing to build a more efficient BERT

“Towards Understanding Grokking: An Effective Theory of Representation Learning ”, Liu et al 2022

Towards Understanding Grokking: An Effective Theory of Representation Learning

“Paramixer: Parameterizing Mixing Links in Sparse Factors Works Better Than Dot-Product Self-Attention ”, Yu et al 2022

Paramixer: Parameterizing Mixing Links in Sparse Factors Works Better than Dot-Product Self-Attention

“Deep Learning Meets Nonparametric Regression: Are Weight-Decayed DNNs Locally Adaptive? ”, Zhang & Wang 2022

Deep Learning meets Nonparametric Regression: Are Weight-Decayed DNNs Locally Adaptive?

“Efficient Language Modeling With Sparse All-MLP ”, Yu et al 2022

Efficient Language Modeling with Sparse All-MLP

“HyperMixer: An MLP-Based Low Cost Alternative to Transformers ”, Mai et al 2022

HyperMixer: An MLP-based Low Cost Alternative to Transformers

“MLP-ASR: Sequence-Length Agnostic All-MLP Architectures for Speech Recognition ”, Sakuma et al 2022

MLP-ASR: Sequence-length agnostic all-MLP architectures for speech recognition

“Mixing and Shifting: Exploiting Global and Local Dependencies in Vision MLPs ”, Zheng et al 2022

Mixing and Shifting: Exploiting Global and Local Dependencies in Vision MLPs

“PNLP-Mixer: an Efficient All-MLP Architecture for Language ”, Fusco et al 2022

pNLP-Mixer: an Efficient all-MLP Architecture for Language

“Data-Driven Emergence of Convolutional Structure in Neural Networks ”, Ingrosso & Goldt 2022

Data-driven emergence of convolutional structure in neural networks

“When Shift Operation Meets Vision Transformer: An Extremely Simple Alternative to Attention Mechanism (ShiftViT) ”, Wang et al 2022

When Shift Operation Meets Vision Transformer: An Extremely Simple Alternative to Attention Mechanism (ShiftViT)

“ConvMixer: Patches Are All You Need? ”, Trockman & Kolter 2022

ConvMixer: Patches Are All You Need?

“MAXIM: Multi-Axis MLP for Image Processing ”, Tu et al 2022

MAXIM: Multi-Axis MLP for Image Processing

“Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets [Paper] ”, Power et al 2022

Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets [paper]

“The GatedTabTransformer: An Enhanced Deep Learning Architecture for Tabular Modeling ”, Cholakov & Kolev 2022

The GatedTabTransformer: An enhanced deep learning architecture for tabular modeling

“MLP Architectures for Vision-And-Language Modeling: An Empirical Study ”, Nie et al 2021

MLP Architectures for Vision-and-Language Modeling: An Empirical Study

“Noether Networks: Meta-Learning Useful Conserved Quantities ”, Alet et al 2021

Noether Networks: Meta-Learning Useful Conserved Quantities

“Residual Pathway Priors for Soft Equivariance Constraints ”, Finzi et al 2021

Residual Pathway Priors for Soft Equivariance Constraints

“Zero-Shot Text-Guided Object Generation With Dream Fields ”, Jain et al 2021

Zero-Shot Text-Guided Object Generation with Dream Fields

“MorphMLP: A Self-Attention Free, MLP-Like Backbone for Image and Video ”, Zhang et al 2021

MorphMLP: A Self-Attention Free, MLP-Like Backbone for Image and Video

“PointMixer: MLP-Mixer for Point Cloud Understanding ”, Choe et al 2021

PointMixer: MLP-Mixer for Point Cloud Understanding

“MetaFormer Is Actually What You Need for Vision ”, Yu et al 2021

MetaFormer is Actually What You Need for Vision

“Deep Learning without Shortcuts: Shaping the Kernel With Tailored Rectifiers ”, Zhang et al 2021

Deep Learning without Shortcuts: Shaping the Kernel with Tailored Rectifiers

“ZerO Initialization: Initializing Residual Networks With Only Zeros and Ones ”, Zhao et al 2021

ZerO Initialization: Initializing Residual Networks with only Zeros and Ones

“Wide Neural Networks Forget Less Catastrophically ”, Mirzadeh et al 2021

Wide Neural Networks Forget Less Catastrophically

“ADOP: Approximate Differentiable One-Pixel Point Rendering ”, Rückert et al 2021

ADOP: Approximate Differentiable One-Pixel Point Rendering

“Rapid Training of Deep Neural Networks without Skip Connections or Normalization Layers Using Deep Kernel Shaping ”, Martens et al 2021

Rapid training of deep neural networks without skip connections or normalization layers using Deep Kernel Shaping

“Exploring the Limits of Large Scale Pre-Training ”, Abnar et al 2021

Exploring the Limits of Large Scale Pre-training

“Sparse MLP for Image Recognition: Is Self-Attention Really Necessary? ”, Tang et al 2021

Sparse MLP for Image Recognition: Is Self-Attention Really Necessary?

“AFT: An Attention Free Transformer ”, Zhai et al 2021

AFT: An Attention Free Transformer

“ConvMLP: Hierarchical Convolutional MLPs for Vision ”, Li et al 2021

ConvMLP: Hierarchical Convolutional MLPs for Vision

“Sparse-MLP: A Fully-MLP Architecture With Conditional Computation ”, Lou et al 2021

Sparse-MLP: A Fully-MLP Architecture with Conditional Computation

“A Battle of Network Structures: An Empirical Study of CNN, Transformer, and MLP ”, Zhao et al 2021

A Battle of Network Structures: An Empirical Study of CNN, Transformer, and MLP

“Hire-MLP: Vision MLP via Hierarchical Rearrangement ”, Guo et al 2021

Hire-MLP: Vision MLP via Hierarchical Rearrangement

“RaftMLP: How Much Can Be Done Without Attention and With Less Spatial Locality? ”, Tatsunami & Taki 2021

RaftMLP: How Much Can Be Done Without Attention and with Less Spatial Locality?

“S²-MLPv2: Improved Spatial-Shift MLP Architecture for Vision ”, Yu et al 2021

S²-MLPv2: Improved Spatial-Shift MLP Architecture for Vision

“CycleMLP: A MLP-Like Architecture for Dense Prediction ”, Chen et al 2021

CycleMLP: A MLP-like Architecture for Dense Prediction

“AS-MLP: An Axial Shifted MLP Architecture for Vision ”, Lian et al 2021

AS-MLP: An Axial Shifted MLP Architecture for Vision

“Vision Permutator: A Permutable MLP-Like Architecture for Visual Recognition ”, Hou et al 2021

Vision Permutator: A Permutable MLP-Like Architecture for Visual Recognition

“Real-Time Neural Radiance Caching for Path Tracing ”, Müller et al 2021

Real-time Neural Radiance Caching for Path Tracing

“Towards Biologically Plausible Convolutional Networks ”, Pogodin et al 2021

Towards Biologically Plausible Convolutional Networks

“Well-Tuned Simple Nets Excel on Tabular Datasets ”, Kadra et al 2021

Well-tuned Simple Nets Excel on Tabular Datasets

“MLP Singer: Towards Rapid Parallel Korean Singing Voice Synthesis ”, Tae et al 2021

MLP Singer: Towards Rapid Parallel Korean Singing Voice Synthesis

“PairConnect: A Compute-Efficient MLP Alternative to Attention ”, Xu et al 2021

PairConnect: A Compute-Efficient MLP Alternative to Attention

“S²-MLP: Spatial-Shift MLP Architecture for Vision ”, Yu et al 2021

S²-MLP: Spatial-Shift MLP Architecture for Vision

“When Vision Transformers Outperform ResNets without Pre-Training or Strong Data Augmentations ”, Chen et al 2021

When Vision Transformers Outperform ResNets without Pre-training or Strong Data Augmentations

“Container: Context Aggregation Network ”, Gao et al 2021

Container: Context Aggregation Network

“MixerGAN: An MLP-Based Architecture for Unpaired Image-To-Image Translation ”, Cazenavette & Guevara 2021

MixerGAN: An MLP-Based Architecture for Unpaired Image-to-Image Translation

“One4all User Representation for Recommender Systems in E-Commerce ”, Shin et al 2021

One4all User Representation for Recommender Systems in E-commerce

“Pay Attention to MLPs ”, Liu et al 2021

Pay Attention to MLPs

“FNet: Mixing Tokens With Fourier Transforms ”, Lee-Thorp et al 2021

FNet: Mixing Tokens with Fourier Transforms

“ResMLP: Feedforward Networks for Image Classification With Data-Efficient Training ”, Touvron et al 2021

ResMLP: Feedforward networks for image classification with data-efficient training

“Do You Even Need Attention? A Stack of Feed-Forward Layers Does Surprisingly Well on ImageNet ”, Melas-Kyriazi 2021

Do You Even Need Attention? A Stack of Feed-Forward Layers Does Surprisingly Well on ImageNet

“Multi-Scale Inference of Genetic Trait Architecture Using Biologically Annotated Neural Networks ”, Demetci et al 2021

Multi-scale Inference of Genetic Trait Architecture using Biologically Annotated Neural Networks

“RepMLP: Re-Parameterizing Convolutions into Fully-Connected Layers for Image Recognition ”, Ding et al 2021

RepMLP: Re-parameterizing Convolutions into Fully-connected Layers for Image Recognition

“MLP-Mixer: An All-MLP Architecture for Vision ”, Tolstikhin et al 2021

MLP-Mixer: An all-MLP Architecture for Vision

“Grokking: Generalization Beyond Overfitting On Small Algorithmic Datasets ”, Power et al 2021

Grokking: Generalization Beyond Overfitting On Small Algorithmic Datasets

“Sifting out the Features by Pruning: Are Convolutional Networks the Winning Lottery Ticket of Fully Connected Ones? ”, Pellegrini & Biroli 2021

Sifting out the features by pruning: Are convolutional networks the winning lottery ticket of fully connected ones?

“Fully-Connected Neural Nets ”, Gwern 2021

Fully-Connected Neural Nets

“Revisiting Simple Neural Probabilistic Language Models ”, Sun & Iyyer 2021

Revisiting Simple Neural Probabilistic Language Models

“KiloNeRF: Speeding up Neural Radiance Fields With Thousands of Tiny MLPs ”, Reiser et al 2021

KiloNeRF: Speeding up Neural Radiance Fields with Thousands of Tiny MLPs

“Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows ”, Liu et al 2021

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

“Attention Is Not All You Need: Pure Attention Loses Rank Doubly Exponentially With Depth ”, Dong et al 2021

Attention is Not All You Need: Pure Attention Loses Rank Doubly Exponentially with Depth

“Clusterability in Neural Networks ”, Filan et al 2021

Clusterability in Neural Networks

“Training Larger Networks for Deep Reinforcement Learning ”, Ota et al 2021

Training Larger Networks for Deep Reinforcement Learning

“Explaining Neural Scaling Laws ”, Bahri et al 2021

Explaining Neural Scaling Laws

“Neural Geometric Level of Detail: Real-Time Rendering With Implicit 3D Shapes ”, Takikawa et al 2021

Neural Geometric Level of Detail: Real-time Rendering with Implicit 3D Shapes

“Is MLP-Mixer a CNN in Disguise? As Part of This Blog Post, We Look at the MLP Mixer Architecture in Detail and Also Understand Why It Is Not Considered Convolution Free. ”

Is MLP-Mixer a CNN in disguise? As part of this blog post, we look at the MLP Mixer architecture in detail and also understand why it is not considered convolution free. :

View HTML:

/doc/ai/nn/fully-connected/2021-arora.html

“Transformer Feed-Forward Layers Are Key-Value Memories ”, Geva et al 2020

Transformer Feed-Forward Layers Are Key-Value Memories

“AdnFM: An Attentive DenseNet Based Factorization Machine for CTR Prediction ”, Wang et al 2020

AdnFM: An Attentive DenseNet based Factorization Machine for CTR Prediction

“TabTransformer: Tabular Data Modeling Using Contextual Embeddings ”, Huang et al 2020

TabTransformer: Tabular Data Modeling Using Contextual Embeddings

“Scaling down Deep Learning ”, Greydanus 2020

Scaling down Deep Learning

“Image Generators With Conditionally-Independent Pixel Synthesis ”, Anokhin et al 2020

Image Generators with Conditionally-Independent Pixel Synthesis

“D2RL: Deep Dense Architectures in Reinforcement Learning ”, Sinha et al 2020

D2RL: Deep Dense Architectures in Reinforcement Learning

“Fourier Neural Operator for Parametric Partial Differential Equations ”, Li et al 2020

Fourier Neural Operator for Parametric Partial Differential Equations

“Towards Learning Convolutions from Scratch ”, Neyshabur 2020

Towards Learning Convolutions from Scratch

“Fourier Features Let Networks Learn High Frequency Functions in Low Dimensional Domains ”, Tancik et al 2020

Fourier Features Let Networks Learn High Frequency Functions in Low Dimensional Domains

“SIREN: Implicit Neural Representations With Periodic Activation Functions ”, Sitzmann et al 2020

SIREN: Implicit Neural Representations with Periodic Activation Functions

“Linformer: Self-Attention With Linear Complexity ”, Wang et al 2020

Linformer: Self-Attention with Linear Complexity

“A Map of Object Space in Primate Inferotemporal Cortex ”, Bao et al 2020

A map of object space in primate inferotemporal cortex

“Synthesizer: Rethinking Self-Attention in Transformer Models ”, Tay et al 2020

Synthesizer: Rethinking Self-Attention in Transformer Models

“Deep Learning Training in Facebook Data Centers: Design of Scale-Up and Scale-Out Systems ”, Naumov et al 2020

Deep Learning Training in Facebook Data Centers: Design of Scale-up and Scale-out Systems

“NeRF: Representing Scenes As Neural Radiance Fields for View Synthesis ”, Mildenhall et al 2020

NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis

“Cryptanalytic Extraction of Neural Network Models ”, Carlini et al 2020

Cryptanalytic Extraction of Neural Network Models

“ReZero Is All You Need: Fast Convergence at Large Depth ”, Bachlechner et al 2020

ReZero is All You Need: Fast Convergence at Large Depth

“Train-By-Reconnect: Decoupling Locations of Weights from Their Values (LaPerm) ”, Qiu & Suda 2020

Train-by-Reconnect: Decoupling Locations of Weights from their Values (LaPerm)

“Can Increasing Input Dimensionality Improve Deep Reinforcement Learning? ”, Ota et al 2020

Can Increasing Input Dimensionality Improve Deep Reinforcement Learning?

“Quasi-Equivalence of Width and Depth of Neural Networks ”, Fan et al 2020

Quasi-Equivalence of Width and Depth of Neural Networks

“Gesticulator: A Framework for Semantically-Aware Speech-Driven Gesture Generation ”, Kucherenko et al 2020

Gesticulator: A framework for semantically-aware speech-driven gesture generation

“What’s Hidden in a Randomly Weighted Neural Network? ”, Ramanujan et al 2019

What’s Hidden in a Randomly Weighted Neural Network?

“Understanding the Generalization of ‘Lottery Tickets’ in Neural Networks ”, Morcos & Tian 2019

Understanding the generalization of ‘lottery tickets’ in neural networks

“The Bouncer Problem: Challenges to Remote Explainability ”, Merrer & Tredan 2019

The Bouncer Problem: Challenges to Remote Explainability

“3D Human Pose Estimation via Human Structure-Aware Fully Connected Network ”, Zhang et al 2019d

3D human pose estimation via human structure-aware fully connected network

“Finding the Needle in the Haystack With Convolutions: on the Benefits of Architectural Bias ”, d’Ascoli et al 2019

Finding the Needle in the Haystack with Convolutions: on the benefits of architectural bias

“Generalization Guarantees for Neural Networks via Harnessing the Low-Rank Structure of the Jacobian ”, Oymak et al 2019

Generalization Guarantees for Neural Networks via Harnessing the Low-rank Structure of the Jacobian

“MoGlow: Probabilistic and Controllable Motion Synthesis Using Normalizing Flows ”, Henter et al 2019

MoGlow: Probabilistic and controllable motion synthesis using normalizing flows

“Fixup Initialization: Residual Learning Without Normalization ”, Zhang et al 2019

Fixup Initialization: Residual Learning Without Normalization

“SwitchNet: a Neural Network Model for Forward and Inverse Scattering Problems ”, Khoo & Ying 2018

SwitchNet: a neural network model for forward and inverse scattering problems

“A Jamming Transition from Under-Parameterization to Over-Parameterization Affects Loss Landscape and Generalization ”, Spigler et al 2018

A jamming transition from under-parameterization to over-parameterization affects loss landscape and generalization

“Neural Arithmetic Logic Units ”, Trask et al 2018

Neural Arithmetic Logic Units

“The Goldilocks Zone: Towards Better Understanding of Neural Network Loss Landscapes ”, Fort & Scherlis 2018

The Goldilocks zone: Towards better understanding of neural network loss landscapes

“Scalable Training of Artificial Neural Networks With Adaptive Sparse Connectivity Inspired by Network Science ”, Mocanu et al 2018

Scalable training of artificial neural networks with adaptive sparse connectivity inspired by network science

“Deep Learning Generalizes Because the Parameter-Function Map Is Biased towards Simple Functions ”, Valle-Pérez et al 2018

Deep learning generalizes because the parameter-function map is biased towards simple functions

“Bidirectional Learning for Robust Neural Networks ”, Pontes-Filho & Liwicki 2018

Bidirectional Learning for Robust Neural Networks

“NAIS-Net: Stable Deep Networks from Non-Autonomous Differential Equations ”, Ciccone et al 2018

NAIS-Net: Stable Deep Networks from Non-Autonomous Differential Equations

“Large Scale Distributed Neural Network Training through Online Distillation ”, Anil et al 2018

Large scale distributed neural network training through online distillation

“Meta-Learning Update Rules for Unsupervised Representation Learning ”, Metz et al 2018

Meta-Learning Update Rules for Unsupervised Representation Learning

“Learning and Memorization ”, Chatterjee 2018

Learning and Memorization

“Repurposing High-Throughput Image Assays Enables Biological Activity Prediction for Drug Discovery ”, Simm et al 2018

Repurposing High-Throughput Image Assays Enables Biological Activity Prediction for Drug Discovery

“Improving Palliative Care With Deep Learning ”, An et al 2018

Improving palliative care with deep learning

“Learning to Play Chess With Minimal Lookahead and Deep Value Neural Networks ”, Sabatelli 2017 (page 3)

Learning to Play Chess with Minimal Lookahead and Deep Value Neural Networks

“Neural Collaborative Filtering ”, He et al 2017

Neural Collaborative Filtering

“Sharp Models on Dull Hardware: Fast and Accurate Neural Machine Translation Decoding on the CPU ”, Devlin 2017

Sharp Models on Dull Hardware: Fast and Accurate Neural Machine Translation Decoding on the CPU

“The Shattered Gradients Problem: If Resnets Are the Answer, Then What Is the Question? ”, Balduzzi et al 2017

The Shattered Gradients Problem: If resnets are the answer, then what is the question?

“Gender-From-Iris or Gender-From-Mascara? ”, Kuehlkamp et al 2017

Gender-From-Iris or Gender-From-Mascara?

“Skip Connections Eliminate Singularities ”, Orhan & Pitkow 2017

Skip Connections Eliminate Singularities

“Deep Information Propagation ”, Schoenholz et al 2016

Deep Information Propagation

“Topology and Geometry of Half-Rectified Network Optimization ”, Freeman & Bruna 2016

Topology and Geometry of Half-Rectified Network Optimization

“On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima ”, Keskar et al 2016

On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

“Decoupled Neural Interfaces Using Synthetic Gradients ”, Jaderberg et al 2016

Decoupled Neural Interfaces using Synthetic Gradients

“Learning to Optimize ”, Li & Malik 2016

Learning to Optimize

“Do Deep Convolutional Nets Really Need to Be Deep and Convolutional? ”, Urban et al 2016

Do Deep Convolutional Nets Really Need to be Deep and Convolutional?

“Network Morphism ”, Wei et al 2016

Network Morphism

“Adding Gradient Noise Improves Learning for Very Deep Networks ”, Neelakantan et al 2015

Adding Gradient Noise Improves Learning for Very Deep Networks

“How Far Can We Go without Convolution: Improving Fully-Connected Networks ”, Lin et al 2015

How far can we go without convolution: Improving fully-connected networks

“BinaryConnect: Training Deep Neural Networks With Binary Weights during Propagations ”, Courbariaux et al 2015

BinaryConnect: Training Deep Neural Networks with binary weights during propagations

“Tensorizing Neural Networks ”, Novikov et al 2015

Tensorizing Neural Networks

“A Neural Attention Model for Abstractive Sentence Summarization ”, Rush et al 2015

A Neural Attention Model for Abstractive Sentence Summarization

“Deep Neural Networks for Large Vocabulary Handwritten Text Recognition ”, Bluche 2015

Deep Neural Networks for Large Vocabulary Handwritten Text Recognition

“In Search of the Real Inductive Bias: On the Role of Implicit Regularization in Deep Learning ”, Neyshabur et al 2014

In Search of the Real Inductive Bias: On the Role of Implicit Regularization in Deep Learning

“The Loss Surfaces of Multilayer Networks ”, Choromanska et al 2014

The Loss Surfaces of Multilayer Networks

“On the Number of Linear Regions of Deep Neural Networks ”, Montúfar et al 2014

On the Number of Linear Regions of Deep Neural Networks

“Do Deep Nets Really Need to Be Deep? ”, Ba & Caruana 2013

Do Deep Nets Really Need to be Deep?

“On the Number of Response Regions of Deep Feed Forward Networks With Piece-Wise Linear Activations ”, Pascanu et al 2013

On the number of response regions of deep feed forward networks with piece-wise linear activations

“Network In Network ”, Lin et al 2013

Network In Network

“Deep Big Multilayer Perceptrons for Digit Recognition ”, Cireşan et al 2012

Deep Big Multilayer Perceptrons for Digit Recognition

“Deep Big Simple Neural Nets Excel on Handwritten Digit Recognition ”, Ciresan et al 2010

Deep Big Simple Neural Nets Excel on Handwritten Digit Recognition

“Compositional Pattern Producing Networks: A Novel Abstraction of Development ”, Stanley 2007

Compositional pattern producing networks: A novel abstraction of development

“Extraction De Séquences Numériques Dans Des Documents Manuscrits Quelconques ”, Chatelain 2006

Extraction de séquences numériques dans des documents manuscrits quelconques

“Best Practices for Convolutional Neural Networks Applied to Visual Document Analysis ”, Simard et al 2003

Best Practices for Convolutional Neural Networks Applied to Visual Document Analysis

“NEAT: Evolving Neural Networks through Augmenting Topologies ”, Stanley & Miikkulainen 2002

NEAT: Evolving Neural Networks through Augmenting Topologies

“DARPA and the Quest for Machine Intelligence, 1983–1993 ”, Roland & Shiman 2002

DARPA and the Quest for Machine Intelligence, 1983–1993

“Quantitative Analysis of Multivariate Data Using Artificial Neural Networks: A Tutorial Review and Applications to the Deconvolution of Pyrolysis Mass Spectra ”, Goodacre et al 1996

Quantitative Analysis of Multivariate Data Using Artificial Neural Networks: A Tutorial Review and Applications to the Deconvolution of Pyrolysis Mass Spectra

“Statistical Mechanics of Generalization ”, Opper & Kinzel 1996

Statistical Mechanics of Generalization

“On the Ability of the Optimal Perceptron to Generalize ”, Opper et al 1990

On the ability of the optimal perceptron to generalize

“Learning To Tell Two Spirals Apart ”, Lang & Witbrock 1988

Learning To Tell Two Spirals Apart

“Learning Internal Representations by Error Propagation ”, Rumelhart et al 1986

Learning Internal Representations by Error Propagation

“Neural Networks and Physical Systems With Emergent Collective Computational Abilities ”, Hopfield 1982

Neural networks and physical systems with emergent collective computational abilities

Sort By Magic

Annotations sorted by machine learning into inferred 'tags'. This provides an alternative way to browse: instead of by date order, one can browse in topic order. The 'sorted' list has been automatically clustered into multiple sections & auto-labeled for easier browsing.

Beginning with the newest annotation, it uses the embedding of each annotation to attempt to create a list of nearest-neighbor annotations, creating a progression of topics. For more details, see the link.

`efficient-mixing`

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

`initialization-methods`

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

Wikipedia (1)

Hebbian learning :

https://en.wikipedia.org/wiki/Hebbian_learning

Miscellaneous

Bibliography

https://arxiv.org/abs/2503.24187: “NeuRaLaTeX: A Machine Learning Library Written in Pure LaTeX ”, James A. D. Gardner, Will Rowan, William A. P. Smith

link-bibliography
https://arxiv.org/abs/2501.03992: “NeuralSVG: An Implicit Representation for Text-To-Vector Generation ”, Sagi Polaczek, Yuval Alaluf, Elad Richardson, Yael Vinker, Daniel Cohen-Or

link-bibliography
https://www.lesswrong.com/posts/LncYobrn3vRr7qkZW/the-slingshot-helps-with-learning: “The Slingshot Helps With Learning ”, Wilson Wu

link-bibliography
https://arxiv.org/abs/2406.15786: “What Matters in Transformers? Not All Attention Is Needed ”, Shwai He, Guoheng Sun, Zheyu Shen, Ang Li

link-bibliography
https://arxiv.org/abs/2406.13131: “When Parts Are Greater Than Sums: Individual LLM Components Can Outperform Full Models ”, Ting-Yun Chang, Jesse Thomason, Robin Jia

link-bibliography
https://arxiv.org/abs/2406.11233: “Probing the Decision Boundaries of In-Context Learning in Large Language Models ”, Siyan Zhao, Tung Nguyen, Aditya Grover

link-bibliography
https://arxiv.org/abs/2405.20233: “Grokfast: Accelerated Grokking by Amplifying Slow Gradients ”, Jaerin Lee, Bong Gyun Kang, Kihoon Kim, Kyoung Mu Lee

link-bibliography
https://arxiv.org/abs/2310.13061: “To Grok or Not to Grok: Disentangling Generalization and Memorization on Corrupted Algorithmic Datasets ”, Darshil Doshi, Aritra Das, Tianyu He, Andrey Gromov

link-bibliography
https://arxiv.org/abs/2310.08708: “Polynomial Time Cryptanalytic Extraction of Neural Network Models ”, Adi Shamir, Isaac Canales-Martinez, Anna Hambitzer, Jorge Chavez-Saab, Francisco Rodrigez-Henriquez, Nitin Satpute

link-bibliography
https://arxiv.org/abs/2306.13575: “Scaling MLPs: A Tale of Inductive Bias ”, Gregor Bachmann, Sotiris Anagnostidis, Thomas Hofmann

link-bibliography
https://arxiv.org/abs/2303.13506: “The Quantization Model of Neural Scaling ”, Eric J. Michaud, Ziming Liu, Uzay Girit, Max Tegmark

link-bibliography
https://arxiv.org/abs/2303.06053#google: “TSMixer: An All-MLP Architecture for Time Series Forecasting ”, Si-An Chen, Chun-Liang Li, Nate Yoder, Sercan O. Arik, Tomas Pfister

link-bibliography
2023-bures.pdf: “Organic Reaction Mechanism Classification Using Machine Learning ”, Jordi Burés, Igor Larrosa

link-bibliography
https://www.nature.com/articles/s41467-022-35422-y: “Merging Enzymatic and Synthetic Chemistry With Computational Synthesis Planning ”, Itai Levin, Mengjie Liu, Christopher A. Voigt, Connor W. Coley

link-bibliography
https://arxiv.org/abs/2211.03495: “How Much Does Attention Actually Attend? Questioning the Importance of Attention in Pretrained Transformers ”, Michael Hassid, Hao Peng, Daniel Rotem, Jungo Kasai, Ivan Montero, Noah Smith, Roy Schwartz

link-bibliography
https://arxiv.org/abs/2210.06313#google: “The Lazy Neuron Phenomenon: On Emergence of Activation Sparsity in Transformers ”, Zonglin Li, Chong You, Srinadh Bhojanapalli, Daliang Li, Ankit Singh Rawat, Sashank J. Reddi, Ke Ye, Felix Chern, Felix Yu, Ruiqi Guo, Sanjiv Kumar

link-bibliography
https://arxiv.org/abs/2210.03310#google: “Scaling Forward Gradient With Local Losses ”, Mengye Ren, Simon Kornblith, Renjie Liao, Geoffrey Hinton

link-bibliography
https://arxiv.org/abs/2210.01117: “Omnigrok: Grokking Beyond Algorithmic Data ”, Ziming Liu, Eric J. Michaud, Max Tegmark

link-bibliography
https://arxiv.org/abs/2209.12892: “g.pt: Learning to Learn With Generative Models of Neural Network Checkpoints ”, William Peebles, Ilija Radosavovic, Tim Brooks, Alexei A. Efros, Jitendra Malik

link-bibliography
https://arxiv.org/abs/2207.10551#google: “Scaling Laws vs Model Architectures: How Does Inductive Bias Influence Scaling? ”, Yi Tay, Mostafa Dehghani, Samira Abnar, Hyung Won Chung, William Fedus, Jinfeng Rao, Sharan Narang, Vinh Q. Tran, Dani Yogatama, Donald Metzler

link-bibliography
https://arxiv.org/abs/2206.07137: “RHO-LOSS: Prioritized Training on Points That Are Learnable, Worth Learning, and Not Yet Learnt ”, Sören Mindermann, Jan Brauner, Muhammed Razzak, Mrinank Sharma, Andreas Kirsch, Winnie Xu, Benedikt Höltgen, Aidan N. Gomez, Adrien Morisot, Sebastian Farquhar, Yarin Gal

link-bibliography
https://arxiv.org/abs/2206.05852: “ChordMixer: A Scalable Neural Attention Model for Sequences With Different Lengths ”, Ruslan Khalitov, Tong Yu, Lei Cheng, Zhirong Yang

link-bibliography
https://arxiv.org/abs/2205.12399#google: “Sparse Mixers: Combining MoE and Mixing to Build a More Efficient BERT ”, James Lee-Thorp, Joshua Ainslie

link-bibliography
https://arxiv.org/abs/2205.10343: “Towards Understanding Grokking: An Effective Theory of Representation Learning ”, Ziming Liu, Ouail Kitouni, Niklas Nolte, Eric J. Michaud, Max Tegmark, Mike Williams

link-bibliography
https://arxiv.org/abs/2204.10670: “Paramixer: Parameterizing Mixing Links in Sparse Factors Works Better Than Dot-Product Self-Attention ”, Tong Yu, Ruslan Khalitov, Lei Cheng, Zhirong Yang

link-bibliography
https://arxiv.org/abs/2203.06850: “Efficient Language Modeling With Sparse All-MLP ”, Ping Yu, Mikel Artetxe, Myle Ott, Sam Shleifer, Hongyu Gong, Ves Stoyanov, Xian Li

link-bibliography
https://arxiv.org/abs/2203.03691: “HyperMixer: An MLP-Based Low Cost Alternative to Transformers ”, Florian Mai, Arnaud Pannatier, Fabio Fehr, Haolin Chen, Francois Marelli, Francois Fleuret, James Henderson

link-bibliography
https://arxiv.org/abs/2202.06510#microsoft: “Mixing and Shifting: Exploiting Global and Local Dependencies in Vision MLPs ”, Huangjie Zheng, Pengcheng He, Weizhu Chen, Mingyuan Zhou

link-bibliography
https://arxiv.org/abs/2201.10801: “When Shift Operation Meets Vision Transformer: An Extremely Simple Alternative to Attention Mechanism (ShiftViT) ”, Guangting Wang, Yucheng Zhao, Chuanxin Tang, Chong Luo, Wenjun Zeng

link-bibliography
https://arxiv.org/abs/2201.09792: “ConvMixer: Patches Are All You Need? ”, Asher Trockman, J. Zico Kolter

link-bibliography
https://arxiv.org/abs/2111.11418: “MetaFormer Is Actually What You Need for Vision ”, Weihao Yu, Mi Luo, Pan Zhou, Chenyang Si, Yichen Zhou, Xinchao Wang, Jiashi Feng, Shuicheng Yan

link-bibliography
https://arxiv.org/abs/2110.11526#deepmind: “Wide Neural Networks Forget Less Catastrophically ”, Seyed Iman Mirzadeh, Arslan Chaudhry, Dong Yin, Huiyi Hu, Razvan Pascanu, Dilan Gorur, Mehrdad Farajtabar

link-bibliography
https://arxiv.org/abs/2110.02095#google: “Exploring the Limits of Large Scale Pre-Training ”, Samira Abnar, Mostafa Dehghani, Behnam Neyshabur, Hanie Sedghi

link-bibliography
https://arxiv.org/abs/2109.05422: “Sparse MLP for Image Recognition: Is Self-Attention Really Necessary? ”, Chuanxin Tang, Yucheng Zhao, Guangting Wang, Chong Luo, Wenxuan Xie, Wenjun Zeng

link-bibliography
https://arxiv.org/abs/2109.04454: “ConvMLP: Hierarchical Convolutional MLPs for Vision ”, Jiachen Li, Ali Hassani, Steven Walton, Humphrey Shi

link-bibliography
https://arxiv.org/abs/2108.13002#microsoft: “A Battle of Network Structures: An Empirical Study of CNN, Transformer, and MLP ”, Yucheng Zhao, Guangting Wang, Chuanxin Tang, Chong Luo, Wenjun Zeng, Zheng-Jun Zha

link-bibliography
https://arxiv.org/abs/2108.13341#huawei: “Hire-MLP: Vision MLP via Hierarchical Rearrangement ”, Jianyuan Guo, Yehui Tang, Kai Han, Xinghao Chen, Han Wu, Chao Xu, Chang Xu, Yunhe Wang

link-bibliography
https://arxiv.org/abs/2108.04384: “RaftMLP: How Much Can Be Done Without Attention and With Less Spatial Locality? ”, Yuki Tatsunami, Masato Taki

link-bibliography
https://arxiv.org/abs/2108.01072#baidu: “S²-MLPv2: Improved Spatial-Shift MLP Architecture for Vision ”, Tan Yu, Xu Li, Yunfeng Cai, Mingming Sun, Ping Li

link-bibliography
https://arxiv.org/abs/2107.10224: “CycleMLP: A MLP-Like Architecture for Dense Prediction ”, Shoufa Chen, Enze Xie, Chongjian Ge, Runjian Chen, Ding Liang, Ping Luo

link-bibliography
https://arxiv.org/abs/2107.08391: “AS-MLP: An Axial Shifted MLP Architecture for Vision ”, Dongze Lian, Zehao Yu, Xing Sun, Shenghua Gao

link-bibliography
https://arxiv.org/abs/2106.12368: “Vision Permutator: A Permutable MLP-Like Architecture for Visual Recognition ”, Qibin Hou, Zihang Jiang, Li Yuan, Ming-Ming Cheng, Shuicheng Yan, Jiashi Feng

link-bibliography
https://arxiv.org/abs/2106.12372#nvidia: “Real-Time Neural Radiance Caching for Path Tracing ”, Thomas Müller, Fabrice Rousselle, Jan Novák, Alexander Keller

link-bibliography
https://arxiv.org/abs/2106.07477#baidu: “S²-MLP: Spatial-Shift MLP Architecture for Vision ”, Tan Yu, Xu Li, Yunfeng Cai, Mingming Sun, Ping Li

link-bibliography
https://arxiv.org/abs/2106.01548: “When Vision Transformers Outperform ResNets without Pre-Training or Strong Data Augmentations ”, Xiangning Chen, Cho-Jui Hsieh, Boqing Gong

link-bibliography
https://arxiv.org/abs/2106.01401: “Container: Context Aggregation Network ”, Peng Gao, Jiasen Lu, Hongsheng Li, Roozbeh Mottaghi, Aniruddha Kembhavi

link-bibliography
https://arxiv.org/abs/2105.08050#google: “Pay Attention to MLPs ”, Hanxiao Liu, Zihang Dai, David R. So, Quoc V. Le

link-bibliography
https://arxiv.org/abs/2105.03824#google: “FNet: Mixing Tokens With Fourier Transforms ”, James Lee-Thorp, Joshua Ainslie, Ilya Eckstein, Santiago Ontanon

link-bibliography
https://arxiv.org/abs/2105.02723: “Do You Even Need Attention? A Stack of Feed-Forward Layers Does Surprisingly Well on ImageNet ”, Luke Melas-Kyriazi

link-bibliography
https://arxiv.org/abs/2105.01883: “RepMLP: Re-Parameterizing Convolutions into Fully-Connected Layers for Image Recognition ”, Xiaohan Ding, Chunlong Xia, Xiangyu Zhang, Xiaojie Chu, Jungong Han, Guiguang Ding

link-bibliography
https://arxiv.org/abs/2105.01601#google: “MLP-Mixer: An All-MLP Architecture for Vision ”, Ilya Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas Unterthiner, Jessica Yung, Daniel Keysers, Jakob Uszkoreit, Mario Lucic, Alexey Dosovitskiy

link-bibliography
2021-power.pdf#openai: “Grokking: Generalization Beyond Overfitting On Small Algorithmic Datasets ”, Alethea Power, Yuri Burda, Harri Edwards, Igor Babuschkin, Vedant Misra

link-bibliography
abstract: “Fully-Connected Neural Nets ”, Gwern

link-bibliography
https://arxiv.org/abs/2103.14030: “Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows ”, Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo

link-bibliography
https://arxiv.org/abs/2011.13775: “Image Generators With Conditionally-Independent Pixel Synthesis ”, Ivan Anokhin, Kirill Demochkin, Taras Khakhulin, Gleb Sterkin, Victor Lempitsky, Denis Korzhenkov

link-bibliography
https://arxiv.org/abs/2005.00743#google: “Synthesizer: Rethinking Self-Attention in Transformer Models ”, Yi Tay, Dara Bahri, Donald Metzler, Da-Cheng Juan, Zhe Zhao, Che Zheng

link-bibliography
https://arxiv.org/abs/2003.01629: “Can Increasing Input Dimensionality Improve Deep Reinforcement Learning? ”, Kei Ota, Tomoaki Oiki, Devesh K. Jha, Toshisada Mariyama, Daniel Nikovski

link-bibliography
https://arxiv.org/abs/1911.13299: “What’s Hidden in a Randomly Weighted Neural Network? ”, Vivek Ramanujan, Mitchell Wortsman, Aniruddha Kembhavi, Ali Farhadi, Mohammad Rastegari

link-bibliography
https://arxiv.org/abs/1804.00222#google: “Meta-Learning Update Rules for Unsupervised Representation Learning ”, Luke Metz, Niru Maheswaranathan, Brian Cheung, Jascha Sohl-Dickstein

link-bibliography
2017-sabatelli.pdf#page=3: “Learning to Play Chess With Minimal Lookahead and Deep Value Neural Networks ”, Matthia Sabatelli

link-bibliography
https://arxiv.org/abs/1402.1869: “On the Number of Linear Regions of Deep Neural Networks ”, Guido Montúfar, Razvan Pascanu, Kyunghyun Cho, Yoshua Bengio

link-bibliography

[Quote Of The Day]

[Site Of The Day]

[Annotation Of The Day]

[adblock public service announcement]