Recommended Machine Learning Papers and Learning Resources

March 28, 2024

Time to read: 2 minutes

Tags: None

Introduction
Papers
Resources for Learning (Theory)

Introduction

This is a filtered curation of papers that I found over the years that either inspire me, blow me away (revolutionize the way I think) or are an essential read for someone within the field. Some of these papers I have not personally read in detail.

I have also included a section on learning resources at the bottom of the post covering a wide variety of techniques, theory and underlying math.

There is an emphasis on vision and language based papers with bias on vision. There is additional bias on modern techniques and deep learning. This list does not cover fundamental algorithms, theory and techniques essential to the field of Machine Learning such as generalization theory, probability, optimization, convex optimization, etc.

Papers

Meta-Learning and Meta-analysis of techniques in the field
- Will we run out of data?
- Beyond neural scaling laws: beating power law scaling via data pruning
Architecture
- AlexNet (the "Deep Learning breakthrough" paper)
- Non-local Neural Networks: the "CV attention paper"
  - Attention Is All You Need: the "NLP attention paper"
- ResNet: scalable architecture via skip connections (just keep adding more layers)
- ViT: applying transformers to vision
- ResNexT: "the response to ViT"
- MViT
- TimeSFormer
- Pay Attention to MLPs
SSL/WSL & Feature-Representation Learning
- MAE & modality extensions
- DiffMAE
- Omnivore and OmniMAE
- ImageBind
- MAWS: billion parameter ViTs pre-trained on billions of images
  - MAWS = MAE + WSP (weak-supervised pre-training)
  - Authors produce a CLIP-variant: "MAWS CLIP"
  - Impressive performance on video activity detection (model is image-based); top-1: 86% K400, 74.4% SSv2
  - IMO: under-rated (only 58 stars, really?)
- DINO
- CutLER, VideoCutLER
- V-JEPA
- InternVideo2
- Cookbook of Self-Supervised Learning
- See also: "Vision and Language"
Generative Models
- BigGAN
- GigaGAN
- Diffusion
  - Diffusion Models are Auto-encoders: not a paper, but it is a well written blog post
- DALLE2, DALLE3
- Imagen (does not use CLIP)
(Neural) Compression
- Neural Texture Compression
- Compact-NGP
3D
- Pre-read: SfM
- NeRF
  - original paper, extension: mipnerf
  - Instant Neural Graphic Primitives (instant-ngp)
  - RawNeRF
  - https://localrf.github.io/
- Gaussian Splatting (a "real-time NeRF")
  - Original paper
  - Dynamic scene rendering
Downstream Image Tasks (classification, object detection, tracking, segmentation, etc.)
- SegmentAnything (SAM)
- XMem
- TrackAnything
- ViTPose: higher resolution model is better
- Object Detection in 20 Years: A Survey
Vision & Language
- CLIP
  - Mind-blowing zero-shot classification capabilities
  - The model that initially enabled DALLE & Stable Diffusion.
  - This pre-training method improves robustness of learnt features (w.r.t classification accuracy on downstream task)
  - Extensions: SigLIP
- MM1: Apple's extension to LLaVa with a good number of experiments/ablations
- LLaVa
LLMs
- PaLM: showing that LLMs have emergent properties/behaviors that only occur with scale (e.g. reasoning, humor)
- GPT-2
- OPT: Meta's "first attempt" at LLMs
- Galactica: showing that with good quality data you can out-perform other models trained on more data
- LLaMa, LLaMa2
- Gemini
Audio
- whisper
- CLAP
Engineering
- llama.cpp, ggml
- nanoGPT
- clip.cpp
- litgpt
- gradient checkpointing
Public Datasets

Resources for Learning (Theory)

Recommended Machine Learning Papers and Learning Resources

.css-15k5gh6{box-sizing:border-box;margin:0;min-width:0;color:var(--theme-ui-colors-primary);-webkit-text-decoration:none;text-decoration:none;fill:var(--theme-ui-colors-text);}.css-15k5gh6:hover{-webkit-text-decoration:underline;text-decoration:underline;}Table of Contents

Introduction

Papers

Resources for Learning (Theory)

Table of Contents