Tags: None
Table of Contents
Introduction
This is a filtered curation of papers that I found over the years that either inspire me, blow me away (revolutionize the way I think) or are an essential read for someone within the field. Some of these papers I have not personally read in detail.
I have also included a section on learning resources at the bottom of the post covering a wide variety of techniques, theory and underlying math.
There is an emphasis on vision and language based papers with bias on vision. There is additional bias on modern techniques and deep learning. This list does not cover fundamental algorithms, theory and techniques essential to the field of Machine Learning such as generalization theory, probability, optimization, convex optimization, etc.
Papers
- Meta-Learning and Meta-analysis of techniques in the field
- Architecture
- AlexNet (the "Deep Learning breakthrough" paper)
- Non-local Neural Networks: the "CV attention paper"
- Attention Is All You Need: the "NLP attention paper"
- ResNet: scalable architecture via skip connections (just keep adding more layers)
- ViT: applying transformers to vision
- ResNexT: "the response to ViT"
- MViT
- TimeSFormer
- Pay Attention to MLPs
- SSL/WSL & Feature-Representation Learning
- MAE & modality extensions
- DiffMAE
- Omnivore and OmniMAE
- ImageBind
- MAWS: billion parameter ViTs pre-trained on billions of images
- MAWS = MAE + WSP (weak-supervised pre-training)
- Authors produce a CLIP-variant: "MAWS CLIP"
- Impressive performance on video activity detection (model is image-based); top-1: 86% K400, 74.4% SSv2
- IMO: under-rated (only 58 stars, really?)
- DINO
- CutLER, VideoCutLER
- V-JEPA
- InternVideo2
- Cookbook of Self-Supervised Learning
- See also: "Vision and Language"
- Generative Models
- (Neural) Compression
- 3D
- Pre-read: SfM
- NeRF
- Gaussian Splatting (a "real-time NeRF")
- Downstream Image Tasks (classification, object detection, tracking, segmentation, etc.)
- SegmentAnything (SAM)
- XMem
- TrackAnything
- ViTPose: higher resolution model is better
- Object Detection in 20 Years: A Survey
- Vision & Language
- CLIP
- Mind-blowing zero-shot classification capabilities
- The model that initially enabled DALLE & Stable Diffusion.
- This pre-training method improves robustness of learnt features (w.r.t classification accuracy on downstream task)
- Extensions: SigLIP
- MM1: Apple's extension to LLaVa with a good number of experiments/ablations
- LLaVa
- CLIP
- LLMs
- Audio
- Engineering
- Public Datasets