A curated path through computer vision — courses and books to learn it, frameworks to build it,
datasets to train on, and the canonical papers per sub-field. Opinionated and kept tight: the
links worth your time, not every link that exists. Links open in a new tab.
Courses & learning
| Resource | What | Link |
| CS231n (Stanford) | The classic CNNs-for-visual-recognition course. Notes alone are worth it. | site |
| First Principles of CV (Shree Nayar) | Beautiful from-scratch lecture series on imaging, optics, and classical CV. | site |
| fast.ai — Practical DL | Top-down, code-first. Fastest route to building working vision models. | course |
| Deep Learning for CV (Justin Johnson, UMich) | Modern CS231n successor with full video lectures. | site |
| PyImageSearch | Practical OpenCV + DL tutorials for real-world tasks. | site |
Books
| Resource | What | Link |
| Szeliski — Computer Vision: Algorithms and Applications | The reference text, free PDF. Classical + modern. | pdf |
| Hartley & Zisserman — Multiple View Geometry | The bible for geometry, calibration, SfM. | site |
| Goodfellow et al. — Deep Learning | Foundational DL theory, free online. | site |
| Prince — Understanding Deep Learning | Modern, visual, free PDF. Excellent for intuition. | pdf |
Frameworks & libraries
| Resource | What | Link |
| PyTorch | Default research + production DL framework. | site |
| OpenCV | Classical CV workhorse — I/O, transforms, features, calibration. | site |
| timm (HF) | Hundreds of pretrained image backbones, one API. Indispensable. | repo |
| torchvision | Datasets, transforms, and reference detection/segmentation models. | docs |
| Detectron2 (Meta) | Production-grade detection/segmentation framework. | repo |
| MMCV / MMDetection (OpenMMLab) | Huge modular toolbox — every detector/segmenter reimplemented. | repo |
| Ultralytics YOLO | Dead-simple SOTA detection/segmentation/pose. Great for shipping fast. | repo |
| Kornia | Differentiable CV ops in PyTorch — augmentation, geometry, filters. | site |
| Albumentations | Fast, flexible image augmentation. | site |
Datasets & benchmarks
| Resource | What | Link |
| ImageNet | The classification benchmark that launched the deep era. | site |
| COCO | Detection, segmentation, keypoints, captions — the detection standard. | site |
| Open Images | ~9M images with labels, boxes, segmentation, relations. | site |
| Cityscapes / KITTI / nuScenes | Autonomous-driving segmentation + 3D perception benchmarks. | site |
| LAION | Billion-scale image-text pairs powering CLIP/diffusion training. | site |
| Papers With Code — CV | Leaderboards + code for every task. Start here to find SOTA. | site |
Backbones & classification
| Paper | Why it matters | Link |
| AlexNet (2012) | Started the deep-learning vision revolution on ImageNet. | paper |
| ResNet (2015) | Residual connections — trains networks 100s of layers deep. | arXiv |
| EfficientNet (2019) | Compound scaling of depth/width/resolution. | arXiv |
| ViT (2020) | Transformers beat CNNs at scale — images as patch sequences. | arXiv |
| Swin Transformer (2021) | Hierarchical windowed attention — a general vision backbone. | arXiv |
| ConvNeXt (2022) | Modernized CNN matching transformers — CNNs aren't dead. | arXiv |
Object detection
| Paper | Why it matters | Link |
| Faster R-CNN (2015) | Region Proposal Network — the two-stage detection standard. | arXiv |
| YOLO (2015) | Single-shot real-time detection. Spawned a whole family. | arXiv |
| SSD (2016) | Multi-scale single-shot detector. | arXiv |
| RetinaNet / Focal Loss (2017) | Fixed class imbalance for one-stage detectors. | arXiv |
| DETR (2020) | End-to-end detection with transformers — no NMS, no anchors. | arXiv |
Segmentation
| Paper | Why it matters | Link |
| U-Net (2015) | Encoder-decoder with skips — still the medical/dense default. | arXiv |
| Mask R-CNN (2017) | Instance segmentation by adding a mask head to Faster R-CNN. | arXiv |
| DeepLabv3+ (2018) | Atrous convolutions + ASPP for semantic segmentation. | arXiv |
| Segment Anything (SAM) (2023) | Promptable foundation model — segment anything zero-shot. | arXiv |
Generative & diffusion
| Paper | Why it matters | Link |
| GAN (2014) | The adversarial framework that defined a generative era. | arXiv |
| StyleGAN (2018) | Style-based generator — photorealistic, controllable faces. | arXiv |
| DDPM (2020) | Denoising diffusion — the basis of modern image generation. | arXiv |
| Latent Diffusion (Stable Diffusion) (2021) | Diffusion in latent space — made it cheap and open. | arXiv |
Vision-language & foundation models
| Paper | Why it matters | Link |
| CLIP (2021) | Contrastive image-text pretraining — zero-shot everything. | arXiv |
| DINOv2 (2023) | Self-supervised features that work without labels. | arXiv |
| LLaVA (2023) | Open visual instruction tuning — multimodal chat. | arXiv |
| NeRF (2020) | Neural radiance fields — novel-view synthesis from images. | arXiv |
| 3D Gaussian Splatting (2023) | Real-time radiance fields — dethroned NeRF for speed. | arXiv |
| Resource | What | Link |
| CVAT | Powerful open-source annotation for boxes/masks/keypoints. | repo |
| Label Studio | Multi-type labeling (image/text/audio) with ML-assist. | site |
| FiftyOne | Dataset curation + model error analysis. Underrated. | site |
| Roboflow | End-to-end dataset management + augmentation + deploy. | site |
where to start
New to CV? Do fast.ai or CS231n, build with timm + torchvision, ship a detector with Ultralytics
YOLO or Detectron2, and reach for SAM/CLIP/DINOv2 when you need zero-shot or foundation features.
Read ResNet → ViT → DETR → SAM → CLIP in that order for the modern arc.