Awesome Computer Vision

A curated path through computer vision — courses and books to learn it, frameworks to build it, datasets to train on, and the canonical papers per sub-field. Opinionated and kept tight: the links worth your time, not every link that exists. Links open in a new tab.

Courses & learning

Resource	What	Link
CS231n (Stanford)	The classic CNNs-for-visual-recognition course. Notes alone are worth it.	site
First Principles of CV (Shree Nayar)	Beautiful from-scratch lecture series on imaging, optics, and classical CV.	site
fast.ai — Practical DL	Top-down, code-first. Fastest route to building working vision models.	course
Deep Learning for CV (Justin Johnson, UMich)	Modern CS231n successor with full video lectures.	site
PyImageSearch	Practical OpenCV + DL tutorials for real-world tasks.	site

Books

Resource	What	Link
Szeliski — Computer Vision: Algorithms and Applications	The reference text, free PDF. Classical + modern.	pdf
Hartley & Zisserman — Multiple View Geometry	The bible for geometry, calibration, SfM.	site
Goodfellow et al. — Deep Learning	Foundational DL theory, free online.	site
Prince — Understanding Deep Learning	Modern, visual, free PDF. Excellent for intuition.	pdf

Frameworks & libraries

Resource	What	Link
PyTorch	Default research + production DL framework.	site
OpenCV	Classical CV workhorse — I/O, transforms, features, calibration.	site
timm (HF)	Hundreds of pretrained image backbones, one API. Indispensable.	repo
torchvision	Datasets, transforms, and reference detection/segmentation models.	docs
Detectron2 (Meta)	Production-grade detection/segmentation framework.	repo
MMCV / MMDetection (OpenMMLab)	Huge modular toolbox — every detector/segmenter reimplemented.	repo
Ultralytics YOLO	Dead-simple SOTA detection/segmentation/pose. Great for shipping fast.	repo
Kornia	Differentiable CV ops in PyTorch — augmentation, geometry, filters.	site
Albumentations	Fast, flexible image augmentation.	site

Datasets & benchmarks

Resource	What	Link
ImageNet	The classification benchmark that launched the deep era.	site
COCO	Detection, segmentation, keypoints, captions — the detection standard.	site
Open Images	~9M images with labels, boxes, segmentation, relations.	site
Cityscapes / KITTI / nuScenes	Autonomous-driving segmentation + 3D perception benchmarks.	site
LAION	Billion-scale image-text pairs powering CLIP/diffusion training.	site
Papers With Code — CV	Leaderboards + code for every task. Start here to find SOTA.	site

Backbones & classification

Paper	Why it matters	Link
AlexNet (2012)	Started the deep-learning vision revolution on ImageNet.	paper
ResNet (2015)	Residual connections — trains networks 100s of layers deep.	arXiv
EfficientNet (2019)	Compound scaling of depth/width/resolution.	arXiv
ViT (2020)	Transformers beat CNNs at scale — images as patch sequences.	arXiv
Swin Transformer (2021)	Hierarchical windowed attention — a general vision backbone.	arXiv
ConvNeXt (2022)	Modernized CNN matching transformers — CNNs aren't dead.	arXiv

Object detection

Paper	Why it matters	Link
Faster R-CNN (2015)	Region Proposal Network — the two-stage detection standard.	arXiv
YOLO (2015)	Single-shot real-time detection. Spawned a whole family.	arXiv
SSD (2016)	Multi-scale single-shot detector.	arXiv
RetinaNet / Focal Loss (2017)	Fixed class imbalance for one-stage detectors.	arXiv
DETR (2020)	End-to-end detection with transformers — no NMS, no anchors.	arXiv

Segmentation

Paper	Why it matters	Link
U-Net (2015)	Encoder-decoder with skips — still the medical/dense default.	arXiv
Mask R-CNN (2017)	Instance segmentation by adding a mask head to Faster R-CNN.	arXiv
DeepLabv3+ (2018)	Atrous convolutions + ASPP for semantic segmentation.	arXiv
Segment Anything (SAM) (2023)	Promptable foundation model — segment anything zero-shot.	arXiv

Generative & diffusion

Paper	Why it matters	Link
GAN (2014)	The adversarial framework that defined a generative era.	arXiv
StyleGAN (2018)	Style-based generator — photorealistic, controllable faces.	arXiv
DDPM (2020)	Denoising diffusion — the basis of modern image generation.	arXiv
Latent Diffusion (Stable Diffusion) (2021)	Diffusion in latent space — made it cheap and open.	arXiv

Vision-language & foundation models

Paper	Why it matters	Link
CLIP (2021)	Contrastive image-text pretraining — zero-shot everything.	arXiv
DINOv2 (2023)	Self-supervised features that work without labels.	arXiv
LLaVA (2023)	Open visual instruction tuning — multimodal chat.	arXiv
NeRF (2020)	Neural radiance fields — novel-view synthesis from images.	arXiv
3D Gaussian Splatting (2023)	Real-time radiance fields — dethroned NeRF for speed.	arXiv

Tools & annotation

Resource	What	Link
CVAT	Powerful open-source annotation for boxes/masks/keypoints.	repo
Label Studio	Multi-type labeling (image/text/audio) with ML-assist.	site
FiftyOne	Dataset curation + model error analysis. Underrated.	site
Roboflow	End-to-end dataset management + augmentation + deploy.	site

where to start New to CV? Do fast.ai or CS231n, build with timm + torchvision, ship a detector with Ultralytics YOLO or Detectron2, and reach for SAM/CLIP/DINOv2 when you need zero-shot or foundation features. Read ResNet → ViT → DETR → SAM → CLIP in that order for the modern arc.

Awesome Computer Vision.