CMU 11-868: Large Language Model Systems

Course Overview

University: Carnegie Mellon University
Prerequisites: Strongly recommended to have taken Deep Learning (11-785) or Advanced NLP (11-611 or 11-711)
Programming Language: Python
Course Difficulty: 🌟🌟🌟🌟
Estimated Workload: 120 hours

This graduate-level course focuses on the full stack of large language model (LLM) systems — from algorithms to engineering. The curriculum covers, but is not limited to:

GPU Programming and Automatic Differentiation: Master CUDA kernel calls, fundamentals of parallel programming, and deep learning framework design.
Model Training and Distributed Systems: Learn efficient training algorithms, communication optimizations (e.g., ZeRO, FlashAttention), and distributed training frameworks like DDP, GPipe, and Megatron-LM.
Model Compression and Acceleration: Study quantization (GPTQ), sparsity (MoE), compiler technologies (JAX, Triton), and inference-time serving systems (vLLM, CacheGen).
Cutting-Edge Topics and Systems Practice: Includes retrieval-augmented generation (RAG), multimodal LLMs, RLHF systems, and end-to-end deployment, monitoring, and maintenance.

Compared to similar courses, this one stands out for its tight integration with recent papers and open-source implementations (hands-on work expanding CUDA support in the miniTorch framework), a project-driven assignment structure (five programming assignments + a final project), and guest lectures from industry experts, offering students real-world insights into LLM engineering challenges and solutions.

Self-Study Tips:

Set up a CUDA-compatible environment in advance (NVIDIA GPU + CUDA Toolkit + PyTorch).
Review fundamentals of parallel computing and deep learning (autograd, tensor operations).
Carefully read the assigned papers and slides before each lecture, and follow the assignments to extend the miniTorch framework from pure Python to real CUDA kernels.

This course assumes a solid understanding of deep learning and is not suitable for complete beginners. See the FAQ for more on prerequisites.

The assignments are fairly challenging and include:

Assignment 1: Implement an autograd framework + custom CUDA ops + basic neural networks
Assignment 2: Build a GPT2 model from scratch
Assignment 3: Accelerate training with custom CUDA kernels for Softmax and LayerNorm
Assignment 4: Implement distributed model training (difficult to configure independently for self-study)

Course Resources

Course Website: https://llmsystem.github.io/llmsystem2025spring/
Syllabus: https://llmsystem.github.io/llmsystem2025spring/docs/Syllabus/
Assignments: https://llmsystem.github.io/llmsystem2025springhw/
Course Texts: Selected research papers + selected chapters from Programming Massively Parallel Processors (4th Edition)