LLM inference optimization: Architecture, KV cache and Flash attention Share: Download MP3 Similar Tracks Mixture of Experts: Mixtral 8x7B YanAITalk Lecture 36: CUTLASS and Flash Attention 3 GPU MODE Understanding the LLM Inference Workload - Mark Moyou, NVIDIA PyTorch Accelerating LLM Inference with vLLM (and SGLang) - Ion Stoica Nadav Timor The case against SQL Theo - t3․gg DeepSeek-V3 Gabriel Mongaras Llama 4 Explained: Architecture, Long Context, and Native Multimodality Julia Turc Context Is The Next Frontier by Jacob Buckman, CEO of Manifest AI Democratize Intelligence 3D Gaussian Splatting! - Computerphile Computerphile A Visual Guide to Mixture of Experts (MoE) in LLMs Maarten Grootendorst Visualizing transformers and attention | Talk for TNG Big Tech Day '24 Grant Sanderson Deep Dive: Optimizing LLM inference Julien Simon Trump on Upholding Constitution: "I Don't Know" | The Daily Show The Daily Show LLaMA explained: KV-Cache, Rotary Positional Embedding, RMS Norm, Grouped Query Attention, SwiGLU Umar Jamil How FlashAttention Accelerates Generative AI Revolution Jia-Bin Huang vLLM Office Hours - Distributed Inference with vLLM - January 23, 2025 Neural Magic Cybersecurity Architecture: Application Security IBM Technology Accelerating LLM Inference with vLLM Databricks Deep Dive into Inference Optimization for LLMs with Philip Kiely Software Huddle