LLM inference optimization: Architecture, KV cache and Flash attention

LLM inference optimization: Architecture, KV cache and Flash attention

Share:

Similar Tracks

Mixture of Experts: Mixtral 8x7B YanAITalk

Lecture 36: CUTLASS and Flash Attention 3 GPU MODE

Understanding the LLM Inference Workload - Mark Moyou, NVIDIA PyTorch

Accelerating LLM Inference with vLLM (and SGLang) - Ion Stoica Nadav Timor

The case against SQL Theo - t3․gg

DeepSeek-V3 Gabriel Mongaras

Llama 4 Explained: Architecture, Long Context, and Native Multimodality Julia Turc

Context Is The Next Frontier by Jacob Buckman, CEO of Manifest AI Democratize Intelligence

3D Gaussian Splatting! - Computerphile Computerphile

A Visual Guide to Mixture of Experts (MoE) in LLMs Maarten Grootendorst

Visualizing transformers and attention | Talk for TNG Big Tech Day '24 Grant Sanderson

Deep Dive: Optimizing LLM inference Julien Simon

Trump on Upholding Constitution: "I Don't Know" | The Daily Show The Daily Show

LLaMA explained: KV-Cache, Rotary Positional Embedding, RMS Norm, Grouped Query Attention, SwiGLU Umar Jamil

How FlashAttention Accelerates Generative AI Revolution Jia-Bin Huang

vLLM Office Hours - Distributed Inference with vLLM - January 23, 2025 Neural Magic

Cybersecurity Architecture: Application Security IBM Technology

Accelerating LLM Inference with vLLM Databricks

Deep Dive into Inference Optimization for LLMs with Philip Kiely Software Huddle