Demo: Optimizing Gemma inference on NVIDIA GPUs with TensorRT-LLM

Demo: Optimizing Gemma inference on NVIDIA GPUs with TensorRT-LLM

Share:

Similar Tracks

AI Inference: The Secret to AI's Superpowers IBM Technology

The Evolution of Multi-GPU Inference in vLLM | Ray Summit 2024 Anyscale

From model weights to API endpoint with TensorRT LLM: Philip Kiely and Pankaj Gupta AI Engineer

Understanding the LLM Inference Workload - Mark Moyou, NVIDIA PyTorch

RAG vs Fine-Tuning vs Prompt Engineering: Optimizing AI Models IBM Technology

TensorRT for Beginners: A Tutorial on Deep Learning Inference Optimization Long's Short-Term Memory

Model Context Protocol (MCP), clearly explained (why it matters) Greg Isenberg

Long-Context LLM Extension Sasha Rush

RAG vs. CAG: Solving Knowledge Gaps in AI Models IBM Technology

Fast LLM Serving with vLLM and PagedAttention Anyscale

Speculative Decoding: When Two LLMs are Faster than One Efficient NLP

The Best RAG Technique Yet? Anthropic’s Contextual Retrieval Explained! Prompt Engineering

17 Python Libraries Every AI Engineer Should Know Dave Ebbelaar

TensorRT-LLM: Quantization and Benchmarking Long's Short-Term Memory

Quantization vs Pruning vs Distillation: Optimizing NNs for Inference Efficient NLP

LoRA explained (and a bit about precision and quantization) DeepFindr

Deep Dive: Optimizing LLM inference Julien Simon

Understanding LLM Inference | NVIDIA Experts Deconstruct How AI Works DataCamp

LoRA & QLoRA Fine-tuning Explained In-Depth Entry Point AI

All You Need To Know About Running LLMs Locally bycloud