GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding (Paper Explained)

Similar Tracks
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters (Paper)
Yannic Kilcher
Ultimate Guide To Scaling ML Models - Megatron-LM | ZeRO | DeepSpeed | Mixed Precision
Aleksa Gordić - The AI Epiphany