Skip to main content

Speakers: Casper van Leeuwen (NCC Netherlands) & Gyula Ujlaki (NCC Hungary)

As datasets and models grow in complexity, mastering distributed training becomes vital. In this video, Casper van Leeuwen from NCC Netherlands breaks down the technical implementation of PyTorch Distributed Data Parallel (DDP) to synchronise training across multiple nodes. Complementing this, Gyula Ujlaki from NCC Hungary presents the strategies behind Model Parallelism, demonstrating how to train massive architectures that exceed the memory capacity of a single GPU device.

CASTIEL 2 & EUROCC2