DLI Training Series - Model Parallelism - Building and Deploying Large Neural Networks

Course/Event Essentials

Event/Course Start
Event/Course End
Event/Course Format
In person

Venue Information

Country: Germany
Venue Details: Click here

Training Content and Scope

Scientific Domain
Level of Instruction
Intermediate
Sector of the Target Audience
Research and Academia
Language of Instruction

Other Information

Organiser
Event/Course Description

Large language models (LLMs) and deep neural networks (DNNs), whether applied to natural language processing (e.g., GPT-3), computer vision (e.g., huge Vision Transformers), or speech AI (e.g., Wave2Vec 2), have certain properties that set them apart from their smaller counterparts. As LLMs and DNNs become larger and are trained on progressively larger datasets, they can adapt to new tasks with just a handful of training examples, accelerating the route toward general artificial intelligence. Training models that contain tens to hundreds of billions of parameters on vast datasets isn’t trivial and requires a unique combination of AI, high-performance computing (HPC), and systems knowledge. The goal of this course is to demonstrate how to train the largest of neural networks and deploy them to production.

The course is part of a training series co-organised by LRZ and NVIDIA Deep Learning Institute (DLI).  All instructors are NVIDIA certified University Ambassadors.

Learning Objectives

By participating in this workshop, you’ll learn how to:

  • Scale training and deployment of LLMs and neural networks across multiple nodes.
  • Use techniques such as activation checkpointing, gradient accumulation, and various forms of model parallelism to overcome the challenges associated with large-model memory footprint.
  • Capture and understand training performance characteristics to optimize model architecture.
  • Deploy very large multi-GPU, multi-node models to production using NVIDIA Triton™ Inference Server.