This course brings together the two domains of Big Data and High Performance Computing (HPC) by showing how to run Hadoop jobs on the Vienna Scientific Cluster (VSC). High Performance Computing applications are usually highly optimized to make efficient use of the available processing power of compute clusters called supercomputers, especially at the level of floating point operations. Big Data applications operate at a higher level on large data sets and their main focus is on features such as fault tolerance, processing of dirty and/or unstructured data, and fast development. The largest Big Data clusters are even larger than supercomputers and require programming paradigms with even better scaling behavior than it is required in HPC. Tools to facilitate Big Data processing include the MapReduce framework and its more modern Spark counterpart, as well as SQL and NoSQL databases.
The course contains a quick overview of the VSC environment, of the module environment and of the involved schedulers. Scheduling becomes an issue when combining Big Data and HPC: we use the Slurm scheduler to gain access to compute nodes for a job and within the job we spawn a Big Data scheduler (mostly Yarn) which schedules and starts user tasks.
This course provides lectures, demos, and hands-on labs. The hands-on labs will be done on our flagship system VSC-4, all participants will get a temporary training user account for the course.
Type of methodology: Combination of lecture and hands-on
Participants receive the certificate of attendance: If requested
Paid training activity for participants: No, it's free of charge
Participants prerequisite knowledge: C/C++ OR Fortran OR ) Python