Big Data analysis with Hadoop and RHadoop

Course/Event Essentials

Event/Course Start
Event/Course End
Event/Course Format
Online
Live (synchronous)

Venue Information

Country: Austria
Venue Details: Click here

Training Content and Scope

Scientific Domain
Level of Instruction
Beginner
Sector of the Target Audience
Research and Academia
Industry
Public Sector
Other (general public...)
HPC Profile of Target Audience
Application Users
Application Developers
Data Scientists
System Administrators
Language of Instruction

Other Information

Supporting Project(s)
EuroCC/CASTIEL
Event/Course Description

This training course will focus on the foundations of “Big Data” processing by introducing the Hadoop distributed computing architecture and providing an introductory level tutorial for Big Data analysis using Hadoop, Rhadoop, and R libraries parallel, doParallel, foreach and Rmpi. Although online, the course will be hands-on, allowing participants to work interactively on real data on the High Performance Computing environment of the University of Ljubljana and on the Vienna Scientific Cluster.

The training event will consist of two 4-hour trainings in two consecutive days. The first day will focus on big data management and data analysis with Hadoop. The participant will learn how to (i) move big data efficiently to a cluster and to Hadoop distributed file system, and (ii) how to perform simple big data analysis by Python scripts using MapReduce and Hadoop. The second day will focus on big data management and analysis using R and Rhadoop. We will first stick to work within RStudio and will write all scripts within R using several state-of-the-art libraries for parallel computations, like parallel, doParallel, foreach, Rmpi and libraries to work with Hadoop, like rmr, rhdfs and rhbase. Finally, we will show how to perform parallel slurm jobs with R scripts.

The participants will need a local machine to connect to the supercomputers at the University of Ljubljana and to the Vienna Scientific Cluster. Before the start of the course they will get training accounts on these supercomputers for running all examples.

Target audience: Everyone interested in big data management and analysis.

Prerequisites for the first day: basic Linux shell commands and Python
Prerequisites for the second day: basic Linux shell commands and R

This course is organized in cooperation with EuroCC Austria, VSC, EuroCC Slovakia, and EuroCC Slovenia.