Course/Event Essentials
Training Content and Scope
Other Information
Overview
This training course will focus on the foundations of “Big Data” analysis by introducing the Hadoop distributed computing architecture and providing an introductory level tutorial for Big Data analysis using Hadoop and Rhadoop. Although online, the course will be hands-on, allowing participants to work interactively on real data on the High Performance Computing environment of the University of Ljubljana
Description
The training event will consist of two 4 hour training in two consecutive days. The first day will focus to big data management and data analysis with Hadoop. The participant will learn how to (i) move big data efficiently to a cluster and to Hadoop distributed file system, and (ii) how to perform simple big data analysis by Python scripts using MapReduce and Hadoop. The second day will focus to big data management and analysis using Rhadoop. We will stick to work within RStudio and will write all scripts within R using several state-of-the-art libraries for parallel computations, like parallel, doParallel and foreach and libraries to work with Hadoop, like rmr, rhdfs and rhbase.
Target audience
Everyone interested in big data management and analysis
Prerequisite knowledge
For the first day: basic Linux shell commands, Python
For the second day: basic Linux shell commands and R
Workflow
The course will be online via zoom. The participants will need local computer to connect to the HPC at University of Ljubljana. Before the start of the course they will get a student account at this supercomputer and all the examples will be done on this machine. They will retain this account for 2 more weeks to repeat the cases again, to transfer the data and the examples to a local machine.
Skills to be gained
At the end of the course the student will be able to:
- Connect to a supercomputer using NoMachine tool;
- Move big data to a supercomputer and store it to a distributed file system;
- Writing Python scripts to perform basic data management and data analysis tasks by Hadoop;
- Writing R scripts to perform basic data management and data analysis tasks by Rhadoop libraries like rmr, rhdfs and rhbase.