Big Data analysis with Hadoop and RHadoop

Course/Event Essentials

Event/Course Start
Event/Course End
Event/Course Format
Online
Live (synchronous)

Venue Information

Country: Slovenia
Venue Details: Click here

Training Content and Scope

Scientific Domain
Technical Domain
Level of Instruction
Intermediate
Sector of the Target Audience
Research and Academia
Industry
Public Sector
Other (general public...)
Language of Instruction

Other Information

Event/Course Description

Overview

This training course will focus on the foundations of “Big Data” analysis by introducing the Hadoop distributed computing architecture and providing an introductory level tutorial for Big Data analysis using Hadoop and Rhadoop. Although online, the course will be hands-on, allowing participants to work interactively on real data on the High Performance Computing environment of the University of Ljubljana

Description

The training event will consist of two 4 hour training in two consecutive days. The first day will focus to big data management and data analysis with Hadoop. The participant will learn how to (i) move big data efficiently to a cluster and to Hadoop distributed file system, and (ii) how to perform simple big data analysis by Python scripts using MapReduce and Hadoop. The second day will focus to big data management and analysis using Rhadoop. We will stick to work within RStudio and will write all scripts within R using several state-of-the-art libraries for parallel computations, like parallel, doParallel and foreach and libraries to work with Hadoop, like rmr, rhdfs and rhbase.

Target audience

Everyone interested in big data management and analysis

Prerequisite knowledge

For the first day: basic Linux shell commands, Python  
For the second day: basic Linux shell commands and R

Workflow

The course will be online via zoom. The participants will need local computer to connect to the HPC at University of Ljubljana. Before the start of the course they will get a student account at this supercomputer and all the examples will be done on this machine. They will retain this account for 2 more weeks to repeat the cases again, to transfer the data and the examples to a local machine.

Skills to be gained

At the end of the course the student will be able to:

  • Connect to a supercomputer using NoMachine tool;
  • Move big data to a supercomputer and store it to a distributed file system;
  • Writing Python scripts to perform basic data management and data analysis tasks by Hadoop;
  • Writing R scripts to perform basic data management and data analysis tasks by Rhadoop libraries like rmr, rhdfs and rhbase.