[Workshop] Practical Data Wrangling

Overview

Data is essential in data-driven projects, as it forms the foundation for all subsequent analysis, modeling, and decision-making. Depending on the specific task, raw data may be collected from a wide variety of sources such as databases, APIs, sensors, logs, documents, or images. Before it can be effectively used for analysis or machine learning, raw data must be cleaned, transformed, validated, and organized into a consistent and usable format. As data comes in many different forms, including numerical, categorical, time series, text, event/log, and image data, the tools and techniques used for data wrangling can vary significantly depending on the data type and the requirements of the task.

In this workshop, we will cover practical data wrangling techniques for numerical, categorical, time series, text, event/log, and image data. Participants will learn how to detect and handle missing values, outliers, inconsistencies, duplicates, and formatting problems. We will demonstrate methods for transforming and encoding categorical variables, parsing and aggregating temporal data, processing unstructured text, analyzing event logs, and preparing image datasets for machine learning and analytics. Each session combines concepts, demonstrations, and hands-on exercises using realistic datasets to help participants develop practical skills that can be applied immediately in downstream modeling tasks.

Who is this workshop for?

This workshop is designed for

data practitioners who regularly work with raw or semi-structured data and need to prepare it for analysis or modeling.
data analysts, data scientists, machine learning engineers, and software engineers who want to strengthen their practical data preprocessing skills.
graduate students and researchers working with real-world datasets who need a structured approach to data cleaning and transformation.

Prerequisites

To ensure a smooth learning experience, participants should have:

basic proficiency in Python programming (variables, loops, functions) and some libraries like NumPy, Pandas, and Matplotlib/Seaborn.
basic familiarity with statistics (mean, median, variance) and introductory machine learning concepts will make it easier to follow the examples.
be comfortable reading and writing simple code and working with datasets in a notebook environment.

Key Takeaways

In this workshop, participants will learn how to systematically clean and structure different types of real-world data, including tabular, time series, text, event/log, and image data.
By the end of this workshop, participants will:

gain practical experience in building reproducible data wrangling pipelines that improve data quality and usability for downstream data analysis and machine learning.
understand common pitfalls in messy datasets and how to address them effectively using standard techniques and tools.
be able to confidently transform raw datasets into well-structured inputs suitable for downstream modeling and analysis tasks.

Tentative Schedule (TBA)

Day 1

Introduction
Data Types and Data Storage Formats
Numerical Data Wrangling
Categorical Data Wrangling
Time Series Data Wrangling

Day 2

Text Data Wrangling
Event and Log Data, and Image Data Wrangling
Wrangling Other Data Types
Summary and Key Takeaways

Regulations

Due to EuroCC3 regulations, we CAN NOT ACCEPT generic or private email addresses. Please use your official university or company email address for registration.

This training is for users that live and work in the European Union or a country associated with Horizon 2020. You can read more about the countries associated with Horizon2020 HERE.

Contact

For questions regarding this workshop or general questions about ENCCS training events, please contact training@enccs.se.

Overview

Who is this workshop for?

Prerequisites

Key Takeaways

Tentative Schedule (TBA)

Regulations

Contact

Dates

Location

Registration link

Organiser

Stay Updated!