Overview
Data is essential in data-driven projects, as it forms the foundation for all subsequent analysis, modeling, and decision-making. Depending on the specific task, raw data may be collected from a wide variety of sources such as databases, APIs, sensors, logs, documents, or images. Before it can be effectively used for analysis or machine learning, raw data must be cleaned, transformed, validated, and organized into a consistent and usable format. As data comes in many different forms, including numerical, categorical, time series, text, event/log, and image data, the tools and techniques used for data wrangling can vary significantly depending on the data type and the requirements of the task.
In this workshop, we will cover practical data wrangling techniques for numerical, categorical, time series, text, event/log, and image data. Participants will learn how to detect and handle missing values, outliers, inconsistencies, duplicates, and formatting problems. We will demonstrate methods for transforming and encoding categorical variables, parsing and aggregating temporal data, processing unstructured text, analyzing event logs, and preparing image datasets for machine learning and analytics. Each session combines concepts, demonstrations, and hands-on exercises using realistic datasets to help participants develop practical skills that can be applied immediately in downstream modeling tasks.
Who is this workshop for?
This workshop is designed for
- data practitioners who regularly work with raw or semi-structured data and need to prepare it for analysis or modeling.
- data analysts, data scientists, machine learning engineers, and software engineers who want to strengthen their practical data preprocessing skills.
- graduate students and researchers working with real-world datasets who need a structured approach to data cleaning and transformation.
Prerequisites
To ensure a smooth learning experience, participants should have:
- basic proficiency in Python programming (variables, loops, functions) and some libraries like NumPy, Pandas, and Matplotlib/Seaborn.
- basic familiarity with statistics (mean, median, variance) and introductory machine learning concepts will make it easier to follow the examples.
- be comfortable reading and writing simple code and working with datasets in a notebook environment.
Key Takeaways
In this workshop, participants will learn how to systematically clean and structure different types of real-world data, including tabular, time series, text, event/log, and image data.
By the end of this workshop, participants will:
- gain practical experience in building reproducible data wrangling pipelines that improve data quality and usability for downstream data analysis and machine learning.
- understand common pitfalls in messy datasets and how to address them effectively using standard techniques and tools.
- be able to confidently transform raw datasets into well-structured inputs suitable for downstream modeling and analysis tasks.
Tentative Schedule (TBA)
Day 1
- Introduction
- Data Types and Data Storage Formats
- Numerical Data Wrangling
- Categorical Data Wrangling
- Time Series Data Wrangling
Day 2
- Text Data Wrangling
- Event and Log Data, and Image Data Wrangling
- Wrangling Other Data Types
- Summary and Key Takeaways
Regulations
Due to EuroCC3 regulations, we CAN NOT ACCEPT generic or private email addresses. Please use your official university or company email address for registration.
This training is for users that live and work in the European Union or a country associated with Horizon 2020. You can read more about the countries associated with Horizon2020 HERE.
Contact
For questions regarding this workshop or general questions about ENCCS training events, please contact training@enccs.se.