Local AI tool for audio transcription

Screenshot of a monitor with black background showing GPU workload in tables

To keep sensitive interview recordings out of third‑party clouds, the University of Innsbruck now uses an AI-based speech‑to‑text tool developed within the EuroCC project that runs locally on the LEO5 HPC system and ensures full data sovereignty.

The Challenge

In political science, interviews are often the method of choice for collecting research data. However, manually transcribing recordings is both time-consuming and costly. As a result, many researchers now rely on AI-based transcription tools. “Automatically generated transcripts are everywhere these days – even if the quality is not always convincing,” says Professor Franz Eder, political scientist at the University of Innsbruck (UIBK). “That made me wonder: can we do better – and, above all, can we run it locally on our own servers so that the data stays in-house?”

To answer this question, Andreas Lindner joined the project. He specialises in deploying AI models efficiently on HPC systems. The project also received support from the HPC team of the Central IT Services (ZID) and the university’s research focus area Scientific Computing.

The Solution

The solution needed to be powerful enough to run modern AI models, operate entirely on local infrastructure, and give the university full control over processes and data. These core requirements were quickly met. However, as discussions with users progressed, expectations grew step by step: speakers had to be identifiable, timestamps inserted, and translations into other languages enabled. What began as a small assignment evolved into a sophisticated project combining advanced features, computational efficiency and digital sovereignty.

Today, the tool follows a clearly defined workflow spanning a Linux workstation and the local HPC system LEO5. After an audio file is transferred from a workstation at the Office for Open Science within the Faculty of Social and Political Sciences to the university’s supercomputer, the audio is analysed and segmented by speaker. The tool then automatically transcribes each segment. If required, the completed transcript can subsequently be translated into other languages.

The technology behind it

The AI solution deliberately builds on established, well-documented components, which Andreas Lindner specifically adapted for use on LEO5:

Transcription is performed using a Whisper model: https://huggingface.co/docs/transformers/en/model_doc/whisper
Speaker diarisation – the process of assigning speech segments and timestamps to individual speakers – is handled by models from pyannote: https://huggingface.co/pyannote
Translation of transcripts is carried out using a Tower model developed by Unbabel: https://huggingface.co/collections/Unbabel/tower
Data encryption and decryption are implemented using gocryptfs, ensuring secure handling of sensitive information: https://www.baeldung.com/linux/gocryptfs-encrypt-decrypt-dirs
To guarantee a clearly separated and reproducible runtime environment on LEO5, the speech-to-text workflow runs inside a software container.

Screenshot of the new AI transcription tool. In the left column: all processes running on the CPUs, including the system’s total CPU utilisation, which is very low here. Beneath the green bar on the far right: the task currently being computed — in this case, the transcription itself. In the right column: the utilisation of GPU 1 in terms of compute capacity (99% utilisation) and memory usage (23,293 MiB of 24,576 MiB; MiB = mebibyte) during the actual transcription process.

Left column: speaker diarisation in progress. Right column: utilisation of GPU 1 — memory usage (2,709 of 24,576 MiB used). The compute utilisation of GPU 1 is at 70%.

The Outcome

The solution is now technically operational and fully documented. It integrates into the existing research infrastructure, relieves researchers of the time-intensive transcription process, and safeguards data sovereignty. AI Factory Austria (AI:AT) uses the tool internally to transcribe video conferences. Incidentally, the interview with Andreas Lindner on which this article is based was itself transcribed using the new tool on LEO5.

An additional benefit of the system is that transcription tasks are not time-critical. They are automatically scheduled into computational gaps between other projects. This improves utilisation of the HPC systems, which consume power even when idle. By putting otherwise unused capacity to productive use, overall energy efficiency increases.

With this new tool, the Faculty of Social and Political Sciences now has access to an AI-supported solution that simplifies research workflows, preserves data sovereignty, and demonstrates how modern technology can be deployed responsibly.

Dataset now publicly available

Since automated transcription is needed not only in Innsbruck but in many other institutions, Andreas Lindner has made the corresponding GitLab repository publicly available. It may therefore also be useful for other operators of high-performance computing clusters who wish to implement a similar setup:

https://researchdata.uibk.ac.at/records/z877w-c6110
https://doi.org/10.24433/CO.0416787.v1

For more info on this case and on HPC in general please contact info@eurocc-austria.at.

STT workflow manager: the user’s initial steps when running the script, for example selecting the language and the number of speakers (at the bottom).