NHR4CES is represented at the third NHR conference

The third NHR Conference focus on AI in Social Sciences & Humanities, Life Sciences and Data Management. The conference brings together high performance computing users and providers of our NHR centers. At the NHR Conference, you have the opportunity to present your projects in a poster session or contributed talk and to exchange ideas with the consulting and operational teams of the NHR centers. It is taking place in in Göttingen (Aula am Waldweg, Waldweg 26) in collaboration with NHR@Göttingen – September 22 – 25, 2025!

Jan-Micha Bodensohn and Marcel Nellesen from our CSG Data Management and Florian Kummer from our SDL Fluids are presenting their research in the field of Data Management.

Jan-Micha Bodensohn

is a doctoral student supervised by Prof. Carsten Binnig at the Data and AI Systems Lab of the Technical University of Darmstadt and works as a researcher for the German Research Center for Artificial Intelligence (DFKI). His research centers on the automation of data engineering tasks with foundation models, and he has a strong background in natural language processing, machine learning, and databases.

His research centers on the automation of data engineering tasks with foundation models. He currently pursues the following lines of research:Evaluation of Large Language Models (LLMs) for data engineering tasksResearch on (retrieval) systems to support foundation modelsDevelopment of foundation models for tabular data

His talk: LLMs for Data Engineering in the Wild

Preparing raw data for applications like empirical analysis and machine learning often entails high manual overheads. Therefore, the automation of such data engineering tasks has long drawn attention from researchers. Recent work shows that Large Language Models (LLMs) achieve state-of-the-art performance on various public benchmarks, providing a promising avenue towards automation without needing expensive, specialized solutions.

Existing research, however, primarily uses evaluation datasets based on tables from web sources such as Wikipedia, calling the applicability of LLMs for data engineering in real-world use cases like science or business into question. In this talk, we use the example of large enterprises to first highlight how real-world data can often differ from that in existing benchmarks. To understand how these differences affect LLMs, we apply recent LLMs to the task of column type annotation and show that data characteristics like expressiveness and sparsity can severely hinder performance. Moreover, the tasks in real-world data engineering scenarios are often more complex than their typical formulations in the scientific community. In the second part of the talk, we thus highlight challenges that arise when automating real-world data engineering scenarios with LLMs.
Finally, we discuss how LLMs are affected by missing domain knowledge as well as their high costs when applied at scale.

With our talk, we want to draw attention to the fact that data engineering „in the wild“is often more challenging than portrayed in existing LLM research. Furthermore, we point towards promising directions to overcome these challenges and adapt LLMs for real-world use.

Marcel Nellesen

holds a Bachelor degree in Scientific Programing from the FH Aachen, Germany, and a Master degree in Computer Science from the RWTH Aachen University, Germany.

From 2019, he works in the department for Research Process & Data Management of the IT Center of the RWTH Aachen University, Germany, as a scientific employee with a focus on research data management. He worked on the Collaborative Scientific Integration Environment (Coscine) a Research Data Management Platform developed at the RWTH Aachen University. Currently he is developing JARDS (Joint Application, Review, and Dispatch Service), a platform for the creation and the scientific review of applications for computation time in NHR.
In 2021 Marcel joined the CSG Data Management in NHR4CES.

His talk: Efficient data transfers in HPC and RDM

In a world where data is growing rapidly in every scientific field and the number of new users on HPC clusters is increasing, the efficient transfer of data between the local infrastructure in the institutes, large-scale research data management systems, and HPC clusters presents a frequent challenge. Transfer speeds between the different systems often become a bottleneck, making efficient data transfer a commonly observed issue, especially for new users.

In our talk, we will introduce two components that we are utilizing to tackle this challenge from an RDM and an HPC perspective at RWTH Aachen University. Firstly, there is the File Transfer Service, originally developed at CERN, which is used for transferring data to and from our research data management platform Coscine. It provides a scalable infrastructure while supporting the scheduling of data transfers and is envisioned to ease the transfer of research data from local devices (e.g. microscopes) to the research data storage.

The second component is the HPC Connector, a dedicated storage system operating between our Object Storage and the HPC Cluster. It allows pre-staging of data close to the cluster and can be accessed at transfer speeds suitable for typical HPC jobs, while simultaneously supporting features such as automated synchronization of raw data and results between the HPC Cluster and our Object Storage.

Together, these two components facilitate a close connection between HPC resources and storage systems while ensuring optimal utilization of network capabilities.

Florian Kummer

is research group leader “Highly accurate simulation methods”, Numerical simulations of incompressible flows (emphasis on multiphase flows), Code development and member of the SDL Fluids.

His talk: Efficient data transfers in HPC and RDM

Reproducibility is a cornerstone of scientific integrity, yet in computational engineering, it is often undermined by the evolving nature of software ecosystems and the transitory nature of academic research roles. In this talk, we present an approach that leverages Continuous Integration (CI) to promote reproducibility and align with the FAIR principles (Findable, Accessible, Interoperable, Reusable) in the context of computational engineering.

Our use case is rooted in a university setting, where PhD students develop novel numerical methods over several years, culminating in complex and resource-intensive simulations. Once these researchers move on, maintaining the capability to reproduce their results becomes a nontrivial challenge. The underlying codebase continues to evolve, dependencies shift, and contextual knowledge may be lost.

To address this, we have designed a system where Jupyter Notebooks serve as the core interface for simulation workflows. These notebooks are tightly integrated into the version-controlled codebase and treated as formal test cases. Using a purpose-built API, notebooks can submit jobs to HPC clusters and retrieve results into a database system. Each notebook thereby encapsulates not only the logic of the simulation but also the exact conditions and results, ensuring traceability.

This integration of computational narratives, automated testing, and data provenance means that every change to the codebase is validated during CI workflows. Once stable, the code can be released and assigned a Digital Object Identifier (DOI), formalizing its alignment with FAIR standards and enabling persistent citation and reuse.

Keynote Speakers

Nataša Djurdjevac Conrad | Zuse Institute Berlin | AI in Social Sciences & Humanities

Dagmar Iber | ETH Zürich | Life Sciences

François Tessier | Inria Rennes | Data Management & Storage

More information: Have a look at the conference website!

About Us

Infrastructure

Scientific Consulting

Projects

Events & Trainings

19. August 2025

NHR4CES is represented at the third NHR conference

Jan-Micha Bodensohn

His talk: LLMs for Data Engineering in the Wild

Marcel Nellesen

His talk: Efficient data transfers in HPC and RDM

Florian Kummer

His talk: Efficient data transfers in HPC and RDM

Keynote Speakers