The third NHR Conference focus on AI in Social Sciences & Humanities, Life Sciences and Data Management. The conference brings together high performance computing users and providers of our NHR centers. At the NHR Conference, you have the opportunity to present your projects in a poster session or contributed talk and to exchange ideas with the consulting and operational teams of the NHR centers. It is taking place in in Göttingen (Aula am Waldweg, Waldweg 26) in collaboration with NHR@Göttingen – September 22 – 25, 2025!
Jan-Micha Bodensohn and Marcel Nellesen from our CSG Data Management and Florian Kummer from our SDL Fluids are presenting their research in the field of Data Management.

Jan-Micha Bodensohn
is a doctoral student supervised by Prof. Carsten Binnig at the Data and AI Systems Lab of the Technical University of Darmstadt and works as a researcher for the German Research Center for Artificial Intelligence (DFKI). His research centers on the automation of data engineering tasks with foundation models, and he has a strong background in natural language processing, machine learning, and databases.
His research centers on the automation of data engineering tasks with foundation models. He currently pursues the following lines of research:Evaluation of Large Language Models (LLMs) for data engineering tasksResearch on (retrieval) systems to support foundation modelsDevelopment of foundation models for tabular data
His talk: LLMs for Data Engineering in the Wild
Preparing raw data for applications like empirical analysis and machine learning often entails high manual overheads. Therefore, the automation of such data engineering tasks has long drawn attention from researchers. Recent work shows that Large Language Models (LLMs) achieve state-of-the-art performance on various public benchmarks, providing a promising avenue towards automation without needing expensive, specialized solutions.
Existing research, however, primarily uses evaluation datasets based on tables from web sources such as Wikipedia, calling the applicability of LLMs for data engineering in real-world use cases like science or business into question. In this talk, we use the example of large enterprises to first highlight how real-world data can often differ from that in existing benchmarks. To understand how these differences affect LLMs, we apply recent LLMs to the task of column type annotation and show that data characteristics like expressiveness and sparsity can severely hinder performance. Moreover, the tasks in real-world data engineering scenarios are often more complex than their typical formulations in the scientific community. In the second part of the talk, we thus highlight challenges that arise when automating real-world data engineering scenarios with LLMs.
Finally, we discuss how LLMs are affected by missing domain knowledge as well as their high costs when applied at scale.
With our talk, we want to draw attention to the fact that data engineering „in the wild“is often more challenging than portrayed in existing LLM research. Furthermore, we point towards promising directions to overcome these challenges and adapt LLMs for real-world use.

Marcel Nellesen
holds a Bachelor degree in Scientific Programing from the FH Aachen, Germany, and a Master degree in Computer Science from the RWTH Aachen University, Germany.
From 2019, he works in the department for Research Process & Data Management of the IT Center of the RWTH Aachen University, Germany, as a scientific employee with a focus on research data management. He worked on the Collaborative Scientific Integration Environment (Coscine) a Research Data Management Platform developed at the RWTH Aachen University. Currently he is developing JARDS (Joint Application, Review, and Dispatch Service), a platform for the creation and the scientific review of applications for computation time in NHR.
In 2021 Marcel joined the CSG Data Management in NHR4CES.
His talk: Efficient data transfers in HPC and RDM
In a world where data is growing rapidly in every scientific field and the number of new users on HPC clusters is increasing, the efficient transfer of data between the local infrastructure in the institutes, large-scale research data management systems, and HPC clusters presents a frequent challenge. Transfer speeds between the different systems often become a bottleneck, making efficient data transfer a commonly observed issue, especially for new users.
In our talk, we will introduce two components that we are utilizing to tackle this challenge from an RDM and an HPC perspective at RWTH Aachen University. Firstly, there is the File Transfer Service, originally developed at CERN, which is used for transferring data to and from our research data management platform Coscine. It provides a scalable infrastructure while supporting the scheduling of data transfers and is envisioned to ease the transfer of research data from local devices (e.g. microscopes) to the research data storage.
The second component is the HPC Connector, a dedicated storage system operating between our Object Storage and the HPC Cluster. It allows pre-staging of data close to the cluster and can be accessed at transfer speeds suitable for typical HPC jobs, while simultaneously supporting features such as automated synchronization of raw data and results between the HPC Cluster and our Object Storage.
Together, these two components facilitate a close connection between HPC resources and storage systems while ensuring optimal utilization of network capabilities.

Florian Kummer
is research group leader “Highly accurate simulation methods”, Numerical simulations of incompressible flows (emphasis on multiphase flows), Code development and member of the SDL Fluids.
His talk: Efficient data transfers in HPC and RDM
Keynote Speakers
Nataša Djurdjevac Conrad | Zuse Institute Berlin | AI in Social Sciences & Humanities
Dagmar Iber | ETH Zürich | Life Sciences
François Tessier | Inria Rennes | Data Management & Storage
More information: Have a look at the conference website!