CSG Data Management

To the overview of the Cross-Sectional Groups

A cooperation of TU Darmstadt
and RWTH Aachen University

Cross-Sectional Group

The CSG Data Management develops automated tools to reduce the manual overhead of data engineering tasks like metadata annotation, data cleaning and data transformation. Furthermore, we aim to better integrate research data management (RDM) and HPC to provide better system integration and avoid unnecessary data copies.

The methods of achieving these goals are the following:

AI to automate Data Engineering Tasks (e.g., Automated Data Cleaning)
Lineage over complete data engineering workflows (i.e., track which data came from where)
System integration (e.g., integrate different storage systems)

We work on new tools and provide them as open-source code that can support researchers in their tasks.

Consulting on utilizing these tools will be provided by workshops (e.g., as hackathons). Additionally, we created web platform to increase data literacy where researchers are able to find links to useful other tools and educational resources like online courses: https://data-ai-literacy.ml/.

We also provide a bridge from NHR4CES to the NFDI4x initiatives by supporting users in the use of emerging solutions for automating research data management.

Competencies

Managing Metadata
Storage Space Applications
Support Data Transfers
Automation of Data Engineering Tasks
Simplifying access to Data Repositories

Activities

Coscine – Research Data Management Platform
Supporting Migrations of Storage Systems
Focus group: RDM in NHR
Cooperations inside and outside of NHR

Contact the CSG Data Management!

Current research topics:

WikiDBs: A corpus of 100,000 real-world databases (Liane Vogel and Jan-Micha Bodensohn)
– Based on data from Wikidata, we created a large-scale corpus of relational databases from various domains to support the development of foundation models for tabular data
Foundation Models for Automating Data Engineering on Structured Data (Liane Vogel)
– The project aims to explore the usage of pre-trained deep neural networks on structured data from databases in order to reduce the manual overhead on data engineering tasks like data cleaning.
Table Retrieval From Data Lakes (Jan-Micha Bodensohn)
Linking Research Data Management and HPC (Marcel Nellesen)
– Provide easy ways to apply for storage space similar to HPC applications
– Data Management, Access Workflows and Knowledge Graph based Metadata Management
– Data Transfers between S3 solutions and HPC nodes
– Efficient metadata management with support for automated extraction of metadata
– Connection to many other federal and national RDM initiatives (NFDI, fdm.nrw…)

Training offers 2025:

Introduction to git and FDM with GitLab
Community Workshop : Materials Science with Advanced Data Management and Data Science Techniques
RDM in NHR: Efficient Data Exchange and User-Friendly HPC
RDM in NHR: Getting started with RDM
Large Language Models for Data Wrangling

Support activities:

JARDS as source of project specific metadata and a common entry point to request resources (computing time and storage together with NFDI4Ing)
Coscine an integration platform that allows services such as the archive, research data storage (RDS.NRW) and GitLab, but also external storages to be linked with one another at project level and stored with metadata

Teaching activities:

“Data Literacy for All” (hub with links to existing good online courses, best practices, online exercises, ….): https://data-ai-literacy.ml/
Hackathons with tools for data engineering
Research Data Management with GitLab

Video

Benjamin Hättasch: WannaDB: Ad-hoc Structured Exploration of Text Collections Using Queries

Gallery

Currently the largest open collection of real-world relational databases. The Code & Data are Open Source: ~325 Downloads since December 2024. A spotlight paper at NeurIPS 2024 (A* ML conference).

Enabling the research data management platform Coscine for HPC use cases. It supports researchers throughout the entire research data life cycle. Currently more than 3300 users and more than 4750 projects

Project partners

Members

Prof. Dr. Carsten Binnig

TU Darmstadt

Prof. Dr. Christian Bischof

TU Darmstadt

Jan-Micha Bodensohn

TU Darmstadt

Prof. Dr. Matthias Müller

RWTH Aachen University

Marcel Nellesen

RWTH Aachen University

Liane Vogel

TU Darmstadt

Publications

2025

Towards Complex Table Question Answering Over Tabular Data Lakes (Daniela Risis, Jan-Micha Bodensohn, Matthias Urban, Carsten Binnig ), DE4DS@BTW’25

2024

Automating Enterprise Data Engineering with LLMs. (Jan-Micha Bodensohn, Ulf Brackmann, Liane Vogel, Anupam Sanghi, Carsten Binnig ), TRL@NeurIPS’24
WikiDBs: A Large-Scale Corpus Of Relational Databases From Wikidata. (
Liane Vogel, Jan-Micha Bodensohn, Carsten Binnig), NeurIPS’24 Datasets&Benchmarks Track
CAESURA: Language Models as Multi-Modal Query Planners. (Matthias Urban, Carsten Binnig), CIDR’24
Demonstrating CAESURA: Language Models as Multi-Modal Query Planners. (Matthias Urban, Carsten Binnig), SIGMOD’24 Demo Track
Rethinking Table Retrieval from Data Lakes. (Jan-Micha Bodensohn, Carsten Binnig), aiDM’24@SIGMOD’24
LLMs for Data Engineering on Enterprise Data. (Jan-Micha Bodensohn, Ulf Brackmann, Liane Vogel, Matthias Urban, Anupam Sanghi, Carsten Binnig), TaDA@VLDB’24

2023

WannaDB: Ad-hoc SQL Queries over Text Collections. (Benjamin Hättasch, Jan-Micha Bodensohn, Liane Vogel, Matthias Urban, Carsten Binnig), Datenbanksysteme für Business, Technologie und Web (BTW 2023)
Carrots and Sticks: Motivating with Storage for Good RDM (Ilona Lang, Marcel Nellesen, Lukas Bossert, Marius Politze)
RDM Platform Coscine – FAIR play integrated right from the start (Ilona Lang, Marcel Nellesen, Marius Politze)
OmniscientDB: A Large Language Model-Augmented DBMS That Knows What Other DBMSs Do Not Know (Matthias Urban, Duc Dat Nguyen and Carsten Binnig), AIDM@SIGMOD
WikiDBs: A corpus of relational databases from wikidata (Liane Vogel, Carsten Binnig), TADA@VLDB 2023

2022

Towards Foundation Models for Relational Databases. (Liane Vogel, Benjamin Hilprecht, Carsten Binnig), TRL@NeurIPS 2022
ScaleStore: A Fast and Cost-Efficient Storage Engine using DRAM, NVMe, and RDMA. (Tobias Ziegler, Carsten Binnig, Viktor Leis), SIGMOD 2022

2021

It’s AI Match: A Two-Step Approach for Schema Matching Using Embeddings. (Benjamin Hättasch, Michael Truong-Ngoc, Andreas Schmidt, Carsten Binnig), AIDB@VLDB 2020
ASET: Ad-hoc Structured Exploration of Text Collections. (Benjamin Hättasch, Jan-Micha Bodensohn, Carsten Binnig), AIDB@VLDB 2021