A cooperation of TU Darmstadt
and RWTH Aachen University

Cross-Sectional Group

The CSG Data Management develops automated tools to reduce the manual overhead of data engineering tasks like metadata annotation, data cleaning and data transformation. Furthermore, we aim to better integrate research data management (RDM) and HPC to provide better system integration and avoid unnecessary data copies.

The methods of achieving these goals are the following:

  • AI to automate Data Engineering Tasks (e.g., Automated Data Cleaning)
  • Lineage over complete data engineering workflows (i.e., track which data came from where)
  • System integration (e.g., integrate different storage systems)

We work on new tools and provide them as open-source code that can support researchers in their tasks.

Consulting on utilizing these tools will be provided by workshops (e.g., as hackathons). Additionally, a web platform to increase data literacy will be created where researchers are able to find links to useful other tools and educational resources like online courses.

Overall, we will also provide a bridge from NHR4CES to the NFDI4x initiatives by supporting users in the use of emerging solutions for automating research data management.

If you have questions for other groups or general questions like access to the HPC infrastructure, have a look at our support website.

Current research topics:

  • Foundation Models for Automating Data Engineering on Structured Data (Liane Vogel)
    – The project aims to explore the usage of pre-trained deep neural networks on structured data from databases in order to reduce the manual overhead on data engineering tasks like data cleaning.
  • Table Retrieval From Data Lakes (Jan-Micha Bodensohn)
  • Multi-modal Databases (Matthias Urban)
  • Linking Research Data Management and HPC (Marcel Nellesen)
    – We extend the integration platform Coscine with storage types that are used within HPC environment, support metadata storage for HPC processes.
  • Data Management, Access Workflows and Knowledge Graph based Metadata Management

Training offers 2024:

Support activities:

  • “Data Literacy for All” (hub with links to existing good online courses, best practices, online exercises, ….):  https://data-ai-literacy.ml/
  • JARDS as source of project specific metadata and a common entry point to request resources (computing time and storage together with NFDI4Ing)
  • Coscine an integration platform that allows services such as the archive, research data storage (RDS.NRW) and GitLab, but also external storages to be linked with one another at project level and stored with metadata

Teaching activities:

Project partners

Members

Prof. Dr. Carsten Binnig

TU Darmstadt

Prof. Dr. Christian Bischof

TU Darmstadt

Jan-Micha Bodensohn

TU Darmstadt

Prof. Dr. Matthias Müller

RWTH Aachen University

Marcel Nellesen

RWTH Aachen University

Matthias Urban

TU Darmstadt

Liane Vogel

TU Darmstadt

Publications

2024

  • CAESURA: Language Models as Multi-Modal Query Planners. (Matthias Urban, Carsten Binnig), CIDR’24
  • Demonstrating CAESURA: Language Models as Multi-Modal Query Planners. (Matthias Urban, Carsten Binnig), SIGMOD’24 Demo Track
  • Rethinking Table Retrieval from Data Lakes. (Jan-Micha Bodensohn, Carsten Binnig), aiDM’24@SIGMOD’24

2023

  • WannaDB: Ad-hoc SQL Queries over Text Collections. (Benjamin Hättasch, Jan-Micha Bodensohn, Liane Vogel, Matthias Urban, Carsten Binnig), Datenbanksysteme für Business, Technologie und Web (BTW 2023)
  • Carrots and Sticks: Motivating with Storage for Good RDM (Ilona Lang, Marcel Nellesen, Lukas Bossert, Marius Politze)
  • RDM Platform Coscine – FAIR play integrated right from the start (Ilona Lang, Marcel Nellesen, Marius Politze)
  • OmniscientDB: A Large Language Model-Augmented DBMS That Knows What Other DBMSs Do Not Know (Matthias Urban, Duc Dat Nguyen and Carsten Binnig), AIDM@SIGMOD
  • WikiDBs: A corpus of relational databases from wikidata (Liane Vogel, Carsten Binnig), TADA@VLDB 2023

2022

  • Towards Foundation Models for Relational Databases. (Liane Vogel, Benjamin Hilprecht, Carsten Binnig), TRL@NeurIPS 2022
  • ScaleStore: A Fast and Cost-Efficient Storage Engine using DRAM, NVMe, and RDMA. (Tobias Ziegler, Carsten Binnig, Viktor Leis), SIGMOD 2022

2021

  • It’s AI Match: A Two-Step Approach for Schema Matching Using Embeddings. (Benjamin Hättasch, Michael Truong-Ngoc, Andreas Schmidt, Carsten Binnig), AIDB@VLDB 2020
  • ASET: Ad-hoc Structured Exploration of Text Collections. (Benjamin Hättasch, Jan-Micha Bodensohn, Carsten Binnig),  AIDB@VLDB 2021