CSG Data Management
and RWTH Aachen University
The CSG Data Management develops automated tools to reduce the manual overhead of data engineering tasks like metadata annotation, data cleaning and data transformation. Furthermore, we aim to better integrate research data management (RDM) and HPC to provide better system integration and avoid unnecessary data copies.
The methods of achieving these goals are the following:
- AI to automate Data Engineering Tasks (e.g., Automated Data Cleaning)
- Lineage over complete data engineering workflows (i.e., track which data came from where)
- System integration (e.g., integrate different storage systems)
We work on new tools and provide them as open-source code that can support researchers in their tasks.
Consulting on utilizing these tools will be provided by workshops (e.g., as hackathons). Additionally, a web platform to increase data literacy will be created where researchers are able to find links to useful other tools and educational resources like online courses.
Overall, we will also provide a bridge from NHR4CES to the NFDI4x initiatives by supporting users in the use of emerging solutions for automating research data management.
If you have questions for other groups or general questions like access to the HPC infrastructure, have a look at our support website.
Current research topics:
- Pre-trained Models for Automating Data Engineering on Structured Data (Liane Vogel)
– The project aims to explore the usage of pre-trained deep neural networks on structured data from databases in order to reduce the manual overhead on data engineering tasks like data cleaning.
- Table extraction from Text (Benjamin Hättasch)
– In this project, we develop a system to interactively and automatically extract structured information from textual documents without domain specific training. In addition to direct usage, this and other projects of the Data Management Lab will be a foundation for automated Metadata Annotation.
- Linking Research Data Management and HPC
– We extend the integration platform Coscine with storage types that are used within HPC environment, support metadata storage for HPC processes.
- Data Management, Access Workflows and Knowledge Graph based Metadata Management
- Resource-efficient Uncertainty Quantification of computational results (Moritz Schwarzmeier)
Training offers 2023:
- Will follow shortly!
- “Data Literacy for All” (hub with links to existing good online courses, best practices, online exercises, ….), taking NFDI4Ing activities into account
- Jars as source of project specific metadate and a common entry point to request resources (computing time and storage together with NFDI4Ing)
- Coscine an integration platform that allows services such as the archive, research data storage (RDS.NRW) and GitLab, but also external storages to be linked with one another at project level and stored with metadata
- “Knowledge Base” with Best Practices for Research Data Management in Research Software Development (together with NFDI4Ing)
- “Data Literacy for All” (hub with links to existing good online courses, best practices, online exercises, ….)
- Hackathons with tools for data engineering
- Research Data Management with GitLab
- Towards Foundation Models for Relational Databases (Liane Vogel, Benjamin Hilprecht, Carsten Binnig), Table Representation Learning Workshop (@NeurIPS 2022)
- WannaDB: Ad-hoc SQL Queries over Text Collections (Benjamin Hättasch, Jan-Micha Bodensohn, Liane Vogel, Matthias Urban, Carsten Binnig), Datenbanksysteme für Business, Technologie und Web (BTW 2023)
- Benjamin Hättasch, Michael Truong-Ngoc, Andreas Schmidt, Carsten Binnig: It’s AI Match: A Two-Step Approach for Schema Matching Using Embeddings. AIDB@VLDB 2020
- Benjamin Hättasch, Jan-Micha Rainer Bodensohn, Carsten Binnig: ASET: Ad-hoc Structured Exploration of Text Collections. AIDB@VLDB 2021