Project

Reinforcement Learning for LLMs in Process Mining

Business processes in companies leave rich digital traces (event logs). Process mining turns these traces into models that show how the work really flows. Today’s large language models (LLMs) can read and explain text well, but they are not specialized for process mining: out‑of‑the‑box they often produce incorrect or unusable models. Our project studies whether reinforcement learning (RL) can “teach” an LLM to solve core process‑mining tasks such as process discovery and conformance checking.

We treat the problem as training an AI assistant that proposes a process model and then receives feedback about how good that model is. The assistant gradually learns to prefer models that fit the data and the description. High‑performance computing (HPC) on CLAIX is essential, because training even small open models requires many parallel trials on graphics processors (GPUs) and repeated evaluation of candidate models. The project runs from July to October 2025 (extension requested to January 2026).

Project Details

Project term

August 7, 2025–October 31, 2025

Affiliations

RWTH Aachen University

Institute

Chair of Process and Data Science

Principal Investigator

Alessandro Berti
Prof. Dr. Wil van der Aalst

Methods

We built a training pipeline that connects three parts: (i) a language model that generates short Python programs constructing a process model; (ii) an automatic checker that verifies whether the program is valid and whether the model’s behavior matches the textual description or a reference log; and (iii) a learning loop that rewards better candidates and penalizes worse ones.

For rewards we combine two kinds of signals. Verifiable signals are computed by software: structural checks (e.g., correct use of choice and loop), and behavioral checks based on footprints (whether activity A may directly follow B and whether two activities can run in either order). Universal signals come from an evaluator model (“LLM as a judge”) that scores how well a candidate fulfills the prompt. A stable policy‑optimization method then nudges the language model toward candidates that score above the group average for the same prompt.

As an intermediate output we target a compact, checkable notation (POWL: Partially Ordered Workflow Language), which lets us verify and visualize results and convert them into standard models when needed. The software stack uses Python, PyTorch and Hugging Face for the model, and PM4Py for process‑mining checks. Jobs run on the cluster’s batch system. We curated a dataset of 1,312 textual process descriptions paired with reference models to support supervised warm‑up and reinforcement learning.

Results

(Period covered: July–October 2025.)

* A functioning end‑to‑end pipeline is in place (code generation → validation → scoring → learning), together with scripts for data preparation, logging, and evaluation.
* We trained several small open models (≈1–3 B parameters) using the above pipeline. On our 312‑description test split, the share of structurally invalid generations dropped from about 71% with the untrained baseline to well below 1% after reinforcement learning.
* Behavioral quality improved markedly. When comparing the generated models to references using footprints, the reinforced models achieved consistently higher agreement and produced more faithful parallelism and ordering.
* On an external benchmark for process modeling, our reinforced checkpoint reached average scores close to much larger proprietary systems while producing fewer unusable outputs.
* We prepared and submitted a preprint describing the method and integrated the core routines into our open PM4Py tooling.
* Human‑readable documentation and an initial tutorial were drafted to support reuse in the group and by students.

HPC usage: the project used roughly half of the requested GPU time by the end of October 2025, with peak usage in September during full‑scale runs (up to 32 GPUs per run for 10–20 hours). CPU time supported conformance checking and log processing. About 4 TB of storage is occupied by datasets, checkpoints, and logs. We experienced no blocking technical issues on CLAIX.

Discussion

The results show that reinforcement learning can specialize a language model for process‑mining tasks and turn mostly unusable generations into reliable models. The combination of verifiable checks and evaluator feedback proved practical: the judge signal quickly reduced format errors, while the behavioral checks anchored the model to the intended ordering and concurrency. HPC was critical: training requires thousands of batched trials, repeated scoring, and frequent checkpointing—workloads that are only practical on many GPUs in parallel.

Limitations remain. Reward quality determines progress; footprint‑based checks capture only part of the semantics, and evaluator models may be biased. The current system targets a specific intermediate notation; broader support (e.g., BPMN) and learning directly from real event logs are planned. We also aim to study efficiency techniques (parameter‑efficient fine‑tuning, quantization) to reduce compute needs.

Outlook (extension to January 2026): finalize experiments on process discovery from logs, broaden evaluation to conformance checking, complete the master’s thesis, and release cleaned code, datasets, and trained checkpoints. We expect to deliver a documented RL framework and specialized model that the group can reuse in future HPC‑supported studies.

Additional Project Information

DFG classification: 409-06 Information Systems, Process and Knowledge Management
Software: PyTorch
Cluster: CLAIX

Publications

Alessandro Berti, Xiaoting Wang, Humam Kourani, Wil M.P. van Der Aalst
Specializing Large Language Models for Process Modeling via Reinforcement Learning with Verifiable and Universal Rewards
https://dx.doi.org/10.36227/techrxiv.175977593.34948838/v1, 2025

About Us

Infrastructure

Scientific Consulting

Projects

Events & Trainings