Project
Reinforcement Learning for LLMs in Process Mining
Process discovery constitutes one of the central tasks in process mining. The task takes an event log that records sequences of activities executed in real processes and produces a process model that represents the control flow relations among those activities. Traditional discovery algorithms such as the Alpha Miner, Heuristic Miner and Inductive Miner analyze frequency of directly follows relations and construct models based on statistical patterns or recursive decomposition. These methods encounter limitations when event logs contain infrequent behavior or when activity names carry semantic information that frequency counts do not capture. Large language models process natural language activity labels and recognize semantic relations between activities. Reinforcement learning provides a mechanism to align model outputs with task specific objectives through numerical reward signals derived from conformance checking metrics. The fine tuning process requires execution of multiple training runs with different configurations and generation of several candidate outputs per input during group based optimization. These computations demand substantial memory and parallel processing capacity. High performance computing resources enable distributed training across multiple graphics processing units and allow systematic exploration of framework variants within reasonable time. The project utilizes such resources to conduct large scale experiments on the Qwen3 model and to evaluate numerous design choices in the reinforcement learning pipeline.
Project Details
Project term
August 1, 2025–February 2, 2026
Affiliations
RWTH Aachen University
Institute
Chair of Process and Data Science
Principal Investigator
Methods
The framework receives event logs formatted as sequences of activities separated by arrows. The language model generates textual representations of process trees that use operators for sequence, choice, parallel and loop constructs. Reward models evaluate each generated tree by first converting it to a Petri net and then applying token based replay to compute fitness and precision values. The content based score combines these two values through their harmonic mean. Additional reward components measure the textual length of the tree to promote structural simplicity and the edit distance to a reference tree produced by the Inductive Miner. Training proceeds with the group relative policy optimization algorithm that samples multiple outputs per prompt and normalizes advantages within each group. The training dataset consists of synthetic event logs divided into easy medium and hard categories according to the number of activities trace length and trace count. Experiments test variations in prompting strategy curriculum learning scheme reward model design fine tuning pipeline reinforcement learning algorithm and hyperparameter values such as group size and maximum completion length. The implementation builds upon the trl library and runs on two compute nodes with a total of eight graphics processing units. Training configurations maintain consistent learning rate batch size and precision across variants while adjusting only the targeted component under investigation. This setup allows direct comparison of the impact of each design choice on model performance.
Results
Evaluation uses synthetic datasets with 200 logs at each difficulty level and five subsets extracted from the Sepsis Cases event log. The base model without fine tuning achieves content scores of 0.61 on easy logs 0.33 on medium logs and 0.09 on hard logs. Fine tuned variants reach content scores between 0.52 and 0.89 on easy logs between 0.40 and 0.97 on medium logs and between 0.03 and 0.94 on hard logs. The configuration with one shot prompting cyclic curriculum content based reward and group relative policy optimization produces the most consistent improvements across difficulty levels. On the real world log the best variants obtain content scores between 0.52 and 0.79 compared with 0.25 for the base model. These variants exhibit higher precision values than the Inductive Miner while maintaining competitive fitness. Training reward curves display steady increases within each difficulty stage and temporary drops when the curriculum advances to the next level. Process trees generated by variants that include a supervised fine tuning stage show activity groupings that align with clinical sequences such as laboratory tests followed by treatment and admission steps. The results indicate that the reinforcement learning approach enables the model to balance fitness and precision more effectively than the unmodified base model.
Discussion
The experimental outcomes confirm that reinforcement learning fine tuning improves the capacity of the language model to generate process trees that align with the behavior recorded in event logs. The framework demonstrates adaptability across process complexities when training follows a cyclic curriculum schedule. Configurations that rely exclusively on content based rewards achieve more stable convergence than those that combine multiple reward terms. The language model captures semantic relations present in activity names of the hospital log and produces sequences that correspond to domain procedures more closely than frequency driven methods. Limitations appear in the form of truncated outputs when process structures exceed the maximum completion length and in occasional syntactic parsing failures. The reliance on synthetic data restricts exposure to the variability found in actual processes. Future investigations can examine larger model variants and incorporate reward models that integrate alignment based metrics or learned evaluators. Extensions can also integrate real world logs during training and explore reasoning mechanisms to support more complex control flow abstractions. Overall the project establishes the feasibility of reinforcement learning based optimization for language model driven process discovery and identifies directions for further refinement of the framework.
Additional Project Information
DFG classification: 409-06 Information Systems, Process and Knowledge Management
Software: PyTorch
Cluster: CLAIX
Publications
Alessandro Berti, Xiaoting Wang, Humam Kourani, Wil M.P. van Der Aalst,
Specializing Large Language Models for Process Modeling via Reinforcement Learning with Verifiable and Universal Reward,
https://dx.doi.org/10.36227/techrxiv.175977593.34948838/v1, September 2025
Thesis:
Xioating Wang,
Reinforcement Learning for LLMs in Process Mining,
Master Thesis, 2026,