Source Data Preparation & Common Challenges
Overview
Clean source data is the single best predictor of a smooth process-mining project. When timestamps are ambiguous, case identifiers missing or activity names inconsistent, a mining engine has to guess the real sequence or causal logic of events—often with spectacularly wrong results. The guidelines below bundle field experience and research so you can design or repair your process data before they reach the Noreja tool.
Timestamp Formats
Why ISO 8601 (and a real DATETIME column) matters
ISO 8601 (YYYY-MM-DDThh:mm:ss[.fff]Z) is lexicographically sortable, time-zone aware and unambiguous across locales. Storing it in a proper DATETIME/TIMESTAMP datatype—not in VARCHAR—keeps the index small, enables range scans and prevents silent string-to-date conversion errors.
Typical non-ISO cases you will meet in relational databases
- Split date & time columns (
OrderDate,OrderTime) that should be concatenated. - Locale-specific text (
04/05/25 14:00– US vs EU interpretation) - 12-hour clock with AM/PM (
7-Jan-25 3:45 PM) that sorts incorrectly as a string. - Unix/Epoch integers (
1730870400) that need casting and time-zone offsets. - Excel serial numbers (45234.5833) arriving via CSV exports.
- Compressed numerics (
20250706)—easy to read, hard to timestamp-diff. - Text with month names (
10-APR-25 23.12.17:54) that break simple parsers. - Missing time-zone suffix (server local time only).
- Truncated precision (milliseconds chopped off).
- NULL placeholders that create artificial “gaps” and break throughput KPIs.
Best practice → normalise everything to UTC, keep the original raw string in a shadow column for audit purposes.
Technical vs Business timestamps
Two clocks, two objectives
Technical timestamps are written by the system at micro- or millisecond precision (e.g., message broker arrival). Business timestamps are coarse, often date-only, reflecting human actions or daily closing batches. Cutting the latter down to “whole days” while keeping the former at millisecond granularity will make concurrently created events appear out of order.
Misalignment pitfalls
- Identical rounded times – Different activities share
2025-07-01 00:00:00once milliseconds are dropped, so the process mining algorithm cannot decide which came first. - Duration inflation/deflation – Sub-second waits collapse to zero; overnight jobs inflate to 24 h when dates are stored without time parts.
- Cross-system merging – Logs from New York (
UTC-4) and Phoenix (UTC-7) recorded in local time seem three hours apart although the events were synchronous.
Mitigation → store the highest precision available, in UTC, then round during reporting, never in the warehouse.
Missing Primary Keys of Data Object
Causal process mining needs a stable identifier of data objects to separate individual cases (instances). When the source table lacks a primary key or the business object has no natural ID.
Repair strategies
- Synthesize: hash concatenation of invariant columns (
SHA2(CustomerID||OrderDate||Line#)). - Sequence windows: use database
ROW_NUMBER() OVERper business document/date to create a surrogate. - Probabilistic correlation or machine-learning linking when no deterministic key exists.
Activity column quality
An activity name should tell the miner “what happened now”. Common defects:
| Defect | Symptom | Impact |
|---|---|---|
| Typos / many aliases | Ship, Shipping, SHIPPED | Explodes variant count |
| NULL / blank values | “Unnamed activity” buckets | Breaks conformance checks |
| Mixed granularity | Invoice Posted vs Send Invoice Email | Causes spaghetti models with high amout of process vairants |
| Overloaded field | “Status” contains both lifecycle + org unit | Hard to define separators |
Quick reference checklist
- Store timestamps in UTC ISO 8601 with full precision; keep raw originals.
- Never truncate high-resolution logs; round only in the reporting layer.
- Guarantee a unique case ID (or object ID) —synthesize it if the source has none.
- Keep the activity column human-readable, finite and typo-free.
- Run a data-quality scan before every extraction load; fix issues early.
Following these practices turns extraction into the least exciting part of your process-mining journey—exactly where plumbing should be.