Source Data Preparation & Common Challenges

High-quality source data is the strongest predictor of a successful process mining initiative. Ambiguous timestamps, missing case identifiers, or inconsistent activity names force mining engines to infer causal sequences—often resulting in distorted models and misleading insights. Proper normalization, timestamp precision, stable object identifiers, and clean activity definitions are therefore not technical niceties but structural prerequisites for reliable causal process analysis. The following best practices consolidate research insights and field experience to ensure your data foundation is robust before entering Noreja.

Overview

Clean source data is the single best predictor of a smooth process-mining project. When timestamps are ambiguous, case identifiers missing or activity names inconsistent, a mining engine has to guess the real sequence or causal logic of events—often with spectacularly wrong results. The guidelines below bundle field experience and research so you can design or repair your process data before they reach the Noreja tool.

Timestamp Formats

Why ISO 8601 (and a real DATETIME column) matters

ISO 8601 (YYYY-MM-DDThh:mm:ss[.fff]Z) is lexicographically sortable, time-zone aware and unambiguous across locales. Storing it in a proper DATETIME/TIMESTAMP datatype—not in VARCHAR—keeps the index small, enables range scans and prevents silent string-to-date conversion errors.

Typical non-ISO cases you will meet in relational databases

  • Split date & time columns (OrderDate, OrderTime) that should be concatenated.
  • Locale-specific text (04/05/25 14:00 – US vs EU interpretation)
  • 12-hour clock with AM/PM (7-Jan-25 3:45 PM) that sorts incorrectly as a string.
  • Unix/Epoch integers (1730870400) that need casting and time-zone offsets.
  • Excel serial numbers (45234.5833) arriving via CSV exports.
  • Compressed numerics (20250706)—easy to read, hard to timestamp-diff.
  • Text with month names (10-APR-25 23.12.17:54) that break simple parsers.
  • Missing time-zone suffix (server local time only).
  • Truncated precision (milliseconds chopped off).
  • NULL placeholders that create artificial “gaps” and break throughput KPIs.

Best practice → normalise everything to UTC, keep the original raw string in a shadow column for audit purposes.

Technical vs Business timestamps

Two clocks, two objectives

Technical timestamps are written by the system at micro- or millisecond precision (e.g., message broker arrival). Business timestamps are coarse, often date-only, reflecting human actions or daily closing batches. Cutting the latter down to “whole days” while keeping the former at millisecond granularity will make concurrently created events appear out of order.

Misalignment pitfalls

  • Identical rounded times – Different activities share 2025-07-01 00:00:00 once milliseconds are dropped, so the process mining algorithm cannot decide which came first.
  • Duration inflation/deflation – Sub-second waits collapse to zero; overnight jobs inflate to 24 h when dates are stored without time parts.
  • Cross-system merging – Logs from New York (UTC-4) and Phoenix (UTC-7) recorded in local time seem three hours apart although the events were synchronous.

Mitigation → store the highest precision available, in UTC, then round during reporting, never in the warehouse.

Missing Primary Keys of Data Object

Causal process mining needs a stable identifier of data objects to separate individual cases (instances). When the source table lacks a primary key or the business object has no natural ID.

Repair strategies

  • Synthesize: hash concatenation of invariant columns (SHA2(CustomerID||OrderDate||Line#)).
  • Sequence windows: use database ROW_NUMBER() OVER per business document/date to create a surrogate.
  • Probabilistic correlation or machine-learning linking when no deterministic key exists.

Activity column quality

An activity name should tell the miner “what happened now”. Common defects:

DefectSymptomImpact
Typos / many aliasesShip, Shipping, SHIPPEDExplodes variant count
NULL / blank values“Unnamed activity” bucketsBreaks conformance checks
Mixed granularityInvoice Posted  vs Send Invoice EmailCauses spaghetti models with high amout of process vairants
Overloaded field“Status” contains both lifecycle + org unitHard to define separators

Quick reference checklist

  • Store timestamps in UTC ISO 8601 with full precision; keep raw originals.
  • Never truncate high-resolution logs; round only in the reporting layer.
  • Guarantee a unique case ID (or object ID) —synthesize it if the source has none.
  • Keep the activity column human-readable, finite and typo-free.
  • Run a data-quality scan before every extraction load; fix issues early.

Following these practices turns extraction into the least exciting part of your process-mining journey—exactly where plumbing should be.

Was this article helpful?