OFX & MT940 Parser Design: Deterministic Statement Parsing for Automated Reconciliation

Parser design sits at the very front of the reconciliation pipeline: it is the adapter stage that turns a raw, protocol-specific bank-statement payload into the normalised transaction contract every downstream matcher depends on. Within the Core Architecture & Bank Feed Ingestion layer, Open Financial Exchange (OFX) and SWIFT MT940 remain the two formats most institutional feeds still emit, and a defect here — a misread debit/credit flag, a dropped continuation line, a float where a Decimal belongs — propagates silently into false matches and phantom balances. This page specifies the finite-state-machine (FSM) architecture, boundary-validation rules, and audit-ready normalization that make OFX and MT940 parsing deterministic enough to survive a SOX walkthrough, and gives a production-grade Python implementation that emits a canonical record for the Transaction Matching Algorithms & Logic cascade to consume.

The engineering mandate is narrow and strict: given identical bytes, the parser must always emit byte-identical canonical output, and any payload it cannot fully validate must be quarantined rather than guessed at. Parsing is a control point, not a convenience.

Prerequisites: Pipeline State Before Parsing Runs

Parsing is invoked after a payload has been fetched and persisted but before any matching logic executes. Three upstream guarantees must already hold:

A retrieved, byte-stable payload. The fetch stage — operating in either a streaming or windowed cadence as described in Real-Time vs Batch Ingestion — must have written the raw bytes to durable storage and recorded a source_hash (SHA-256 over the raw payload) before the parser touches them. The parser reads the persisted bytes, never the live socket, so a network retry cannot corrupt mid-parse state.
Resolved, scope-limited credentials. Any decryption key, client certificate, or feed token the payload required must have been injected through Secure API Token Management. The parser itself holds no credentials and must never log them, even inside an exception handler.
A registered canonical schema version. The parser projects into the normalised transaction contract defined by the ingestion layer; the active parser_version and target schema revision must be pinned so output is reproducible. Currency normalization rules are deferred to the Multi-Currency Ledger Mapping stage — the parser captures transaction_currency verbatim and does not convert.

If any precondition is unmet, the correct behaviour is to refuse to parse and emit a PRECONDITION_FAILED audit event rather than produce a partial record.

Protocol Divergence and the Two Parsing Models

OFX and MT940 are different data models and demand structurally different parsers; trying to share one tokenizer between them is the most common source of parser drift.

OFX is a hierarchical, SGML/XML-derived envelope. Transaction detail lives in nested <STMTTRN> blocks that inherit <BANKACCTFROM> and <BANKACCTTO> context from enclosing scopes, so the parser must track depth and scope inheritance. Because statement files can reach multiple megabytes, OFX is parsed with a streaming reader — lxml.etree.iterparse clearing elements as it goes — rather than a DOM, to keep memory bounded.

MT940 is a flat, tag-delimited SWIFT format delivered as end-of-day batches. Each field is introduced by a tag (:20:, :25:, :60F:, :61:, :86:, :62F:), lines are capped at 65 characters, and :86: narratives continue across multiple lines. It is parsed by a line-oriented FSM that respects carriage-return boundaries and continuation rules. The hard part is :61:: value date, entry date, debit/credit mark, amount, transaction type, and bank reference are concatenated into a single unspaced string that must be decomposed positionally and deterministically.

Both adapters converge on the same three-phase pipeline before they emit a record:

Boundary validation. Verify the protocol header (OFXHEADER:100 for legacy OFX, the {1:...} SWIFT block or a leading :20: for MT940) and confirm the declared encoding (ISO-8859-1 for legacy MT940, UTF-8 for modern OFX). An unexpected encoding triggers immediate quarantine, never a silent fallback.
Stateful tokenization. Drive a transition table that maps each incoming token to a parser state. MT940 decomposes :61: and accumulates :86: continuations; OFX tracks element depth so <TRNAMT> and <DTPOSTED> resolve to the correct parent <STMTTRN>.
Schema projection. Map validated primitives onto the canonical contract, preserving a verbatim copy of the raw field fragment for forensic replay.

Field lengths, mandatory-versus-optional tags, and balance semantics are fixed by specification; the SWIFT MT940 standard and the OFX specification are the authorities a conformant parser must validate against before projection.

Algorithm and Mechanism: Why a Deterministic FSM

A naive split/regex-only approach fails in production because of multi-line narratives, optional fields, vendor whitespace quirks, and century-ambiguous YYMMDD dates. A tag-aware FSM is the correct abstraction: a finite set of states, an explicit transition table keyed on the recognised tag, and a rule that any unrecognised transition is an error rather than a silently ignored line.

Complexity is linear in input size — O(n) over characters for MT940 and O(n) over elements for streaming OFX — with O(1) working memory per transaction because completed records are flushed downstream and not retained. The determinism property is what matters for reconciliation: the transition function is pure, so re-running the parser over the same source_hash reproduces identical output, which is the precondition for safe replay from a dead-letter queue.

The financial-domain caveats are unforgiving. Monetary amounts must be parsed into Decimal with an explicit quantization (ROUND_HALF_EVEN) — IEEE-754 float introduces drift that corrupts a opening_balance + Σ entries == closing_balance check. Sign conventions differ by format (MT940 D/C/RC/RD marks versus OFX signed <TRNAMT>) and must be normalised to one signed convention at projection time. Dates must resolve a century for YYMMDD and normalise to UTC ISO 8601. None of these may be “best effort”: a value that does not parse cleanly is a quarantine event, not a default.

Production-Grade Python Implementation

The implementation below is a deterministic MT940 line FSM that validates boundaries, decomposes :61:, accumulates :86: continuations, projects to a pydantic canonical model, and emits a structured audit record carrying trace_id, source_hash, and a match_decision status for every statement processed. It uses Python 3.10+ type hints and Decimal throughout. An equivalent OFX adapter wraps lxml.etree.iterparse and feeds the same CanonicalTxn model, so both formats land on one contract.

python

from __future__ import annotations

import hashlib
import json
import logging
import re
import uuid
from datetime import date, datetime, timezone
from decimal import Decimal, ROUND_HALF_EVEN, InvalidOperation
from enum import Enum
from typing import Iterator

from pydantic import BaseModel, Field, field_validator

audit_log = logging.getLogger("ingestion.audit")

TAG_RE = re.compile(r"^:(?P<tag>\d{2}[A-Z]?):(?P<body>.*)$")
# :61: value(YYMMDD) [entry(MMDD)] mark(C|D|RC|RD) amount(d,dd) type(4) ref...
TXN_RE = re.compile(
    r"^(?P<value>\d{6})(?P<entry>\d{4})?(?P<mark>R?[CD])"
    r"(?P<amount>\d+,\d{0,2})(?P<ttype>[A-Z0-9]{4})(?P<ref>.*)$"
)


class ParseDecision(str, Enum):
    PARSED = "PARSED"
    QUARANTINED = "QUARANTINED"


class StatementError(Exception):
    """Raised on any boundary or field violation; forces quarantine."""

    def __init__(self, code: str, detail: str) -> None:
        super().__init__(f"{code}: {detail}")
        self.code = code


class CanonicalTxn(BaseModel):
    """The normalised contract every adapter must emit."""

    value_date: date
    amount: Decimal                       # signed: credits positive, debits negative
    transaction_currency: str = Field(min_length=3, max_length=3)
    transaction_type: str
    bank_reference: str
    narrative: str
    source_protocol: str = "MT940"
    parser_version: str
    source_hash: str
    raw_fragment: str

    @field_validator("amount")
    @classmethod
    def _quantize(cls, v: Decimal) -> Decimal:
        return v.quantize(Decimal("0.01"), rounding=ROUND_HALF_EVEN)


def _to_signed(mark: str, raw_amount: str) -> Decimal:
    try:
        magnitude = Decimal(raw_amount.replace(",", "."))
    except InvalidOperation as exc:
        raise StatementError("FIELD_PARSE_ERROR", f"bad amount {raw_amount!r}") from exc
    # 'C'/'RD' increase the balance; 'D'/'RC' decrease it.
    return magnitude if mark in {"C", "RD"} else -magnitude


def _to_date(yymmdd: str) -> date:
    yy = int(yymmdd[:2])
    century = 2000 if yy <= 68 else 1900            # ISO 8601 century inference
    return date(century + yy, int(yymmdd[2:4]), int(yymmdd[4:6]))


def parse_mt940(
    raw: bytes,
    *,
    parser_version: str,
    currency: str = "EUR",
) -> Iterator[CanonicalTxn]:
    """Deterministic MT940 -> CanonicalTxn projection with audit emission."""
    trace_id = str(uuid.uuid4())
    source_hash = hashlib.sha256(raw).hexdigest()

    try:
        text = raw.decode("iso-8859-1")           # legacy MT940 encoding contract
    except UnicodeDecodeError as exc:
        _emit_audit(trace_id, source_hash, ParseDecision.QUARANTINED, "ENCODING_VIOLATION")
        raise StatementError("ENCODING_VIOLATION", str(exc)) from exc

    state = "HEADER"
    pending: dict | None = None
    count = 0

    for line in text.replace("\r\n", "\n").split("\n"):
        m = TAG_RE.match(line)
        if m is None:                              # continuation of a :86: narrative
            if state == "NARRATIVE" and pending is not None:
                pending["narrative"] += " " + line.strip()
                continue
            if line.strip() == "" or line.strip() == "-":
                continue
            _emit_audit(trace_id, source_hash, ParseDecision.QUARANTINED, "UNRECOGNISED_LINE")
            raise StatementError("UNRECOGNISED_LINE", line[:65])

        tag, body = m.group("tag"), m.group("body")

        if tag == "20":
            state = "STATEMENT"
        elif tag == "61":
            if pending is not None:
                yield _project(pending, currency, parser_version, source_hash)
                count += 1
            t = TXN_RE.match(body)
            if t is None:
                _emit_audit(trace_id, source_hash, ParseDecision.QUARANTINED, "MALFORMED_TXN")
                raise StatementError("MALFORMED_TXN", body[:65])
            pending = {
                "value_date": _to_date(t.group("value")),
                "amount": _to_signed(t.group("mark"), t.group("amount")),
                "transaction_type": t.group("ttype"),
                "bank_reference": t.group("ref").strip() or "NONREF",
                "narrative": "",
                "raw_fragment": line,
            }
            state = "TRANSACTION"
        elif tag == "86":
            if pending is not None:
                pending["narrative"] = body.strip()
            state = "NARRATIVE"
        elif tag.startswith("62"):                 # closing balance -> flush last txn
            if pending is not None:
                yield _project(pending, currency, parser_version, source_hash)
                count += 1
                pending = None
            state = "FOOTER"

    decision = ParseDecision.PARSED if count else ParseDecision.QUARANTINED
    _emit_audit(trace_id, source_hash, decision, f"{count} transactions")


def _project(p: dict, currency: str, parser_version: str, source_hash: str) -> CanonicalTxn:
    return CanonicalTxn(
        value_date=p["value_date"],
        amount=p["amount"],
        transaction_currency=currency,
        transaction_type=p["transaction_type"],
        bank_reference=p["bank_reference"],
        narrative=p["narrative"],
        parser_version=parser_version,
        source_hash=source_hash,
        raw_fragment=p["raw_fragment"],
    )


def _emit_audit(trace_id: str, source_hash: str, decision: ParseDecision, detail: str) -> None:
    audit_log.info(json.dumps({
        "trace_id": trace_id,
        "source_hash": source_hash,
        "match_decision": decision.value,
        "detail": detail,
        "emitted_at": datetime.now(timezone.utc).isoformat(),
    }))

Every statement therefore produces exactly one terminal audit line keyed on source_hash, and every quarantine carries a named code — the evidence trail an auditor needs to confirm that no transaction was dropped without a record.

Configuration Rules and Threshold Calibration

Parser behaviour is governed by a small set of explicit parameters. Defaults below are starting points for institutional EUR/USD feeds; tune per bank only with a recorded change to parser_version.

Parameter	Default	Range	Tuning guidance
`max_line_length`	`65`	`65`–`80`	SWIFT mandates 65; relax only for a documented vendor that overruns, and log every overrun.
`encoding`	`iso-8859-1` (MT940) / `utf-8` (OFX)	per format	Mismatch must quarantine, never fall back silently.
`century_pivot`	`68`	`0`–`99`	`YY <= pivot` maps to 20xx. Set below the oldest plausible statement year for the feed.
`amount_scale`	`2`	`0`–`4`	Decimal places for quantization; raise only for currencies with >2 minor units.
`rounding_mode`	`ROUND_HALF_EVEN`	fixed	Banker’s rounding; do not change — it is part of the audit contract.
`nonref_token`	`NONREF`	string	Sentinel for an absent `:61:` reference; downstream falls back to a composite idempotency key.
`balance_check`	`strict`	`strict`/`warn`	`strict` quarantines on `opening + Σ != closing`; `warn` only during onboarding.

The two non-negotiable settings are rounding_mode and balance_check: strict. Loosening either removes the arithmetic guarantee that makes parser output trustworthy.

Multi-Dimensional Validation

A clean tokenization is necessary but not sufficient; the parser must also satisfy cross-field invariants before a record is allowed to leave the adapter. The strongest is the statement-level balance identity: the :60F: opening balance plus the signed sum of all :61: entries must equal the :62F: closing balance, to the cent. This single check catches dropped lines, sign-flip errors, and truncation in one assertion, and it is the parser equivalent of the amount-tolerance and date-window constraints that the matching stage layers on later.

Two further dimensions matter at projection time. First, temporal consistency — a value_date that falls outside a plausible window around the statement period signals a century-inference bug or a corrupt field and should quarantine rather than post. Second, reference integrity — when :61: yields the NONREF sentinel, the record cannot stand on its bank reference alone, so a deterministic composite key (hash of value_date, amount, narrative) is computed downstream to preserve idempotency. These constraints compose: a record passes only when tokenization, the balance identity, the temporal window, and reference integrity all hold, mirroring how the downstream cascade combines amount tolerance with a date window and string similarity before accepting a match.

Async and High-Throughput Execution

End-of-day windows deliver many statement files at once, so the parser runs inside a bounded worker pool rather than a serial loop. The pattern is a producer that enumerates persisted payloads onto an asyncio.Queue, a fixed set of worker coroutines that each pull a payload and run the CPU-bound FSM inside loop.run_in_executor (or a ProcessPoolExecutor for genuine parallelism past the GIL), and a consumer that streams completed CanonicalTxn records to the matching stage.

Backpressure is the load-bearing detail: the queue is given a fixed maxsize so a burst of large OFX files cannot exhaust memory — producers block until workers drain capacity. Because each parse is keyed on an immutable source_hash and the FSM is pure, work is trivially partitionable across processes and safe to retry; two workers that accidentally process the same payload produce identical output and identical idempotency keys, so the downstream deduplication index collapses them harmlessly. Per-payload concurrency is bounded by maxsize and the worker count; both belong in the configuration table above for the specific hardware profile. The lower-level state-tracking, delimiter resolution, and error-recovery patterns these workers rely on are worked end to end in How to parse MT940 files in Python.

Failure Modes Specific to Parsing

Parser errors are enumerated, not ad hoc. Each maps to a named code so the dead-letter queue and the audit ledger speak the same language.

Code	Root cause	Remediation
`ENCODING_VIOLATION`	Declared encoding does not decode the payload bytes	Quarantine raw bytes to the DLQ; re-fetch or re-encode at source, then replay.
`UNRECOGNISED_LINE`	A non-tag line appears outside a `:86:` continuation	Inspect for vendor formatting; extend the transition table behind a new `parser_version`; replay.
`MALFORMED_TXN`	`:61:` fails positional decomposition (bad mark, amount, or date)	Capture the fragment; add a vendor-specific `:61:` variant; replay after fix.
`FIELD_PARSE_ERROR`	Amount or date primitive will not coerce to `Decimal`/`date`	Quarantine; never substitute a default; correct the source field and replay.
`BALANCE_MISMATCH`	`opening + Σ entries != closing`	Re-fetch the full statement; flag the account for manual review; do not post partial.
`PRECONDITION_FAILED`	Missing `source_hash`, credential, or pinned schema version	Hold the payload; resolve the upstream gap; re-enter the parse stage.

Recovery is always replay-based. Because the FSM is deterministic and version-stamped, reprocessing a quarantined payload after the fix yields byte-identical output and cannot create duplicates — the same guarantee the ingestion layer relies on for its dead-letter remediation.

Compliance and Audit Trail Requirements

Every statement the parser touches must leave a complete, immutable evidence trail, because under SOX, IFRS 9, and GAAP the parsed ledger entry is only as trustworthy as the proof of how it was produced. Each terminal audit record emits trace_id, source_hash, and a match_decision of PARSED or QUARANTINED, alongside the parser_version that generated it and the verbatim raw_fragment preserved on every CanonicalTxn. That tuple lets an auditor reconstruct exactly which bytes, which parser revision, and which transition rules produced a given ledger entry — and, for a non-match, exactly why a payload was held.

The non-negotiable controls are: append-only audit storage (no in-place edits to a parse record), a pinned and logged parser_version for every run so behaviour changes are attributable, and verbatim raw-fragment retention so the projection can be re-derived independently. Records that fail validation are never silently dropped; the QUARANTINED decision is itself an auditable event. These obligations carry forward into the Multi-Currency Ledger Mapping stage and into Exception Routing & Human-in-the-Loop Workflows, both of which assume the parser has already stamped provenance onto every record they receive.

Part of Core Architecture & Bank Feed Ingestion — the adapter stage that feeds the Transaction Matching Algorithms & Logic cascade.