How to Parse MT940 Files in Python for Automated Reconciliation

This page solves one narrow, recurring scenario: you have received an end-of-day SWIFT MT940 statement file from a bank feed and you need to turn it into a deterministic, validated, audit-stamped list of transactions that the downstream matching cascade can consume without guessing. It is the concrete implementation companion to the OFX & MT940 Parser Design reference, sitting at the very front of the Core Architecture & Bank Feed Ingestion layer. The MT940 grammar looks rigid on paper, but real banks deviate in narrative formatting (:86:), date encoding (YYMMDD century inference), debit/credit prefixes on :61:, and currency placement. The goal here is a tag-aware finite state machine that emits byte-identical canonical output for identical input and quarantines anything it cannot fully validate, rather than coercing a float where a Decimal belongs.

Prerequisites

Before this parser runs, the following upstream pipeline state must already hold:

The raw MT940 payload has been fetched and persisted byte-for-byte (no normalization, no re-encoding) by the ingestion scheduler described in Real-Time vs Batch Ingestion.
SFTP/API credentials were resolved through Secure API Token Management and never appear in config files or logs.
A trace_id and a source_hash (SHA-256 of the raw bytes) have been minted for the file so every parsed record is traceable end to end.
Python 3.10+ is available; the only standard-library dependencies below are re, hashlib, json, logging, datetime, decimal, and dataclasses.
A canonical currency authority is reachable so codes can be validated against the ISO 4217 registry before they reach Multi-Currency Ledger Mapping.

Step-by-Step Implementation

Step 1 — Define the canonical record types

Model the output first so parsing has a strict target. Every monetary field is a Decimal; dates are datetime; the statement carries an audit_hash that downstream Exact Match & Hash Comparison uses for duplicate-statement detection.

python

import re
import json
import hashlib
import logging
from datetime import datetime
from typing import Optional
from decimal import Decimal, ROUND_HALF_EVEN
from dataclasses import dataclass, field

logging.basicConfig(level=logging.INFO, format="%(message)s")
logger = logging.getLogger("mt940_parser")


def audit_log(trace_id: str, source_hash: str, match_decision: str, **fields) -> None:
    """Emit one structured audit line per significant parse event."""
    logger.info(json.dumps({
        "trace_id": trace_id,
        "source_hash": source_hash,
        "match_decision": match_decision,
        **fields,
    }))


@dataclass
class MT940Transaction:
    value_date: datetime
    entry_date: Optional[datetime]
    debit_credit: str
    amount: Decimal
    transaction_code: str
    reference: str
    narrative: str
    raw_line: str = ""


@dataclass
class MT940Statement:
    transaction_ref: str
    account_id: str
    statement_number: str
    opening_balance: Decimal
    closing_balance: Decimal
    currency: str
    transactions: list[MT940Transaction] = field(default_factory=list)
    audit_hash: str = ""

Step 2 — Write deterministic field resolvers

The two fields banks deviate on most are dates and amounts. Parse YYMMDD with an explicit 50-year pivot, and parse amounts with a D/C sign prefix and a comma decimal separator, returning a signed Decimal quantized with ROUND_HALF_EVEN. Precompile every pattern so extraction is deterministic and allocation-free in the hot loop.

python

TAG_PATTERN = re.compile(r"^:(\d{2}[A-Z]?):(.*)$")
DATE_PATTERN = re.compile(r"^\d{6}$")
AMOUNT_PATTERN = re.compile(r"^([DC])(\d+,\d{2})$")
AMOUNT_TAIL = re.compile(r"(\d+,\d{2})")


def resolve_date(date_str: str) -> datetime:
    """Parse YYMMDD; YY < 50 -> 20xx, else 19xx (ISO 8601 century inference)."""
    if not DATE_PATTERN.match(date_str):
        raise ValueError(f"Invalid date format: {date_str!r}")
    yy, mm, dd = int(date_str[:2]), int(date_str[2:4]), int(date_str[4:6])
    year = 2000 + yy if yy < 50 else 1900 + yy
    return datetime(year, mm, dd)


def resolve_amount(amount_str: str) -> Decimal:
    """'C1234,56' -> +1234.56, 'D1234,56' -> -1234.56."""
    match = AMOUNT_PATTERN.match(amount_str)
    if not match:
        raise ValueError(f"Invalid amount format: {amount_str!r}")
    sign, value = match.groups()
    amount = Decimal(value.replace(",", ".")).quantize(
        Decimal("0.01"), rounding=ROUND_HALF_EVEN
    )
    return amount if sign == "C" else -amount


def parse_balance(content: str) -> tuple[str, Decimal]:
    """Parse a :60F:/:62F: balance line: <D|C><YYMMDD><ISO4217><amount>."""
    if len(content) < 10:
        raise ValueError(f"Balance line too short: {content!r}")
    dc, currency = content[0], content[7:10]
    return currency, resolve_amount(dc + content[10:])

Step 3 — Build the tag-aware finite state machine

Naive splitlines() plus regex fails on multiline :86: narratives and unspaced :61: fields. The FSM walks tags sequentially, holds context across continuation lines, and tracks three states: HEADER, STATEMENT_LINES, and FOOTER. The critical correctness point is the :61: field — the SWIFT format embeds the D/C indicator and amount as a single unspaced string after the date fields, so a content.split() approach breaks on the (common) banks that omit spaces.

python

class MT940Parser:
    """Production-grade, tag-aware MT940 state machine."""

    def __init__(self, trace_id: str, source_hash: str, strict_mode: bool = True):
        self.trace_id = trace_id
        self.source_hash = source_hash
        self.strict_mode = strict_mode
        self._state = "HEADER"
        self._stmt: Optional[MT940Statement] = None
        self._tx: Optional[MT940Transaction] = None
        self._pending: list[str] = []

    @staticmethod
    def _hash(stmt: MT940Statement) -> str:
        payload = (f"{stmt.transaction_ref}|{stmt.account_id}"
                   f"|{stmt.opening_balance}|{stmt.closing_balance}")
        return hashlib.sha256(payload.encode("utf-8")).hexdigest()

    def _commit_narrative(self) -> None:
        if self._pending and self._tx:
            self._tx.narrative += " ".join(self._pending)
        self._pending.clear()

    def parse(self, raw_content: str) -> list[MT940Statement]:
        statements: list[MT940Statement] = []
        self._state = "HEADER"
        for line in raw_content.splitlines():
            stripped = line.strip()
            if not stripped:
                continue
            tag_match = TAG_PATTERN.match(stripped)
            if tag_match:
                self._commit_narrative()
                self._process_tag(*tag_match.groups(), statements)
            else:
                self._pending.append(stripped)
        self._commit_narrative()
        if self._stmt:
            self._stmt.audit_hash = self._hash(self._stmt)
            statements.append(self._stmt)
        audit_log(self.trace_id, self.source_hash, "PARSE_OK",
                  statements=len(statements),
                  transactions=sum(len(s.transactions) for s in statements))
        return statements

    def _process_tag(self, tag: str, content: str,
                     statements: list[MT940Statement]) -> None:
        if tag == "20":
            if self._stmt:
                self._stmt.audit_hash = self._hash(self._stmt)
                statements.append(self._stmt)
            self._stmt = MT940Statement(content.strip(), "", "",
                                        Decimal("0"), Decimal("0"), "")
            self._tx = None
            self._state = "HEADER"
        elif tag == "25" and self._stmt:
            self._stmt.account_id = content.strip()
        elif tag == "28C" and self._stmt:
            self._stmt.statement_number = content.strip()
        elif tag in ("60F", "60M") and self._stmt:
            self._stmt.currency, self._stmt.opening_balance = parse_balance(content)
        elif tag == "61":
            self._process_61(content)
        elif tag in ("62F", "62M") and self._stmt:
            _, self._stmt.closing_balance = parse_balance(content)
            self._state = "FOOTER"
        elif tag == "86":
            if self._tx:
                self._tx.narrative += content.strip()
            else:
                self._pending.append(content.strip())

    def _process_61(self, content: str) -> None:
        # :61: = YYMMDD[MMDD]<[R]DC><amount><tx_code><ref>, e.g. "2301010101C1234,56NTRFNONREF"
        if len(content) < 8:
            if self.strict_mode:
                raise ValueError(f"Malformed :61: line: {content!r}")
            return
        val_date = resolve_date(content[:6])
        pos = 6
        entry_date: Optional[datetime] = None
        if content[pos].isdigit():
            try:
                mmdd = content[pos:pos + 4]
                entry_date = datetime(val_date.year, int(mmdd[:2]), int(mmdd[2:]))
                pos += 4
            except (ValueError, IndexError):
                pass  # entry date is optional
        if content[pos] == "R":          # reversal: RD / RC
            dc_flag, pos = content[pos + 1], pos + 2
        else:
            dc_flag, pos = content[pos], pos + 1
        amt = AMOUNT_TAIL.match(content[pos:])
        if not amt:
            if self.strict_mode:
                raise ValueError(f"Cannot parse amount in :61:: {content!r}")
            return
        amount = resolve_amount(dc_flag + amt.group(1))
        remainder = content[pos + len(amt.group(1)):].split()
        self._tx = MT940Transaction(
            value_date=val_date, entry_date=entry_date, debit_credit=dc_flag,
            amount=amount, transaction_code=remainder[0] if remainder else "",
            reference=remainder[1] if len(remainder) > 1 else "",
            narrative="", raw_line=content)
        if self._stmt:
            self._stmt.transactions.append(self._tx)
        self._state = "STATEMENT_LINES"

Step 4 — Run the parser with end-to-end audit context

Mint the source_hash from the raw bytes, isolate parsing from I/O, and emit a structured audit line on success and on quarantine. A statement that fails validation is routed to a dead-letter queue with full context rather than partially committed.

python

def parse_mt940_file(raw_bytes: bytes, trace_id: str) -> list[MT940Statement]:
    source_hash = hashlib.sha256(raw_bytes).hexdigest()
    parser = MT940Parser(trace_id=trace_id, source_hash=source_hash, strict_mode=True)
    try:
        return parser.parse(raw_bytes.decode("utf-8"))
    except ValueError as exc:
        audit_log(trace_id, source_hash, "QUARANTINE", error=str(exc))
        raise  # caller routes to DLQ; never partial-commit a bad statement

Configuration Boundary Table

Parameter	Default	Valid range	Notes
`strict_mode`	`True`	`True` / `False`	`False` skips malformed `:61:` lines instead of raising; only for sandboxed backfills.
Century pivot (`YY < 50`)	`50`	`0`–`99`	Boundary for `20xx` vs `19xx` inference; align with your oldest expected statement.
Amount quantize	`Decimal("0.01")`	`0.01`–`0.0001`	2 dp for ledger commit; widen only for intermediate FX (rounded at commit).
Rounding mode	`ROUND_HALF_EVEN`	banker’s rounding	Mandated for accounting parity; do not use `ROUND_HALF_UP`.
Encoding	`utf-8`	`utf-8` / `latin-1`	Some legacy SWIFT feeds emit `latin-1`; detect from feed metadata, never guess.
Max statement size	5 MB	1–25 MB	Reject larger payloads before parsing to bound memory.

Verification and Testing

Confirm correctness against a small, hand-built fixture whose balances are known. The closing balance must equal the opening balance plus the sum of signed transaction amounts; if it does not, the file is malformed or a :61: line was misread.

python

FIXTURE = (
    ":20:STMT240601REF\r\n"
    ":25:NL00BANK0123456789\r\n"
    ":28C:00123/001\r\n"
    ":60F:C240601EUR1000,00\r\n"
    ":61:2406010601C250,00NTRFNONREF//INV-042\r\n"
    ":86:Incoming payment INV-0042\r\n"
    ":61:2406010601D75,50NTRFNONREF//FEE\r\n"
    ":86:Monthly service fee\r\n"
    ":62F:C240601EUR1174,50\r\n"
)

def test_balances_reconcile() -> None:
    [stmt] = parse_mt940_file(FIXTURE.encode("utf-8"), trace_id="test-001")
    movement = sum((t.amount for t in stmt.transactions), Decimal("0"))
    assert stmt.opening_balance + movement == stmt.closing_balance
    assert stmt.currency == "EUR"
    assert len(stmt.audit_hash) == 64
    assert stmt.transactions[0].narrative == "Incoming payment INV-0042"

A green run proves three things at once: the :61: signed amounts are correct, multiline :86: narratives attached to the right transaction, and the audit_hash is populated for downstream deduplication.

Troubleshooting

MT940_AMOUNT_UNPARSED — _process_61 raises on a :61: amount. Root cause: the bank emits a SEPA-style amount with a . decimal separator or no ,NN fraction. Fix: extend AMOUNT_TAIL to accept [.,] and pad missing minor units before quantizing.
MT940_DATE_PIVOT_WRONG — historical statements land in the wrong century. Root cause: the YY < 50 pivot mismatches your archive horizon. Fix: lower the pivot or derive the century from the file’s delivery timestamp instead of a constant.
MT940_NARRATIVE_DROPPED — :86: text is missing from a transaction. Root cause: a continuation line began with a :-like token and was misread as a tag, or _commit_narrative ran before the :86: arrived. Fix: confirm TAG_PATTERN anchors on ^:\d{2} and that narratives flush only on the next real tag.
MT940_BALANCE_DRIFT — test_balances_reconcile fails by a few cents. Root cause: a float crept into an FX or rounding step. Fix: keep every monetary value as Decimal end to end and round once, at ledger commit, with ROUND_HALF_EVEN.
MT940_ENCODING_ERROR — decode("utf-8") raises UnicodeDecodeError. Root cause: a latin-1 SWIFT feed. Fix: select the encoding from feed metadata in the configuration table above; never silently fall back, as that can corrupt account identifiers.

Real-Time vs Batch Ingestion — choosing the cadence that delivers MT940 files to this parser.
Best Practices for Handling Bank API Rate Limits — fetching statements without tripping throttling.
Mapping ISO 20022 to Internal GL Formats — projecting parsed records onto the general ledger.

Part of OFX & MT940 Parser Design, within Core Architecture & Bank Feed Ingestion.