2026-04-08·8 min read·sota.io team

EU AI Act Article 9: Why Formal Verification Is the Overlooked Compliance Tool for High-Risk AI (2026)

On August 2, 2026, the high-risk AI provisions of the EU AI Act — Title III, Chapter 2, Articles 8 through 15 — begin applying to new high-risk AI systems placed on the market or put into service. If you are building or deploying AI in healthcare diagnostics, biometric identification, critical infrastructure, education, employment, credit scoring, law enforcement, or border control, these requirements apply to you.

Article 9 is the foundational requirement: a continuous, documented, maintained risk management system that operates throughout the entire lifecycle of the system. Most developers interpret this as a testing and monitoring obligation. That reading is correct but incomplete.

Formal verification — the mathematical proof of software properties — is what Article 9 actually demands when read carefully. Here is why, and which tools make it tractable.

What Article 9 Actually Requires

Article 9(1) requires providers of high-risk AI systems to "establish, implement, document and maintain a risk management system." Article 9(2) specifies that this is "a continuous iterative process run throughout the entire lifecycle of the high-risk AI system."

The key operative paragraphs are 9(4), 9(6), and 9(7):

The phrase "foreseeable risks" in Art. 9(4) is the one that changes the compliance calculus. Testing finds failures you have actually observed. Formal verification identifies failure modes across all reachable system states — including states you have never executed in a test suite.

The Compliance Gap: Why Testing Alone Is Not Enough

Consider a credit scoring AI. Testing with 10,000 loan applications gives you confidence that the system behaves correctly for those 10,000 cases. But Article 9 asks: are there foreseeable states — edge cases, unusual combinations of inputs, adversarial inputs — where the system produces outputs that could cause harm?

A test suite cannot answer that question. You cannot enumerate every possible combination of inputs to a scoring model. What you can do is formally specify the properties the system must satisfy — for example, "the risk score shall never exceed the configured maximum for verified low-income applicants with no adverse history" — and then prove that property holds for all possible inputs, not just tested ones.

This is what formal verification does. It is not faster testing. It is a different class of guarantee:

ApproachWhat it provesWhat it misses
Unit testingAbsence of failures in tested casesAll untested cases
Property-based testingAbsence of failures in randomly generated casesStructured edge cases, worst-case inputs
Fuzz testingAbsence of failures under random mutationsSemantically valid edge cases
Formal verificationAbsence of a class of failures in ALL reachable statesProperties you did not specify

Formal verification does not replace testing. It provides a proof for properties that testing cannot establish — which is exactly what "foreseeable risks" in Art. 9(4) requires documentation of.

The Tools: Formal Verification for High-Risk AI Systems

These are production-grade tools used in aviation, automotive, medical devices, and nuclear systems — sectors with compliance requirements stricter than the EU AI Act.

TLA+ (Leslie Lamport, Microsoft Research / AWS)

TLA+ is a specification language for modelling concurrent and distributed systems. You write a state machine that describes all possible behaviours of your system. The TLC model checker exhaustively explores the state space, finding any behaviour that violates a temporal logic property you specify.

Amazon Web Services uses TLA+ to verify properties of S3, DynamoDB, EBS, and internal distributed protocols. It found 10 bugs in distributed algorithms across 7 AWS systems, 4 of which would have caused serious data loss under rare conditions.

For an AI system, TLA+ is most useful for the surrounding logic — the orchestration layer, the decision pipeline, the fallback logic. You can specify: "If the AI model returns a confidence score below 0.7, the system shall always route to human review." TLA+ proves that property holds across all possible orderings of concurrent events.

---- MODULE CreditScoring ----
VARIABLES score, confidence, route

TypeInvariant ==
  route \in {"automated", "human_review"}

SafetyProperty ==
  [] (confidence < 0.7 => route = "human_review")
====

Art. 9 relevance: Directly addresses 9(4) (foreseeable risks in all reachable states) and 9(6) (documents that risk management measures hold unconditionally).

SPARK Ada (AdaCore 🇫🇷, Paris)

SPARK is a formally-verifiable subset of Ada. Every function carries contracts — preconditions, postconditions, loop invariants — that are checked both by the GNATprove verifier (which produces a formal proof) and at runtime. SPARK Ada is used in:

AdaCore is a French company, founded in Paris in 1994. SPARK Ada is EU-origin formal verification technology.

function Clamp_Score (Raw : Float; Max : Float) return Float
  with Pre  => Raw >= 0.0 and Max > 0.0,
       Post => Clamp_Score'Result >= 0.0
               and Clamp_Score'Result <= Max
is
begin
   return Float'Min (Raw, Max);
end Clamp_Score;

GNATprove can verify that Clamp_Score never returns a value exceeding Max for any input satisfying the precondition — without executing a single test case.

Art. 9 relevance: Postconditions are machine-checked proofs of output properties. This creates auditable documentation that Art. 9(6) risk management measures are implemented correctly.

Isabelle/HOL (TU Munich 🇩🇪 + Cambridge 🇬🇧)

Isabelle is an interactive theorem prover. You write mathematical proofs of system properties, which Isabelle verifies mechanically. It was used to prove the correctness of the seL4 microkernel (the first formally verified OS kernel) and to verify compilers, cryptographic protocols, and operating system components.

For AI compliance, Isabelle is most appropriate for high-stakes properties where you need a formal mathematical argument — not a model check over a finite state space, but a proof over all inputs, all executions, all time. If your AI system processes medical images and must satisfy a property like "for any DICOM image, the output diagnosis label is always drawn from the approved ICD-10 vocabulary," Isabelle can prove that property holds given the implementation.

Art. 9 relevance: Produces machine-verifiable proofs. The proof artifact itself constitutes the documentation of foreseeable-risk analysis that Art. 9(3) requires to be maintained.

Frama-C (CEA/INRIA 🇫🇷, Saclay and Paris)

Frama-C is a C code analysis framework developed by CEA (Commissariat à l'énergie atomique) and INRIA, both French public research organizations. Its WP plugin (Weakest Precondition) verifies ACSL (ANSI/ISO C Specification Language) annotations against C implementations.

Airbus uses Frama-C for critical avionics software verification. The tool is actively maintained by EU public institutions and is completely EU-origin.

/*@ requires 0 <= x <= 1.0f;
  @ ensures  0 <= \result <= threshold;
  @ assigns  \nothing;
  @*/
float clip_confidence(float x, float threshold) {
  return x > threshold ? threshold : x;
}

Art. 9 relevance: C-native formal verification for safety-critical AI components (real-time inference engines, embedded preprocessing pipelines).

Dafny (K. Rustan M. Leino, Microsoft Research)

Dafny is a verification-aware programming language that compiles to C#, Go, Python, and Java. You write specifications inline with code — the Dafny verifier checks them at development time, and the compiled output carries runtime assertion checks as a second layer.

method ScoreApplicant(income: real, history: int) returns (score: real)
  requires 0.0 <= income
  requires 0 <= history <= 100
  ensures 0.0 <= score <= 1.0
{
  score := (income / 100000.0 + history as real / 100.0) / 2.0;
  if score > 1.0 { score := 1.0; }
}

Dafny is particularly accessible for teams not working in Ada or C — the syntax is close to Python/C#, and the verifier integrates with VS Code.

Art. 9 relevance: Inline specifications create a living document of risk properties that is updated alongside the code — precisely the "maintained" documentation Art. 9(3) requires.

Mapping Art. 9 Requirements to Formal Verification

Article 9 RequirementFormal Verification ToolWhat It Proves
9(4): Identify foreseeable risksTLA+ (model checking)All reachable system states, including failure modes you did not observe
9(4): Analysis of known risksCBMC / Frama-CAbsence of buffer overflows, null dereferences, integer overflows in C/C++ inference pipelines
9(6): Risk management measuresSPARK Ada / DafnySafety contracts (preconditions/postconditions) hold for all inputs
9(7): Testing before deploymentIsabelle/HOLMathematical proof of high-stakes properties — stronger than any test suite
9(3): Document and maintainAll toolsProof artifacts, specifications, and annotations are version-controlled documentation

Practical Implementation: Where to Start

For most teams building high-risk AI under the EU AI Act, the most tractable path is not to formally verify the entire neural network (that remains an open research problem). It is to formally verify the safety envelope around the model:

  1. Input validation layer (SPARK Ada or Dafny): Prove that all inputs to the model satisfy constraints before inference. No out-of-range values, no null references, no malformed data enters the model.

  2. Output constraints (TLA+ or Dafny): Prove that the output processing pipeline always produces a result within the allowed range — regardless of what the model returns.

  3. Fallback logic (TLA+): Prove that the human-review routing logic is unconditionally correct — if confidence falls below threshold, human review is triggered without exception.

  4. Audit trail invariants (Isabelle/HOL for high-stakes): For systems subject to Art. 14 (human oversight), prove that the audit log always contains an entry for every decision above a specified risk threshold.

Running Formal Verification Pipelines on EU-Native Infrastructure

Art. 9 compliance requires that the verification artifacts themselves — proofs, model check results, specification documents — are stored and processed on infrastructure that satisfies your GDPR obligations. Formal verification toolchains run as standard software: CI/CD pipelines that invoke GNATprove, TLC, Frama-C, or the Dafny verifier and produce output artifacts.

If those pipelines run on US-headquartered infrastructure (AWS, Railway, Render, Fly.io), your compliance documentation is subject to CLOUD Act disclosure risk. This is a structural exposure: a US company operating your CI/CD infrastructure can be compelled to produce your verification artifacts under US law, without going through EU mutual legal assistance procedures.

Running formal verification pipelines on EU-native PaaS closes this exposure. The verification toolchain, the proof artifacts, and the source code they verify all remain under EU jurisdiction.

# Deploy a Dafny verification pipeline to EU infrastructure
sota deploy
# → Dafny 4.8.0 verification container running on EU servers
# → Proof artifacts stored in managed EU PostgreSQL
# → No CLOUD Act exposure for compliance documentation

sota.io is an EU-incorporated PaaS that runs exclusively on EU infrastructure. Deploy any verification toolchain — TLA+, Dafny, Frama-C, SPARK Ada — as a containerized pipeline, with managed PostgreSQL for artifact storage. GDPR-compliant by design, free tier available.

August 2026: The Compliance Window

The EU AI Act enforcement timeline:

If you are building or deploying a new high-risk AI system, August 2026 is not a distant horizon — it is four months away. Article 9 compliance cannot be retrofitted at deployment time. The risk management system must be established, documented, and operational throughout the development process.

The tools described here — TLA+, SPARK Ada, Isabelle/HOL, Frama-C, Dafny — are production-grade, well-documented, and actively maintained by EU institutions and EU-origin research groups. They are not experimental. They are the same tools that verify aircraft navigation systems, nuclear power plant controllers, and medical device firmware.

The gap most teams have is not that the tools are too hard to use. It is that they have not yet recognised that Art. 9's "foreseeable risks" language demands a stronger form of analysis than testing alone can provide.

See Also


Sources: EU AI Act (Regulation (EU) 2024/1689), Official Journal 2024-07-12, Articles 9, 14, and Annex III. TLA+ documentation: lamport.azurewebsites.net. AdaCore SPARK Ada: docs.adacore.com/spark2014-docs. Frama-C: frama-c.com (CEA/INRIA). Isabelle: isabelle.in.tum.de (TU Munich). Dafny: dafny.org (Microsoft Research). AWS TLA+ use: dl.acm.org/doi/10.1145/2699417.