Soren Learning

Chapter 7

Modern Testing: AI, Chaos, and Beyond

Listen to this article

The fundamentals of testing — pyramid, TDD, CI — have been stable for fifteen years. What has changed, especially in the last five, is the frontier: new techniques that were academic curiosities a decade ago are now routine in well-run teams. This chapter is a tour of that frontier.

AI-Assisted Test Generation

Large language models have changed what is possible in test authoring. Three distinct uses are now common.

Test Scaffolding from Code

Given a function, an LLM can generate a reasonable set of tests covering the obvious cases: happy path, common edge cases, null/empty inputs, boundary values. Tools like GitHub Copilot, Cursor, and Claude Code all do this interactively; dedicated tools like CodiumAI (Qodo) and Diffblue Cover specialize in it.

This is genuinely useful for:

  • Bootstrapping a test suite on legacy code that has none
  • Covering boilerplate cases (input validation, type checks) quickly
  • Suggesting edge cases a human might forget (what about negative zero? what about Unicode whitespace?)

It is not a replacement for thinking about what to test. LLMs generate plausible tests; they don't know which invariants matter for your business. A test that asserts "this getter returns the value the setter set" is literally testing that assignment works in your language. LLMs produce a lot of these. Review every generated test — accept the ones that add value, delete the noise.

Self-Healing Tests

UI automation has always suffered from brittleness: a developer renames a CSS class and 40 Selenium tests break. "Self-healing" test tools (Testim, Mabl, Applitools, Functionize) use ML to identify UI elements by multiple signals — position, surrounding text, visual shape, accessibility attributes — so that a single-attribute change doesn't break the test. When a selector fails, the tool tries to find the "same" element by its other properties and auto-updates the test.

This works well for cosmetic and incidental changes (renaming a class, restructuring DOM without changing semantics). It works less well for intentional UX changes, and it can hide real bugs — a "self-healed" test that now clicks the wrong button is worse than one that fails loudly.

Use self-healing as a reducer of noise, not a replacement for test review.

LLM-Powered Exploratory Testing

Newer experiments use LLMs as agents: give them a running application, a goal ("sign up a new user, make a purchase, cancel it"), and turn them loose. The agent clicks, types, reads pages, decides what to do next, and reports what it found.

This is early-stage. Current agents are slow, expensive, non-deterministic, and easily confused by non-standard UIs. But the trajectory is real — by the end of the decade, LLM-driven exploratory testing will likely be a standard part of the stack.

Mutation Testing: Testing Your Tests

How do you know your tests are any good? You can measure coverage, but as Chapter 5 discussed, coverage only tells you what the tests touched, not what they would catch.

Mutation testing answers this directly. It takes your source code, introduces a small bug (a "mutation") — changes < to <=, deletes a line, flips a boolean — then runs the test suite. If the tests still pass, the mutation "survived," which means your tests would not have caught that bug. If they fail, the mutation is "killed," which means the tests are doing their job.

The output is a mutation score: the percentage of mutants killed by the suite.

# Original
def is_eligible(age):
    return age >= 18
 
# Mutant (> instead of >=)
def is_eligible(age):
    return age > 18
 
# If no test passes age=18, the mutant survives.
# The test suite has a gap — a real bug could slip through.

Popular tools: Stryker (JS/TS/C#), PIT (Java), mutmut and Cosmic Ray (Python), Pitest (Kotlin).

The Catch

Mutation testing is slow. Each mutant requires running the suite. For a suite with 1,000 tests and hundreds of mutants, you are looking at hours. Most teams:

  • Run mutation testing only on changed files in a PR (incremental mode)
  • Run it nightly on the full codebase
  • Scope it to business-critical modules rather than the whole suite

It is not something you run on every commit. But the information it provides — "here are the specific gaps in your testing" — is richer than any other measure.

Property-Based Testing

Traditional "example-based" tests check specific cases: "sort([3, 1, 2]) should equal [1, 2, 3]." Property-based testing checks properties that should hold for any input: "for any list, the sorted result should have the same length, should be in non-decreasing order, and should contain the same elements."

The framework then generates hundreds or thousands of random inputs (including adversarial ones — empty lists, single elements, long lists, lists with duplicates, lists with NaN) and checks the properties on each.

# Example-based — tests three cases
def test_sort():
    assert sort([3, 1, 2]) == [1, 2, 3]
    assert sort([]) == []
    assert sort([1]) == [1]
 
# Property-based (Hypothesis, Python)
from hypothesis import given, strategies as st
 
@given(st.lists(st.integers()))
def test_sort_properties(lst):
    result = sort(lst)
    assert len(result) == len(lst)
    assert result == sorted(result)
    assert sorted(result) == sorted(lst)

The property-based test catches things you never thought to check: lists with NaN, lists with -0.0, lists exactly 2^31 elements long, and so on. When a bug is found, the framework automatically shrinks the failing input — it finds the simplest failing case, so instead of a 10,000-element list you get a minimal reproduction: "here's a 3-element list that breaks it."

Tools: Hypothesis (Python), fast-check (JavaScript/TypeScript), QuickCheck (Haskell, the original), jqwik (Java), PropEr (Erlang), Proptest (Rust).

Property-based testing shines for:

  • Parsers and serializers (round-trip properties: parse(serialize(x)) == x)
  • Data structures with invariants
  • Algorithms with known mathematical properties (associativity, commutativity, idempotence)
  • Protocols with defined rules

It is less useful for code with no discernible properties, like UI glue or straight-line business workflows. Use it where the math applies.

Chaos Engineering

Chaos engineering is the practice of deliberately injecting failures into production (or production-like) systems to discover weaknesses. Netflix popularized it with the Chaos Monkey — a service that randomly terminates production instances during business hours to ensure the rest of the system handles it gracefully.

The premise is counterintuitive but solid: you cannot know how your system handles failure until it fails, and the worst time to find out is during a real incident at 3 a.m. Better to fail on purpose, in controlled conditions, while the team is watching.

A mature chaos practice tests:

  • Infrastructure failures — kill an instance, drain a node, lose an AZ
  • Network failures — drop packets, add latency, partition services
  • Dependency failures — a downstream API returns 500s, a database goes read-only
  • Resource exhaustion — fill the disk, run out of file handles, max the CPU
  • Clock skew — time jumps forward, time jumps backward

The formal method is a game day: schedule a failure scenario, predict what will happen, inject the failure, observe what actually happens. Any gap between prediction and reality is a finding.

Tools: Netflix's Chaos Monkey and Simian Army, Gremlin (commercial), AWS Fault Injection Simulator, LitmusChaos (Kubernetes-native), Chaos Toolkit, ChaosMesh.

Starting Small

You do not need to run chaos in production on day one. A sensible path:

  1. Chaos in staging, with a small scoped experiment (kill one pod).
  2. Postmortem what happened. Fix gaps.
  3. Scale up the chaos complexity.
  4. Move to production, with a kill switch and a runbook.
  5. Automate chaos as part of routine operations.

Netflix has been doing this for over a decade. Most teams are years behind, and that is fine — start where you are.

Observability-Driven Testing

The traditional boundary between "testing" and "production" is dissolving. A number of techniques blur the line.

Synthetic Monitoring

Small scripted workflows — a user login, a checkout, an API call — that run continuously against production from outside. If they fail, you are alerted. Effectively, a smoke test that never stops.

Tools: Pingdom, Datadog Synthetics, Checkly, AWS CloudWatch Synthetics. Every production system should have at least a handful of these for critical journeys.

Feature Flags as a Testing Tool

Covered in Chapter 5, but worth repeating here: feature flags let you ship code to production and test it under real traffic conditions before enabling it for users. This is the highest-fidelity testing possible — real infrastructure, real database, real traffic, real integrations.

A/B Tests as Hypothesis Validation

A/B tests are usually framed as product tools, but they are also the ultimate form of validation testing: they tell you whether the thing you built actually achieves the outcome you wanted. No amount of unit testing can answer "does this new checkout flow increase conversions?" Only production traffic can.

SLOs and Error Budgets

Defining Service Level Objectives (99.9% availability, p95 latency under 300ms) and tracking them with real production data turns reliability into a measurable, testable property. If you exceed your error budget, you stop shipping features until you have paid back the gap. This is a test suite for your reliability, running in production, continuously.

Contract Testing Matures

Covered briefly in Chapter 4 — but the space has matured significantly. Tools like Pact now integrate with CI pipelines so that the provider's test suite fails if it breaks any consumer's expectations. Schema registries for event-driven systems (Confluent Schema Registry, AWS Glue Schema Registry) enforce compatibility rules on message formats, catching breaking changes at schema registration time rather than at message consumption time.

In distributed systems, contract testing is increasingly replacing most E2E testing. The realization: you don't need to run the full cluster to verify that service A and service B agree on an interface. You just need to verify that each service's behavior matches a shared contract.

The Rise of the Test Runner as Platform

A quiet but important shift: modern test runners are no longer just "execute the tests and report pass/fail." They are platforms.

  • Vitest (JS/TS) offers in-browser, in-node, and in-worker execution from the same config, with watch mode faster than anything previous.
  • pytest (Python) has an ecosystem of 1,000+ plugins, from parallelization to benchmark comparison to property-based testing.
  • Playwright (E2E) bundles trace viewers, video replay, codegen, and auto-wait into a single tool that has largely displaced Selenium.
  • Jest and Vitest treat tests as first-class build artifacts, with coverage, snapshots, and CI integration built in.

The implication for test authors: you are no longer writing "tests" in isolation; you are participating in a platform that expects you to leverage its features. Teams that adopt the platform's idioms — test isolation, proper fixtures, structured reporting — get significantly more out of it than teams that write bare-assert tests against any runner.

What to Learn Next

If you want to pick one thing from this chapter to try:

  • Junior engineer, no tests on a project yet: Learn TDD and write property-based tests for the trickiest logic you have.
  • Mid-level engineer, decent test suite: Try mutation testing on one module. You will find real gaps.
  • Senior engineer, large system in production: Run a game day. Pick one dependency, simulate its failure in staging, and see what breaks.
  • Team lead: Measure your feedback loop — time from git push to "merge is safe" — and find the biggest ten-minute win.

Closing Thought: Testing as a Craft

Across these seven chapters we have covered a lot: fundamentals, pyramids, TDD, automation, CI/CD, security, and modern practices. What unifies all of it is a single conviction — that the quality of software is determined not by how careful the engineer is, but by how quickly and honestly the engineer finds out when they are wrong.

Testing is the infrastructure of honesty. A good test suite tells you the truth: that the code works, that the last change did not break anything, that the system still does what it claimed to do an hour ago and a year ago. A bad test suite, or no test suite, leaves you with hope — and hope is not a strategy.

The goal is not perfection. It is a system where being wrong is cheap, so you can be wrong more often, discover it faster, and end up right more quickly than the teams who never let themselves be wrong at all.

Build the safety net. Then go fast.


This concludes the Software Testing series. Thanks for reading.