How Trunk Flaky Tests detects flaky tests

Flake Detection automatically identifies problematic tests in your test suite by monitoring test behavior over time. Instead of a single set of built-in detection rules, Trunk uses monitors, independent detectors that each watch for a specific pattern. When a monitor activates on a test, it runs the action you configured on the monitor — either classifying the test as flaky or broken, or applying labels to it.

How Monitors Work

Each monitor independently observes your test runs and tracks two states per test: active (problematic behavior detected) or inactive (no problematic behavior). When a monitor transitions to active, it executes its configured action; when it resolves, it undoes that action (restoring health status, or removing the labels it applied). For monitors whose action is Classify test status (referred to below as health classification monitors), the test’s overall status is determined by combining all such monitors, with the most severe status winning:

Priority	Status	Condition
Highest	Broken	Any enabled broken-type monitor (failure rate or failure count) is active for this test
Middle	Flaky	Any enabled flaky-type monitor (failure rate, failure count, or pass-on-retry) is active
Lowest	Healthy	No active health classification monitor

If a test triggers both a broken monitor and a flaky monitor simultaneously, it shows as Broken. When the broken monitor resolves (e.g., you fix the regression and the failure rate drops), the test transitions to Flaky if a flaky monitor is still active, or to Healthy if no health classification monitors remain active. A test stays in its detected state until every health classification monitor that flagged it has independently resolved. Monitors configured to apply labels do not contribute to this status calculation — they only add or remove labels.

Disabling or Deleting a Monitor

When you disable or delete a monitor, it is immediately set to resolved for every test case in the repo. For a health classification monitor, this triggers a status re-evaluation for all affected tests: if the disabled monitor was the only active health classification monitor for a test, that test transitions to healthy; if others are still active, the test remains in the most severe active state. For a labeling monitor, the labels it had applied are removed (subject to its Remove these labels when the monitor resolves setting). For example, if you have a broken failure rate monitor and a flaky pass-on-retry monitor, and you disable the broken monitor, any test that was only flagged by the broken monitor will become healthy. A test flagged by both will transition from broken to flaky (because pass-on-retry is still active).

Monitor Types

Trunk groups monitors into two categories based on what they do when they activate:

Health classification monitors determine a test’s overall health status (healthy, flaky, or broken). When one activates, the test’s status changes across the dashboard, CI annotations, and notifications.
Lifecycle and performance monitors apply labels to tests based on lifecycle events or performance characteristics. They do not affect health status. These monitors appear in a separate section of the monitors page.

Health Classification Monitors

Monitor	What it detects	Available actions	Default state
Pass-on-Retry	A test fails then passes on the same commit (retry after failure)	Classify (flaky) or apply labels	Enabled
Failure Rate	Failure rate exceeds a configured percentage over a time window	Classify (flaky or broken) or apply labels	Disabled
Failure Count	A test accumulates a configured number of failures in a rolling window	Classify (flaky or broken) or apply labels	Disabled

Lifecycle and Performance Monitors

These monitors apply labels based on lifecycle events or performance characteristics. They do not classify tests as flaky or broken, and they do not contribute to the test’s overall health status.

Monitor	What it detects	Available actions	Default state
New Test	A test case seen for the first time, tracked for a configurable grace period	Apply labels	Disabled
Skipped Test	A test is consistently skipped across runs within a time window	Apply labels	Disabled
Slow Test	A test’s average duration exceeds a configured threshold	Apply labels	Disabled

You can run multiple monitors simultaneously. For example, you might use pass-on-retry to catch classic retry-based flakiness while also running failure rate monitors scoped to different branches. A common pattern is to pair a broken-type failure rate monitor (catching consistently failing tests) with a flaky-type failure rate monitor (catching intermittently failing tests). See Failure Rate Monitor: Recommended Configurations for details. The failure count monitor complements failure rate monitors by reacting to individual failures rather than failure rates. Use it on branches where any failure is a meaningful signal, like main or merge queue branches. If you need to manually flag a test that automated monitors haven’t caught, use Flag as Flaky from the test detail page.

Dry-Running with Labels

You can preview how a new health classification monitor would behave by deploying it as a labeling monitor first. Because Apply labels attaches labels without changing health status, you can let the monitor run on live test data, see which tests it activates on, refine the settings, and only flip it to Classify test status once you trust the configuration. The flow is typically:

Create the monitor with Apply labels and a dedicated label (e.g., would-be-flaky).
Let the monitor run for a few cycles and observe which tests pick up the label.
Refine the settings until the labeled set matches what you want classified.
Switch the monitor’s action to Classify test status.

The Preview Panel on each monitor config form shows a static snapshot at configuration time, but a label dry-run validates the monitor against live runs without committing to a status change.

Branch-Aware Detection

Tests often behave differently depending on where they run. Failures on main are usually unexpected and signal flakiness. Failures on PR branches may be expected during active development. Merge queue failures are suspicious because the code has already passed PR checks. Rather than applying a single set of branch rules automatically, Trunk gives you control over how detection treats different branches through branch scoping on failure rate monitors. You can create separate monitors with different thresholds and windows for your stable branch, PR branches, and merge queue branches. See Failure Rate Monitor: Recommended configurations for specific guidance. Pass-on-retry detection is branch-agnostic. It flags any test that fails and passes on the same commit, regardless of which branch the test ran on.

Muting Monitors

You can temporarily mute a monitor for a specific test case. A muted monitor continues to run and record detections, but it won’t contribute to the test’s flaky status until the mute expires. This is useful when you know a test is flaky but want to suppress the signal temporarily, for example while a fix is in progress or during a known infrastructure issue. Unlike Flag as Flaky, which is a persistent user override, muting preserves the detection history and automatically re-enables itself after the mute period.

How Muting Works

You can mute a monitor from the test case view in the Trunk app. When muting, you choose a duration:

Duration
1 hour
4 hours
24 hours
7 days
30 days

While muted, the monitor is excluded from the test’s status calculation. If the muted monitor was the only active health classification monitor, the test transitions from flaky to healthy for the duration of the mute. When the mute expires, the monitor is automatically included in the next status evaluation. If it’s still active, the test will be flagged again. You can also unmute a monitor early from the test case view.

You can only mute a monitor that has already detected flaky behavior for a test. If a monitor has never been active for a test, the mute option is disabled.

When to Mute vs. Other Options

Situation	Recommended action
Fix is in progress and you want to suppress noise temporarily	Mute the monitor for a few days
Test is flaky but no automated monitor has caught it	Use Flag as Flaky to mark it as flaky
You want to stop a monitor from evaluating a test permanently	Adjust the monitor’s branch scope or thresholds instead
You want to suppress all flaky signals for a test	Mute each active monitor individually, or address the root cause

Variants

If you run the same tests across different environments or architectures, you can use variants to separate these runs into distinct test cases. This lets monitors detect environment-specific flakes. For example, a test might be flaky on iOS but stable on Android. Using variants, monitors isolate flakes on the iOS variant instead of marking the test as flaky across all environments. See the Trunk Analytics CLI docs for details on how to upload with variants.

Detection Time

Detection of flaky tests is run automatically when test uploads are processed. From the time that a test with configured flake detection is uploaded, it will take at most 20 minutes for the flakiness to be detected.

Documentation Index

​How Monitors Work

​Disabling or Deleting a Monitor

​Monitor Types

​Health Classification Monitors

​Lifecycle and Performance Monitors

​Dry-Running with Labels

​Branch-Aware Detection

​Muting Monitors

​How Muting Works

​When to Mute vs. Other Options

​Variants

​Detection Time

How Monitors Work

Disabling or Deleting a Monitor

Monitor Types

Health Classification Monitors

Lifecycle and Performance Monitors

Dry-Running with Labels

Branch-Aware Detection

Muting Monitors

How Muting Works

When to Mute vs. Other Options

Variants

Detection Time