Skip to content

Heartbeat Monitoring#

Heartbeat is the always-on monitoring system for managed agents. At a configurable interval, the agent runs its monitoring instructions, checks your systems, and reports findings.


How It Works#

  1. Timer fires at the configured interval
  2. Agent loads its Heartbeat workspace file — the natural language instructions for what to check
  3. Agent uses its tools (kubectl, AWS CLI, GitHub, shell) to perform checks
  4. Agent generates a report of findings
  5. Report goes through a three-stage delivery gate:
  6. Empty check — if there's nothing to report, nothing is delivered
  7. Duplicate suppression — exact match of a report delivered in the last 24 hours is suppressed
  8. Classification gate — a fast LLM evaluation compares the new report to the last delivered alert and decides DELIVER or SUPPRESS
  9. If delivered, the report is sent to web chat and Slack (if connected)

This design means your agent only alerts you when something meaningful has changed.


Configuration#

Heartbeat interval#

Default: 8 hours. Minimum: 1 minute.

Change through conversation:

Set your heartbeat to every 2 hours.

Or set a specific interval:

Check every 30 minutes.

Monitoring instructions#

Tell the agent what to check by updating its Heartbeat workspace file:

Monitor the production namespace. Report:
- CrashLoopBackOff pods
- Pods stuck in Pending for more than 5 minutes
- Any pod using more than 80% of its memory limit
- Certificate expiry within 14 days

The agent updates its Heartbeat file and starts checking these items on the next cycle.

Enable/disable#

Pause your heartbeat.
Resume your heartbeat.

Smart alert suppression#

The three-stage delivery gate prevents alert fatigue:

Stage 1: Empty check. If the heartbeat produced no output, nothing is delivered.

Stage 2: Duplicate suppression. The platform compares the new report's full text against the last delivered report. If they are an exact string match (character-for-character, not semantic), and the previous report was delivered within the last 24 hours, the new report is suppressed. This is a programmatic check, not an LLM evaluation.

Stage 3: Classification gate. A fast LLM evaluation (Claude Haiku) compares the new report to the last delivered alert and decides DELIVER or SUPPRESS. This catches semantically similar reports that differ in wording but describe the same situation.

Fail-open: if the classification LLM call fails for any reason, the report is delivered. Availability over precision — you get a duplicate rather than a missed alert.

Alert suppression is always active and is not configurable. Every heartbeat execution is recorded regardless of whether its report was delivered, so the full monitoring history — including suppressed reports — is available for review.


Slack-Triggered Heartbeats#

A Slack channel can trigger an immediate heartbeat when a message arrives. This is useful for incident channels — when someone posts in the channel, the agent runs its monitoring cycle with the Slack message as additional context.

An agent's heartbeat trigger channel is configured per-agent along with a throttle interval (default: 60 seconds) that prevents the agent from firing a heartbeat on every message in a fast-moving channel. When a message arrives in the trigger channel, if enough time has passed since the last trigger, an immediate heartbeat fires. The heartbeat invocation includes recent channel messages as context — the agent knows what triggered the check.

Each agent can have one heartbeat trigger channel.


Error handling#

  • Auto-stop: after 5 consecutive heartbeat failures, the heartbeat is automatically stopped. You'll be notified. Restart it through conversation or the dashboard.
  • Budget awareness: if the organization's spending limit is exceeded, the heartbeat skips gracefully and does not count as a failure.

Example#

An SRE agent with this heartbeat configuration:

# Heartbeat

Every cycle, check:
1. Pod health in production — CrashLoopBackOff, OOMKilled, Pending
2. Node utilization — alert above 85% CPU or memory
3. Recent deployments — summarize what changed
4. PagerDuty incidents — any open P1/P2 incidents

Running every 2 hours, the agent:

  • Runs kubectl to check pod status and node metrics
  • Reviews recent deployments via the GitHub integration
  • Checks PagerDuty via the policy-controlled proxy
  • Delivers a consolidated report — or stays quiet if nothing has changed