HTTP Monitor
The http-monitor plugin watches HTTP endpoints for availability and can automatically restart services when they become unhealthy. It contributes to all four OODA phases: an observer for health checks, an orienter for failure analysis, a decision strategy for restart logic, and an action for executing restarts.
Installation
Section titled “Installation”import { createAgent } from 'zupdev';import { httpMonitor } from 'zupdev/plugins/http-monitor';
const agent = await createAgent({ name: 'my-agent', plugins: [ httpMonitor({ endpoints: [ { id: 'api', name: 'API Server', url: 'https://api.example.com/health', }, { id: 'dashboard', name: 'Dashboard', url: 'https://dashboard.example.com/health', expectedStatus: 200, timeout: 10000, critical: true, }, ], checkInterval: 30000, maxHistorySize: 50, }), ],});Plugin options
Section titled “Plugin options”| Field | Type | Default | Description |
|---|---|---|---|
endpoints | EndpointConfig[] | — | Required. At least one endpoint must be configured. |
checkInterval | number | 30000 | How often to check endpoints, in milliseconds. This is set as the observer’s interval, so it is only enforced in continuous mode. |
maxHistorySize | number | 50 | Maximum number of health check results to keep per endpoint. Older results are trimmed. |
Endpoint configuration
Section titled “Endpoint configuration”Each endpoint describes one HTTP service to monitor:
| Field | Type | Default | Description |
|---|---|---|---|
id | string | — | Required. Unique identifier for this endpoint. |
name | string | — | Required. Human-readable name used in logs and assessments. |
url | string | — | Required. URL to check. |
method | string | 'GET' | HTTP method for the health check request. |
expectedStatus | number | 200 | The status code that indicates a healthy response. |
timeout | number | 5000 | Request timeout in milliseconds. |
headers | Record<string, string> | — | Custom headers sent with every health check request. Useful for auth tokens. |
restartStrategy | RestartStrategy | — | How to restart this service when it fails. If not set, restarts are not attempted. |
failureThreshold | number | 3 | Number of consecutive failures before the endpoint is considered unhealthy and eligible for restart. |
cooldownPeriod | number | 300000 | Minimum time between restarts in milliseconds (default: 5 minutes). Prevents restart loops. |
critical | boolean | false | If true, failures produce 'critical' severity observations (instead of 'error') and the decision strategy sets requiresApproval: true on restart decisions. |
Restart strategies
Section titled “Restart strategies”The restartStrategy field determines how the plugin restarts a service. Three strategy types are available.
Command strategy
Section titled “Command strategy”Run a shell command to restart the service:
{ id: 'api', name: 'API Server', url: 'https://api.example.com/health', restartStrategy: { type: 'command', command: 'systemctl restart api-server', cwd: '/opt/api', },}The command field can be a string (split on spaces) or an array of strings:
// String form -- split on spacesrestartStrategy: { type: 'command', command: 'docker compose restart api',}
// Array form -- no splitting, handles args with spacesrestartStrategy: { type: 'command', command: ['docker', 'compose', 'restart', 'api'], cwd: '/opt/services',}| Field | Type | Required | Description |
|---|---|---|---|
type | 'command' | Yes | Selects the command strategy. |
command | string | string[] | Yes | Command to execute. |
cwd | string | No | Working directory for the command. |
The command is executed with Bun.spawn. If the process exits with a non-zero exit code, the restart is considered failed and the stderr output is included in the error message.
HTTP strategy
Section titled “HTTP strategy”Call an HTTP endpoint to trigger a restart (for example, a deployment API or orchestrator webhook):
{ id: 'worker', name: 'Worker Service', url: 'https://worker.example.com/health', restartStrategy: { type: 'http', url: 'https://deploy.example.com/api/services/worker/restart', method: 'POST', headers: { 'Authorization': 'Bearer deploy-token', }, body: { force: true }, timeout: 30000, },}| Field | Type | Required | Description |
|---|---|---|---|
type | 'http' | Yes | Selects the HTTP strategy. |
url | string | Yes | URL to call for the restart. |
method | string | No | HTTP method. Defaults to 'POST'. |
body | unknown | No | Request body. Serialized as JSON. |
headers | Record<string, string> | No | Additional headers. |
timeout | number | No | Request timeout in ms. Defaults to 30000. |
The restart is considered failed if the response status is not in the 2xx range.
Function strategy
Section titled “Function strategy”Call a custom async function for full programmatic control:
{ id: 'api', name: 'API Server', url: 'https://api.example.com/health', restartStrategy: { type: 'function', handler: async () => { // Custom restart logic -- call Kubernetes API, Fly.io API, etc. await k8s.restartDeployment('api-server', 'production'); }, },}| Field | Type | Required | Description |
|---|---|---|---|
type | 'function' | Yes | Selects the function strategy. |
handler | () => Promise<void> | Yes | Async function to execute. Throw to signal failure. |
Failure detection and cooldown
Section titled “Failure detection and cooldown”The plugin tracks state for each endpoint independently:
-
Consecutive failure counter — Incremented on each failed health check. Reset to
0on success or after a successful restart. -
Failure threshold — When
consecutiveFailures >= failureThreshold, the endpoint is considered unhealthy and eligible for restart. Below the threshold, failures produce'warning'severity observations. At or above the threshold, they produce'error'(or'critical'if the endpoint is markedcritical). -
Cooldown period — After a restart, the endpoint enters a cooldown period (default 5 minutes). During cooldown, no further restarts are attempted even if the endpoint continues to fail. The decision strategy logs a warning and moves on to the next endpoint.
-
No restart strategy — If an endpoint has no
restartStrategyconfigured, the decision strategy skips it with a warning. Health checks and failure detection still work.
OODA phase contributions
Section titled “OODA phase contributions”Observe: http-health-check
Section titled “Observe: http-health-check”The observer performs HTTP health checks against all configured endpoints. For each endpoint, it produces one Observation with:
source:'http-monitor/health-check'type:'metric'severity:'info'(healthy),'warning'(failing below threshold),'error'(failing above threshold), or'critical'(critical endpoint above threshold)data: endpoint ID, name, URL, success status, status code, response time, error message, consecutive failures, last restart time
Orient: analyze-endpoint-failures
Section titled “Orient: analyze-endpoint-failures”The orienter filters observations from the health check observer and produces a SituationAssessment:
- Healthy:
findings: ['All monitored endpoints are healthy'], confidence1.0 - Unhealthy: Lists each failing endpoint with its failure count and error. Sets
contributingFactorto either'Cascading failure detected'(50% or more endpoints failing) or'Isolated failure'.
Decide: restart-unhealthy-endpoint
Section titled “Decide: restart-unhealthy-endpoint”The decision strategy is only applicable when assessments mention “unhealthy” or “failure” in their findings. It scans endpoints to find the first one that:
- Has
consecutiveFailures >= failureThreshold - Is past the cooldown period
- Has a
restartStrategyconfigured
If found, it returns a decision to execute http-monitor:restartService with confidence: 0.85 and risk based on whether the endpoint is critical (medium) or not (low). Critical endpoints set requiresApproval: true.
If no endpoints qualify, it returns a no-op.
Act: restartService
Section titled “Act: restartService”The action executes the endpoint’s configured restart strategy. On success, it resets the consecutive failure counter and records the restart time. The action has default autonomy settings of mode: 'auto' with minConfidence: 0.7.
REST API endpoints
Section titled “REST API endpoints”The plugin registers three API endpoints:
GET /http-monitor/endpoints
Section titled “GET /http-monitor/endpoints”List all monitored endpoints with their current state.
Response:
{ "endpoints": [ { "id": "api", "name": "API Server", "url": "https://api.example.com/health", "consecutiveFailures": 0, "lastRestartTime": null, "recentChecks": [ { "endpointId": "api", "url": "https://api.example.com/health", "success": true, "statusCode": 200, "responseTime": 45, "timestamp": "2025-06-15T10:30:00.000Z" } ] } ]}POST /http-monitor/endpoints/:endpointId/check
Section titled “POST /http-monitor/endpoints/:endpointId/check”Trigger an immediate health check for a specific endpoint, bypassing the observer interval.
Response:
{ "endpointId": "api", "url": "https://api.example.com/health", "success": true, "statusCode": 200, "responseTime": 52, "timestamp": "2025-06-15T10:31:00.000Z"}POST /http-monitor/endpoints/:endpointId/restart
Section titled “POST /http-monitor/endpoints/:endpointId/restart”Manually trigger a restart for a specific endpoint, bypassing the decision strategy and cooldown logic.
Response (success):
{ "action": "restart-service", "success": true, "output": "Successfully restarted service for API Server", "duration": 3200, "sideEffects": ["Service restart triggered for https://api.example.com/health"]}Response (failure):
{ "action": "restart-service", "success": false, "error": "Command failed with exit code 1: permission denied", "duration": 150}All three endpoints require authentication (Bearer token) by default.
Full example
Section titled “Full example”import { createAgent } from 'zupdev';import { httpMonitor } from 'zupdev/plugins/http-monitor';
const agent = await createAgent({ name: 'infra-monitor', mode: 'continuous', loopInterval: 15000, api: { port: 3000, auth: { apiKeys: [{ key: process.env.API_KEY!, name: 'admin' }], }, }, plugins: [ httpMonitor({ checkInterval: 30000, maxHistorySize: 100, endpoints: [ { id: 'api', name: 'API Gateway', url: 'https://api.example.com/health', timeout: 5000, failureThreshold: 3, cooldownPeriod: 300000, restartStrategy: { type: 'command', command: ['docker', 'compose', 'restart', 'api'], cwd: '/opt/services', }, }, { id: 'db', name: 'Database Proxy', url: 'https://db-proxy.example.com/health', critical: true, failureThreshold: 5, cooldownPeriod: 600000, restartStrategy: { type: 'http', url: 'https://orchestrator.internal/restart/db-proxy', method: 'POST', headers: { 'Authorization': 'Bearer internal-token' }, }, }, { id: 'cache', name: 'Cache Layer', url: 'http://cache.internal:6379/ping', expectedStatus: 200, // No restart strategy -- monitor only }, ], }), ],});
const server = agent.startApi({ port: 3000 });await agent.start();What this does:
- The API Gateway is checked every 30 seconds. After 3 consecutive failures, a Docker Compose restart is attempted. 5-minute cooldown between restarts.
- The Database Proxy is marked critical. After 5 failures, a restart is requested but requires human approval (because
critical: true). 10-minute cooldown. - The Cache Layer is monitored only — no restart strategy is configured, so failures are observed and reported but never acted on.