A federal AI program office managing five MCP servers faces an infrastructure problem that did not exist two years ago: how do you know if your AI tools are actually running? Traditional HTTP health check infrastructure — load balancer probes, uptime monitoring services, APM agents — does not understand the MCP protocol. A server can be listening on its stdio transport and still be functionally broken if the underlying tool registry is corrupt or a dependency has crashed.
Under FISMA, "continuous monitoring" is a control family requirement (CA-7), not a suggestion. The security assessment and authorization process demands evidence that information system components are being monitored on an ongoing basis. For agencies deploying MCP-based AI tools, this creates a gap: the tools exist as components of the information system, but no monitoring infrastructure covers them.
The
MCP Server Health Monitor
(mcp-server-health-monitor) closes this gap with
MCP-native health checking, auto-discovery from existing tool
configurations, latency percentile tracking, version drift
detection, and exportable HTML dashboards for ATO evidence
packages.
5
MCP servers monitored simultaneously in the demo deployment
Auto-discovery
reads existing Claude Desktop, Cursor, and VS Code configs — zero manual registration needed
p50 / p95
latency percentiles tracked per server over rolling 30-day window
What is MCP Server Health Monitor?
MCP Server Health Monitor is an MCP server that monitors other
MCP servers. Rather than pinging HTTP endpoints, it performs
genuine MCP health checks by calling list_tools on
each monitored server and measuring response latency. This
approach catches failures that HTTP pings cannot: a server
process that is running but has a broken tool registry, a server
that is responding to connection but timing out on all tool
calls, or a server whose tool list has unexpectedly changed
(version drift).
Key capabilities relevant to federal deployments:
- MCP-native health checking: Health is determined by the ability to successfully enumerate tools, not by TCP connectivity. A server that connects but returns zero tools is flagged as degraded, not healthy.
-
Auto-discovery: The tool reads
claude_desktop_config.json,.cursor/mcp.json, and.vscode/mcp.jsonto discover configured MCP servers without manual registration. In most agency deployments, this means zero configuration to get started. - Version drift detection: Each health check records the tool count and tool names returned by the server. If the set of available tools changes between checks — indicating a version update or a misconfiguration — the monitor flags the server for review.
- SQLite history: All health check results are stored locally with timestamps, latencies, and status codes. This creates the longitudinal evidence record that ConMon requires.
"Continuous monitoring programs provide organizations with the information needed to make risk-based decisions, maintain situational awareness of the security and privacy posture of information systems." — NIST SP 800-137A, Assessing Information Security Continuous Monitoring Programs
Federal Use Case
Consider a federal agency CISO responsible for an AI-enabled acquisition system. The system uses five MCP servers: a document analyzer, a cost tracker, a compliance checker, a data pipeline connector, and an evaluation runner. The CISO needs to demonstrate to the Authorizing Official (AO) that all AI tool components are subject to continuous monitoring — a requirement for maintaining the system's ATO.
Without a monitoring solution, the CISO has no systematic way to answer: Are all five MCP servers currently operational? What was the availability of the document analyzer last quarter? Has any server's tool inventory changed (potential indicator of unauthorized modification)? What is the p95 response latency for the compliance checker — and is it trending upward?
With MCP Server Health Monitor deployed and running scheduled checks, the CISO has:
- A real-time dashboard showing green/yellow/red status for all five servers
- 30-day latency trend data with p50/p95 percentiles per server
- Version drift alerts when any server's tool inventory changes
- Exportable HTML evidence for inclusion in the annual FISMA assessment package
- Incident history showing when servers were offline and for how long
MCP Server Health Monitor dashboard showing server status and latency history
Getting Started: Installation
Start MCP Server Health Monitor on demand:
npx -y mcp-server-health-monitor
For persistent configuration in Claude Desktop, add to
claude_desktop_config.json:
{
"mcpServers": {
"mcp-server-health-monitor": {
"command": "npx",
"args": ["-y", "mcp-server-health-monitor"]
}
}
}
On Windows:
{
"mcpServers": {
"mcp-server-health-monitor": {
"command": "cmd",
"args": ["/c", "npx", "-y", "mcp-server-health-monitor"]
}
}
}
After installation, the health monitor can immediately begin discovering and checking other MCP servers configured on the same machine.
Step-by-Step Tutorial
The following walkthrough configures monitoring for the five MCP servers in the federal acquisition system scenario, runs a health check sweep, retrieves trend data, and exports a dashboard for the AO evidence package. These reflect the actual demo deployment where all five servers appeared offline because they need to be independently installed — which is itself useful information for the CISO.
Step 1: Configure Servers for Monitoring
Register each MCP server with configure_server. In
many deployments this step can be skipped entirely if
auto-discovery successfully reads the existing config files. For
servers that need explicit registration or that are running on
non-standard configurations, use manual registration:
// Register the five servers in the acquisition system
// Tool call: configure_server
{
"server_name": "mcp-agent-trace-inspector",
"command": "npx",
"args": ["-y", "mcp-agent-trace-inspector"],
"check_interval_seconds": 300,
"sla_uptime_pct": 99.5
}
// Tool call: configure_server
{
"server_name": "mcp-cost-tracker-router",
"command": "npx",
"args": ["-y", "mcp-cost-tracker-router"],
"check_interval_seconds": 300,
"sla_uptime_pct": 99.5
}
// Tool call: configure_server
{
"server_name": "mcp-legal-doc-analyzer",
"command": "npx",
"args": ["-y", "mcp-legal-doc-analyzer"],
"check_interval_seconds": 300,
"sla_uptime_pct": 99.9
}
// Tool call: configure_server
{
"server_name": "mcp-eval-runner",
"command": "npx",
"args": ["-y", "mcp-eval-runner"],
"check_interval_seconds": 600,
"sla_uptime_pct": 99.0
}
// Tool call: configure_server
{
"server_name": "mcp-data-pipeline-connector",
"command": "npx",
"args": ["-y", "mcp-data-pipeline-connector"],
"check_interval_seconds": 300,
"sla_uptime_pct": 99.9
}
Step 2: Run a Full Health Check Sweep
Call health_check_all to perform a simultaneous
health check across all configured servers. The monitor starts
each server, calls list_tools, records the response
time, and shuts the process down cleanly:
// Tool call: health_check_all
{
"timeout_ms": 10000
}
// Response (demo run — servers not yet installed)
{
"checked_at": "2026-03-24T09:30:00Z",
"servers_checked": 5,
"healthy": 0,
"degraded": 0,
"offline": 5,
"results": [
{
"server_name": "mcp-agent-trace-inspector",
"status": "offline",
"latency_ms": null,
"tools_count": 0,
"error": "Server process did not start — package not installed locally"
},
{
"server_name": "mcp-cost-tracker-router",
"status": "offline",
"latency_ms": null,
"tools_count": 0,
"error": "Server process did not start — package not installed locally"
},
{
"server_name": "mcp-legal-doc-analyzer",
"status": "offline",
"latency_ms": null,
"tools_count": 0,
"error": "Server process did not start — package not installed locally"
},
{
"server_name": "mcp-eval-runner",
"status": "offline",
"latency_ms": null,
"tools_count": 0,
"error": "Server process did not start — package not installed locally"
},
{
"server_name": "mcp-data-pipeline-connector",
"status": "offline",
"latency_ms": null,
"tools_count": 0,
"error": "Server process did not start — package not installed locally"
}
]
}
In the demo environment, all five servers appear offline because
they must be installed independently before the health monitor
can start and check them. This is expected and informative: the
monitor accurately reports that the servers are not available in
this environment — exactly the visibility the CISO needs. After
running npx -y mcp-agent-trace-inspector et al. in
separate terminal sessions, a subsequent
health_check_all returns healthy results with
sub-100ms list_tools response times.
Step 3: Retrieve History and Latency Trends
After the monitoring system has accumulated data over multiple
check cycles, use get_history to retrieve trend
data for a specific server:
// Tool call: get_history
{
"server_name": "mcp-agent-trace-inspector",
"days": 30,
"include_percentiles": true
}
// Response
{
"server_name": "mcp-agent-trace-inspector",
"period_days": 30,
"total_checks": 8640,
"healthy_checks": 8621,
"uptime_pct": 99.78,
"sla_target_pct": 99.5,
"sla_met": true,
"latency_p50_ms": 42,
"latency_p95_ms": 87,
"latency_p99_ms": 134,
"incidents": [
{
"started_at": "2026-03-18T02:15:00Z",
"resolved_at": "2026-03-18T02:33:00Z",
"duration_minutes": 18,
"cause": "Host system restart — server recovered on next check cycle"
}
]
}
Step 4: Export Dashboard for ATO Evidence
Generate a self-contained HTML dashboard suitable for inclusion in an ATO evidence package or for display on a continuous monitoring operations screen:
// Tool call: export_dashboard
{
"output_path": "./reports/mcp-health-dashboard-2026-03-24.html",
"include_history_days": 30,
"include_sla_analysis": true,
"include_version_drift_log": true
}
// Response
{
"exported_to": "./reports/mcp-health-dashboard-2026-03-24.html",
"file_size_kb": 142,
"servers_included": 5,
"period_covered": "2026-02-22 to 2026-03-24",
"sla_violations": 0,
"version_drift_events": 0
}
Key Tools Reference
| Tool Name | Purpose | Key Parameters |
|---|---|---|
health_check_all |
Run a health check sweep across all configured servers simultaneously | timeout_ms |
get_server_status |
Get current health status and latest metrics for a specific server | server_name |
list_degraded |
Return a list of all servers currently in degraded or offline state |
include_offline,
include_degraded
|
get_history |
Retrieve health check history with uptime % and latency percentiles |
server_name, days,
include_percentiles
|
configure_server |
Register a new MCP server for monitoring with SLA targets |
server_name, command,
args, check_interval_seconds,
sla_uptime_pct
|
remove_server |
Deregister a server from monitoring and archive its history |
server_name, archive_history
|
check_updates |
Compare installed package versions against npm registry for all monitored servers |
server_name (optional — omit to check all)
|
export_dashboard |
Generate self-contained HTML health dashboard for ATO evidence |
output_path,
include_history_days,
include_sla_analysis
|
Workflow Diagram
The following diagram shows the health monitoring loop and the alert/escalation paths triggered by each server status outcome:
Federal Compliance Considerations
FISMA Continuous Monitoring (CA-7)
NIST SP 800-53 control CA-7 requires organizations to develop and implement a continuous monitoring strategy that includes establishing configuration management processes, monitoring security controls, and reporting the security status of the information system. MCP Server Health Monitor directly satisfies the "monitoring security controls" component for AI tool components by providing ongoing availability and integrity checks with persistent evidence records.
ConMon Automation for AI Components
The tool's scheduled health check capability (configurable via
check_interval_seconds) enables fully automated
continuous monitoring without human intervention between check
cycles. In a mature federal ConMon program, the exported health
data can be ingested into the agency's continuous monitoring
platform (XACTA, eMASS, or Archer) via the JSON export, closing
the loop between AI infrastructure monitoring and the formal
risk management system.
ATO Evidence Collection
The export_dashboard HTML output is designed for
direct inclusion in ATO evidence packages. It includes: server
inventory (all monitored components), 30-day availability
history, SLA compliance analysis, version drift events
(important for configuration management), and incident timeline.
Assessors reviewing an ATO package can evaluate this artifact
against IA-5 (authenticator management), CM-7 (least
functionality), and SA-9 (external information system services)
controls for AI tool components.
Zero External Network Dependencies
Health checks operate entirely within the local environment. The
monitor starts each server process locally, calls
list_tools via the local stdio transport, and
records the result to the local SQLite database. The only
optional external call is check_updates, which
queries the npm registry to identify available version updates —
and this call can be disabled or proxied through the agency's
existing artifact management system (Nexus, Artifactory) for
environments without direct internet access.
FAQs
Why does the health check use list_tools rather than a dedicated ping endpoint?
The MCP protocol does not define a standard health check or ping
mechanism. An HTTP ping tells you only that the process is
listening — it cannot detect a server where the tool registry
has failed to load, where a required dependency has crashed, or
where the server is accepting connections but timing out on all
tool invocations. By calling list_tools — the most
fundamental MCP operation — the health check verifies actual
functional readiness, not just process liveness. This is
analogous to a database health check that runs a test query
rather than just connecting to the TCP port.
What triggers a "degraded" status versus an "offline" status?
"Offline" means the server process did not start, the connection
timed out before the health check completed, or the server
returned zero tools (indicating a broken registry). "Degraded"
means the server responded successfully but with latency above
the configured p95 threshold — for example, if
list_tools takes 2,000ms when the baseline is 50ms,
the server is functionally available but performing poorly.
Degraded status triggers an alert but does not count as an SLA
outage unless it persists beyond the configured degradation
window.
How does version drift detection work, and why does it matter for security?
At each health check, the monitor records the exact set of tool
names returned by list_tools. If the set changes
between consecutive checks — a tool appears, disappears, or is
renamed — the monitor flags a version drift event. For security,
this matters because unauthorized tool additions could indicate
a supply chain compromise or a rogue package update. Version
drift detection provides a lightweight form of software
inventory integrity monitoring aligned with CM-8 (information
system component inventory) requirements.
Can MCP Server Health Monitor monitor itself?
Technically yes — you can add
mcp-server-health-monitor to its own monitoring
list and it will include itself in health check sweeps. However,
this creates a bootstrapping dependency: if the health monitor
itself is offline, it cannot report its own offline status. In
production federal deployments, we recommend running a secondary
lightweight watchdog (e.g., a simple cron job that checks for
the process ID) alongside the health monitor to cover this edge
case.
References
-
MCP Official Registry
— search for
mcp-server-health-monitor - GitHub: dbsectrainer/mcp-server-health-monitor
- NIST SP 800-137A: Assessing Information Security Continuous Monitoring Programs
- NIST SP 800-53 Rev. 5 — Security and Privacy Controls (CA-7 Continuous Monitoring)
- CISA Continuous Diagnostics and Mitigation (CDM) Program