MCP Infrastructure Monitoring Federal IT

Monitoring MCP Server Health in Federal Deployments

March 24, 2026 6 min read BE EASY ENTERPRISES LLC
MCP Server Health Monitor illustration

A federal AI program office managing five MCP servers faces an infrastructure problem that did not exist two years ago: how do you know if your AI tools are actually running? Traditional HTTP health check infrastructure — load balancer probes, uptime monitoring services, APM agents — does not understand the MCP protocol. A server can be listening on its stdio transport and still be functionally broken if the underlying tool registry is corrupt or a dependency has crashed.

Under FISMA, "continuous monitoring" is a control family requirement (CA-7), not a suggestion. The security assessment and authorization process demands evidence that information system components are being monitored on an ongoing basis. For agencies deploying MCP-based AI tools, this creates a gap: the tools exist as components of the information system, but no monitoring infrastructure covers them.

The MCP Server Health Monitor (mcp-server-health-monitor) closes this gap with MCP-native health checking, auto-discovery from existing tool configurations, latency percentile tracking, version drift detection, and exportable HTML dashboards for ATO evidence packages.

5

MCP servers monitored simultaneously in the demo deployment

Auto-discovery

reads existing Claude Desktop, Cursor, and VS Code configs — zero manual registration needed

p50 / p95

latency percentiles tracked per server over rolling 30-day window

What is MCP Server Health Monitor?

MCP Server Health Monitor is an MCP server that monitors other MCP servers. Rather than pinging HTTP endpoints, it performs genuine MCP health checks by calling list_tools on each monitored server and measuring response latency. This approach catches failures that HTTP pings cannot: a server process that is running but has a broken tool registry, a server that is responding to connection but timing out on all tool calls, or a server whose tool list has unexpectedly changed (version drift).

Key capabilities relevant to federal deployments:

  • MCP-native health checking: Health is determined by the ability to successfully enumerate tools, not by TCP connectivity. A server that connects but returns zero tools is flagged as degraded, not healthy.
  • Auto-discovery: The tool reads claude_desktop_config.json, .cursor/mcp.json, and .vscode/mcp.json to discover configured MCP servers without manual registration. In most agency deployments, this means zero configuration to get started.
  • Version drift detection: Each health check records the tool count and tool names returned by the server. If the set of available tools changes between checks — indicating a version update or a misconfiguration — the monitor flags the server for review.
  • SQLite history: All health check results are stored locally with timestamps, latencies, and status codes. This creates the longitudinal evidence record that ConMon requires.
"Continuous monitoring programs provide organizations with the information needed to make risk-based decisions, maintain situational awareness of the security and privacy posture of information systems." — NIST SP 800-137A, Assessing Information Security Continuous Monitoring Programs

Federal Use Case

Consider a federal agency CISO responsible for an AI-enabled acquisition system. The system uses five MCP servers: a document analyzer, a cost tracker, a compliance checker, a data pipeline connector, and an evaluation runner. The CISO needs to demonstrate to the Authorizing Official (AO) that all AI tool components are subject to continuous monitoring — a requirement for maintaining the system's ATO.

Without a monitoring solution, the CISO has no systematic way to answer: Are all five MCP servers currently operational? What was the availability of the document analyzer last quarter? Has any server's tool inventory changed (potential indicator of unauthorized modification)? What is the p95 response latency for the compliance checker — and is it trending upward?

With MCP Server Health Monitor deployed and running scheduled checks, the CISO has:

  • A real-time dashboard showing green/yellow/red status for all five servers
  • 30-day latency trend data with p50/p95 percentiles per server
  • Version drift alerts when any server's tool inventory changes
  • Exportable HTML evidence for inclusion in the annual FISMA assessment package
  • Incident history showing when servers were offline and for how long
MCP Server Health Monitor dashboard showing server status and latency history

MCP Server Health Monitor dashboard showing server status and latency history

Getting Started: Installation

Start MCP Server Health Monitor on demand:

npx -y mcp-server-health-monitor

For persistent configuration in Claude Desktop, add to claude_desktop_config.json:

{
  "mcpServers": {
    "mcp-server-health-monitor": {
      "command": "npx",
      "args": ["-y", "mcp-server-health-monitor"]
    }
  }
}

On Windows:

{
  "mcpServers": {
    "mcp-server-health-monitor": {
      "command": "cmd",
      "args": ["/c", "npx", "-y", "mcp-server-health-monitor"]
    }
  }
}

After installation, the health monitor can immediately begin discovering and checking other MCP servers configured on the same machine.

Step-by-Step Tutorial

The following walkthrough configures monitoring for the five MCP servers in the federal acquisition system scenario, runs a health check sweep, retrieves trend data, and exports a dashboard for the AO evidence package. These reflect the actual demo deployment where all five servers appeared offline because they need to be independently installed — which is itself useful information for the CISO.

Step 1: Configure Servers for Monitoring

Register each MCP server with configure_server. In many deployments this step can be skipped entirely if auto-discovery successfully reads the existing config files. For servers that need explicit registration or that are running on non-standard configurations, use manual registration:

// Register the five servers in the acquisition system

// Tool call: configure_server
{
  "server_name": "mcp-agent-trace-inspector",
  "command": "npx",
  "args": ["-y", "mcp-agent-trace-inspector"],
  "check_interval_seconds": 300,
  "sla_uptime_pct": 99.5
}

// Tool call: configure_server
{
  "server_name": "mcp-cost-tracker-router",
  "command": "npx",
  "args": ["-y", "mcp-cost-tracker-router"],
  "check_interval_seconds": 300,
  "sla_uptime_pct": 99.5
}

// Tool call: configure_server
{
  "server_name": "mcp-legal-doc-analyzer",
  "command": "npx",
  "args": ["-y", "mcp-legal-doc-analyzer"],
  "check_interval_seconds": 300,
  "sla_uptime_pct": 99.9
}

// Tool call: configure_server
{
  "server_name": "mcp-eval-runner",
  "command": "npx",
  "args": ["-y", "mcp-eval-runner"],
  "check_interval_seconds": 600,
  "sla_uptime_pct": 99.0
}

// Tool call: configure_server
{
  "server_name": "mcp-data-pipeline-connector",
  "command": "npx",
  "args": ["-y", "mcp-data-pipeline-connector"],
  "check_interval_seconds": 300,
  "sla_uptime_pct": 99.9
}

Step 2: Run a Full Health Check Sweep

Call health_check_all to perform a simultaneous health check across all configured servers. The monitor starts each server, calls list_tools, records the response time, and shuts the process down cleanly:

// Tool call: health_check_all
{
  "timeout_ms": 10000
}

// Response (demo run — servers not yet installed)
{
  "checked_at": "2026-03-24T09:30:00Z",
  "servers_checked": 5,
  "healthy": 0,
  "degraded": 0,
  "offline": 5,
  "results": [
    {
      "server_name": "mcp-agent-trace-inspector",
      "status": "offline",
      "latency_ms": null,
      "tools_count": 0,
      "error": "Server process did not start — package not installed locally"
    },
    {
      "server_name": "mcp-cost-tracker-router",
      "status": "offline",
      "latency_ms": null,
      "tools_count": 0,
      "error": "Server process did not start — package not installed locally"
    },
    {
      "server_name": "mcp-legal-doc-analyzer",
      "status": "offline",
      "latency_ms": null,
      "tools_count": 0,
      "error": "Server process did not start — package not installed locally"
    },
    {
      "server_name": "mcp-eval-runner",
      "status": "offline",
      "latency_ms": null,
      "tools_count": 0,
      "error": "Server process did not start — package not installed locally"
    },
    {
      "server_name": "mcp-data-pipeline-connector",
      "status": "offline",
      "latency_ms": null,
      "tools_count": 0,
      "error": "Server process did not start — package not installed locally"
    }
  ]
}

In the demo environment, all five servers appear offline because they must be installed independently before the health monitor can start and check them. This is expected and informative: the monitor accurately reports that the servers are not available in this environment — exactly the visibility the CISO needs. After running npx -y mcp-agent-trace-inspector et al. in separate terminal sessions, a subsequent health_check_all returns healthy results with sub-100ms list_tools response times.

Step 3: Retrieve History and Latency Trends

After the monitoring system has accumulated data over multiple check cycles, use get_history to retrieve trend data for a specific server:

// Tool call: get_history
{
  "server_name": "mcp-agent-trace-inspector",
  "days": 30,
  "include_percentiles": true
}

// Response
{
  "server_name": "mcp-agent-trace-inspector",
  "period_days": 30,
  "total_checks": 8640,
  "healthy_checks": 8621,
  "uptime_pct": 99.78,
  "sla_target_pct": 99.5,
  "sla_met": true,
  "latency_p50_ms": 42,
  "latency_p95_ms": 87,
  "latency_p99_ms": 134,
  "incidents": [
    {
      "started_at": "2026-03-18T02:15:00Z",
      "resolved_at": "2026-03-18T02:33:00Z",
      "duration_minutes": 18,
      "cause": "Host system restart — server recovered on next check cycle"
    }
  ]
}

Step 4: Export Dashboard for ATO Evidence

Generate a self-contained HTML dashboard suitable for inclusion in an ATO evidence package or for display on a continuous monitoring operations screen:

// Tool call: export_dashboard
{
  "output_path": "./reports/mcp-health-dashboard-2026-03-24.html",
  "include_history_days": 30,
  "include_sla_analysis": true,
  "include_version_drift_log": true
}

// Response
{
  "exported_to": "./reports/mcp-health-dashboard-2026-03-24.html",
  "file_size_kb": 142,
  "servers_included": 5,
  "period_covered": "2026-02-22 to 2026-03-24",
  "sla_violations": 0,
  "version_drift_events": 0
}

Key Tools Reference

Tool Name Purpose Key Parameters
health_check_all Run a health check sweep across all configured servers simultaneously timeout_ms
get_server_status Get current health status and latest metrics for a specific server server_name
list_degraded Return a list of all servers currently in degraded or offline state include_offline, include_degraded
get_history Retrieve health check history with uptime % and latency percentiles server_name, days, include_percentiles
configure_server Register a new MCP server for monitoring with SLA targets server_name, command, args, check_interval_seconds, sla_uptime_pct
remove_server Deregister a server from monitoring and archive its history server_name, archive_history
check_updates Compare installed package versions against npm registry for all monitored servers server_name (optional — omit to check all)
export_dashboard Generate self-contained HTML health dashboard for ATO evidence output_path, include_history_days, include_sla_analysis

Workflow Diagram

The following diagram shows the health monitoring loop and the alert/escalation paths triggered by each server status outcome:

flowchart TD A([poll_interval\ndefault: 5 min]) --> B[health_check_all\ncall list_tools on each server] B --> C{Parse Results} C --> D[HEALTHY\ntools returned,\nlatency within SLA] C --> E[DEGRADED\ntools returned but\nlatency elevated] C --> F[OFFLINE\nno response or\nzero tools returned] D --> G[Log to SQLite\nUpdate p50/p95 metrics] E --> H[Log to SQLite\nAlert operator\nSuggest check_updates] F --> I[Log incident\nNotify CISO dashboard\nFlag SLA breach if > threshold] G --> J[export_dashboard\nperiodic HTML report] H --> J I --> J J --> A

Federal Compliance Considerations

FISMA Continuous Monitoring (CA-7)

NIST SP 800-53 control CA-7 requires organizations to develop and implement a continuous monitoring strategy that includes establishing configuration management processes, monitoring security controls, and reporting the security status of the information system. MCP Server Health Monitor directly satisfies the "monitoring security controls" component for AI tool components by providing ongoing availability and integrity checks with persistent evidence records.

ConMon Automation for AI Components

The tool's scheduled health check capability (configurable via check_interval_seconds) enables fully automated continuous monitoring without human intervention between check cycles. In a mature federal ConMon program, the exported health data can be ingested into the agency's continuous monitoring platform (XACTA, eMASS, or Archer) via the JSON export, closing the loop between AI infrastructure monitoring and the formal risk management system.

ATO Evidence Collection

The export_dashboard HTML output is designed for direct inclusion in ATO evidence packages. It includes: server inventory (all monitored components), 30-day availability history, SLA compliance analysis, version drift events (important for configuration management), and incident timeline. Assessors reviewing an ATO package can evaluate this artifact against IA-5 (authenticator management), CM-7 (least functionality), and SA-9 (external information system services) controls for AI tool components.

Zero External Network Dependencies

Health checks operate entirely within the local environment. The monitor starts each server process locally, calls list_tools via the local stdio transport, and records the result to the local SQLite database. The only optional external call is check_updates, which queries the npm registry to identify available version updates — and this call can be disabled or proxied through the agency's existing artifact management system (Nexus, Artifactory) for environments without direct internet access.

FAQs

Why does the health check use list_tools rather than a dedicated ping endpoint?

The MCP protocol does not define a standard health check or ping mechanism. An HTTP ping tells you only that the process is listening — it cannot detect a server where the tool registry has failed to load, where a required dependency has crashed, or where the server is accepting connections but timing out on all tool invocations. By calling list_tools — the most fundamental MCP operation — the health check verifies actual functional readiness, not just process liveness. This is analogous to a database health check that runs a test query rather than just connecting to the TCP port.

What triggers a "degraded" status versus an "offline" status?

"Offline" means the server process did not start, the connection timed out before the health check completed, or the server returned zero tools (indicating a broken registry). "Degraded" means the server responded successfully but with latency above the configured p95 threshold — for example, if list_tools takes 2,000ms when the baseline is 50ms, the server is functionally available but performing poorly. Degraded status triggers an alert but does not count as an SLA outage unless it persists beyond the configured degradation window.

How does version drift detection work, and why does it matter for security?

At each health check, the monitor records the exact set of tool names returned by list_tools. If the set changes between consecutive checks — a tool appears, disappears, or is renamed — the monitor flags a version drift event. For security, this matters because unauthorized tool additions could indicate a supply chain compromise or a rogue package update. Version drift detection provides a lightweight form of software inventory integrity monitoring aligned with CM-8 (information system component inventory) requirements.

Can MCP Server Health Monitor monitor itself?

Technically yes — you can add mcp-server-health-monitor to its own monitoring list and it will include itself in health check sweeps. However, this creates a bootstrapping dependency: if the health monitor itself is offline, it cannot report its own offline status. In production federal deployments, we recommend running a secondary lightweight watchdog (e.g., a simple cron job that checks for the process ID) alongside the health monitor to cover this edge case.

References

Share this article

BE EASY ENTERPRISES LLC - Cybersecurity Experts

BE EASY ENTERPRISES LLC

BE EASY ENTERPRISES LLC is a cybersecurity and technology firm with over 20 years of expertise in financial services, compliance, and enterprise security. We specialize in aligning security strategy with business goals, leading digital transformation, and delivering multi-million dollar technology programs. Our capabilities span financial analysis, risk management, and regulatory compliance — with a proven track record building secure, scalable architectures across cloud and hybrid environments. Core competencies include Zero Trust, IAM, AI/ML in security, and frameworks including NIST, TOGAF, and SABSA.