Connecting Federal Data Sources with MCP Data Pipeline Connector

MCP Data Pipeline Connector architecture diagram

Federal data lives in silos. A contracting officer might have FY2026 contract awards in a CSV export from the financial system, vendor business data in a PostgreSQL database maintained by the acquisitions team, and active registration status accessible only through the SAM.gov REST API. Correlating these three sources to answer a single question — "Which active vendors received awards over $250K this fiscal year?" — traditionally means hours of manual joins in spreadsheets or waiting for an IT ticket to build a custom report.

MCP Data Pipeline Connector eliminates that bottleneck. It exposes a unified SQL interface that lets AI agents query CSV files, relational databases, and REST APIs as if they were tables in the same database — all through the Model Context Protocol. The underlying engine is DuckDB, an embedded analytical database that runs in-process with no separate service to stand up.

3

Source types supported
CSV, PostgreSQL, REST API

12ms

Query latency in demo
Cross-source join, 5-row result

Read-Only

Mode enforced for security
No accidental writes to production

What is MCP Data Pipeline Connector?

MCP Data Pipeline Connector is an MCP server that wraps DuckDB's multi-source query capability in a set of well-typed tools an AI agent can call. Key design decisions that matter for federal deployments:

Embedded DuckDB engine: No separate database service to provision, patch, or secure. The engine runs inside the MCP server process and terminates with it.
Cross-source SQL joins: Once sources are registered, DuckDB presents them as virtual tables. A single SELECT can join a CSV table with a PostgreSQL table and a REST API endpoint using standard ANSI SQL syntax.
Auto schema detection: CSV headers and PostgreSQL column metadata are introspected automatically on connection. REST API responses are normalized to columnar format based on the first response batch.
REST API caching: API responses are cached in-memory for the session, preventing redundant calls to rate-limited government APIs like SAM.gov.
Read-only mode: Passing --read-only at startup prevents any INSERT, UPDATE, DELETE, or DROP statements from executing, even if an agent or user attempts them.

The package is available as mcp-data-pipeline-connector on npm (v1.0.0) and the source is published at github.com/dbsectrainer/mcp-data-pipeline-connector.

Federal Use Case

Consider this scenario: A Contracting Officer at a civilian agency needs to prepare a quarterly briefing on FY2026 contract awards. Three data systems are involved:

A CSV export from the agency's financial system (contracts_fy2026.csv) with columns for contract number, award date, award value, and vendor CAGE code.
An internal PostgreSQL vendor database maintained by the acquisitions office, with business size, socioeconomic certifications, and contact information.
The SAM.gov Entity Management API, which provides real-time active/inactive registration status.

Previously this analysis required manual VLOOKUP work across three exported spreadsheets. With MCP Data Pipeline Connector, an AI agent can register all three sources once and answer the full question — filtered, sorted, and formatted — in a single natural-language request that generates the SQL automatically.

MCP Data Pipeline Connector showing a cross-source SQL join across CSV contracts, PostgreSQL vendor database, and SAM.gov REST API

Getting Started: Installation

The fastest way to start is via npx, which requires no global install:

npx -y mcp-data-pipeline-connector

For production use with a persistent source configuration, provide a JSON config file and enable read-only mode:

npx -y mcp-data-pipeline-connector -- --config data-sources.json --read-only

To register this as a persistent MCP server in your Claude Desktop or Claude Code configuration, add it to .mcp.json:

{
  "mcpServers": {
    "data-pipeline": {
      "command": "npx",
      "args": ["-y", "mcp-data-pipeline-connector", "--", "--config", "./data-sources.json", "--read-only"]
    }
  }
}

Step-by-Step Tutorial

The following walkthrough registers the three sources described in the federal use case and executes a cross-source SQL query. All tool calls use the JSON input format an AI agent would generate.

Step 1: Connect the CSV Source

Call connect_source with source_type: "csv". The connector reads the file, infers column types, and registers the CSV as a virtual table named contracts_fy2026.

{
  "source_type": "csv",
  "name": "contracts_fy2026",
  "file_path": "/data/contracts_fy2026.csv"
}

Expected response: {"status": "connected", "table": "contracts_fy2026", "columns": 8, "rows_sampled": 500}

Step 2: Connect the PostgreSQL Vendor Database

Call connect_source with source_type: "postgres". The connector uses DuckDB's native PostgreSQL scanner extension, which does not require a local PostgreSQL client.

{
  "source_type": "postgres",
  "name": "agency_vendors",
  "connection_string": "postgresql://readonly_user:****@db.agency.gov:5432/acquisitions"
}

Expected response: {"status": "connected", "schema": "agency_vendors", "tables_available": ["vendors", "certifications", "contacts"]}

Step 3: Connect the SAM.gov REST API

Call connect_source with source_type: "rest". Provide the SAM.gov Entity Management API endpoint. The connector fetches the first page, normalizes the response to a columnar schema, and caches subsequent calls per session.

{
  "source_type": "rest",
  "name": "sam_gov_entities",
  "url": "https://api.sam.gov/entity-information/v3/entities",
  "headers": {"X-Api-Key": "${SAM_GOV_API_KEY}"},
  "response_path": "entityData",
  "cache_ttl_seconds": 300
}

Expected response: {"status": "connected", "table": "sam_gov_entities", "columns_inferred": ["ueiSAM", "cageCode", "registrationStatus", "expirationDate"]}

Step 4: Verify All Sources Are Registered

Call list_sources to confirm all three sources are active before running the join query.

// Tool: list_sources
// Input: {}

// Response:
{
  "sources": [
    {"name": "contracts_fy2026",  "type": "csv",      "status": "connected"},
    {"name": "agency_vendors",    "type": "postgres",  "status": "connected"},
    {"name": "sam_gov_entities",  "type": "rest",      "status": "connected"}
  ]
}

Step 5: Execute a Cross-Source SQL Query

Call query with the full cross-source SQL statement. DuckDB executes the join across all three sources in-process.

SELECT
  c.contract_number,
  c.award_date,
  c.value_usd,
  v.vendor_name,
  v.cage_code,
  s.registrationStatus AS registration_status
FROM contracts_fy2026 AS c
JOIN agency_vendors.vendors AS v
  ON c.cage_code = v.cage_code
JOIN sam_gov_entities AS s
  ON v.cage_code = s.cageCode
WHERE c.value_usd > 250000
  AND s.registrationStatus = 'Active'
ORDER BY c.value_usd DESC
LIMIT 5;

Query results (12ms execution time):

contract_number	award_date	value_usd	vendor_name	cage_code	registration_status
GS-35F-0042A	2026-01-15	$4,250,000	Apex Systems LLC	7X4B2	Active
GS-35F-0089B	2026-02-03	$2,875,000	ClearPath Federal	3K9M1	Active
GS-35F-0117C	2026-02-28	$1,600,000	DataBridge Corp	6T1N8	Active
GS-35F-0203D	2026-03-10	$890,000	NovaTech Solutions	2P5Q3	Active
GS-35F-0251E	2026-03-19	$415,000	FedLink Services	8R7V6	Active

Key Tools Reference

Tool	Description	Key Parameters
`connect_source`	Register a new data source (CSV, PostgreSQL, or REST API)	`source_type`, `name`, `file_path` / `connection_string` / `url`
`list_sources`	List all registered sources and their connection status	none
`list_tables`	List all virtual tables available for querying	`source_name` (optional filter)
`get_schema`	Return column names and inferred types for a table	`table_name`
`query`	Execute a SQL SELECT statement across registered sources	`sql`, `limit` (optional)
`transform`	Apply a transformation pipeline (filter, aggregate, pivot)	`source_table`, `operations`
`check_health`	Verify connectivity to all registered sources	none

Architecture Diagram

graph LR CSV["CSV Files\n/data/*.csv"] --> DDB PG["PostgreSQL\nagency_vendors DB"] --> DDB REST["REST API\nSAM.gov Entities"] --> DDB DDB["DuckDB Engine\n(embedded, in-process)"] --> Q["query tool\nCross-source SQL"] Q --> R["Result Set\nreturned to agent"] style DDB fill:#0066cc,color:#fff style Q fill:#0a1628,color:#fff

Federal Compliance Considerations

Several design choices make MCP Data Pipeline Connector appropriate for federal environments that handle sensitive acquisition data:

Read-only mode prevents accidental writes: Enabling --read-only at startup causes DuckDB to reject any DML statement (INSERT, UPDATE, DELETE) or DDL statement (CREATE, DROP). This is a hard constraint enforced by the engine, not a software check that can be bypassed by clever prompt injection.
No external data transmission: All joins and aggregations occur inside the DuckDB process on the local machine or server. Query results are returned only to the calling AI agent — they are never sent to a third-party service.
Connection strings stored in config, not logs: The --config file approach keeps database credentials and API keys out of command-line history and process lists. The connector redacts credentials from all log output.
CUI data handling when deployed locally: Because the connector runs on-premises with no cloud dependency, it can be used in environments that handle Controlled Unclassified Information (CUI) under NIST SP 800-171 controls, provided the host system meets the applicable requirements.
Session-scoped connections: Sources registered in one session are not persisted to disk by default. Each server restart requires re-registration, which limits the blast radius of a misconfigured source.

"The read-only constraint isn't advisory — it's enforced at the DuckDB layer. Even if an AI agent constructs a DELETE statement, the engine rejects it before execution."

FAQs

Is it safe to point this at a production PostgreSQL database?

Yes, with the right precautions. Create a dedicated read-only PostgreSQL role with SELECT privileges only on the relevant schemas and tables. Combine that with the connector's --read-only flag for defense in depth. Never use a superuser or application write account as the connector credential.

Which databases are supported beyond PostgreSQL?

The current v1.0.0 release supports CSV, PostgreSQL, and REST APIs. DuckDB natively supports MySQL, SQLite, and Parquet files as well — community contributions to expose those source types through the MCP interface are tracked in the GitHub issue tracker.

How does performance scale with large CSV files?

DuckDB's columnar execution engine handles CSV files up to several gigabytes efficiently on a modern laptop. For the demo scenario (a few thousand contract rows), query times are in the single-digit millisecond range. For CSV files exceeding 1 million rows, consider converting to Parquet format first, which DuckDB reads approximately 10x faster due to column pruning and predicate pushdown.

Does the connector handle REST API pagination?

Yes. The connect_source tool accepts a pagination configuration block where you specify the next-page token field and offset strategy. The connector fetches all pages up to a configurable max_pages limit and caches the full result set for the session duration.

References

BE EASY ENTERPRISES LLC

BE EASY ENTERPRISES LLC is a cybersecurity and technology firm with over 20 years of expertise in financial services, compliance, and enterprise security. We specialize in aligning security strategy with business goals, leading digital transformation, and delivering multi-million dollar technology programs. Our capabilities span financial analysis, risk management, and regulatory compliance — with a proven track record building secure, scalable architectures across cloud and hybrid environments. Core competencies include Zero Trust, IAM, AI/ML in security, and frameworks including NIST, TOGAF, and SABSA.

Contents

3