Federal data lives in silos. A contracting officer might have FY2026 contract awards in a CSV export from the financial system, vendor business data in a PostgreSQL database maintained by the acquisitions team, and active registration status accessible only through the SAM.gov REST API. Correlating these three sources to answer a single question — "Which active vendors received awards over $250K this fiscal year?" — traditionally means hours of manual joins in spreadsheets or waiting for an IT ticket to build a custom report.
MCP Data Pipeline Connector eliminates that bottleneck. It exposes a unified SQL interface that lets AI agents query CSV files, relational databases, and REST APIs as if they were tables in the same database — all through the Model Context Protocol. The underlying engine is DuckDB, an embedded analytical database that runs in-process with no separate service to stand up.
3
Source types supported
CSV, PostgreSQL, REST API
12ms
Query latency in demo
Cross-source join, 5-row result
Read-Only
Mode enforced for security
No accidental writes to production
What is MCP Data Pipeline Connector?
MCP Data Pipeline Connector is an MCP server that wraps DuckDB's multi-source query capability in a set of well-typed tools an AI agent can call. Key design decisions that matter for federal deployments:
- Embedded DuckDB engine: No separate database service to provision, patch, or secure. The engine runs inside the MCP server process and terminates with it.
-
Cross-source SQL joins: Once sources are
registered, DuckDB presents them as virtual tables. A single
SELECTcan join a CSV table with a PostgreSQL table and a REST API endpoint using standard ANSI SQL syntax. - Auto schema detection: CSV headers and PostgreSQL column metadata are introspected automatically on connection. REST API responses are normalized to columnar format based on the first response batch.
- REST API caching: API responses are cached in-memory for the session, preventing redundant calls to rate-limited government APIs like SAM.gov.
-
Read-only mode: Passing
--read-onlyat startup prevents any INSERT, UPDATE, DELETE, or DROP statements from executing, even if an agent or user attempts them.
The package is available as
mcp-data-pipeline-connector on npm (v1.0.0) and the
source is published at
github.com/dbsectrainer/mcp-data-pipeline-connector.
Federal Use Case
Consider this scenario: A Contracting Officer at a civilian agency needs to prepare a quarterly briefing on FY2026 contract awards. Three data systems are involved:
-
A CSV export from the agency's financial system
(
contracts_fy2026.csv) with columns for contract number, award date, award value, and vendor CAGE code. - An internal PostgreSQL vendor database maintained by the acquisitions office, with business size, socioeconomic certifications, and contact information.
- The SAM.gov Entity Management API, which provides real-time active/inactive registration status.
Previously this analysis required manual VLOOKUP work across three exported spreadsheets. With MCP Data Pipeline Connector, an AI agent can register all three sources once and answer the full question — filtered, sorted, and formatted — in a single natural-language request that generates the SQL automatically.
Getting Started: Installation
The fastest way to start is via npx, which requires
no global install:
npx -y mcp-data-pipeline-connector
For production use with a persistent source configuration, provide a JSON config file and enable read-only mode:
npx -y mcp-data-pipeline-connector -- --config data-sources.json --read-only
To register this as a persistent MCP server in your Claude
Desktop or Claude Code configuration, add it to
.mcp.json:
{
"mcpServers": {
"data-pipeline": {
"command": "npx",
"args": ["-y", "mcp-data-pipeline-connector", "--", "--config", "./data-sources.json", "--read-only"]
}
}
}
Step-by-Step Tutorial
The following walkthrough registers the three sources described in the federal use case and executes a cross-source SQL query. All tool calls use the JSON input format an AI agent would generate.
Step 1: Connect the CSV Source
Call connect_source with
source_type: "csv". The connector reads the file,
infers column types, and registers the CSV as a virtual table
named contracts_fy2026.
{
"source_type": "csv",
"name": "contracts_fy2026",
"file_path": "/data/contracts_fy2026.csv"
}
Expected response:
{"status": "connected", "table": "contracts_fy2026",
"columns": 8, "rows_sampled": 500}
Step 2: Connect the PostgreSQL Vendor Database
Call connect_source with
source_type: "postgres". The connector uses
DuckDB's native PostgreSQL scanner extension, which does not
require a local PostgreSQL client.
{
"source_type": "postgres",
"name": "agency_vendors",
"connection_string": "postgresql://readonly_user:****@db.agency.gov:5432/acquisitions"
}
Expected response:
{"status": "connected", "schema": "agency_vendors",
"tables_available": ["vendors", "certifications",
"contacts"]}
Step 3: Connect the SAM.gov REST API
Call connect_source with
source_type: "rest". Provide the SAM.gov Entity
Management API endpoint. The connector fetches the first page,
normalizes the response to a columnar schema, and caches
subsequent calls per session.
{
"source_type": "rest",
"name": "sam_gov_entities",
"url": "https://api.sam.gov/entity-information/v3/entities",
"headers": {"X-Api-Key": "${SAM_GOV_API_KEY}"},
"response_path": "entityData",
"cache_ttl_seconds": 300
}
Expected response:
{"status": "connected", "table": "sam_gov_entities",
"columns_inferred": ["ueiSAM", "cageCode",
"registrationStatus", "expirationDate"]}
Step 4: Verify All Sources Are Registered
Call list_sources to confirm all three sources are
active before running the join query.
// Tool: list_sources
// Input: {}
// Response:
{
"sources": [
{"name": "contracts_fy2026", "type": "csv", "status": "connected"},
{"name": "agency_vendors", "type": "postgres", "status": "connected"},
{"name": "sam_gov_entities", "type": "rest", "status": "connected"}
]
}
Step 5: Execute a Cross-Source SQL Query
Call query with the full cross-source SQL
statement. DuckDB executes the join across all three sources
in-process.
SELECT
c.contract_number,
c.award_date,
c.value_usd,
v.vendor_name,
v.cage_code,
s.registrationStatus AS registration_status
FROM contracts_fy2026 AS c
JOIN agency_vendors.vendors AS v
ON c.cage_code = v.cage_code
JOIN sam_gov_entities AS s
ON v.cage_code = s.cageCode
WHERE c.value_usd > 250000
AND s.registrationStatus = 'Active'
ORDER BY c.value_usd DESC
LIMIT 5;
Query results (12ms execution time):
| contract_number | award_date | value_usd | vendor_name | cage_code | registration_status |
|---|---|---|---|---|---|
| GS-35F-0042A | 2026-01-15 | $4,250,000 | Apex Systems LLC | 7X4B2 | Active |
| GS-35F-0089B | 2026-02-03 | $2,875,000 | ClearPath Federal | 3K9M1 | Active |
| GS-35F-0117C | 2026-02-28 | $1,600,000 | DataBridge Corp | 6T1N8 | Active |
| GS-35F-0203D | 2026-03-10 | $890,000 | NovaTech Solutions | 2P5Q3 | Active |
| GS-35F-0251E | 2026-03-19 | $415,000 | FedLink Services | 8R7V6 | Active |
Key Tools Reference
| Tool | Description | Key Parameters |
|---|---|---|
connect_source |
Register a new data source (CSV, PostgreSQL, or REST API) |
source_type, name,
file_path /
connection_string / url
|
list_sources |
List all registered sources and their connection status | none |
list_tables |
List all virtual tables available for querying | source_name (optional filter) |
get_schema |
Return column names and inferred types for a table | table_name |
query |
Execute a SQL SELECT statement across registered sources | sql, limit (optional) |
transform |
Apply a transformation pipeline (filter, aggregate, pivot) |
source_table, operations
|
check_health |
Verify connectivity to all registered sources | none |
Architecture Diagram
Federal Compliance Considerations
Several design choices make MCP Data Pipeline Connector appropriate for federal environments that handle sensitive acquisition data:
-
Read-only mode prevents accidental writes:
Enabling
--read-onlyat startup causes DuckDB to reject any DML statement (INSERT, UPDATE, DELETE) or DDL statement (CREATE, DROP). This is a hard constraint enforced by the engine, not a software check that can be bypassed by clever prompt injection. - No external data transmission: All joins and aggregations occur inside the DuckDB process on the local machine or server. Query results are returned only to the calling AI agent — they are never sent to a third-party service.
-
Connection strings stored in config, not logs:
The
--configfile approach keeps database credentials and API keys out of command-line history and process lists. The connector redacts credentials from all log output. - CUI data handling when deployed locally: Because the connector runs on-premises with no cloud dependency, it can be used in environments that handle Controlled Unclassified Information (CUI) under NIST SP 800-171 controls, provided the host system meets the applicable requirements.
- Session-scoped connections: Sources registered in one session are not persisted to disk by default. Each server restart requires re-registration, which limits the blast radius of a misconfigured source.
"The read-only constraint isn't advisory — it's enforced at the DuckDB layer. Even if an AI agent constructs a DELETE statement, the engine rejects it before execution."
FAQs
Is it safe to point this at a production PostgreSQL database?
Yes, with the right precautions. Create a dedicated read-only
PostgreSQL role with SELECT privileges only on the relevant
schemas and tables. Combine that with the connector's
--read-only flag for defense in depth. Never use a
superuser or application write account as the connector
credential.
Which databases are supported beyond PostgreSQL?
The current v1.0.0 release supports CSV, PostgreSQL, and REST APIs. DuckDB natively supports MySQL, SQLite, and Parquet files as well — community contributions to expose those source types through the MCP interface are tracked in the GitHub issue tracker.
How does performance scale with large CSV files?
DuckDB's columnar execution engine handles CSV files up to several gigabytes efficiently on a modern laptop. For the demo scenario (a few thousand contract rows), query times are in the single-digit millisecond range. For CSV files exceeding 1 million rows, consider converting to Parquet format first, which DuckDB reads approximately 10x faster due to column pruning and predicate pushdown.
Does the connector handle REST API pagination?
Yes. The connect_source tool accepts a
pagination configuration block where you specify
the next-page token field and offset strategy. The connector
fetches all pages up to a configurable
max_pages limit and caches the full result set for
the session duration.