Skip to main content

DuckDB β€” NHS Quickstart

πŸ¦† In-process SQL Β· Parquet/CSV Β· Zero server Β· Fast prototypes & validation
Why DuckDB in the NHS

DuckDB gives you blazing-fast SQL over files (CSV/Parquet) with no server to run. It’s ideal for: quick audits, reproducible examples, schema checks before landing in SQL Server, and lightweight dashboards.

Great for: BI Analyst Β· Data Scientist Β· Clinician-Researcher.


βš™οΈ 10-minute install​

python -m venv .venv && . .venv/bin/activate   # Windows: .venv\Scripts\activate
pip install duckdb pandas pyarrow

pyarrow speeds up Parquet/Arrow IO.


πŸš€ β€œHello NHS” β€” three ways​

hello_duck.py
import duckdb, pandas as pd

df = pd.DataFrame({
"practice_id": ["A","B","C"],
"appointments": [120, 140, 128]
})

# Register dataframe then run SQL
duckdb.register("t", df)
out = duckdb.sql("SELECT practice_id, appointments FROM t ORDER BY appointments DESC").df()
print(out)

πŸ”„ From SQL Server β†’ Parquet β†’ DuckDB (safe prototyping)​

Extract a de-identified slice once, then iterate locally at speed.

extract_sqlserver_to_parquet.py
import os, urllib.parse, pandas as pd
from sqlalchemy import create_engine

params = urllib.parse.quote_plus(
"DRIVER={ODBC Driver 18 for SQL Server};"
f"SERVER={os.getenv('SQLSERVER_SERVER')};"
f"DATABASE={os.getenv('SQLSERVER_DATABASE')};"
"Trusted_Connection=Yes;Encrypt=Yes;"
)
engine = create_engine(f"mssql+pyodbc:///?odbc_connect={params}")

sql = """
SELECT practice_id,
CAST(CONVERT(char(7), start_time, 126) AS varchar(7)) AS month,
COUNT(*) AS appointments
FROM dbo.appointments
WHERE start_time >= DATEADD(MONTH,-3,SYSUTCDATETIME())
GROUP BY practice_id, CONVERT(char(7), start_time, 126);
"""
df = pd.read_sql(sql, engine)
os.makedirs("data", exist_ok=True)
df.to_parquet("data/kpi_3m.parquet", index=False)
print("Wrote data/kpi_3m.parquet", df.shape)

Then analyse with DuckDB:

analyse_in_duckdb.py
import duckdb
q = duckdb.sql("""
SELECT month, practice_id, appointments,
appointments / NULLIF(SUM(appointments) OVER (PARTITION BY month),0) AS share
FROM read_parquet('data/kpi_3m.parquet')
ORDER BY month, practice_id
""" ).df()
print(q.head())

βœ… Lightweight data checks​

Use DuckDB SQL to catch problems early.

checks.sql
-- 1) Rowcount by month (look for gaps)
SELECT month, COUNT(*) AS n
FROM read_parquet('data/kpi_3m.parquet') GROUP BY month ORDER BY month;

-- 2) Nulls where not expected
SELECT COUNT(*) AS null_practice
FROM read_parquet('data/kpi_3m.parquet') WHERE practice_id IS NULL;

-- 3) Simple range checks
SELECT COUNT(*) AS neg_appointments
FROM read_parquet('data/kpi_3m.parquet') WHERE appointments < 0;

You can save each check as a separate .sql and run via the CLI in CI.


🧳 Save results for dashboards​

publish_for_dash.py
import duckdb, os
os.makedirs("out", exist_ok=True)
duckdb.sql("""
COPY (
SELECT month, practice_id, appointments
FROM read_parquet('data/kpi_3m.parquet')
ORDER BY month, practice_id
) TO 'out/kpi_clean.parquet' (FORMAT PARQUET);
""" )

Point Dash/Evidence.dev at out/kpi_clean.parquet for fast load times.


πŸ›  Tips & patterns​

  • Prefer Parquet for Trust exports (columnar + compressed).
  • Use read_parquet('folder/*.parquet') for partitioned data.
  • Keep a small synthetic sample in the repo for demos/tests.
  • Version your analysis SQL files; run them in CI for quick validation.
  • For SQL-heavy workflows, consider dbt-duckdb locally, then port final models to SQL Server views.

πŸ›‘ IG & safety checklist​

  • Use de-identified/synthetic extracts for local work.
  • Keep secrets out of notebooks/scripts; use env vars or secret stores.
  • Avoid exporting free-text; apply small-number suppression before sharing.
  • Document data sources, owners, and update cadence in the repo README.

πŸ“ Measuring impact​

  • Velocity: time from data drop β†’ first chart/insight.
  • Quality: validation checks passing in CI.
  • Reusability: number of analyses that run from a clean clone.
  • Cost: zero server spend for prototypes; minimal storage via Parquet.

πŸ”— See also​

See also: SQL Server Β· Evidence.dev Β· Dash Β· Python Β· GitHub

What’s next?

You’ve completed the Learn β€” DuckDB stage. Keep momentum: