Data Scientist
The 10‑Year Health Plan pushes Hospital → Community, Analogue → Digital, and Sickness → Prevention. Data scientists turn these shifts into practice with predictive models, risk stratification, and actionable dashboards/APIs. This path emphasises clean data → trustworthy models → simple deployment.
👤 Role snapshot
You work with messy, linked NHS data (clinical, operational, public) to answer complex questions and forecast outcomes. Typical flow: ingest → feature → model → evaluate → explain → publish.
🎯 Outcomes to target (aligned to the Plan)
- Model quality: AUROC/PR‑AUC, Brier score, calibration slope/intercept
- Fairness: subgroup performance parity (sex, age band, deprivation quintile)
- Velocity: notebook → decision in days, not weeks
- Reproducibility: one‑command rerun; environment pinned; data contract documented
- Adoption: decisions/actions taken; clinical feedback loops
⚙️ 90‑minute quickstart
Goal: train a tiny baseline model on a de‑identified extract, visualise performance, and publish either a Dash explorer or a FastAPI endpoint.
1) Load data from SQL Server (≈15 min)
import os, urllib.parse, pandas as pd
from sqlalchemy import create_engine, text
from dotenv import load_dotenv; load_dotenv()
params = urllib.parse.quote_plus(
"DRIVER={ODBC Driver 18 for SQL Server};"
f"SERVER={os.getenv('SQLSERVER_SERVER')};"
f"DATABASE={os.getenv('SQLSERVER_DATABASE')};"
"Trusted_Connection=Yes;Encrypt=Yes;"
)
engine = create_engine(f"mssql+pyodbc:///?odbc_connect={params}")
sql = text("""
SELECT TOP 5000 patient_age, sex, imd_quintile, prev_dna_count, lead_time_days,
outcome_readmit_30d AS y
FROM dbo.ds_readmit_sample
WHERE encounter_start >= DATEADD(DAY, -365, SYSUTCDATETIME())
""")
df = pd.read_sql(sql, engine)
df.to_parquet("work/readmit_sample.parquet", index=False)
print(df.head())
SQLSERVER_SERVER=YOURSERVER
SQLSERVER_DATABASE=NHS_Analytics
Tip: For ad‑hoc prototyping, copy a strict, de‑identified slice to Parquet and work locally with DuckDB.
2) Baseline model & evaluation (≈30 min)
# pip install pandas scikit-learn plotly joblib
import pandas as pd, plotly.express as px
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, RocCurveDisplay, precision_recall_curve, auc
import joblib, numpy as np
df = pd.read_parquet("work/readmit_sample.parquet")
y = df["y"].astype(int)
X = df.drop(columns=["y"])
num_cols = X.select_dtypes(include=["int64","float64"]).columns.tolist()
cat_cols = [c for c in X.columns if c not in num_cols]
pre = ColumnTransformer([
("num","passthrough", num_cols),
("cat", OneHotEncoder(handle_unknown="ignore"), cat_cols)
])
clf = Pipeline([("prep", pre), ("lr", LogisticRegression(max_iter=200))])
Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.25, random_state=42, stratify=y)
clf.fit(Xtr, ytr)
probs = clf.predict_proba(Xte)[:,1]
roc = roc_auc_score(yte, probs)
prec, rec, _ = precision_recall_curve(yte, probs)
prauc = auc(rec, prec)
print({"roc_auc": round(roc,3), "pr_auc": round(prauc,3)})
joblib.dump(clf, "work/model.joblib")
pd.DataFrame({"y":yte, "p":probs}).to_parquet("work/preds.parquet", index=False)
# Quick ROC chart with Plotly
dfc = pd.DataFrame({"y":yte, "p":probs})
dfc = dfc.sort_values("p")
dfc["tp"] = (dfc["y"]==1).cumsum()
dfc["fp"] = ((dfc["y"]==0).cumsum())
fig = px.histogram(dfc, x="p", nbins=25, title=f"Score distribution (AUROC={roc:.3f})")
fig.write_html("work/score_hist.html", include_plotlyjs="cdn")
3) Publish (pick one) (≈30 min)
- Dash Explorer
- FastAPI Endpoint
# pip install dash plotly pandas joblib
import dash
from dash import html, dcc, Input, Output
import plotly.express as px, pandas as pd, joblib
df = pd.read_parquet("work/preds.parquet")
app = dash.Dash(__name__)
app.layout = html.Div([
html.H3("Readmission risk — validation set"),
dcc.Slider(0.1, 0.9, value=0.5, step=0.05, id="thr"),
dcc.Graph(id="curve")
])
@app.callback(Output("curve","figure"), Input("thr","value"))
def update(th):
t = float(th)
df2 = df.copy()
df2["pred"] = (df2["p"]>=t).astype(int)
cm = pd.crosstab(df2["y"], df2["pred"], rownames=["actual"], colnames=["pred"])
cm = cm.reset_index().melt("actual", var_name="pred", value_name="n")
return px.bar(cm, x="pred", y="n", color="actual", barmode="group",
title=f"Confusion matrix @ threshold={t:0.2f}")
if __name__ == "__main__":
app.run_server(debug=True)
# pip install fastapi uvicorn joblib pandas scikit-learn
from fastapi import FastAPI
import joblib, pandas as pd
app = FastAPI()
model = joblib.load("work/model.joblib")
@app.post("/predict")
def predict(payload: dict):
X = pd.DataFrame([payload])
prob = float(model.predict_proba(X)[:,1][0])
return {"readmit_30d": prob}
# uvicorn serve:app --reload --port 8000
▶️ Run
# 1) Load data
python load_from_sql.py
# 2) Train
python train_baseline.py
# 3a) Explore in Dash
python app.py
# 3b) OR serve as API
uvicorn serve:app --reload --port 8000
🗓️ Week‑one build (repeatable, safe)
Day 1 — Problem & cohort
- Define outcome, horizon, inclusion/exclusion; create a data dictionary.
Day 2 — Baseline & metrics
- Ship a logistic/linear baseline; track AUROC/PR‑AUC or RMSE; create calibration plot.
Day 3 — Reproducibility
- Pin env (
requirements.txt); deterministic seeds; record dataset hash; save model & metrics.
Day 4 — Explainability & fairness
- Add SHAP/coefficients summary; subgroup metrics (age, sex, IMD). Document risks/mitigations.
Day 5 — Publish & monitor
- Choose Dash or FastAPI; add basic logs; plan for drift checks and retraining cadence.
🛡️ IG & safety checklist
- Use de‑identified/synthetic data in dev; keep PHI out of notebooks.
- Store secrets outside code (env vars/secret store).
- Apply small‑number suppression and aggregation when sharing results.
- Record approvals/ethics IDs; document intended use and limits.
- Avoid automation bias: surface uncertainty and calibration to users.
📏 Measuring impact
- Quality: AUROC/PR‑AUC, calibration slope, decision‑curve net benefit.
- Fairness: subgroup parity (Δ metrics), qualitative review with clinicians.
- Velocity: time from question → model → decision.
- Adoption: teams using it; actions logged; post‑deployment performance.
- Maintainability: retraining cadence; time to fix when data schema changes.
📚 References & next
See also: Python · SQL Server · DuckDB · Dash · GitHub · Secrets & .env
What’s next?
You’ve completed the Persona — Data Scientist stage. Keep momentum: