Skip to main content

Data Scientist

📈 Python · Pandas · scikit-learn · SQL Server · Dash/FastAPI · Reproducible ML
🎯 Why this matters now

The 10‑Year Health Plan pushes Hospital → Community, Analogue → Digital, and Sickness → Prevention. Data scientists turn these shifts into practice with predictive models, risk stratification, and actionable dashboards/APIs. This path emphasises clean data → trustworthy models → simple deployment.


👤 Role snapshot

You work with messy, linked NHS data (clinical, operational, public) to answer complex questions and forecast outcomes. Typical flow: ingest → feature → model → evaluate → explain → publish.

See also: Python · SQL Server · DuckDB · Dash · VS Code · Git · GitHub · Secrets & .env

🎯 Outcomes to target (aligned to the Plan)

Accuracy & calibrationFairnessTime‑to‑insightReproducibilityAdoption
  • Model quality: AUROC/PR‑AUC, Brier score, calibration slope/intercept
  • Fairness: subgroup performance parity (sex, age band, deprivation quintile)
  • Velocity: notebook → decision in days, not weeks
  • Reproducibility: one‑command rerun; environment pinned; data contract documented
  • Adoption: decisions/actions taken; clinical feedback loops

⚙️ 90‑minute quickstart

Goal: train a tiny baseline model on a de‑identified extract, visualise performance, and publish either a Dash explorer or a FastAPI endpoint.

1) Load data from SQL Server (≈15 min)

load_from_sql.py
import os, urllib.parse, pandas as pd
from sqlalchemy import create_engine, text
from dotenv import load_dotenv; load_dotenv()

params = urllib.parse.quote_plus(
"DRIVER={ODBC Driver 18 for SQL Server};"
f"SERVER={os.getenv('SQLSERVER_SERVER')};"
f"DATABASE={os.getenv('SQLSERVER_DATABASE')};"
"Trusted_Connection=Yes;Encrypt=Yes;"
)
engine = create_engine(f"mssql+pyodbc:///?odbc_connect={params}")

sql = text("""
SELECT TOP 5000 patient_age, sex, imd_quintile, prev_dna_count, lead_time_days,
outcome_readmit_30d AS y
FROM dbo.ds_readmit_sample
WHERE encounter_start >= DATEADD(DAY, -365, SYSUTCDATETIME())
""")

df = pd.read_sql(sql, engine)
df.to_parquet("work/readmit_sample.parquet", index=False)
print(df.head())
.env (local only — do not commit)
SQLSERVER_SERVER=YOURSERVER
SQLSERVER_DATABASE=NHS_Analytics

Tip: For ad‑hoc prototyping, copy a strict, de‑identified slice to Parquet and work locally with DuckDB.

2) Baseline model & evaluation (≈30 min)

train_baseline.py
# pip install pandas scikit-learn plotly joblib
import pandas as pd, plotly.express as px
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, RocCurveDisplay, precision_recall_curve, auc
import joblib, numpy as np

df = pd.read_parquet("work/readmit_sample.parquet")

y = df["y"].astype(int)
X = df.drop(columns=["y"])

num_cols = X.select_dtypes(include=["int64","float64"]).columns.tolist()
cat_cols = [c for c in X.columns if c not in num_cols]

pre = ColumnTransformer([
("num","passthrough", num_cols),
("cat", OneHotEncoder(handle_unknown="ignore"), cat_cols)
])

clf = Pipeline([("prep", pre), ("lr", LogisticRegression(max_iter=200))])

Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.25, random_state=42, stratify=y)
clf.fit(Xtr, ytr)
probs = clf.predict_proba(Xte)[:,1]

roc = roc_auc_score(yte, probs)
prec, rec, _ = precision_recall_curve(yte, probs)
prauc = auc(rec, prec)
print({"roc_auc": round(roc,3), "pr_auc": round(prauc,3)})

joblib.dump(clf, "work/model.joblib")
pd.DataFrame({"y":yte, "p":probs}).to_parquet("work/preds.parquet", index=False)

# Quick ROC chart with Plotly
dfc = pd.DataFrame({"y":yte, "p":probs})
dfc = dfc.sort_values("p")
dfc["tp"] = (dfc["y"]==1).cumsum()
dfc["fp"] = ((dfc["y"]==0).cumsum())
fig = px.histogram(dfc, x="p", nbins=25, title=f"Score distribution (AUROC={roc:.3f})")
fig.write_html("work/score_hist.html", include_plotlyjs="cdn")

3) Publish (pick one) (≈30 min)

app.py
# pip install dash plotly pandas joblib
import dash
from dash import html, dcc, Input, Output
import plotly.express as px, pandas as pd, joblib

df = pd.read_parquet("work/preds.parquet")
app = dash.Dash(__name__)

app.layout = html.Div([
html.H3("Readmission risk — validation set"),
dcc.Slider(0.1, 0.9, value=0.5, step=0.05, id="thr"),
dcc.Graph(id="curve")
])

@app.callback(Output("curve","figure"), Input("thr","value"))
def update(th):
t = float(th)
df2 = df.copy()
df2["pred"] = (df2["p"]>=t).astype(int)
cm = pd.crosstab(df2["y"], df2["pred"], rownames=["actual"], colnames=["pred"])
cm = cm.reset_index().melt("actual", var_name="pred", value_name="n")
return px.bar(cm, x="pred", y="n", color="actual", barmode="group",
title=f"Confusion matrix @ threshold={t:0.2f}")

if __name__ == "__main__":
app.run_server(debug=True)

▶️ Run

# 1) Load data
python load_from_sql.py

# 2) Train
python train_baseline.py

# 3a) Explore in Dash
python app.py

# 3b) OR serve as API
uvicorn serve:app --reload --port 8000

🗓️ Week‑one build (repeatable, safe)

Day 1 — Problem & cohort

  • Define outcome, horizon, inclusion/exclusion; create a data dictionary.

Day 2 — Baseline & metrics

  • Ship a logistic/linear baseline; track AUROC/PR‑AUC or RMSE; create calibration plot.

Day 3 — Reproducibility

  • Pin env (requirements.txt); deterministic seeds; record dataset hash; save model & metrics.

Day 4 — Explainability & fairness

  • Add SHAP/coefficients summary; subgroup metrics (age, sex, IMD). Document risks/mitigations.

Day 5 — Publish & monitor

  • Choose Dash or FastAPI; add basic logs; plan for drift checks and retraining cadence.

🛡️ IG & safety checklist

  • Use de‑identified/synthetic data in dev; keep PHI out of notebooks.
  • Store secrets outside code (env vars/secret store).
  • Apply small‑number suppression and aggregation when sharing results.
  • Record approvals/ethics IDs; document intended use and limits.
  • Avoid automation bias: surface uncertainty and calibration to users.

📏 Measuring impact

  • Quality: AUROC/PR‑AUC, calibration slope, decision‑curve net benefit.
  • Fairness: subgroup parity (Δ metrics), qualitative review with clinicians.
  • Velocity: time from question → model → decision.
  • Adoption: teams using it; actions logged; post‑deployment performance.
  • Maintainability: retraining cadence; time to fix when data schema changes.

📚 References & next

See also: Python · SQL Server · DuckDB · Dash · GitHub · Secrets & .env

What’s next?

You’ve completed the Persona — Data Scientist stage. Keep momentum: