- Home
- Data Science Tutorial
Data Science Tutorial
Clear, practical notes with real examples. Covers concepts, workflow, tools, and portfolio guidance for beginners and working professionals.
Contents
1) Introduction
Data Science combines programming, statistics, and domain understanding to answer business questions and build predictive/automated systems. Focus is on measurable impact, not just complex models.
Data vs AI vs ML
- AI: Broad goal—systems that perform tasks needing human intelligence.
- ML: Subset of AI—learning patterns from data.
- Data Science: End-to-end process—collect → clean → analyze → model → deploy → monitor.
2) Data Science Process
Think CRISP-DM style. Keep a clear success metric from day one.
- Business Understanding → Problem statement and KPI (e.g., increase upsell rate 5%).
- Data Understanding → Source list, schema, sample checks, data dictionary.
- Data Preparation → Cleaning, joining, feature creation, train/valid/test split.
- Modeling → Baseline (simple) → better models → hyper-parameters.
- Evaluation → Metric vs KPI; leakage and bias checks.
- Deployment → Real-time API or batch jobs; access control and logging.
- Monitoring → Data drift, performance drop, retraining triggers.
3) Roles in Data (Who does what?)
- Data Analyst – SQL, dashboards (Power BI/Tableau), ad-hoc insights.
- Data Scientist – Modeling, experimentation, feature work, prototypes.
- ML Engineer – Serving models reliably (APIs, CI/CD, infra).
- Data Engineer – Pipelines, warehouses/lakes, quality, orchestration.
Career path: Many freshers start as Analyst, then move to DS after projects.
4) Essential Tools (Why they matter)
- Python or R for analysis and ML. Python has wider ecosystem.
- Jupyter / VS Code for notebooks and scripts.
- Git & GitHub for version control and collaboration.
- SQL DB (PostgreSQL/MySQL) for storage and querying.
- BI (Power BI/Tableau) for executive dashboards.
- FastAPI/Flask for serving models; Streamlit/Gradio for quick apps.
- Docker basics for consistent deployments.
5) Key Python Libraries (When to use)
- Pandas – tabular data handling; groupby, merge, pivot.
- NumPy – fast arrays; used under the hood by many libs.
- Matplotlib/Seaborn/Plotly – static vs interactive charts.
- scikit-learn – ML algorithms + pipelines + metrics.
- XGBoost/LightGBM/CatBoost – gradient boosting for tabular data.
- Statsmodels – regression, time-series (ARIMA), statistical tests.
- SciPy – optimization, distances, scientific utils.
6) Statistics Basics
Core Ideas
- Central tendency (mean/median) and spread (variance/STD, IQR).
- Distributions: normal, binomial, Poisson (know shapes/uses).
- Sampling & CLT: sample mean approaches population mean as n↑.
- Hypothesis testing: t-test/χ²/ANOVA (decide if difference is real).
- Confidence intervals: plausible range of the true parameter.
- Correlation ≠ causation; watch confounders.
7) Python Basics (what to master)
- Data types, lists/dicts/sets, slicing, functions, classes.
- File I/O, virtual environments (
conda/venv). - Notebook hygiene: titles, sections, results first.
# Quick EDA skeleton
import pandas as pd
df = pd.read_csv("data.csv")
print(df.shape, df.isna().mean().sort_values(ascending=False).head())
df.describe(include="all")
8) SQL Essentials (with examples)
- Filtering and sorting:
SELECT ... FROM ... WHERE ... ORDER BY ... - Aggregates:
SUM, COUNT, AVGwithGROUP BY,HAVING. - JOINs: inner/left/right; use keys and null checks.
- Window functions: rankings and moving metrics.
-- Second highest salary (robust)
SELECT name, salary
FROM (
SELECT name, salary, DENSE_RANK() OVER(ORDER BY salary DESC) rnk
FROM employees
) t
WHERE rnk = 2;
9) Data Cleaning (prevent garbage-in)
- Missing values: drop, impute (median/most-freq), or model-based.
- Outliers: visualize (boxplot); cap/winsorize if needed.
- Types: dates to datetime; categories to string/categorical.
- Duplicates: row-level or key-based removal.
- Text normalization: lowercase/trim/remove extra spaces.
- Split before heavy transforms to avoid leakage.
10) Exploratory Data Analysis
- Start with shape, nulls, basic stats.
- Target distribution and class balance.
- Segment analysis: cohort/city/channel.
- Hypothesis checks with simple charts and tests.
11) Visualization (story first)
- Pick the right chart (bar/line/box/scatter/heatmap).
- Use clear titles, labels, units; avoid clutter.
- Tell a story: problem → insight → action.
- Dashboards: KPIs + filters + drilldowns; keep 1-page.
12) Supervised ML (workflow)
- Split data (train/valid/test) with proper stratification.
- Pipeline: scaling/encoding → model → thresholding.
- Try linear/logistic, trees, ensembles (RF/XGBoost).
- Tune with cross-validation; guard against leakage.
13) Unsupervised ML
- K-Means for customer segments (choose K via elbow/silhouette).
- DBSCAN for irregular clusters/outliers.
- PCA to reduce dimensions and visualize.
- Anomaly detection: isolation forest/one-class SVM.
14) Feature Engineering
- Numeric: scaling (standard/min-max), binning for monotonicity.
- Categorical: one-hot; high-cardinality → target/woe encoding (careful with CV).
- Date/Time: day/week/month, lags, rolling stats.
- Text: bag-of-words/TF-IDF for classical models.
- Interaction terms where logic supports it.
15) Model Evaluation (choose right metric)
- Regression: MAE, RMSE, R² (prefer MAE/RMSE for scale).
- Classification: Accuracy (balanced only), F1, ROC-AUC, PR-AUC.
- Confusion matrix → business cost matrix (false-pos vs false-neg).
- Calibration and threshold tuning for decisions.
- Cross-validation and hold-out test for honest estimates.
16) Deployment (simple first)
- Batch: daily/weekly scoring → CSV/DB → BI dashboard.
- Real-time API: FastAPI/Flask behind Nginx; auth + logging.
- Package env with Docker; use env vars for secrets.
- Track model & data versions; store inference logs.
# Minimal FastAPI stub
from fastapi import FastAPI
import joblib, pandas as pd
app = FastAPI()
model = joblib.load("model.joblib")
@app.post("/predict")
def predict(payload: dict):
X = pd.DataFrame([payload])
y = model.predict_proba(X)[:,1][0]
return {"score": float(y)}
17) MLOps Basics
- Experiment tracking: MLflow/W&B (params, metrics, artifacts).
- Model registry: stage → production with approvals.
- Data versioning: DVC or lakehouse tables.
- CI/CD: tests + build image + deploy; rollback plan.
- Monitoring: performance + data drift; alerting & retraining.
18) Project Ideas (industry-style)
- Sales Forecast – store/item time series; add holidays & promos.
- Customer Churn – telecom/SaaS usage → retention actions.
- Credit Risk – classify default risk; set score cut-offs.
- Warranty Claims – anomaly detection for fraud/defects.
- NPS Text Analysis – topic + sentiment; route issues.
19) Portfolio Guide (what recruiters check)
- 3–5 end-to-end projects with clean READMEs (problem → approach → result → next steps).
- Live demos (Streamlit) + small video walkthroughs (2–3 min).
- Reproducible env (requirements.txt/conda.yaml) + clear folder structure.
- Business impact notes: “reduced churn 3% on sample data.”
20) Interview Prep (fresher-friendly)
- Explain bias-variance trade-off with an example.
- SQL joins, groupings, and one window query.
- Walkthrough one project end-to-end (decisions + mistakes).
- ML: when to use logistic vs tree vs boosting.
- Case: estimate impact and choose metric/threshold.
21) FAQ
How long to be job-ready? 4–6 months of focused learning + 3–5 projects + strong resume.
Math needed? Basic stats/probability; learn more as you go.
Python vs R? Python for broader ecosystem; R shines for statistics and some viz.
Laptop spec? 8GB RAM min (16GB better), SSD, recent CPU; cloud GPUs only for deep learning.
22) Resources & Next Steps
- Daily: 60–90 mins—Python/SQL practice + one EDA notebook.
- Weekly: one mini-project or dashboard; write a short post.
- Monthly: one deployable app (Streamlit) + code review with peers.
- Checklist: resume with quantified impacts, GitHub pinned repos, LinkedIn posts.