Loading, exploring, cleaning, transforming
The pipeline of every analytical project.
Prof. Xuhu Wan
ISOM, HKUST Business School · Wan Academy · 2026 Edition
Load → Explore → Clean → Transform → Analyse → Visualise
Every analytical task follows the same six stages. This chapter covers the first four; Chapters 3 and 4 cover analyse and visualise in depth.
Note
A DataFrame is conceptually a dictionary of Series objects sharing a common index. Think of it as Excel with programming.
Important
Always run these three first. They are the analyst’s smoke test that the file loaded correctly before any analysis runs.
| Selector | Indexed by | End-point |
|---|---|---|
df.iloc[1:3] |
Integer position | Excluded (Python slice) |
df.loc['b':'d'] |
Index label | Included |
Warning
Never shuffle a time series before splitting. Shuffling leaks future information into training — the classic look-ahead bias that produces backtests that look brilliant but fail in production.
| Stage | Tools |
|---|---|
| Load | pd.read_csv() |
| Explore | .head() · .shape · .dtypes · .describe() |
| Select | .loc[] · .iloc[] · boolean masks |
| Modify | new columns · .apply() · method chaining |
| Missing | .isna() · .dropna() · .fillna() |
| Time-series | .resample() · .rolling() · .shift() |
Full treatment in the book — Chapter 2. This deck covers the most-used selectors; the book walks through cleaning, transformation, and groupby in depth.
Next: Chapter 3 — Linear Regression.
Prof. Xuhu Wan · HKUST ISOM · Introduction to Business Analytics