Chapter 2 — DataFrames

Loading, exploring, cleaning, transforming

Prof. Xuhu Wan

Chapter 2 · Introduction to Business Analytics

DataFrames

The pipeline of every analytical project.

Prof. Xuhu Wan

ISOM, HKUST Business School · Wan Academy · 2026 Edition

The Six-Stage Pipeline

Load → Explore → Clean → Transform → Analyse → Visualise

Every analytical task follows the same six stages. This chapter covers the first four; Chapters 3 and 4 cover analyse and visualise in depth.

Note

A DataFrame is conceptually a dictionary of Series objects sharing a common index. Think of it as Excel with programming.

Load and First Look

Important

Always run these three first. They are the analyst’s smoke test that the file loaded correctly before any analysis runs.

.iloc vs .loc — The Most Common Confusion

Selector Indexed by End-point
df.iloc[1:3] Integer position Excluded (Python slice)
df.loc['b':'d'] Index label Included

Train/Test Split: Time-Series Rule

Warning

Never shuffle a time series before splitting. Shuffling leaks future information into training — the classic look-ahead bias that produces backtests that look brilliant but fail in production.

Chapter Summary

Stage Tools
Load pd.read_csv()
Explore .head() · .shape · .dtypes · .describe()
Select .loc[] · .iloc[] · boolean masks
Modify new columns · .apply() · method chaining
Missing .isna() · .dropna() · .fillna()
Time-series .resample() · .rolling() · .shift()

Full treatment in the book — Chapter 2. This deck covers the most-used selectors; the book walks through cleaning, transformation, and groupby in depth.

Next: Chapter 3 — Linear Regression.