Chapter 2 — DataFrames

Loading, exploring, cleaning, transforming

Prof. Xuhu Wan

Chapter 2 · Introduction to Business Analytics

DataFrames

The pipeline of every analytical project.

Prof. Xuhu Wan

ISOM, HKUST Business School · Wan Academy · 2026 Edition

The Six-Stage Pipeline

Load → Explore → Clean → Transform → Analyse → Visualise

Every analytical task follows the same six stages. This chapter covers the first four; Chapters 3 and 4 cover analyse and visualise in depth.

Note

A DataFrame is conceptually a dictionary of Series objects sharing a common index. Think of it as Excel with programming.

Load and First Look

Important

Always run these three first. They are the analyst’s smoke test that the file loaded correctly before any analysis runs.

.iloc vs .loc — The Most Common Confusion

Selector	Indexed by	End-point
`df.iloc[1:3]`	Integer position	Excluded (Python slice)
`df.loc['b':'d']`	Index label	Included

Train/Test Split: Time-Series Rule

Warning

Never shuffle a time series before splitting. Shuffling leaks future information into training — the classic look-ahead bias that produces capacity-planning forecasts that look brilliant but fail in production.

Cross-Sectional Split: Shuffle First

For data without time structure — a snapshot of ten university dorms — the rule reverses: shuffle, then split, otherwise an unintended sort (alphabetical, by size) makes train and test systematically different.

Worked Example: Shopee Sale-Day Baskets

A 12-row sale-day snapshot — small enough to read, rich enough to drive a decision.

Working with an AI Copilot

An AI copilot will happily call df.drop_duplicates() or fillna(0) and never tell you how many rows it just erased. Silent cleaning is how bad decisions get shipped.

Three rules for every cleaning step:

Demand a before/after row count for any dropna, drop_duplicates, merge, or filter.
Never let the AI fillna(0) or fillna(mean) without explicit approval — imputation is a modelling choice, not a janitorial one.
The AI cannot distinguish a real zero from a missing value coded as zero. That is domain knowledge — and it belongs to you, not the model.

Mistakes Library: Reinhart and Rogoff (2010)

Warning

Growth in a Time of Debt (Harvard, 2010) claimed that countries with public debt above 90% of GDP saw average growth turn negative. The paper was cited by the IMF, the European Commission, and the US Treasury to justify austerity programmes across Europe.

In 2013, UMass graduate student Thomas Herndon asked for the spreadsheet. The Excel AVERAGE(...) formula stopped at row 49 instead of row 54 — five countries were silently excluded. Re-including them moved average growth at the high-debt threshold from −0.1% to +2.2%.

Lesson: sanity-check the row count of every groupby, every iloc, every merge. One off-by-five error rewrote macro policy for a continent.

Decision Memo — Where to Spend the 12.12 Coupon Budget?

The output of an analysis is not a chart — it is a one-page memo your director can act on.

To: Marketing Director, Shopee SEA From: <Your name>, intern analyst Subject: Re-allocate 60% of 12.12 coupon budget from Fashion → Beauty Date: 2026-05-15

Recommendation: Shift the lion’s share of the next sale’s coupon spend to Beauty.

Evidence:

Beauty had the highest mean basket value with coupons used.

Coupon-driven baskets in Beauty are 2.3× larger than non-coupon ones.

Returning-customer share is highest in Beauty.

Caveats:

Sample is one sale day only.

Causation vs selection: heavy spenders may already use coupons.

Next step: Re-run with three sale days; A/B test the budget shift.

Chapter Summary

Stage	Tools
Load	`pd.read_csv()`
Explore	`.head()` · `.shape` · `.dtypes` · `.describe()`
Select	`.loc[]` · `.iloc[]` · boolean masks
Modify	new columns · `.apply()` · method chaining
Missing	`.isna()` · `.dropna()` · `.fillna()`
Time-series	`.resample()` · `.rolling()` · `.shift()`

Full treatment in the book — Chapter 2. This deck covers the most-used selectors; the book walks through cleaning, transformation, and groupby in depth.

Next: Chapter 3 — Linear Regression.