Chapter 3 — Linear Regression

Correlation, CAPM, Fama-French, variable selection

Prof. Xuhu Wan

Chapter 3 · Introduction to Business Analytics

Linear Regression

Correlation, CAPM, Fama-French factors, train/test, variable selection.

Prof. Xuhu Wan

ISOM, HKUST Business School · Wan Academy · 2026 Edition

Correlation — r vs r²

r ≈ 0.74 between NVDA and SPY daily returns.

Do not say “74 % of NVDA’s movement is explained by SPY.” That’s wrong.

Do say “r² = 0.55, so 55 % of NVDA’s variance is linearly explained by SPY.” The remaining 45 % is idiosyncratic.

The relationship between r and r² is the most-confused fact in introductory regression.

CAPM — α and β

The Capital Asset Pricing Model:

\[r_{\text{stock}} - r_f = \alpha + \beta\,(r_m - r_f) + \varepsilon\]

β (beta) — sensitivity to the market
- β > 1 = aggressive (amplifies)
- β < 1 = defensive (dampens)
α (Jensen’s alpha) — abnormal return after adjusting for market risk
- α > 0 = outperforms benchmark
- α ≈ 0 = efficient-market prediction
r_f — risk-free rate

We subtract r_f from both sides because CAPM models excess returns.

Fit CAPM with statsmodels

Important

sm.add_constant(X) is required — without it, statsmodels fits a model with no intercept. This is the single most common bug for analysts moving from R or Stata.

Reading the Output

The model.summary() table:

	coef	std err	t	P>\|t\|	[0.025	0.975]
`const` (α)	0.0043	0.001	4.32	0.000	0.0024	0.0063
`Mkt_excess` (β)	2.221	0.103	21.65	0.000	2.020	2.422
R² = 0.542

β = 2.22 → NVDA moves ≈ 2.2 % per 1 % market move
95 % CI for β = [2.02, 2.42] doesn’t contain 1 → significantly aggressive
p-values near zero → both α and β statistically nonzero
R² = 0.54 → market explains 54 % of NVDA’s daily variance

Variable Selection — AIC vs BIC

\[\text{AIC} = -2\ln L + 2k \qquad \text{BIC} = -2\ln L + k\ln n\]

Note

AIC penalty +2k is small → keeps more variables, optimised for forecasting.

BIC penalty +k ln n grows with sample size → keeps fewer variables, optimised for identifying the true model.

No criterion is simultaneously efficient and consistent — a fundamental statistical impossibility. Use AIC if you care about prediction; BIC if you care about which factors are real.

Chapter Summary

Concept	Tool
Correlation	`df.corr()`
Regression	`sm.OLS(y, sm.add_constant(X)).fit()`
Reading output	`.summary()`
CI for β	`.conf_int()`
Prediction	`.predict()` / `.get_prediction()`
Variable selection	AIC / BIC / Adj R² / Mallow’s Cp

Full treatment of CAPM, Fama-French 5-factor, residual diagnostics, and the pharmacy multiple-regression case in the book — Chapter 3.

Next: Chapter 4 — Clustering.