Chapter 3 — Linear Regression

Correlation, CAPM, Fama-French, variable selection

Prof. Xuhu Wan

Chapter 3 · Introduction to Business Analytics

Linear Regression

Correlation, CAPM, Fama-French factors, train/test, variable selection.

Prof. Xuhu Wan

ISOM, HKUST Business School · Wan Academy · 2026 Edition

Correlation — r vs r²

r ≈ 0.74 between NVDA and SPY daily returns.

Do not say “74 % of NVDA’s movement is explained by SPY.” That’s wrong.

Do say “r² = 0.55, so 55 % of NVDA’s variance is linearly explained by SPY.” The remaining 45 % is idiosyncratic.

The relationship between r and r² is the most-confused fact in introductory regression.

CAPM — α and β

The Capital Asset Pricing Model:

\[r_{\text{stock}} - r_f = \alpha + \beta\,(r_m - r_f) + \varepsilon\]

  • β (beta) — sensitivity to the market
    • β > 1 = aggressive (amplifies)
    • β < 1 = defensive (dampens)
  • α (Jensen’s alpha) — abnormal return after adjusting for market risk
    • α > 0 = outperforms benchmark
    • α ≈ 0 = efficient-market prediction
  • r_f — risk-free rate

We subtract r_f from both sides because CAPM models excess returns.

Fit CAPM with statsmodels

Important

sm.add_constant(X) is required — without it, statsmodels fits a model with no intercept. This is the single most common bug for analysts moving from R or Stata.

Reading the Output

The model.summary() table:

coef std err t P>|t| [0.025 0.975]
const (α) 0.0043 0.001 4.32 0.000 0.0024 0.0063
Mkt_excess (β) 2.221 0.103 21.65 0.000 2.020 2.422
R² = 0.542
  • β = 2.22 → NVDA moves ≈ 2.2 % per 1 % market move
  • 95 % CI for β = [2.02, 2.42] doesn’t contain 1 → significantly aggressive
  • p-values near zero → both α and β statistically nonzero
  • R² = 0.54 → market explains 54 % of NVDA’s daily variance

Variable Selection — AIC vs BIC

\[\text{AIC} = -2\ln L + 2k \qquad \text{BIC} = -2\ln L + k\ln n\]

Note

AIC penalty +2k is small → keeps more variables, optimised for forecasting.

BIC penalty +k ln n grows with sample size → keeps fewer variables, optimised for identifying the true model.

No criterion is simultaneously efficient and consistent — a fundamental statistical impossibility. Use AIC if you care about prediction; BIC if you care about which factors are real.

Chapter Summary

Concept Tool
Correlation df.corr()
Regression sm.OLS(y, sm.add_constant(X)).fit()
Reading output .summary()
CI for β .conf_int()
Prediction .predict() / .get_prediction()
Variable selection AIC / BIC / Adj R² / Mallow’s Cp

Full treatment of CAPM, Fama-French 5-factor, residual diagnostics, and the pharmacy multiple-regression case in the book — Chapter 3.

Next: Chapter 4 — Clustering.