Multiple Regression Foundations and Applications (CFA Level 2): Covers Purpose and Definition and Key Assumptions with key formulas and practical examples. Includes exam-style practice questions with explanations.
The essential idea of multiple regression is that we’re trying to explain or predict a dependent variable Y using more than one independent variable (X₁, X₂, …, Xₖ). In finance, Y could be a company’s return, a firm’s profitability ratio, or GDP growth. Each X represents a factor that you suspect might influence Y: perhaps a market index, interest rate, or commodity price. Formally, we might write it as:
$$ Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_k X_k + \epsilon $$
where:
By incorporating multiple predictors, the model can detect how each factor influences Y while holding the other factors constant. Which is pretty neat—no more saying “well, maybe this was driven by GDP growth, or maybe it was just the stock market.” Multiple regression helps tease out which variables truly matter.
Lurking behind multiple regression is a set of assumptions that must be met (or at least approximately met) to trust our results. These assumptions matter because if they’re violated, our inferences—like p-values and confidence intervals—can be unreliable.
Linearity:
We assume there’s a linear relationship between the dependent variable and each of the independent variables. It’s “linear” in the coefficients: Y changes by \(\beta_1\) units with a one-unit change in \(X_1\), holding everything else constant. In practice, we might check linearity by scrutinizing scatterplots or residual plots.
Independent and Identically Distributed Errors:
We want the errors, \(\epsilon\), to be independent from one data point to the next. For time-series data, this can be problematic if autocorrelation creeps in (e.g., stock returns on Monday are related to stock returns on Tuesday). We also assume errors come from the same distribution, meaning the data generation process shouldn’t change over time.
Homoskedastic Errors:
Say that five times fast. Homoskedasticity means the error variance is constant across all levels of the independent variables. If the variance gets larger or smaller for different values of \(X_1\), we say the errors are heteroskedastic. Heteroskedastic errors can lead to inaccurate estimates of standard errors and p-values.
No Perfect Multicollinearity:
Multicollinearity occurs if some (or all) independent variables are strongly correlated with each other. This can make it impossible (or very difficult) to estimate the contribution of each variable uniquely. For instance, if you have two variables that both measure essentially the same thing—like “return on the TSX Composite Index” and “return on a broad Canadian market portfolio”—they may be redundant.
Errors Normally Distributed:
For hypothesis testing and constructing confidence intervals, we usually assume residuals are normally distributed. In large samples, the Central Limit Theorem helps (errors may approximate normality), but it’s still good to check with histogram or Q-Q plots of residuals.
Each slope coefficient \(\beta_j\) represents how much Y changes with a one-unit change in \(X_j\), holding other variables constant. In a finance context:
From a statistical standpoint, we check if the coefficient is significantly different from zero by looking at its p-value. A small p-value (like <0.05 or <0.01) suggests that the relationship is statistically significant. That means we’re fairly confident there’s a real relationship there. However, be mindful that a significant result doesn’t necessarily imply a large economic impact; practical significance can differ from statistical significance.
The coefficient of determination, \( R^2 \), often emerges in conversations about how “good” a model is. In plain English, it measures the proportion of the variation in Y that’s explained by the model. If \( R^2 = 0.80 \), for instance, we’re saying 80% of the variation in Y is accounted for by our variables.
But there’s a catch: adding more variables will never decrease \( R^2 \). In fact, you might see \( R^2 \) keep inching up if you throw in enough random variables. That’s where the adjusted \( R^2 \) swoops in, imposing a penalty for adding more predictors. If a new variable isn’t actually helping, the adjusted \( R^2 \) won’t rise (and might even fall).
In the CFA world, multiple regression is a go-to method in a variety of scenarios:
Equity Analysis:
Use multiple regression to see how individual factor exposures (like market beta, size, value, or momentum) affect stock returns.
Risk Management:
Model Value at Risk (VaR) by regressing returns on a set of macroeconomic indicators or market factors. You might be able to figure out how changes in interest rates, inflation, or volatile commodity prices shape the worst-case outcomes for your portfolio.
Economy-Wide Forecasts:
Think of predicting GDP growth or inflation using variables like consumer confidence, unemployment, interest rates, or oil prices. Multiple regression offers a structured way to harness multiple data sources at once.
Corporate Profitability:
Analysts might model a firm’s return on equity using operational variables such as sales growth, leverage ratios, or cost of capital, helping to see which operational levers truly drive profitability.
Wherever there are multiple intertwined factors, multiple regression has a seat at the table. In fact, you’ll see it again in many advanced topics—like in time-series analysis (Section 1.3) for forecasting stock prices or GDP.
To bring this closer to home, imagine you’re an analyst in Toronto, wanting to forecast the stock returns of a major Canadian bank—let’s pick the Royal Bank of Canada (RBC). RBC’s returns can be influenced by a bunch of factors. Let’s define:
The multiple regression model could look like:
$$ \text{Return}_{\text{RBC}} = \beta_0 + \beta_1(\text{GDP Growth}) + \beta_2(\text{Oil Price}) + \beta_3(\text{TSX Return}) + \epsilon $$
By estimating this model on historical data, you might find that RBC returns are highly sensitive to overall market swings (\(\beta_3\)), somewhat sensitive to commodity prices (\(\beta_2\)), and moderately linked with GDP growth (\(\beta_1\)). If RBC’s stock is especially correlated with the broad market, that means your portfolio risk might be more “macro-driven” than a smaller, regionally focused Canadian bank stock might be.
Of course, you would want to check the assumptions—are the errors autocorrelated because we’re dealing with time-series data? Are RBC’s returns heavily correlated with the TSX index, giving us multicollinearity issues? Tools like the Durbin–Watson test (for autocorrelation) or Variance Inflation Factor (for multicollinearity) can come to the rescue.
Data Quality:
You can’t get results you trust if your data looks like Swiss cheese or is measured misleadingly. Make sure your sources—like Statistics Canada, the US Federal Reserve (FRED), or other vendors—are robust and consistent.
Check for Stationarity and Seasonality:
Time-series data often has trends or seasonal cycles, as you might expect in macro variables like GDP or inflation. If your data isn’t stationary, you might consider differencing or other transformations.
Diagnose Violations:
If you find heteroskedasticity (via White’s test or Breusch-Pagan test), consider using robust standard errors. If you see autocorrelation, Newey–West errors or the inclusion of lagged variables might help.
Interpretation Over Blind Projection:
It’s easy to get a big table of coefficients and R-squared values without pausing to ask, “Does this make economic sense?” Always weigh your regression results against financial theory or reasoned understanding of how markets operate.
Software Tools:
Excel, R, Python, EViews—pick your weapon of choice. Typically, you’ll run a regression command, check the output (coefficients, standard errors, p-values, R-squared), and run diagnostic tests.
I remember the first time I tried to regress 10 variables against monthly stock returns. My R-squared shot up, and I was so excited. But then I realized half my variables were basically duplicates—like the TSX index and a TSX financial index. My model was riddled with multicollinearity, and the t-stats were all over the place. Perhaps I was so enamored by “bigger is better” that I fell into that classic trap. Moral of the story? More variables are not always better—quality trumps quantity.
Below is a simple flowchart that outlines the typical steps you’d go through when applying multiple regression in a financial context:
flowchart LR
A["Collect Data <br/> from Market"] --> B["Specify Model <br/> & Variables"];
B --> C["Estimate Regression <br/> with Software"];
C --> D["Check Assumptions <br/> & Diagnostics"];
D --> E["Interpret Results <br/> & Make Decisions"];
Dependent Variable:
The variable we’re trying to explain (e.g., RBC’s returns).
Independent Variables (Predictors):
The factors suspected to influence the dependent variable (e.g., GDP growth, interest rates).
Multicollinearity:
A high degree of correlation among independent variables, making it tough to interpret individual coefficients.
Homoskedasticity:
Constant variance of error terms across observations.
Heteroskedasticity:
When the error variance changes across data points.
Serial Correlation (Autocorrelation):
Errors are correlated over time, common in time-series data.
Residuals (Errors):
The difference between actual and predicted values of the dependent variable.
P-Value:
The probability of observing the test statistic under the null hypothesis; helps determine statistical significance.
Stay curious, second-guess your results, and always read the question carefully. In my opinion, it’s totally fine to get that tingle of excitement when your regression model yields a crisp interpretation—just remember that regression can mislead if you don’t keep an eye on the fundamentals.
Important Notice: FinancialAnalystGuide.com provides supplemental CFA study materials, including mock exams, sample exam questions, and other practice resources to aid your exam preparation. These resources are not affiliated with or endorsed by the CFA Institute. CFA® and Chartered Financial Analyst® are registered trademarks owned exclusively by CFA Institute. Our content is independent, and we do not guarantee exam success. CFA Institute does not endorse, promote, or warrant the accuracy or quality of our products.