MATH60230 - Lecture 9
Machine Learning for Empirical Finance
Vincent Grégoire
HEC Montréal
Saad Ali Khan
HEC Montréal
Plan for Today
Introduction to machine learning
ML vs. econometrics: different goals, different vocabulary
Regularized linear models: Lasso, Ridge, Elastic Net
Overfitting and model validation
Other models and preprocessing
Clustering
What is Machine Learning?
Machine learning is an application of artificial intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed.
ML = algorithms that learn patterns from data to make predictions or decisions
In finance: forecasting returns, classifying firms, extracting patterns from complex data
Machine Learning Approaches
Supervised learning : The computer is presented with example inputs and their desired outputs (think OLS).
Example : predict stock returns from firm characteristics
Unsupervised learning : No labels are given to the learning algorithm, leaving it on its own to find structure in its input.
Example : cluster stocks into groups based on return patterns
Reinforcement learning : A computer program interacts with a dynamic environment in which it must perform a certain goal, receiving feedback analogous to rewards, which it tries to maximize.
Example : trade execution algorithms
Prediction vs. Inference
Econometrics
Estimate causal effects , test hypotheses
Inference on \beta : is \beta significant? What is its magnitude?
Machine Learning
Minimize prediction error , maximize out-of-sample performance
“We don’t care about \beta — we care about \hat{y} ”
ML is good for forecasting or extracting complex (non-linear) patterns from data. ML is not good for statistical inference.
Vocabulary: ML vs. Econometrics
Dependent variable (y )
Target / label
Independent variables (X )
Features
Estimation
Training
Coefficients (\beta )
Weights / parameters
Sample
Dataset
Out-of-sample
Test set
Model specification
Model selection / architecture
Regression
Can mean classification too!
“Regression” in ML
In econometrics : regression = modeling a relationship (usually linear)
In ML : regression = predicting a continuous value (vs. classification = predicting a category )
Logistic regression is a classification algorithm in ML parlance!
Types of ML Tasks
Common ML tasks can be divided in four groups:
Classification : Identify which category an observation belongs to.
e.g., will a firm default?
Regression : Predict a continuous value.
e.g., what will next month’s return be?
Clustering : Automatic grouping of observations.
e.g., group stocks by behavior
Dimensionality reduction : Reduce the number of explanatory variables.
e.g., summarize 100 firm characteristics into a few factors
Supervised vs. Unsupervised
Supervised : we have labels (y ); learn a mapping X \rightarrow y
Classification and regression
Unsupervised : no labels; find structure in X
Clustering and dimensionality reduction
Semi-supervised : small amount of labeled data + large amount of unlabeled data
Less common, but useful when labeling is expensive
scikit-learn
scikit-learn is the standard Python library for machine learning (not deep learning).
Consistent API across all models
Key modules:
sklearn.linear_model — linear and regularized models
sklearn.ensemble — random forests, gradient boosting
sklearn.svm — support vector machines
sklearn.preprocessing — feature scaling, encoding
sklearn.model_selection — cross-validation, grid search
sklearn.metrics — evaluation metrics
The Estimator API
All scikit-learn models follow the same pattern:
from sklearn.linear_model import Ridge
model = Ridge(alpha= 1.0 ) # 1. Create model with hyperparameters
model.fit(X_train, y_train) # 2. Train the model
y_pred = model.predict(X_test) # 3. Make predictions
score = model.score(X_test, y_test) # 4. Evaluate performance
This consistent interface makes it easy to swap models in and out.
scikit-learn Ecosystem
Deep learning libraries
What is Overfitting?
Model learns noise in the training data instead of the true pattern
Performs well on training data but poorly on new data
Analogy: memorizing exam answers vs. understanding the material
Overfitting with OLS
Estimation of multiple different models in the same statistical analysis on the same data and picking the one that optimizes some objective function:
R^2
Log-likelihood function
Information criteria
Profits generated (backtest)
Models that suffer from overfitting tend to perform badly out of sample.
Overfitting - A Simple Model
Let X_{t},Y_{t} be our variables with Y_{t}=X_{t}\beta+\varepsilon_{t}
E\left(\varepsilon_{t}X_{t}\right)=0 and \varepsilon_{t}\sim N(0,\sigma) i.i.d.
\left\{X_{1,t},Y_{1,t}\right\} are the data for our main sample
\left\{X_{2,t},Y_{2,t}\right\} are the data for a second sample
Problem
We want to use X to predict Y.
Suppose we want to minimize E\left[ \left( Y_{t}-X_{t}\theta\right)^{2}\right]
Ideally, we should use \theta=\beta . But \beta is unknown.
If we only had \left\{ X_{1t},Y_{1t}\right\} , we could minimize E\left[ \left( Y_{1,t}-X_{1,t}\theta\right)^{2}\right] \Rightarrow\widehat{\theta}=\widehat{\theta}_{OLS}
Overfitting with OLS
Fact
Q_{1}=\sum_{t=1}^{T}\left( \left( Y_{1t}-X_{1t}\beta\right)^{2}-\left( Y_{1t}-X_{1t}\widehat{\theta}\right)^{2}\right)
E\left( Q_{1}\right) =k\cdot\sigma^{2} where k= dimension of X
OLS tends to underestimate prediction error (i.e. “overfit”)
Fact
E\left[ Q_{2}\right] =E\left[ \sum_{t=1}^{T}\left( \left( Y_{2t}-X_{2t}\beta\right)^{2}-\left( Y_{2t}-X_{2t}\widehat{\theta}\right)^{2}\right) \right] =-k\cdot\sigma^{2}
The over-fitting in-sample hurts our performance out-of-sample.
The more we over-fit, the more it hurts our forecast.
To avoid overfitting, we should use models with similar performance in-sample and out-of-sample and techniques such as cross-validation.
From OLS to Regularization
OLS minimizes \sum(y_i - X_i\beta)^2
Problem : with many features, OLS overfits — it fits noise in the training data
Solution : add a penalty term to constrain \beta
\hat{\beta} \equiv \underset{\beta}{\operatorname{argmin}} \left( \|y - X\beta\|^2 + \text{penalty}(\beta) \right)
Ridge Regression (L2 Penalty)
\hat{\beta} \equiv \underset{\beta}{\operatorname{argmin}} \left( \|y - X\beta\|^2 + \lambda_2 \|\beta\|^2 \right)
Shrinks coefficients toward zero but never exactly to zero
\lambda_2 controls the strength of regularization:
When \lambda_2 = 0 \rightarrow OLS
When \lambda_2 \rightarrow \infty \rightarrow all coefficients \rightarrow 0
Lasso Regression (L1 Penalty)
\hat{\beta} \equiv \underset{\beta}{\operatorname{argmin}} \left( \|y - X\beta\|^2 + \lambda_1 \|\beta\|_1 \right)
Can shrink coefficients exactly to zero \rightarrow automatic feature selection
Useful when you suspect many features are irrelevant
Produces sparse solutions
Elastic Net
Combines L1 and L2 penalties:
\hat{\beta} \equiv \underset{\beta}{\operatorname{argmin}} \left( \|y - X\beta\|^2 + \lambda_2 \|\beta\|^2 + \lambda_1 \|\beta\|_1 \right)
Best of both worlds: feature selection (L1) + grouped shrinkage (L2)
When features are correlated, Lasso tends to pick one and ignore the others; Elastic Net keeps groups together
Regularization in scikit-learn
from sklearn.linear_model import Ridge, Lasso, ElasticNet
ridge = Ridge(alpha= 1.0 )
lasso = Lasso(alpha= 1.0 )
enet = ElasticNet(alpha= 1.0 , l1_ratio= 0.5 )
scikit-learn uses alpha instead of \lambda
l1_ratio in ElasticNet: 0 = pure Ridge, 1 = pure Lasso
model = Lasso(alpha= 0.1 )
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
Bias-Variance Tradeoff
Underfitting
High bias , low variance
Model is too simple
Misses the true pattern
e.g., linear model for non-linear data
Overfitting
Low bias, high variance
Model is too complex
Fits noise in the data
e.g., very deep decision tree
Goal: find the sweet spot that minimizes total error.
Regularization Controls Overfitting
\lambda is the dial between underfitting and overfitting:
Small \lambda \rightarrow complex model \rightarrow potential overfitting
Large \lambda \rightarrow simple model \rightarrow potential underfitting
This is why choosing \lambda (hyperparameter tuning) matters!
Overfitting in Practice
More features than observations \rightarrow almost guaranteed overfitting
Common in finance: many potential predictors, limited history
Key defenses:
Regularization (Ridge, Lasso, Elastic Net)
Cross-validation
Feature selection and dimensionality reduction
Hyperparameters
Parameters : learned from data during training
e.g., \beta in regression, node splits in a tree
Hyperparameters : set before training
e.g., \lambda in Ridge, depth of a decision tree, number of trees in a forest
Tuning hyperparameters : finding the best values
Cannot learn them from the training data directly (would overfit!)
Train / Validation / Test Sets
Split the data into three groups:
Training set : fit the model
Validation set : tune hyperparameters, compare models
Test set : final evaluation (touch only once!)
Typical splits: 60/20/20 or 70/15/15
Why three sets? If you use the test set to tune hyperparameters, you are effectively “training on the test data.”
Cross-Validation
K-fold cross-validation : split training data into K folds
Train on K-1 folds
Validate on the remaining fold
Rotate which fold is the validation set
Average performance across all K folds
Advantages: uses all data for both training and validation
Common choices: K = 5 or K = 10
Cross-Validation Diagram (5-fold)
Fold 1
Valid.
Train
Train
Train
Train
Fold 2
Train
Valid.
Train
Train
Train
Fold 3
Train
Train
Valid.
Train
Train
Fold 4
Train
Train
Train
Valid.
Train
Fold 5
Train
Train
Train
Train
Valid.
Each row = one iteration. Average the scores across all 5 iterations.
Hyperparameter Tuning
Grid search : try all combinations over a predefined grid
GridSearchCV in scikit-learn combines grid search + cross-validation:
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Ridge
param_grid = {"alpha" : [0.01 , 0.1 , 1.0 , 10.0 , 100.0 ]}
search = GridSearchCV(Ridge(), param_grid, cv= 5 )
search.fit(X_train, y_train)
best_alpha = search.best_params_["alpha" ]
Regression Metrics
MSE (Mean Squared Error): \frac{1}{n}\sum(y_i - \hat{y}_i)^2
RMSE (Root MSE): \sqrt{\text{MSE}} — same units as y
MAE (Mean Absolute Error): \frac{1}{n}\sum|y_i - \hat{y}_i|
R^2 (coefficient of determination): 1 - \frac{\sum(y_i - \hat{y}_i)^2}{\sum(y_i - \bar{y})^2}
Note: R^2 can be negative out-of-sample! This means the model does worse than simply predicting the mean.
Classification Metrics
Accuracy : fraction of correct predictions
Precision : of all predicted positives, how many are truly positive?
Recall : of all actual positives, how many did we find?
F1 score : harmonic mean of precision and recall: F_1 = \frac{2 \cdot \text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}
Why accuracy can be misleading: imbalanced classes
If 99% of firms don’t default, predicting “no default” always gives 99% accuracy!
In finance: consider the cost of false positives vs. false negatives
sklearn.metrics
from sklearn.metrics import (
mean_squared_error,
r2_score,
accuracy_score,
classification_report,
)
# Regression
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
# Classification
acc = accuracy_score(y_test, y_pred)
print (classification_report(y_test, y_pred))
Decision Trees
Each node = a binary decision on a feature (e.g., x_1 > 0.04 ?)
Easy to interpret — you can visualize the tree
Prone to overfitting (can memorize training data)
Key hyperparameters: max depth, min samples per leaf
scikit-learn: DecisionTreeRegressor, DecisionTreeClassifier
Random Forests
Ensemble of decision trees
Each tree trained on a random subset of data and features
Reduces overfitting through averaging — this is diversification!
If errors are not perfectly correlated and models are not biased, averaging helps
scikit-learn: RandomForestRegressor, RandomForestClassifier
Support Vector Machines
Finds the optimal boundary (hyperplane) that separates classes
Can handle non-linear boundaries via the kernel trick
Works for both classification (SVC) and regression (SVR)
Can be slow on large datasets
Model Comparison Summary
Linear models
High
Low
Fast
Low (with reg.)
Decision trees
High
High
Fast
High
Random forests
Medium
High
Medium
Low
SVM
Low
High
Slow
Medium
No single model is best for all problems
Start simple (linear models), add complexity only if needed
Why Preprocessing Matters
Many ML algorithms are sensitive to feature scales (SVM, Ridge, Lasso)
Missing values and categorical variables need handling
Remember, we only care about predicting y , not about inference, so it is common to normalize X (and sometimes y ) to improve numerical performance.
Whatever transformation is done should be fit on training data and the same transformation applied to test data.
Feature Scaling
StandardScaler : zero mean, unit variance (z-score)
x' = \frac{x - \mu}{\sigma}
Use for most algorithms (Ridge, Lasso, SVM)
MinMaxScaler : scale to [0, 1]
x' = \frac{x - x_{\min}}{x_{\max} - x_{\min}}
Use when you need bounded values
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train) # fit + transform
X_test_scaled = scaler.transform(X_test) # transform only!
scikit-learn Pipelines
Combine preprocessing + model in one object:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge
pipe = Pipeline([
("scaler" , StandardScaler()),
("model" , Ridge(alpha= 1.0 )),
])
pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_test)
Prevents data leakage : scaler fits only on training data
Works seamlessly with GridSearchCV
The Curse of Dimensionality
More features \rightarrow data becomes sparse in high dimensions
Models need exponentially more data to fill the space
Many features may be correlated or irrelevant
Principal Component Analysis (PCA)
Most common dimensionality reduction technique
Finds directions of maximum variance in the data
Reduces features while preserving as much information as possible
Finance application: reducing many correlated firm characteristics into a few principal components
from sklearn.decomposition import PCA
pca = PCA(n_components= 10 )
X_reduced = pca.fit_transform(X)
PCA will be covered in more detail in the following course.
Unsupervised Learning: K-Means Clustering
Algorithm: assign points to K clusters, minimize within-cluster distance
Initialize K centroids
Assign each point to nearest centroid
Update centroids as cluster means
Repeat until convergence
Must choose K (hyperparameter)
Finance uses: grouping similar stocks/firms, market segmentation, regime detection
K-Means in scikit-learn
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters= 3 , random_state= 42 )
kmeans.fit(X)
labels = kmeans.labels_
Choosing K : elbow method — plot inertia vs. K , look for the “elbow”
Limitations: assumes spherical clusters, sensitive to initialization
Challenges of Time-Series Forecasting
Standard ML assumes i.i.d. data — time series violates this
Look-ahead bias : accidentally using future information
Cannot randomly split: must use rolling/expanding windows
Important that the test set is always out-of-sample
Autocorrelation, regime changes, non-stationarity
These methods will be covered in the next course
More Book Recommendations
All available via the library.