MATH60230 - Lecture 9

Machine Learning for Empirical Finance

Vincent Grégoire

vincent.gregoire@hec.ca

HEC Montréal

Saad Ali Khan

saad-ali.khan@hec.ca

HEC Montréal

Plan for Today

Introduction to machine learning
ML vs. econometrics: different goals, different vocabulary
Regularized linear models: Lasso, Ridge, Elastic Net
Overfitting and model validation
Other models and preprocessing
Clustering

What is Machine Learning?

Machine learning is an application of artificial intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed.

ML = algorithms that learn patterns from data to make predictions or decisions
In finance: forecasting returns, classifying firms, extracting patterns from complex data

Machine Learning Approaches

Supervised learning: The computer is presented with example inputs and their desired outputs (think OLS).
- Example: predict stock returns from firm characteristics

Unsupervised learning: No labels are given to the learning algorithm, leaving it on its own to find structure in its input.
- Example: cluster stocks into groups based on return patterns

Reinforcement learning: A computer program interacts with a dynamic environment in which it must perform a certain goal, receiving feedback analogous to rewards, which it tries to maximize.
- Example: trade execution algorithms

Prediction vs. Inference

Econometrics

Estimate causal effects, test hypotheses
Inference on \beta: is \beta significant? What is its magnitude?

Machine Learning

Minimize prediction error, maximize out-of-sample performance
“We don’t care about \beta — we care about \hat{y}”

ML is good for forecasting or extracting complex (non-linear) patterns from data. ML is not good for statistical inference.

Vocabulary: ML vs. Econometrics

Econometrics	Machine Learning
Dependent variable (y)	Target / label
Independent variables (X)	Features
Estimation	Training
Coefficients (\beta)	Weights / parameters
Sample	Dataset
Out-of-sample	Test set
Model specification	Model selection / architecture
Regression	Can mean classification too!

“Regression” in ML

In econometrics: regression = modeling a relationship (usually linear)

In ML: regression = predicting a continuous value (vs. classification = predicting a category)

Logistic regression is a classification algorithm in ML parlance!

Types of ML Tasks

Common ML tasks can be divided in four groups:

Classification: Identify which category an observation belongs to.
- e.g., will a firm default?

Regression: Predict a continuous value.
- e.g., what will next month’s return be?

Clustering: Automatic grouping of observations.
- e.g., group stocks by behavior

Dimensionality reduction: Reduce the number of explanatory variables.
- e.g., summarize 100 firm characteristics into a few factors

Supervised vs. Unsupervised

Supervised: we have labels (y); learn a mapping X \rightarrow y
- Classification and regression

Unsupervised: no labels; find structure in X
- Clustering and dimensionality reduction

Semi-supervised: small amount of labeled data + large amount of unlabeled data
- Less common, but useful when labeling is expensive

scikit-learn

scikit-learn is the standard Python library for machine learning (not deep learning).

Consistent API across all models
Key modules:
- sklearn.linear_model — linear and regularized models
- sklearn.ensemble — random forests, gradient boosting
- sklearn.svm — support vector machines
- sklearn.preprocessing — feature scaling, encoding
- sklearn.model_selection — cross-validation, grid search
- sklearn.metrics — evaluation metrics

The Estimator API

All scikit-learn models follow the same pattern:

from sklearn.linear_model import Ridge

model = Ridge(alpha=1.0)       # 1. Create model with hyperparameters
model.fit(X_train, y_train)    # 2. Train the model
y_pred = model.predict(X_test) # 3. Make predictions
score = model.score(X_test, y_test) # 4. Evaluate performance

This consistent interface makes it easy to swap models in and out.

scikit-learn Ecosystem

skfolio

Deep learning libraries

PyTorch (Facebook)
Keras (built on TensorFlow, Google)

What is Overfitting?

Model learns noise in the training data instead of the true pattern

Performs well on training data but poorly on new data

Analogy: memorizing exam answers vs. understanding the material

Overfitting with OLS

Estimation of multiple different models in the same statistical analysis on the same data and picking the one that optimizes some objective function:

R^2
Log-likelihood function
Information criteria
Profits generated (backtest)

Models that suffer from overfitting tend to perform badly out of sample.

Overfitting - A Simple Model

Let X_{t},Y_{t} be our variables with Y_{t}=X_{t}\beta+\varepsilon_{t}

E\left(\varepsilon_{t}X_{t}\right)=0 and \varepsilon_{t}\sim N(0,\sigma) i.i.d.
\left\{X_{1,t},Y_{1,t}\right\} are the data for our main sample
\left\{X_{2,t},Y_{2,t}\right\} are the data for a second sample

Problem

We want to use X to predict Y.

Suppose we want to minimize E\left[ \left( Y_{t}-X_{t}\theta\right)^{2}\right]

Ideally, we should use \theta=\beta. But \beta is unknown.

If we only had \left\{ X_{1t},Y_{1t}\right\} , we could minimize E\left[ \left( Y_{1,t}-X_{1,t}\theta\right)^{2}\right] \Rightarrow\widehat{\theta}=\widehat{\theta}_{OLS}

Overfitting with OLS

Fact

Q_{1}=\sum_{t=1}^{T}\left( \left( Y_{1t}-X_{1t}\beta\right)^{2}-\left( Y_{1t}-X_{1t}\widehat{\theta}\right)^{2}\right)

E\left( Q_{1}\right) =k\cdot\sigma^{2} where k= dimension of X

OLS tends to underestimate prediction error (i.e. “overfit”)

Fact

E\left[ Q_{2}\right] =E\left[ \sum_{t=1}^{T}\left( \left( Y_{2t}-X_{2t}\beta\right)^{2}-\left( Y_{2t}-X_{2t}\widehat{\theta}\right)^{2}\right) \right] =-k\cdot\sigma^{2}

The over-fitting in-sample hurts our performance out-of-sample.
The more we over-fit, the more it hurts our forecast.
To avoid overfitting, we should use models with similar performance in-sample and out-of-sample and techniques such as cross-validation.

From OLS to Regularization

OLS minimizes \sum(y_i - X_i\beta)^2

Problem: with many features, OLS overfits — it fits noise in the training data

Solution: add a penalty term to constrain \beta

\hat{\beta} \equiv \underset{\beta}{\operatorname{argmin}} \left( \|y - X\beta\|^2 + \text{penalty}(\beta) \right)

Ridge Regression (L2 Penalty)

\hat{\beta} \equiv \underset{\beta}{\operatorname{argmin}} \left( \|y - X\beta\|^2 + \lambda_2 \|\beta\|^2 \right)

Shrinks coefficients toward zero but never exactly to zero

\lambda_2 controls the strength of regularization:
- When \lambda_2 = 0 \rightarrow OLS
- When \lambda_2 \rightarrow \infty \rightarrow all coefficients \rightarrow 0

Lasso Regression (L1 Penalty)

\hat{\beta} \equiv \underset{\beta}{\operatorname{argmin}} \left( \|y - X\beta\|^2 + \lambda_1 \|\beta\|_1 \right)

Can shrink coefficients exactly to zero \rightarrow automatic feature selection

Useful when you suspect many features are irrelevant

Produces sparse solutions

Elastic Net

Combines L1 and L2 penalties:

\hat{\beta} \equiv \underset{\beta}{\operatorname{argmin}} \left( \|y - X\beta\|^2 + \lambda_2 \|\beta\|^2 + \lambda_1 \|\beta\|_1 \right)

Best of both worlds: feature selection (L1) + grouped shrinkage (L2)

When features are correlated, Lasso tends to pick one and ignore the others; Elastic Net keeps groups together

Regularization in scikit-learn

from sklearn.linear_model import Ridge, Lasso, ElasticNet

ridge = Ridge(alpha=1.0)
lasso = Lasso(alpha=1.0)
enet  = ElasticNet(alpha=1.0, l1_ratio=0.5)

scikit-learn uses alpha instead of \lambda
l1_ratio in ElasticNet: 0 = pure Ridge, 1 = pure Lasso

model = Lasso(alpha=0.1)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

Bias-Variance Tradeoff

Underfitting

High bias, low variance
Model is too simple
Misses the true pattern
e.g., linear model for non-linear data

Overfitting

Low bias, high variance
Model is too complex
Fits noise in the data
e.g., very deep decision tree

Goal: find the sweet spot that minimizes total error.

Regularization Controls Overfitting

\lambda is the dial between underfitting and overfitting:

Small \lambda \rightarrow complex model \rightarrow potential overfitting
Large \lambda \rightarrow simple model \rightarrow potential underfitting

This is why choosing \lambda (hyperparameter tuning) matters!

Overfitting in Practice

More features than observations \rightarrow almost guaranteed overfitting

Common in finance: many potential predictors, limited history

Key defenses:
- Regularization (Ridge, Lasso, Elastic Net)
- Cross-validation
- Feature selection and dimensionality reduction

Hyperparameters

Parameters: learned from data during training
- e.g., \beta in regression, node splits in a tree

Hyperparameters: set before training
- e.g., \lambda in Ridge, depth of a decision tree, number of trees in a forest

Tuning hyperparameters: finding the best values
- Cannot learn them from the training data directly (would overfit!)

Train / Validation / Test Sets

Split the data into three groups:

Training set: fit the model
Validation set: tune hyperparameters, compare models
Test set: final evaluation (touch only once!)

Typical splits: 60/20/20 or 70/15/15

Why three sets? If you use the test set to tune hyperparameters, you are effectively “training on the test data.”

Cross-Validation

K-fold cross-validation: split training data into K folds
1. Train on K-1 folds
2. Validate on the remaining fold
3. Rotate which fold is the validation set
4. Average performance across all K folds

Advantages: uses all data for both training and validation

Common choices: K = 5 or K = 10

Cross-Validation Diagram (5-fold)

Fold	Split 1	Split 2	Split 3	Split 4	Split 5
Fold 1	Valid.	Train	Train	Train	Train
Fold 2	Train	Valid.	Train	Train	Train
Fold 3	Train	Train	Valid.	Train	Train
Fold 4	Train	Train	Train	Valid.	Train
Fold 5	Train	Train	Train	Train	Valid.

Each row = one iteration. Average the scores across all 5 iterations.

Hyperparameter Tuning

Grid search: try all combinations over a predefined grid

GridSearchCV in scikit-learn combines grid search + cross-validation:

from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Ridge

param_grid = {"alpha": [0.01, 0.1, 1.0, 10.0, 100.0]}
search = GridSearchCV(Ridge(), param_grid, cv=5)
search.fit(X_train, y_train)

best_alpha = search.best_params_["alpha"]

Regression Metrics

MSE (Mean Squared Error): \frac{1}{n}\sum(y_i - \hat{y}_i)^2

RMSE (Root MSE): \sqrt{\text{MSE}} — same units as y

MAE (Mean Absolute Error): \frac{1}{n}\sum|y_i - \hat{y}_i|

R^2 (coefficient of determination): 1 - \frac{\sum(y_i - \hat{y}_i)^2}{\sum(y_i - \bar{y})^2}

Note: R^2 can be negative out-of-sample! This means the model does worse than simply predicting the mean.

Classification Metrics

Accuracy: fraction of correct predictions
Precision: of all predicted positives, how many are truly positive?
Recall: of all actual positives, how many did we find?
F1 score: harmonic mean of precision and recall: F_1 = \frac{2 \cdot \text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}

Why accuracy can be misleading: imbalanced classes
- If 99% of firms don’t default, predicting “no default” always gives 99% accuracy!

In finance: consider the cost of false positives vs. false negatives

`sklearn.metrics`

from sklearn.metrics import (
    mean_squared_error,
    r2_score,
    accuracy_score,
    classification_report,
)

# Regression
mse = mean_squared_error(y_test, y_pred)
r2  = r2_score(y_test, y_pred)

# Classification
acc = accuracy_score(y_test, y_pred)
print(classification_report(y_test, y_pred))

Decision Trees

Each node = a binary decision on a feature (e.g., x_1 > 0.04?)

Easy to interpret — you can visualize the tree

Prone to overfitting (can memorize training data)

Key hyperparameters: max depth, min samples per leaf

scikit-learn: DecisionTreeRegressor, DecisionTreeClassifier

Random Forests

Ensemble of decision trees

Each tree trained on a random subset of data and features

Reduces overfitting through averaging — this is diversification!
- If errors are not perfectly correlated and models are not biased, averaging helps

scikit-learn: RandomForestRegressor, RandomForestClassifier

Support Vector Machines

Finds the optimal boundary (hyperplane) that separates classes

Can handle non-linear boundaries via the kernel trick

Works for both classification (SVC) and regression (SVR)

Can be slow on large datasets

Model Comparison Summary

	Interpretability	Non-linearity	Speed	Overfitting risk
Linear models	High	Low	Fast	Low (with reg.)
Decision trees	High	High	Fast	High
Random forests	Medium	High	Medium	Low
SVM	Low	High	Slow	Medium

No single model is best for all problems
Start simple (linear models), add complexity only if needed

Why Preprocessing Matters

Many ML algorithms are sensitive to feature scales (SVM, Ridge, Lasso)

Missing values and categorical variables need handling

Remember, we only care about predicting y, not about inference, so it is common to normalize X (and sometimes y) to improve numerical performance.

Whatever transformation is done should be fit on training data and the same transformation applied to test data.

Feature Scaling

StandardScaler: zero mean, unit variance (z-score)
- x' = \frac{x - \mu}{\sigma}
- Use for most algorithms (Ridge, Lasso, SVM)

MinMaxScaler: scale to [0, 1]
- x' = \frac{x - x_{\min}}{x_{\max} - x_{\min}}
- Use when you need bounded values

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)  # fit + transform
X_test_scaled  = scaler.transform(X_test)        # transform only!

scikit-learn Pipelines

Combine preprocessing + model in one object:

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge

pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("model", Ridge(alpha=1.0)),
])

pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_test)

Prevents data leakage: scaler fits only on training data
Works seamlessly with GridSearchCV

The Curse of Dimensionality

More features \rightarrow data becomes sparse in high dimensions

Models need exponentially more data to fill the space

Many features may be correlated or irrelevant

Principal Component Analysis (PCA)

Most common dimensionality reduction technique

Finds directions of maximum variance in the data

Reduces features while preserving as much information as possible

Finance application: reducing many correlated firm characteristics into a few principal components

from sklearn.decomposition import PCA

pca = PCA(n_components=10)
X_reduced = pca.fit_transform(X)

PCA will be covered in more detail in the following course.

Unsupervised Learning: K-Means Clustering

Algorithm: assign points to K clusters, minimize within-cluster distance
1. Initialize K centroids
2. Assign each point to nearest centroid
3. Update centroids as cluster means
4. Repeat until convergence

Must choose K (hyperparameter)

Finance uses: grouping similar stocks/firms, market segmentation, regime detection

K-Means in scikit-learn

from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=3, random_state=42)
kmeans.fit(X)
labels = kmeans.labels_

Choosing K: elbow method — plot inertia vs. K, look for the “elbow”

Limitations: assumes spherical clusters, sensitive to initialization

Challenges of Time-Series Forecasting

Standard ML assumes i.i.d. data — time series violates this

Look-ahead bias: accidentally using future information

Cannot randomly split: must use rolling/expanding windows
- Important that the test set is always out-of-sample

Autocorrelation, regime changes, non-stationarity

These methods will be covered in the next course

Book Recommendations

Hands-On Machine Learning with Scikit-Learn and PyTorch by Aurélien Géron — available via the library.

For ML in finance, see Gu, Kelly, and Xiu (2020).

More Book Recommendations

Deep Learning with PyTorch by Eli Stevens, Luca Antiga, and Thomas Viehmann

Artificial Intelligence in Finance by Yves Hilpisch

Reinforcement Learning for Finance by Yves Hilpisch

All available via the library.

References

Gu, Shihao, Bryan Kelly, and Dacheng Xiu. 2020. “Empirical Asset Pricing via Machine Learning.” The Review of Financial Studies 33 (5): 2223–73.