MATH60230 - Lecture 9

Machine Learning for Empirical Finance

Vincent Grégoire

HEC Montréal

Saad Ali Khan

HEC Montréal

Plan for Today

  • Introduction to machine learning
  • ML vs. econometrics: different goals, different vocabulary
  • Regularized linear models: Lasso, Ridge, Elastic Net
  • Overfitting and model validation
  • Other models and preprocessing
  • Clustering

What is Machine Learning?

Machine learning is an application of artificial intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed.

  • ML = algorithms that learn patterns from data to make predictions or decisions
  • In finance: forecasting returns, classifying firms, extracting patterns from complex data

Machine Learning Approaches

  • Supervised learning: The computer is presented with example inputs and their desired outputs (think OLS).
    • Example: predict stock returns from firm characteristics
  • Unsupervised learning: No labels are given to the learning algorithm, leaving it on its own to find structure in its input.
    • Example: cluster stocks into groups based on return patterns
  • Reinforcement learning: A computer program interacts with a dynamic environment in which it must perform a certain goal, receiving feedback analogous to rewards, which it tries to maximize.
    • Example: trade execution algorithms

Prediction vs. Inference

Econometrics

  • Estimate causal effects, test hypotheses
  • Inference on \beta: is \beta significant? What is its magnitude?

Machine Learning

  • Minimize prediction error, maximize out-of-sample performance
  • “We don’t care about \beta — we care about \hat{y}

ML is good for forecasting or extracting complex (non-linear) patterns from data. ML is not good for statistical inference.

Vocabulary: ML vs. Econometrics

Econometrics Machine Learning
Dependent variable (y) Target / label
Independent variables (X) Features
Estimation Training
Coefficients (\beta) Weights / parameters
Sample Dataset
Out-of-sample Test set
Model specification Model selection / architecture
Regression Can mean classification too!

“Regression” in ML

  • In econometrics: regression = modeling a relationship (usually linear)
  • In ML: regression = predicting a continuous value (vs. classification = predicting a category)
  • Logistic regression is a classification algorithm in ML parlance!

Types of ML Tasks

Common ML tasks can be divided in four groups:

  • Classification: Identify which category an observation belongs to.
    • e.g., will a firm default?
  • Regression: Predict a continuous value.
    • e.g., what will next month’s return be?
  • Clustering: Automatic grouping of observations.
    • e.g., group stocks by behavior
  • Dimensionality reduction: Reduce the number of explanatory variables.
    • e.g., summarize 100 firm characteristics into a few factors

Supervised vs. Unsupervised

  • Supervised: we have labels (y); learn a mapping X \rightarrow y
    • Classification and regression
  • Unsupervised: no labels; find structure in X
    • Clustering and dimensionality reduction
  • Semi-supervised: small amount of labeled data + large amount of unlabeled data
    • Less common, but useful when labeling is expensive

scikit-learn

scikit-learn is the standard Python library for machine learning (not deep learning).

  • Consistent API across all models
  • Key modules:
    • sklearn.linear_model — linear and regularized models
    • sklearn.ensemble — random forests, gradient boosting
    • sklearn.svm — support vector machines
    • sklearn.preprocessing — feature scaling, encoding
    • sklearn.model_selection — cross-validation, grid search
    • sklearn.metrics — evaluation metrics

The Estimator API

All scikit-learn models follow the same pattern:

from sklearn.linear_model import Ridge

model = Ridge(alpha=1.0)       # 1. Create model with hyperparameters
model.fit(X_train, y_train)    # 2. Train the model
y_pred = model.predict(X_test) # 3. Make predictions
score = model.score(X_test, y_test) # 4. Evaluate performance

This consistent interface makes it easy to swap models in and out.

scikit-learn Ecosystem

Deep learning libraries

What is Overfitting?

  • Model learns noise in the training data instead of the true pattern
  • Performs well on training data but poorly on new data
  • Analogy: memorizing exam answers vs. understanding the material

Overfitting with OLS

Estimation of multiple different models in the same statistical analysis on the same data and picking the one that optimizes some objective function:

  • R^2
  • Log-likelihood function
  • Information criteria
  • Profits generated (backtest)

Models that suffer from overfitting tend to perform badly out of sample.

Overfitting - A Simple Model

Let X_{t},Y_{t} be our variables with Y_{t}=X_{t}\beta+\varepsilon_{t}

  • E\left(\varepsilon_{t}X_{t}\right)=0 and \varepsilon_{t}\sim N(0,\sigma) i.i.d.
  • \left\{X_{1,t},Y_{1,t}\right\} are the data for our main sample
  • \left\{X_{2,t},Y_{2,t}\right\} are the data for a second sample

Problem

We want to use X to predict Y.

Suppose we want to minimize E\left[ \left( Y_{t}-X_{t}\theta\right)^{2}\right]

Ideally, we should use \theta=\beta. But \beta is unknown.

If we only had \left\{ X_{1t},Y_{1t}\right\} , we could minimize E\left[ \left( Y_{1,t}-X_{1,t}\theta\right)^{2}\right] \Rightarrow\widehat{\theta}=\widehat{\theta}_{OLS}

Overfitting with OLS

Fact

Q_{1}=\sum_{t=1}^{T}\left( \left( Y_{1t}-X_{1t}\beta\right)^{2}-\left( Y_{1t}-X_{1t}\widehat{\theta}\right)^{2}\right)

E\left( Q_{1}\right) =k\cdot\sigma^{2} where k= dimension of X

  • OLS tends to underestimate prediction error (i.e. “overfit”)

Fact

E\left[ Q_{2}\right] =E\left[ \sum_{t=1}^{T}\left( \left( Y_{2t}-X_{2t}\beta\right)^{2}-\left( Y_{2t}-X_{2t}\widehat{\theta}\right)^{2}\right) \right] =-k\cdot\sigma^{2}

  • The over-fitting in-sample hurts our performance out-of-sample.
  • The more we over-fit, the more it hurts our forecast.
  • To avoid overfitting, we should use models with similar performance in-sample and out-of-sample and techniques such as cross-validation.

From OLS to Regularization

  • OLS minimizes \sum(y_i - X_i\beta)^2
  • Problem: with many features, OLS overfits — it fits noise in the training data
  • Solution: add a penalty term to constrain \beta

\hat{\beta} \equiv \underset{\beta}{\operatorname{argmin}} \left( \|y - X\beta\|^2 + \text{penalty}(\beta) \right)

Ridge Regression (L2 Penalty)

\hat{\beta} \equiv \underset{\beta}{\operatorname{argmin}} \left( \|y - X\beta\|^2 + \lambda_2 \|\beta\|^2 \right)

  • Shrinks coefficients toward zero but never exactly to zero
  • \lambda_2 controls the strength of regularization:
    • When \lambda_2 = 0 \rightarrow OLS
    • When \lambda_2 \rightarrow \infty \rightarrow all coefficients \rightarrow 0

Lasso Regression (L1 Penalty)

\hat{\beta} \equiv \underset{\beta}{\operatorname{argmin}} \left( \|y - X\beta\|^2 + \lambda_1 \|\beta\|_1 \right)

  • Can shrink coefficients exactly to zero \rightarrow automatic feature selection
  • Useful when you suspect many features are irrelevant
  • Produces sparse solutions

Elastic Net

Combines L1 and L2 penalties:

\hat{\beta} \equiv \underset{\beta}{\operatorname{argmin}} \left( \|y - X\beta\|^2 + \lambda_2 \|\beta\|^2 + \lambda_1 \|\beta\|_1 \right)

  • Best of both worlds: feature selection (L1) + grouped shrinkage (L2)
  • When features are correlated, Lasso tends to pick one and ignore the others; Elastic Net keeps groups together

Regularization in scikit-learn

from sklearn.linear_model import Ridge, Lasso, ElasticNet

ridge = Ridge(alpha=1.0)
lasso = Lasso(alpha=1.0)
enet  = ElasticNet(alpha=1.0, l1_ratio=0.5)
  • scikit-learn uses alpha instead of \lambda
  • l1_ratio in ElasticNet: 0 = pure Ridge, 1 = pure Lasso
model = Lasso(alpha=0.1)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

Bias-Variance Tradeoff

Underfitting

  • High bias, low variance
  • Model is too simple
  • Misses the true pattern
  • e.g., linear model for non-linear data

Overfitting

  • Low bias, high variance
  • Model is too complex
  • Fits noise in the data
  • e.g., very deep decision tree

Goal: find the sweet spot that minimizes total error.

Regularization Controls Overfitting

  • \lambda is the dial between underfitting and overfitting:
  • Small \lambda \rightarrow complex model \rightarrow potential overfitting
  • Large \lambda \rightarrow simple model \rightarrow potential underfitting
  • This is why choosing \lambda (hyperparameter tuning) matters!

Overfitting in Practice

  • More features than observations \rightarrow almost guaranteed overfitting
  • Common in finance: many potential predictors, limited history
  • Key defenses:
    • Regularization (Ridge, Lasso, Elastic Net)
    • Cross-validation
    • Feature selection and dimensionality reduction

Hyperparameters

  • Parameters: learned from data during training
    • e.g., \beta in regression, node splits in a tree
  • Hyperparameters: set before training
    • e.g., \lambda in Ridge, depth of a decision tree, number of trees in a forest
  • Tuning hyperparameters: finding the best values
    • Cannot learn them from the training data directly (would overfit!)

Train / Validation / Test Sets

Split the data into three groups:

  1. Training set: fit the model
  2. Validation set: tune hyperparameters, compare models
  3. Test set: final evaluation (touch only once!)
  • Typical splits: 60/20/20 or 70/15/15
  • Why three sets? If you use the test set to tune hyperparameters, you are effectively “training on the test data.”

Cross-Validation

  • K-fold cross-validation: split training data into K folds
    1. Train on K-1 folds
    2. Validate on the remaining fold
    3. Rotate which fold is the validation set
    4. Average performance across all K folds
  • Advantages: uses all data for both training and validation
  • Common choices: K = 5 or K = 10

Cross-Validation Diagram (5-fold)

Fold Split 1 Split 2 Split 3 Split 4 Split 5
Fold 1 Valid. Train Train Train Train
Fold 2 Train Valid. Train Train Train
Fold 3 Train Train Valid. Train Train
Fold 4 Train Train Train Valid. Train
Fold 5 Train Train Train Train Valid.

Each row = one iteration. Average the scores across all 5 iterations.

Hyperparameter Tuning

  • Grid search: try all combinations over a predefined grid
  • GridSearchCV in scikit-learn combines grid search + cross-validation:
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Ridge

param_grid = {"alpha": [0.01, 0.1, 1.0, 10.0, 100.0]}
search = GridSearchCV(Ridge(), param_grid, cv=5)
search.fit(X_train, y_train)

best_alpha = search.best_params_["alpha"]

Regression Metrics

  • MSE (Mean Squared Error): \frac{1}{n}\sum(y_i - \hat{y}_i)^2
  • RMSE (Root MSE): \sqrt{\text{MSE}} — same units as y
  • MAE (Mean Absolute Error): \frac{1}{n}\sum|y_i - \hat{y}_i|
  • R^2 (coefficient of determination): 1 - \frac{\sum(y_i - \hat{y}_i)^2}{\sum(y_i - \bar{y})^2}
  • Note: R^2 can be negative out-of-sample! This means the model does worse than simply predicting the mean.

Classification Metrics

  • Accuracy: fraction of correct predictions
  • Precision: of all predicted positives, how many are truly positive?
  • Recall: of all actual positives, how many did we find?
  • F1 score: harmonic mean of precision and recall: F_1 = \frac{2 \cdot \text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}
  • Why accuracy can be misleading: imbalanced classes
    • If 99% of firms don’t default, predicting “no default” always gives 99% accuracy!
  • In finance: consider the cost of false positives vs. false negatives

sklearn.metrics

from sklearn.metrics import (
    mean_squared_error,
    r2_score,
    accuracy_score,
    classification_report,
)

# Regression
mse = mean_squared_error(y_test, y_pred)
r2  = r2_score(y_test, y_pred)

# Classification
acc = accuracy_score(y_test, y_pred)
print(classification_report(y_test, y_pred))

Decision Trees

  • Each node = a binary decision on a feature (e.g., x_1 > 0.04?)
  • Easy to interpret — you can visualize the tree
  • Prone to overfitting (can memorize training data)
  • Key hyperparameters: max depth, min samples per leaf
  • scikit-learn: DecisionTreeRegressor, DecisionTreeClassifier

Random Forests

  • Ensemble of decision trees
  • Each tree trained on a random subset of data and features
  • Reduces overfitting through averaging — this is diversification!
    • If errors are not perfectly correlated and models are not biased, averaging helps
  • scikit-learn: RandomForestRegressor, RandomForestClassifier

Support Vector Machines

  • Finds the optimal boundary (hyperplane) that separates classes
  • Can handle non-linear boundaries via the kernel trick
  • Works for both classification (SVC) and regression (SVR)
  • Can be slow on large datasets

Model Comparison Summary

Interpretability Non-linearity Speed Overfitting risk
Linear models High Low Fast Low (with reg.)
Decision trees High High Fast High
Random forests Medium High Medium Low
SVM Low High Slow Medium
  • No single model is best for all problems
  • Start simple (linear models), add complexity only if needed

Why Preprocessing Matters

  • Many ML algorithms are sensitive to feature scales (SVM, Ridge, Lasso)
  • Missing values and categorical variables need handling
  • Remember, we only care about predicting y, not about inference, so it is common to normalize X (and sometimes y) to improve numerical performance.
  • Whatever transformation is done should be fit on training data and the same transformation applied to test data.

Feature Scaling

  • StandardScaler: zero mean, unit variance (z-score)
    • x' = \frac{x - \mu}{\sigma}
    • Use for most algorithms (Ridge, Lasso, SVM)
  • MinMaxScaler: scale to [0, 1]
    • x' = \frac{x - x_{\min}}{x_{\max} - x_{\min}}
    • Use when you need bounded values
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)  # fit + transform
X_test_scaled  = scaler.transform(X_test)        # transform only!

scikit-learn Pipelines

Combine preprocessing + model in one object:

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge

pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("model", Ridge(alpha=1.0)),
])

pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_test)
  • Prevents data leakage: scaler fits only on training data
  • Works seamlessly with GridSearchCV

The Curse of Dimensionality

  • More features \rightarrow data becomes sparse in high dimensions
  • Models need exponentially more data to fill the space
  • Many features may be correlated or irrelevant

Principal Component Analysis (PCA)

  • Most common dimensionality reduction technique
  • Finds directions of maximum variance in the data
  • Reduces features while preserving as much information as possible
  • Finance application: reducing many correlated firm characteristics into a few principal components
from sklearn.decomposition import PCA

pca = PCA(n_components=10)
X_reduced = pca.fit_transform(X)

PCA will be covered in more detail in the following course.

Unsupervised Learning: K-Means Clustering

  • Algorithm: assign points to K clusters, minimize within-cluster distance
    1. Initialize K centroids
    2. Assign each point to nearest centroid
    3. Update centroids as cluster means
    4. Repeat until convergence
  • Must choose K (hyperparameter)
  • Finance uses: grouping similar stocks/firms, market segmentation, regime detection

K-Means in scikit-learn

from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=3, random_state=42)
kmeans.fit(X)
labels = kmeans.labels_
  • Choosing K: elbow method — plot inertia vs. K, look for the “elbow”
  • Limitations: assumes spherical clusters, sensitive to initialization

Challenges of Time-Series Forecasting

  • Standard ML assumes i.i.d. data — time series violates this
  • Look-ahead bias: accidentally using future information
  • Cannot randomly split: must use rolling/expanding windows
    • Important that the test set is always out-of-sample
  • Autocorrelation, regime changes, non-stationarity
  • These methods will be covered in the next course

Book Recommendations

Hands-On Machine Learning with Scikit-Learn and PyTorch by Aurélien Géron — available via the library.

For ML in finance, see Gu, Kelly, and Xiu (2020).

More Book Recommendations

All available via the library.

References

Gu, Shihao, Bryan Kelly, and Dacheng Xiu. 2020. “Empirical Asset Pricing via Machine Learning.” The Review of Financial Studies 33 (5): 2223–73.