MATH60230 - Lecture 4

Vincent Grégoire

HEC Montréal

Saad Ali Khan

HEC Montréal

Plan for Today

  • Vectorized operations
  • Merges and joins
  • Jupyter Notebooks
  • Assignment 4

Vectorized operations

  • We can think of a vector as a single column of a pandas Dataframe.
from pandas import DataFrame
df = DataFrame({
    'vecA':[1,2,3], 
    'vecB': [4,5,6],
},
index=["row1","row2",'row3'])

Vectors

  • Pandas DataFrame supports both element-by-element and by-row/column operations.
  • E.g. element wise addition and substraction of vectors, means by column…etc
  • These operations are optimized for speed.
    • Avoid using for loops whenever you can use vectorized operations.

Example (vectors)

df['vecA'] + df['vecB']
row1    5
row2    7
row3    9
dtype: int64
2*df['vecA']
row1    2
row2    4
row3    6
Name: vecA, dtype: int64
df['vecA'].mean()
np.float64(2.0)

Example (Matrices)

  • We can think of a matrix as two columns of a pandas Dataframe.
print(df)
      vecA  vecB
row1     1     4
row2     2     5
row3     3     6
  • Computing means by columns:
df.mean(axis=0)
vecA    2.0
vecB    5.0
dtype: float64
  • Computing means by rows:
df.mean(axis=1)
row1    2.5
row2    3.5
row3    4.5
dtype: float64
  • Thinking of vector A and vector B stacked as a matrix, we can also transpose the matrix:
df.T
row1 row2 row3
vecA 1 2 3
vecB 4 5 6

Merges and Joins

  • Combining datasets is fundamental to empirical finance.
    • Stock returns in CRSP, accounting data in Compustat, analyst forecasts in I/B/E/S.
  • Two concepts to understand:
    1. Cardinality: How keys relate between datasets (one-to-one, one-to-many, many-to-many).
    2. Merge operation: Which rows to keep (inner, left, right, outer).
  • Getting this wrong can silently corrupt your results.

Join Types (Cardinality)

One-to-One

Each key appears at most once in both datasets.

Example: Monthly portfolio returns merged with portfolio characteristics.

One-to-Many

Keys unique in one dataset but repeated in the other. Rows broadcast.

Example: Firm characteristics merged with daily returns.

Many-to-Many

Keys repeat in both datasets. Creates all combinations.

Usually wrong!


Use validate parameter in pd.merge() to enforce expected cardinality.

Merge Operations

Which rows to keep when keys don’t match perfectly?

  • Inner join: Keep only rows where keys match in both datasets.
  • Left join: Keep all rows from the left dataset; add data from right where available (NaN otherwise).
  • Right join: Keep all rows from the right dataset; add data from left where available.
  • Outer join: Keep all rows from both datasets.

Functions: pd.merge() or df.join() in pandas, df.join() in Polars.

Use how parameter: 'inner', 'left', 'right', 'outer' (Polars uses 'full' for outer).

Asof Joins

  • Financial data rarely aligns perfectly in time.
    • Daily prices, quarterly earnings, irregular analyst forecasts.
  • Asof join: Find the most recent value as of each timestamp.
    • No exact match required.

Key points:

  • Data must be sorted by time before joining.
  • Use direction='backward' to avoid look-ahead bias.
  • Essential for point-in-time data (accounting, forecasts).

Functions: pd.merge_asof() in pandas, .join_asof() in Polars.

Merge Best Practices

  1. Validate cardinality with validate parameter.
  2. Check row counts before and after merging.
  3. Use indicator=True to see match rates.
  4. Normalize strings (strip whitespace, standardize case).
  5. Sort before asof joins — unsorted data gives wrong results.
  6. Watch for type mismatches (int vs float, datetime formats).

Low match rates usually indicate problems — investigate them!

Jupyter Notebooks

  • Jupyter notebook are the other popular format for Python code.
    • Made up of cells that can contain Python code or markdown.
    • Look more like a report, useful for showing data.
  • When do I use Python scripts (.py) vs Jupyter Notebooks (.ipynb)?
    • .py: For almost everything.
      • To define longer functions, classes, etc.
      • To create modules.
      • For code that does processing for which I do not need to see the output.
    • Notebooks:
      • For statistics and analysis, to see the results in figures and tables, document that analysis.

Jupyter Notebooks - Issues

Jupyter Notebooks, while useful, have many issues:

  • Harder to version control.
    • They embed the output in the file, making it harder to diff.
  • Harder to test and debug.
  • They can be slow to run.
  • The output is not always reproducible because of the order of execution.

There is a new notebook format called marimo that tries to solve these issues.

Assignment 4

Pandas continued!

Next Lecture

  • Figures and tables
  • Matplotlib and seaborn
  • How to read a paper