graph LR
subgraph "Dataset A"
A1[ticker: AAPL<br/>ret: 0.02]
A2[ticker: MSFT<br/>ret: -0.01]
end
subgraph "Dataset B"
B1[ticker: AAPL<br/>mktcap: 2500B]
B2[ticker: MSFT<br/>mktcap: 2200B]
end
A1 -.->|match| B1
A2 -.->|match| B2
style A1 fill:#e1f5ff
style A2 fill:#e1f5ff
style B1 fill:#fff4e1
style B2 fill:#fff4e1
16 Joins and Merges
Combining datasets is fundamental to empirical finance. Stock returns live in one database (CRSP), accounting data in another (Compustat), analyst forecasts in a third (I/B/E/S). Real research requires merging these sources, and doing it wrong can silently corrupt your results.
This chapter develops the theory and practice of joins. We start with the conceptual framework—what happens when you match rows between datasets—then move to specific merge operations. Special attention goes to asof joins, essential for time-series data with mismatched timestamps. We close with diagnostic techniques to catch the merge errors that plague real-world research.
16.1 Join Types: Cardinality Matters
Before learning merge syntax, understand what happens to row counts. The relationship between keys in your datasets determines whether you end up with more rows, fewer rows, or the same number. Getting this wrong is catastrophic—you might duplicate observations, lose data, or create spurious patterns.
16.1.1 One-to-One Joins
Each key appears at most once in both datasets. This is the simplest case: rows either match or they don’t.
Example: merging monthly portfolio returns with portfolio characteristics. Each portfolio-month appears once in each dataset.
import pandas as pd
import polars as pl
from datetime import date
# Monthly portfolio returns
returns = pd.DataFrame({
'portfolio': ['growth', 'value', 'momentum'],
'month': ['2024-01', '2024-01', '2024-01'],
'return': [0.025, -0.010, 0.035]
})
# Portfolio characteristics (computed separately)
characteristics = pd.DataFrame({
'portfolio': ['growth', 'value', 'momentum'],
'month': ['2024-01', '2024-01', '2024-01'],
'avg_size': [5000, 3000, 4000],
'avg_bm': [0.5, 2.5, 1.2]
})
# One-to-one merge
merged = returns.merge(
characteristics,
on=['portfolio', 'month'],
validate='one_to_one'
)
merged- 1
- Columns that must match exactly in both DataFrames.
- 2
-
Raises
MergeErrorif duplicate keys exist in either DataFrame.
| portfolio | month | return | avg_size | avg_bm | |
|---|---|---|---|---|---|
| 0 | growth | 2024-01 | 0.025 | 5000 | 0.5 |
| 1 | value | 2024-01 | -0.010 | 3000 | 2.5 |
| 2 | momentum | 2024-01 | 0.035 | 4000 | 1.2 |
Use validate='one_to_one' in pandas or check row counts before and after. One-to-one merges should never change the number of rows (except for unmatched keys with inner joins). If row count changes unexpectedly, you have duplicate keys somewhere.
16.1.2 One-to-Many Joins
Keys are unique in one dataset but repeated in the other. This expands rows from the unique side.
graph LR
subgraph "Firms (One)"
F1[permno: 10107<br/>sector: Tech]
end
subgraph "Returns (Many)"
R1[permno: 10107<br/>date: 2024-01-02<br/>ret: 0.01]
R2[permno: 10107<br/>date: 2024-01-03<br/>ret: -0.02]
R3[permno: 10107<br/>date: 2024-01-04<br/>ret: 0.03]
end
F1 -.->|broadcast| R1
F1 -.->|broadcast| R2
F1 -.->|broadcast| R3
style F1 fill:#e1f5ff
style R1 fill:#fff4e1
style R2 fill:#fff4e1
style R3 fill:#fff4e1
Example: merging firm characteristics (one row per firm) with daily returns (many rows per firm).
# Firm characteristics (one row per firm)
firms = pd.DataFrame({
'permno': [10107, 10107, 14593],
'year': [2023, 2024, 2024],
'sector': ['Tech', 'Tech', 'Finance'],
'headquarters': ['CA', 'CA', 'NY']
})
# Daily returns (many rows per firm-year)
returns = pd.DataFrame({
'permno': [10107, 10107, 10107, 14593, 14593],
'year': [2024, 2024, 2024, 2024, 2024],
'date': ['2024-01-02', '2024-01-03', '2024-01-04',
'2024-01-02', '2024-01-03'],
'ret': [0.01, -0.02, 0.03, 0.005, -0.001]
})
# One-to-many merge: firm characteristics broadcast to each return
merged = returns.merge(
firms,
on=['permno', 'year'],
validate='many_to_one' # Ensures firms has unique keys
)
merged| permno | year | date | ret | sector | headquarters | |
|---|---|---|---|---|---|---|
| 0 | 10107 | 2024 | 2024-01-02 | 0.010 | Tech | CA |
| 1 | 10107 | 2024 | 2024-01-03 | -0.020 | Tech | CA |
| 2 | 10107 | 2024 | 2024-01-04 | 0.030 | Tech | CA |
| 3 | 14593 | 2024 | 2024-01-02 | 0.005 | Finance | NY |
| 4 | 14593 | 2024 | 2024-01-03 | -0.001 | Finance | NY |
Notice how sector and headquarters repeat for each date. This is broadcasting: the single firm row expands to match multiple return rows.
If you think you have a one-to-many join but the “one” side has duplicates, you’ll get a many-to-many join instead. This multiplies rows in ways that corrupt analyses. Always validate.
16.1.3 Many-to-Many Joins
Keys repeat in both datasets. Every combination of matching keys produces a row. This explodes row counts and is rarely what you want.
graph LR
subgraph "Dataset A (Many)"
A1[sector: Tech<br/>year: 2024<br/>firm: AAPL]
A2[sector: Tech<br/>year: 2024<br/>firm: MSFT]
end
subgraph "Dataset B (Many)"
B1[sector: Tech<br/>year: 2024<br/>analyst: Goldman]
B2[sector: Tech<br/>year: 2024<br/>analyst: Morgan]
end
A1 -.->|match| B1
A1 -.->|match| B2
A2 -.->|match| B1
A2 -.->|match| B2
style A1 fill:#e1f5ff
style A2 fill:#e1f5ff
style B1 fill:#fff4e1
style B2 fill:#fff4e1
Example: joining firm-level data to analyst coverage by sector-year creates all combinations.
# Firms in each sector-year
firms = pd.DataFrame({
'sector': ['Tech', 'Tech', 'Finance'],
'year': [2024, 2024, 2024],
'ticker': ['AAPL', 'MSFT', 'JPM'],
'return': [0.25, 0.30, 0.15]
})
# Analyst reports by sector-year (multiple analysts per sector)
analysts = pd.DataFrame({
'sector': ['Tech', 'Tech', 'Finance'],
'year': [2024, 2024, 2024],
'analyst': ['Goldman', 'Morgan', 'Citi'],
'recommendation': ['Buy', 'Hold', 'Buy']
})
# Many-to-many merge: DANGER!
merged = firms.merge(analysts, on=['sector', 'year'])
merged| sector | year | ticker | return | analyst | recommendation | |
|---|---|---|---|---|---|---|
| 0 | Tech | 2024 | AAPL | 0.25 | Goldman | Buy |
| 1 | Tech | 2024 | AAPL | 0.25 | Morgan | Hold |
| 2 | Tech | 2024 | MSFT | 0.30 | Goldman | Buy |
| 3 | Tech | 2024 | MSFT | 0.30 | Morgan | Hold |
| 4 | Finance | 2024 | JPM | 0.15 | Citi | Buy |
Notice how the row count exploded:
print(f"Original firms: {len(firms)}, analysts: {len(analysts)}, merged: {len(merged)}")Original firms: 3, analysts: 3, merged: 5
In finance, many-to-many joins typically indicate a modeling error. You probably need to aggregate one dataset first (e.g., average analyst recommendations by sector-year) or use a more specific join key. Pandas allows many-to-many by default; consider this a bug, not a feature.
16.2 Merge Operations: Which Rows Survive?
Join type (one-to-one, etc.) describes cardinality. Merge operation (inner, outer, left, right) describes which rows to keep when keys don’t match perfectly.
16.2.1 Inner Join: Intersection Only
Keep only rows where keys match in both datasets. This is conservative—you lose any observation that doesn’t have a partner—but ensures every row has complete data.
graph TD
subgraph "Dataset A"
A1[AAPL]
A2[MSFT]
A3[TSLA]
end
subgraph "Dataset B"
B1[AAPL]
B2[GOOGL]
B3[TSLA]
end
subgraph "Result"
R1[AAPL]
R2[TSLA]
end
A1 --> R1
B1 --> R1
A3 --> R2
B3 --> R2
style A2 fill:#ffcccc
style B2 fill:#ffcccc
style R1 fill:#ccffcc
style R2 fill:#ccffcc
Example: merging returns with accounting data. Only firm-years with both survive.
# Stock returns
returns = pd.DataFrame({
'ticker': ['AAPL', 'MSFT', 'TSLA'],
'year': [2023, 2023, 2023],
'return': [0.45, 0.35, 0.10]
})
# Accounting data (missing MSFT)
accounting = pd.DataFrame({
'ticker': ['AAPL', 'TSLA', 'GOOGL'],
'year': [2023, 2023, 2023],
'roa': [0.20, 0.05, 0.15],
'leverage': [0.30, 0.10, 0.05]
})
# Inner join: only AAPL and TSLA have both
inner = returns.merge(accounting, on=['ticker', 'year'], how='inner')
inner| ticker | year | return | roa | leverage | |
|---|---|---|---|---|---|
| 0 | AAPL | 2023 | 0.45 | 0.20 | 0.3 |
| 1 | TSLA | 2023 | 0.10 | 0.05 | 0.1 |
Use inner joins when you need data from both sources for your analysis. For example, computing correlations between returns and accounting ratios requires both variables. The sample restriction is a feature, not a bug—you’re explicitly limiting to observations where analysis is possible.
16.2.2 Left Join: Keep All from Left Dataset
Keep every row from the left dataset, adding data from the right where available. Unmatched right rows disappear. Missing values (NaN) fill columns from the right dataset where no match exists.
graph TD
subgraph "Left Dataset"
L1[AAPL]
L2[MSFT]
L3[TSLA]
end
subgraph "Right Dataset"
R1[AAPL]
R2[GOOGL]
R3[TSLA]
end
subgraph "Result"
RES1[AAPL + match]
RES2[MSFT + NaN]
RES3[TSLA + match]
end
L1 --> RES1
L2 --> RES2
L3 --> RES3
R1 --> RES1
R3 --> RES3
style R2 fill:#ffcccc
style RES1 fill:#ccffcc
style RES2 fill:#ffffcc
style RES3 fill:#ccffcc
Example: keeping all returns, adding accounting data where available.
# Left join: keep all returns
left = returns.merge(accounting, on=['ticker', 'year'], how='left')
left| ticker | year | return | roa | leverage | |
|---|---|---|---|---|---|
| 0 | AAPL | 2023 | 0.45 | 0.20 | 0.3 |
| 1 | MSFT | 2023 | 0.35 | NaN | NaN |
| 2 | TSLA | 2023 | 0.10 | 0.05 | 0.1 |
Use left joins when the left dataset defines your sample. For example, if you’ve carefully constructed a set of stocks for analysis, use those as the left dataset and merge in additional characteristics. This ensures you keep your full sample even if some characteristics are missing.
16.2.3 Right Join: Keep All from Right Dataset
The mirror of left join. Keep every row from the right dataset, add data from left where available. In practice, just swap your datasets and use a left join—right joins exist for symmetry but rarely clarify code.
# Right join: keep all accounting observations
right = returns.merge(accounting, on=['ticker', 'year'], how='right')
right| ticker | year | return | roa | leverage | |
|---|---|---|---|---|---|
| 0 | AAPL | 2023 | 0.45 | 0.20 | 0.30 |
| 1 | TSLA | 2023 | 0.10 | 0.05 | 0.10 |
| 2 | GOOGL | 2023 | NaN | 0.15 | 0.05 |
Write accounting.merge(returns, how='left') instead of returns.merge(accounting, how='right'). Reading left-to-right matches execution order and makes the priority explicit: you’re starting with accounting data and adding returns.
16.2.4 Outer Join: Keep Everything
Keep all rows from both datasets. This maximizes sample size but can create large numbers of missing values.
graph TD
subgraph "Dataset A"
A1[AAPL]
A2[MSFT]
A3[TSLA]
end
subgraph "Dataset B"
B1[AAPL]
B2[GOOGL]
B3[TSLA]
end
subgraph "Result"
R1[AAPL + match]
R2[MSFT + NaN]
R3[TSLA + match]
R4[GOOGL + NaN]
end
A1 --> R1
A2 --> R2
A3 --> R3
B1 --> R1
B2 --> R4
B3 --> R3
style R1 fill:#ccffcc
style R2 fill:#ffffcc
style R3 fill:#ccffcc
style R4 fill:#ffffcc
Example: keeping all tickers from either dataset.
# Outer join: keep everything
outer = returns.merge(accounting, on=['ticker', 'year'], how='outer')
outer| ticker | year | return | roa | leverage | |
|---|---|---|---|---|---|
| 0 | AAPL | 2023 | 0.45 | 0.20 | 0.30 |
| 1 | GOOGL | 2023 | NaN | 0.15 | 0.05 |
| 2 | MSFT | 2023 | 0.35 | NaN | NaN |
| 3 | TSLA | 2023 | 0.10 | 0.05 | 0.10 |
Outer joins are dangerous because they succeed even when keys match poorly. You might merge stock returns with bond yields on date, get mostly NaN, but never realize the datasets use different date formats. Always inspect outer join results carefully and report match rates.
16.2.5 Polars Merge Syntax
Polars uses similar concepts but different method names. The key difference: Polars is more explicit about join types and discourages many-to-many joins by default.
# Same data in Polars
returns_pl = pl.DataFrame({
'ticker': ['AAPL', 'MSFT', 'TSLA'],
'year': [2023, 2023, 2023],
'return': [0.45, 0.35, 0.10]
})
accounting_pl = pl.DataFrame({
'ticker': ['AAPL', 'TSLA', 'GOOGL'],
'year': [2023, 2023, 2023],
'roa': [0.20, 0.05, 0.15],
'leverage': [0.30, 0.10, 0.05]
})
# Inner join
inner_pl = returns_pl.join(
accounting_pl,
on=['ticker', 'year'],
how='inner'
)
inner_pl| ticker | year | return | roa | leverage |
|---|---|---|---|---|
| str | i64 | f64 | f64 | f64 |
| "AAPL" | 2023 | 0.45 | 0.2 | 0.3 |
| "TSLA" | 2023 | 0.1 | 0.05 | 0.1 |
# Left join
left_pl = returns_pl.join(
accounting_pl,
on=['ticker', 'year'],
how='left'
)
left_pl| ticker | year | return | roa | leverage |
|---|---|---|---|---|
| str | i64 | f64 | f64 | f64 |
| "AAPL" | 2023 | 0.45 | 0.2 | 0.3 |
| "MSFT" | 2023 | 0.35 | null | null |
| "TSLA" | 2023 | 0.1 | 0.05 | 0.1 |
# Outer join (called 'full' in Polars)
outer_pl = returns_pl.join(
accounting_pl,
on=['ticker', 'year'],
how='full'
)
outer_pl| ticker | year | return | ticker_right | year_right | roa | leverage |
|---|---|---|---|---|---|---|
| str | i64 | f64 | str | i64 | f64 | f64 |
| "AAPL" | 2023 | 0.45 | "AAPL" | 2023 | 0.2 | 0.3 |
| "TSLA" | 2023 | 0.1 | "TSLA" | 2023 | 0.05 | 0.1 |
| null | null | null | "GOOGL" | 2023 | 0.15 | 0.05 |
| "MSFT" | 2023 | 0.35 | null | null | null | null |
Polars also provides a validate parameter similar to pandas:
# Validate one-to-one
inner_pl = returns_pl.join(
accounting_pl,
on=['ticker', 'year'],
how='inner',
validate='1:1' # Raises error if violated
)16.3 Asof Joins for Time-Series Data
Financial data rarely aligns perfectly in time. You have daily stock prices, quarterly earnings announcements, irregular analyst forecast updates. Matching these requires asof joins: find the most recent value as of each timestamp, without requiring exact matches.
16.3.1 The Asof Join Problem
Suppose you want to merge earnings announcement dates with stock returns. Earnings come out on irregular dates (whenever the company reports). You need to tag each daily return with the most recent earnings announcement as of that date. A standard merge fails because announcement dates don’t match return dates. An asof join solves this: for each return date, find the most recent announcement date on or before that date.
16.3.2 Asof Join Mechanics
Asof joins require sorted data and directional matching. The syntax varies between pandas and Polars, but the logic is identical:
- Sort both datasets by time
- For each row in the left dataset, find the row in the right dataset with the closest timestamp that doesn’t exceed the left timestamp (backward-looking)
- Optionally match exactly on other columns (e.g., ticker)
Example: merging daily returns with quarterly earnings.
import numpy as np
# Daily returns (many observations)
returns = pd.DataFrame({
'ticker': ['AAPL'] * 10,
'date': pd.date_range('2024-01-01', periods=10, freq='D'),
'return': np.random.randn(10) * 0.02
})
# Earnings announcements (sparse, irregular)
earnings = pd.DataFrame({
'ticker': ['AAPL', 'AAPL', 'AAPL'],
'announce_date': pd.to_datetime(['2023-12-28', '2024-01-03', '2024-01-08']),
'eps': [1.25, 1.30, 1.35],
'surprise': [0.02, -0.01, 0.03]
})
# CRITICAL: Sort by time
returns = returns.sort_values('date')
earnings = earnings.sort_values('announce_date')
# Asof join: match each return to most recent earnings
merged = pd.merge_asof(
returns,
earnings,
left_on='date',
right_on='announce_date',
by='ticker',
direction='backward'
)
merged[['date', 'return', 'announce_date', 'eps', 'surprise']]- 1
- Both DataFrames must be sorted by their time columns before asof join.
- 2
- Time columns to match on (can have different names).
- 3
- Additional columns that must match exactly.
- 4
- Use most recent prior value (avoids look-ahead bias).
| date | return | announce_date | eps | surprise | |
|---|---|---|---|---|---|
| 0 | 2024-01-01 | -0.032470 | 2023-12-28 | 1.25 | 0.02 |
| 1 | 2024-01-02 | 0.023386 | 2023-12-28 | 1.25 | 0.02 |
| 2 | 2024-01-03 | -0.040501 | 2024-01-03 | 1.30 | -0.01 |
| 3 | 2024-01-04 | -0.017916 | 2024-01-03 | 1.30 | -0.01 |
| 4 | 2024-01-05 | -0.004434 | 2024-01-03 | 1.30 | -0.01 |
| 5 | 2024-01-06 | -0.020436 | 2024-01-03 | 1.30 | -0.01 |
| 6 | 2024-01-07 | -0.024500 | 2024-01-03 | 1.30 | -0.01 |
| 7 | 2024-01-08 | 0.046251 | 2024-01-08 | 1.35 | 0.03 |
| 8 | 2024-01-09 | -0.013361 | 2024-01-08 | 1.35 | 0.03 |
| 9 | 2024-01-10 | 0.024632 | 2024-01-08 | 1.35 | 0.03 |
Notice how the earnings data “fills forward”:
- Jan 1-2 use the Dec 28 announcement
- Jan 3-7 use the Jan 3 announcement
- Jan 8+ use the Jan 8 announcement
This is exactly what you want for event studies: tag each return with the most recent earnings information available at that time.
Both datasets must be sorted by the time column. Pandas doesn’t always check this, and unsorted data produces wrong results silently. Always sort explicitly before asof joins.
16.3.3 Asof Join Directions
The direction parameter controls which timestamp to match:
direction='backward'(default): Use most recent prior value (as of semantics)direction='forward': Use next future value (rare in finance)direction='nearest': Use closest value in either direction (dangerous for event studies—can introduce look-ahead bias)
# Forward-looking (CAREFUL: creates look-ahead bias)
forward = pd.merge_asof(
returns,
earnings,
left_on='date',
right_on='announce_date',
by='ticker',
direction='forward' # Use next announcement
)
# Nearest (AVOID for event studies)
nearest = pd.merge_asof(
returns,
earnings,
left_on='date',
right_on='announce_date',
by='ticker',
direction='nearest' # Use closest announcement
)Using direction='forward' or direction='nearest' can introduce look-ahead bias: your analysis uses information that didn’t exist at the time. For example, tagging a January return with a February earnings announcement. This inflates backtested strategy returns and invalidates empirical tests. Stick with direction='backward' unless you have a specific reason to do otherwise.
16.3.4 Asof Joins in Polars
Polars provides asof joins through the join_asof method. The syntax is cleaner and performance is often better for large datasets.
# Same data in Polars
returns_pl = pl.DataFrame({
'ticker': ['AAPL'] * 10,
'date': pl.date_range(
date(2024, 1, 1),
date(2024, 1, 10),
interval='1d',
eager=True
),
'return': np.random.randn(10) * 0.02
})
earnings_pl = pl.DataFrame({
'ticker': ['AAPL', 'AAPL', 'AAPL'],
'announce_date': [
date(2023, 12, 28),
date(2024, 1, 3),
date(2024, 1, 8)
],
'eps': [1.25, 1.30, 1.35],
'surprise': [0.02, -0.01, 0.03]
})
# Asof join in Polars
merged_pl = returns_pl.sort('date').join_asof(
earnings_pl.sort('announce_date'),
left_on='date',
right_on='announce_date',
by='ticker',
strategy='backward' # Equivalent to pandas direction='backward'
)
merged_pl.select(['date', 'return', 'announce_date', 'eps', 'surprise'])/var/folders/jr/cn9h86ld68qb5rtvs9gsb1vr0000gn/T/ipykernel_65713/4247637574.py:25: UserWarning: Sortedness of columns cannot be checked when 'by' groups provided
merged_pl = returns_pl.sort('date').join_asof(
| date | return | announce_date | eps | surprise |
|---|---|---|---|---|
| date | f64 | date | f64 | f64 |
| 2024-01-01 | 0.01453 | 2023-12-28 | 1.25 | 0.02 |
| 2024-01-02 | 0.001239 | 2023-12-28 | 1.25 | 0.02 |
| 2024-01-03 | 0.016048 | 2024-01-03 | 1.3 | -0.01 |
| 2024-01-04 | -0.02854 | 2024-01-03 | 1.3 | -0.01 |
| 2024-01-05 | 0.01889 | 2024-01-03 | 1.3 | -0.01 |
| 2024-01-06 | -0.005555 | 2024-01-03 | 1.3 | -0.01 |
| 2024-01-07 | 0.045201 | 2024-01-03 | 1.3 | -0.01 |
| 2024-01-08 | -0.029402 | 2024-01-08 | 1.35 | 0.03 |
| 2024-01-09 | -0.021957 | 2024-01-08 | 1.35 | 0.03 |
| 2024-01-10 | 0.026204 | 2024-01-08 | 1.35 | 0.03 |
Polars also supports tolerance (maximum time difference) and handles missing values more explicitly:
# Only match if announcement is within 90 days
merged_pl = returns_pl.sort('date').join_asof(
earnings_pl.sort('announce_date'),
left_on='date',
right_on='announce_date',
by='ticker',
strategy='backward',
tolerance='90d' # NaN if no announcement within 90 days
)16.3.5 Common Asof Join Patterns in Finance
Pattern 1: Point-in-time accounting data
Merge daily returns with quarterly accounting variables. Each return should use the most recent financial statement available at that date.
# Daily returns
returns = pd.DataFrame({
'permno': [10107] * 100,
'date': pd.date_range('2024-01-01', periods=100, freq='D'),
'ret': np.random.randn(100) * 0.02
})
# Quarterly accounting (fiscal quarter end dates)
accounting = pd.DataFrame({
'permno': [10107, 10107, 10107, 10107],
'datadate': pd.to_datetime(['2023-09-30', '2023-12-31',
'2024-03-31', '2024-06-30']),
'book_equity': [1000, 1100, 1150, 1200],
'total_assets': [5000, 5200, 5300, 5400]
})
# Asof join: each return uses most recent accounting data
merged = pd.merge_asof(
returns.sort_values('date'),
accounting.sort_values('datadate'),
left_on='date',
right_on='datadate',
by='permno',
direction='backward'
)Pattern 2: Analyst forecast updates
Analyst forecasts arrive irregularly. Tag each return with the current consensus forecast.
# Daily returns
returns = pd.DataFrame({
'ticker': ['AAPL'] * 30,
'date': pd.date_range('2024-01-01', periods=30, freq='D'),
'ret': np.random.randn(30) * 0.02
})
# Analyst forecast updates (irregular)
forecasts = pd.DataFrame({
'ticker': ['AAPL'] * 5,
'forecast_date': pd.to_datetime([
'2023-12-15', '2024-01-05', '2024-01-12',
'2024-01-20', '2024-01-28'
]),
'consensus_eps': [5.25, 5.30, 5.28, 5.35, 5.40],
'num_analysts': [25, 26, 27, 28, 29]
})
# Asof join: current forecast as of each date
merged = pd.merge_asof(
returns.sort_values('date'),
forecasts.sort_values('forecast_date'),
left_on='date',
right_on='forecast_date',
by='ticker',
direction='backward'
)Pattern 3: Bid-ask spreads and trades
Match each trade to the prevailing bid-ask spread (within milliseconds).
# Trades (exact timestamps)
trades = pd.DataFrame({
'symbol': ['AAPL'] * 5,
'trade_time': pd.to_datetime([
'2024-01-02 09:30:00.123',
'2024-01-02 09:30:00.456',
'2024-01-02 09:30:01.789',
'2024-01-02 09:30:02.012',
'2024-01-02 09:30:03.345'
]),
'trade_price': [150.25, 150.26, 150.24, 150.25, 150.27],
'trade_size': [100, 200, 150, 300, 250]
})
# Quote updates (irregular, microsecond timestamps)
quotes = pd.DataFrame({
'symbol': ['AAPL'] * 4,
'quote_time': pd.to_datetime([
'2024-01-02 09:30:00.100',
'2024-01-02 09:30:00.500',
'2024-01-02 09:30:02.000',
'2024-01-02 09:30:03.000'
]),
'bid': [150.24, 150.25, 150.23, 150.26],
'ask': [150.26, 150.27, 150.25, 150.28]
})
# Asof join: prevailing quote at each trade
merged = pd.merge_asof(
trades.sort_values('trade_time'),
quotes.sort_values('quote_time'),
left_on='trade_time',
right_on='quote_time',
by='symbol',
direction='backward',
tolerance=pd.Timedelta('1s'), # Only match within 1 second
allow_exact_matches=False # Use quote *before* the trade, not at same instant
)
# Calculate effective spread
merged['spread'] = merged['ask'] - merged['bid']
merged['effective_spread'] = abs(merged['trade_price'] -
(merged['bid'] + merged['ask']) / 2)16.4 Diagnosing and Correcting Merge Errors
Merges fail silently. You run the code, get a DataFrame back, and assume it worked. But maybe 80% of rows didn’t match. Maybe you have duplicate keys creating a many-to-many join. Maybe your date formats don’t align. The code runs, but your results are garbage.
Run the following checks every time until they become automatic.
16.4.1 Essential checks before and after merging
Before merging:
- Predict row counts: Inner join should give ≤ min(left, right) rows; left join should preserve left row count for one-to-one or one-to-many; outer join gives ≥ max(left, right) rows
- Check for duplicate keys: Use
df.duplicated(key_cols).sum()orvalidateparameter to enforce cardinality - Verify data types match: Use
df[key_cols].dtypesto ensure join keys have compatible types - For string keys: Normalize with
.str.strip().str.upper()to handle whitespace and case differences - For dates: Ensure both use
datetime64, same time zone, and same granularity
After merging:
- Check row count changes: Unexpected changes indicate duplicate keys or wrong join type
- Check match rates: Use
indicator=Trueto see how many rows matched (usemerged['_merge'].value_counts()) - Inspect missing values: NaN in right-side columns after left join indicates failed matches
16.4.2 Using the indicator parameter
The indicator=True parameter adds a _merge column showing match status:
# Re-create sample data for merge diagnostics
returns = pd.DataFrame({
'ticker': ['AAPL', 'MSFT', 'TSLA'],
'year': [2023, 2023, 2023],
'return': [0.45, 0.35, 0.10]
})
accounting = pd.DataFrame({
'ticker': ['AAPL', 'TSLA', 'GOOGL'],
'year': [2023, 2023, 2023],
'roa': [0.20, 0.05, 0.15],
'leverage': [0.30, 0.10, 0.05]
})
# Left join with indicator
merged = returns.merge(
accounting,
on=['ticker', 'year'],
how='left',
indicator=True
)
# Check match rates
merged['_merge'].value_counts()- 1
-
Adds a
_mergecolumn showing whether each row matched:left_only,right_only, orboth.
_merge
both 2
left_only 1
right_only 0
Name: count, dtype: int64
print(f"Match rate: {(merged['_merge'] == 'both').mean():.1%}")Match rate: 66.7%
A match rate below 90% usually indicates problems: incorrect join keys, different identifier schemes (CUSIP vs PERMNO), date format mismatches, inconsistent capitalization, or datasets filtered to different subsets. Don’t just accept low match rates—understand why they occurred.
16.4.3 Common merge error patterns
| Error | Symptom | Cause | Solution |
|---|---|---|---|
| Cartesian explosion | Row count explodes (10,000x+ more rows) | Duplicate keys in both datasets | Aggregate one dataset or use more specific join keys |
| Zero matches | All rows are left_only |
Keys don’t overlap at all | Check identifier mapping (CUSIP vs PERMNO), date formats |
| Type mismatch | Low or zero matches despite overlapping keys | Different data types (int vs float, str vs object) | Convert both to same type before merge |
| Float precision | Values that should match don’t | 150.25 vs 150.25000001 |
Round floats or use integer keys |
| Time zone mismatch | Asof join produces all NaN | One dataset in UTC, other in local time | Convert both to same time zone |
| Case/whitespace | Low matches on string keys | “AAPL” vs “aapl” vs ” AAPL” | Use .str.strip().str.upper() |
| Unsorted asof join | Wrong matches or NaN | Data not sorted by time column | Always sort_values() before asof joins |
16.5 Summary and Best Practices
Merging is where most data errors occur in empirical finance. Follow these practices to catch problems early:
Understand cardinality before merging. Is this one-to-one, one-to-many, or many-to-many? Use
validateto enforce expectations.Choose the right merge type. Inner for complete cases only, left to preserve your sample, outer rarely (and carefully).
Use asof joins for time-series data. Don’t try to match timestamps exactly—use asof joins with
direction='backward'to avoid look-ahead bias.Validate every merge. Check row counts, match rates, duplicates, and data types until these checks become automatic.
Sort before asof joins. Unsorted data produces silent errors.
Document match rates. Report how many observations matched in your analysis. Low match rates indicate problems.
Normalize strings. Strip whitespace, standardize case, remove special characters before merging on text.
Use indicator columns. Add
indicator=Trueto understand what matched and what didn’t.Prefer left joins for clarity. Write
base.merge(extra, how='left')instead ofextra.merge(base, how='right').Consider Polars for large datasets. Polars’ lazy evaluation and better error messages can save time on big merges.
The datasets in empirical finance are messy. Companies change tickers, databases use different identifiers, timestamps don’t align. Merging correctly requires skepticism, validation, and careful attention to detail. The techniques in this chapter won’t eliminate merge errors, but they’ll help you catch them before they corrupt your research.