16 Joins and Merges

Combining datasets is fundamental to empirical finance. Stock returns live in one database (CRSP), accounting data in another (Compustat), analyst forecasts in a third (I/B/E/S). Real research requires merging these sources, and doing it wrong can silently corrupt your results.

This chapter develops the theory and practice of joins. We start with the conceptual framework—what happens when you match rows between datasets—then move to specific merge operations. Special attention goes to asof joins, essential for time-series data with mismatched timestamps. We close with diagnostic techniques to catch the merge errors that plague real-world research.

16.1 Join Types: Cardinality Matters

Before learning merge syntax, understand what happens to row counts. The relationship between keys in your datasets determines whether you end up with more rows, fewer rows, or the same number. Getting this wrong is catastrophic—you might duplicate observations, lose data, or create spurious patterns.

16.1.1 One-to-One Joins

Each key appears at most once in both datasets. This is the simplest case: rows either match or they don’t.

graph LR
    subgraph "Dataset A"
        A1[ticker: AAPL<br/>ret: 0.02]
        A2[ticker: MSFT<br/>ret: -0.01]
    end

    subgraph "Dataset B"
        B1[ticker: AAPL<br/>mktcap: 2500B]
        B2[ticker: MSFT<br/>mktcap: 2200B]
    end

    A1 -.->|match| B1
    A2 -.->|match| B2

    style A1 fill:#e1f5ff
    style A2 fill:#e1f5ff
    style B1 fill:#fff4e1
    style B2 fill:#fff4e1

One-to-one join: each key matches at most once

Example: merging monthly portfolio returns with portfolio characteristics. Each portfolio-month appears once in each dataset.

import pandas as pd
import polars as pl
from datetime import date

# Monthly portfolio returns
returns = pd.DataFrame({
    'portfolio': ['growth', 'value', 'momentum'],
    'month': ['2024-01', '2024-01', '2024-01'],
    'return': [0.025, -0.010, 0.035]
})

# Portfolio characteristics (computed separately)
characteristics = pd.DataFrame({
    'portfolio': ['growth', 'value', 'momentum'],
    'month': ['2024-01', '2024-01', '2024-01'],
    'avg_size': [5000, 3000, 4000],
    'avg_bm': [0.5, 2.5, 1.2]
})

# One-to-one merge
merged = returns.merge(
    characteristics,
    on=['portfolio', 'month'],
    validate='one_to_one'
)
merged

1: Columns that must match exactly in both DataFrames.
2: Raises MergeError if duplicate keys exist in either DataFrame.

	portfolio	month	return	avg_size	avg_bm
0	growth	2024-01	0.025	5000	0.5
1	value	2024-01	-0.010	3000	2.5
2	momentum	2024-01	0.035	4000	1.2

Always Validate Cardinality

Use validate='one_to_one' in pandas or check row counts before and after. One-to-one merges should never change the number of rows (except for unmatched keys with inner joins). If row count changes unexpectedly, you have duplicate keys somewhere.

16.1.2 One-to-Many Joins

Keys are unique in one dataset but repeated in the other. This expands rows from the unique side.

graph LR
    subgraph "Firms (One)"
        F1[permno: 10107<br/>sector: Tech]
    end

    subgraph "Returns (Many)"
        R1[permno: 10107<br/>date: 2024-01-02<br/>ret: 0.01]
        R2[permno: 10107<br/>date: 2024-01-03<br/>ret: -0.02]
        R3[permno: 10107<br/>date: 2024-01-04<br/>ret: 0.03]
    end

    F1 -.->|broadcast| R1
    F1 -.->|broadcast| R2
    F1 -.->|broadcast| R3

    style F1 fill:#e1f5ff
    style R1 fill:#fff4e1
    style R2 fill:#fff4e1
    style R3 fill:#fff4e1

One-to-many join: firm-level data matched to daily returns

Example: merging firm characteristics (one row per firm) with daily returns (many rows per firm).

# Firm characteristics (one row per firm)
firms = pd.DataFrame({
    'permno': [10107, 10107, 14593],
    'year': [2023, 2024, 2024],
    'sector': ['Tech', 'Tech', 'Finance'],
    'headquarters': ['CA', 'CA', 'NY']
})

# Daily returns (many rows per firm-year)
returns = pd.DataFrame({
    'permno': [10107, 10107, 10107, 14593, 14593],
    'year': [2024, 2024, 2024, 2024, 2024],
    'date': ['2024-01-02', '2024-01-03', '2024-01-04',
             '2024-01-02', '2024-01-03'],
    'ret': [0.01, -0.02, 0.03, 0.005, -0.001]
})

# One-to-many merge: firm characteristics broadcast to each return
merged = returns.merge(
    firms,
    on=['permno', 'year'],
    validate='many_to_one'  # Ensures firms has unique keys
)
merged

	permno	year	date	ret	sector	headquarters
0	10107	2024	2024-01-02	0.010	Tech	CA
1	10107	2024	2024-01-03	-0.020	Tech	CA
2	10107	2024	2024-01-04	0.030	Tech	CA
3	14593	2024	2024-01-02	0.005	Finance	NY
4	14593	2024	2024-01-03	-0.001	Finance	NY

Notice how sector and headquarters repeat for each date. This is broadcasting: the single firm row expands to match multiple return rows.

Watch for Unintended Duplicates

If you think you have a one-to-many join but the “one” side has duplicates, you’ll get a many-to-many join instead. This multiplies rows in ways that corrupt analyses. Always validate.

16.1.3 Many-to-Many Joins

Keys repeat in both datasets. Every combination of matching keys produces a row. This explodes row counts and is rarely what you want.

graph LR
    subgraph "Dataset A (Many)"
        A1[sector: Tech<br/>year: 2024<br/>firm: AAPL]
        A2[sector: Tech<br/>year: 2024<br/>firm: MSFT]
    end

    subgraph "Dataset B (Many)"
        B1[sector: Tech<br/>year: 2024<br/>analyst: Goldman]
        B2[sector: Tech<br/>year: 2024<br/>analyst: Morgan]
    end

    A1 -.->|match| B1
    A1 -.->|match| B2
    A2 -.->|match| B1
    A2 -.->|match| B2

    style A1 fill:#e1f5ff
    style A2 fill:#e1f5ff
    style B1 fill:#fff4e1
    style B2 fill:#fff4e1

Many-to-many join: combinatorial explosion

Example: joining firm-level data to analyst coverage by sector-year creates all combinations.

# Firms in each sector-year
firms = pd.DataFrame({
    'sector': ['Tech', 'Tech', 'Finance'],
    'year': [2024, 2024, 2024],
    'ticker': ['AAPL', 'MSFT', 'JPM'],
    'return': [0.25, 0.30, 0.15]
})

# Analyst reports by sector-year (multiple analysts per sector)
analysts = pd.DataFrame({
    'sector': ['Tech', 'Tech', 'Finance'],
    'year': [2024, 2024, 2024],
    'analyst': ['Goldman', 'Morgan', 'Citi'],
    'recommendation': ['Buy', 'Hold', 'Buy']
})

# Many-to-many merge: DANGER!
merged = firms.merge(analysts, on=['sector', 'year'])
merged

	sector	year	ticker	return	analyst	recommendation
0	Tech	2024	AAPL	0.25	Goldman	Buy
1	Tech	2024	AAPL	0.25	Morgan	Hold
2	Tech	2024	MSFT	0.30	Goldman	Buy
3	Tech	2024	MSFT	0.30	Morgan	Hold
4	Finance	2024	JPM	0.15	Citi	Buy

Notice how the row count exploded:

print(f"Original firms: {len(firms)}, analysts: {len(analysts)}, merged: {len(merged)}")

Original firms: 3, analysts: 3, merged: 5

Many-to-Many is Usually Wrong

In finance, many-to-many joins typically indicate a modeling error. You probably need to aggregate one dataset first (e.g., average analyst recommendations by sector-year) or use a more specific join key. Pandas allows many-to-many by default; consider this a bug, not a feature.

16.2 Merge Operations: Which Rows Survive?

Join type (one-to-one, etc.) describes cardinality. Merge operation (inner, outer, left, right) describes which rows to keep when keys don’t match perfectly.

16.2.1 Inner Join: Intersection Only

Keep only rows where keys match in both datasets. This is conservative—you lose any observation that doesn’t have a partner—but ensures every row has complete data.

graph TD
    subgraph "Dataset A"
        A1[AAPL]
        A2[MSFT]
        A3[TSLA]
    end

    subgraph "Dataset B"
        B1[AAPL]
        B2[GOOGL]
        B3[TSLA]
    end

    subgraph "Result"
        R1[AAPL]
        R2[TSLA]
    end

    A1 --> R1
    B1 --> R1
    A3 --> R2
    B3 --> R2

    style A2 fill:#ffcccc
    style B2 fill:#ffcccc
    style R1 fill:#ccffcc
    style R2 fill:#ccffcc

Inner join: only matching keys survive

Example: merging returns with accounting data. Only firm-years with both survive.

# Stock returns
returns = pd.DataFrame({
    'ticker': ['AAPL', 'MSFT', 'TSLA'],
    'year': [2023, 2023, 2023],
    'return': [0.45, 0.35, 0.10]
})

# Accounting data (missing MSFT)
accounting = pd.DataFrame({
    'ticker': ['AAPL', 'TSLA', 'GOOGL'],
    'year': [2023, 2023, 2023],
    'roa': [0.20, 0.05, 0.15],
    'leverage': [0.30, 0.10, 0.05]
})

# Inner join: only AAPL and TSLA have both
inner = returns.merge(accounting, on=['ticker', 'year'], how='inner')
inner

	ticker	year	return	roa	leverage
0	AAPL	2023	0.45	0.20	0.3
1	TSLA	2023	0.10	0.05	0.1

When to Use Inner Joins

Use inner joins when you need data from both sources for your analysis. For example, computing correlations between returns and accounting ratios requires both variables. The sample restriction is a feature, not a bug—you’re explicitly limiting to observations where analysis is possible.

16.2.2 Left Join: Keep All from Left Dataset

Keep every row from the left dataset, adding data from the right where available. Unmatched right rows disappear. Missing values (NaN) fill columns from the right dataset where no match exists.

graph TD
    subgraph "Left Dataset"
        L1[AAPL]
        L2[MSFT]
        L3[TSLA]
    end

    subgraph "Right Dataset"
        R1[AAPL]
        R2[GOOGL]
        R3[TSLA]
    end

    subgraph "Result"
        RES1[AAPL + match]
        RES2[MSFT + NaN]
        RES3[TSLA + match]
    end

    L1 --> RES1
    L2 --> RES2
    L3 --> RES3
    R1 --> RES1
    R3 --> RES3

    style R2 fill:#ffcccc
    style RES1 fill:#ccffcc
    style RES2 fill:#ffffcc
    style RES3 fill:#ccffcc

Left join: all left-side rows survive, right-side data added where available

Example: keeping all returns, adding accounting data where available.

# Left join: keep all returns
left = returns.merge(accounting, on=['ticker', 'year'], how='left')
left

	ticker	year	return	roa	leverage
0	AAPL	2023	0.45	0.20	0.3
1	MSFT	2023	0.35	NaN	NaN
2	TSLA	2023	0.10	0.05	0.1

Left Joins Preserve Sample

Use left joins when the left dataset defines your sample. For example, if you’ve carefully constructed a set of stocks for analysis, use those as the left dataset and merge in additional characteristics. This ensures you keep your full sample even if some characteristics are missing.

16.2.3 Right Join: Keep All from Right Dataset

The mirror of left join. Keep every row from the right dataset, add data from left where available. In practice, just swap your datasets and use a left join—right joins exist for symmetry but rarely clarify code.

# Right join: keep all accounting observations
right = returns.merge(accounting, on=['ticker', 'year'], how='right')
right

	ticker	year	return	roa	leverage
0	AAPL	2023	0.45	0.20	0.30
1	TSLA	2023	0.10	0.05	0.10
2	GOOGL	2023	NaN	0.15	0.05

Prefer Left Joins for Clarity

Write accounting.merge(returns, how='left') instead of returns.merge(accounting, how='right'). Reading left-to-right matches execution order and makes the priority explicit: you’re starting with accounting data and adding returns.

16.2.4 Outer Join: Keep Everything

Keep all rows from both datasets. This maximizes sample size but can create large numbers of missing values.

graph TD
    subgraph "Dataset A"
        A1[AAPL]
        A2[MSFT]
        A3[TSLA]
    end

    subgraph "Dataset B"
        B1[AAPL]
        B2[GOOGL]
        B3[TSLA]
    end

    subgraph "Result"
        R1[AAPL + match]
        R2[MSFT + NaN]
        R3[TSLA + match]
        R4[GOOGL + NaN]
    end

    A1 --> R1
    A2 --> R2
    A3 --> R3
    B1 --> R1
    B2 --> R4
    B3 --> R3

    style R1 fill:#ccffcc
    style R2 fill:#ffffcc
    style R3 fill:#ccffcc
    style R4 fill:#ffffcc

Outer join: all rows from both datasets survive

Example: keeping all tickers from either dataset.

# Outer join: keep everything
outer = returns.merge(accounting, on=['ticker', 'year'], how='outer')
outer

	ticker	year	return	roa	leverage
0	AAPL	2023	0.45	0.20	0.30
1	GOOGL	2023	NaN	0.15	0.05
2	MSFT	2023	0.35	NaN	NaN
3	TSLA	2023	0.10	0.05	0.10

Outer Joins Hide Problems

Outer joins are dangerous because they succeed even when keys match poorly. You might merge stock returns with bond yields on date, get mostly NaN, but never realize the datasets use different date formats. Always inspect outer join results carefully and report match rates.

16.2.5 Polars Merge Syntax

Polars uses similar concepts but different method names. The key difference: Polars is more explicit about join types and discourages many-to-many joins by default.

# Same data in Polars
returns_pl = pl.DataFrame({
    'ticker': ['AAPL', 'MSFT', 'TSLA'],
    'year': [2023, 2023, 2023],
    'return': [0.45, 0.35, 0.10]
})

accounting_pl = pl.DataFrame({
    'ticker': ['AAPL', 'TSLA', 'GOOGL'],
    'year': [2023, 2023, 2023],
    'roa': [0.20, 0.05, 0.15],
    'leverage': [0.30, 0.10, 0.05]
})

# Inner join
inner_pl = returns_pl.join(
    accounting_pl,
    on=['ticker', 'year'],
    how='inner'
)
inner_pl

shape: (2, 5)

ticker	year	return	roa	leverage
str	i64	f64	f64	f64
"AAPL"	2023	0.45	0.2	0.3
"TSLA"	2023	0.1	0.05	0.1

# Left join
left_pl = returns_pl.join(
    accounting_pl,
    on=['ticker', 'year'],
    how='left'
)
left_pl

shape: (3, 5)

ticker	year	return	roa	leverage
str	i64	f64	f64	f64
"AAPL"	2023	0.45	0.2	0.3
"MSFT"	2023	0.35	null	null
"TSLA"	2023	0.1	0.05	0.1

# Outer join (called 'full' in Polars)
outer_pl = returns_pl.join(
    accounting_pl,
    on=['ticker', 'year'],
    how='full'
)
outer_pl

shape: (4, 7)

ticker	year	return	ticker_right	year_right	roa	leverage
str	i64	f64	str	i64	f64	f64
"AAPL"	2023	0.45	"AAPL"	2023	0.2	0.3
"TSLA"	2023	0.1	"TSLA"	2023	0.05	0.1
null	null	null	"GOOGL"	2023	0.15	0.05
"MSFT"	2023	0.35	null	null	null	null

Polars also provides a validate parameter similar to pandas:

# Validate one-to-one
inner_pl = returns_pl.join(
    accounting_pl,
    on=['ticker', 'year'],
    how='inner',
    validate='1:1'  # Raises error if violated
)

16.3 Asof Joins for Time-Series Data

Financial data rarely aligns perfectly in time. You have daily stock prices, quarterly earnings announcements, irregular analyst forecast updates. Matching these requires asof joins: find the most recent value as of each timestamp, without requiring exact matches.

16.3.1 The Asof Join Problem

Suppose you want to merge earnings announcement dates with stock returns. Earnings come out on irregular dates (whenever the company reports). You need to tag each daily return with the most recent earnings announcement as of that date. A standard merge fails because announcement dates don’t match return dates. An asof join solves this: for each return date, find the most recent announcement date on or before that date.

16.3.2 Asof Join Mechanics

Asof joins require sorted data and directional matching. The syntax varies between pandas and Polars, but the logic is identical:

Sort both datasets by time
For each row in the left dataset, find the row in the right dataset with the closest timestamp that doesn’t exceed the left timestamp (backward-looking)
Optionally match exactly on other columns (e.g., ticker)

Example: merging daily returns with quarterly earnings.

import numpy as np

# Daily returns (many observations)
returns = pd.DataFrame({
    'ticker': ['AAPL'] * 10,
    'date': pd.date_range('2024-01-01', periods=10, freq='D'),
    'return': np.random.randn(10) * 0.02
})

# Earnings announcements (sparse, irregular)
earnings = pd.DataFrame({
    'ticker': ['AAPL', 'AAPL', 'AAPL'],
    'announce_date': pd.to_datetime(['2023-12-28', '2024-01-03', '2024-01-08']),
    'eps': [1.25, 1.30, 1.35],
    'surprise': [0.02, -0.01, 0.03]
})

# CRITICAL: Sort by time
returns = returns.sort_values('date')
earnings = earnings.sort_values('announce_date')

# Asof join: match each return to most recent earnings
merged = pd.merge_asof(
    returns,
    earnings,
    left_on='date',
    right_on='announce_date',
    by='ticker',
    direction='backward'
)

merged[['date', 'return', 'announce_date', 'eps', 'surprise']]

1: Both DataFrames must be sorted by their time columns before asof join.
2: Time columns to match on (can have different names).
3: Additional columns that must match exactly.
4: Use most recent prior value (avoids look-ahead bias).

	date	return	announce_date	eps	surprise
0	2024-01-01	-0.032470	2023-12-28	1.25	0.02
1	2024-01-02	0.023386	2023-12-28	1.25	0.02
2	2024-01-03	-0.040501	2024-01-03	1.30	-0.01
3	2024-01-04	-0.017916	2024-01-03	1.30	-0.01
4	2024-01-05	-0.004434	2024-01-03	1.30	-0.01
5	2024-01-06	-0.020436	2024-01-03	1.30	-0.01
6	2024-01-07	-0.024500	2024-01-03	1.30	-0.01
7	2024-01-08	0.046251	2024-01-08	1.35	0.03
8	2024-01-09	-0.013361	2024-01-08	1.35	0.03
9	2024-01-10	0.024632	2024-01-08	1.35	0.03

Notice how the earnings data “fills forward”:

Jan 1-2 use the Dec 28 announcement
Jan 3-7 use the Jan 3 announcement
Jan 8+ use the Jan 8 announcement

This is exactly what you want for event studies: tag each return with the most recent earnings information available at that time.

Asof Joins Require Sorted Data

Both datasets must be sorted by the time column. Pandas doesn’t always check this, and unsorted data produces wrong results silently. Always sort explicitly before asof joins.

16.3.3 Asof Join Directions

The direction parameter controls which timestamp to match:

direction='backward' (default): Use most recent prior value (as of semantics)
direction='forward': Use next future value (rare in finance)
direction='nearest': Use closest value in either direction (dangerous for event studies—can introduce look-ahead bias)

# Forward-looking (CAREFUL: creates look-ahead bias)
forward = pd.merge_asof(
    returns,
    earnings,
    left_on='date',
    right_on='announce_date',
    by='ticker',
    direction='forward'  # Use next announcement
)

# Nearest (AVOID for event studies)
nearest = pd.merge_asof(
    returns,
    earnings,
    left_on='date',
    right_on='announce_date',
    by='ticker',
    direction='nearest'  # Use closest announcement
)

Beware Look-Ahead Bias

Using direction='forward' or direction='nearest' can introduce look-ahead bias: your analysis uses information that didn’t exist at the time. For example, tagging a January return with a February earnings announcement. This inflates backtested strategy returns and invalidates empirical tests. Stick with direction='backward' unless you have a specific reason to do otherwise.

16.3.4 Asof Joins in Polars

Polars provides asof joins through the join_asof method. The syntax is cleaner and performance is often better for large datasets.

# Same data in Polars
returns_pl = pl.DataFrame({
    'ticker': ['AAPL'] * 10,
    'date': pl.date_range(
        date(2024, 1, 1),
        date(2024, 1, 10),
        interval='1d',
        eager=True
    ),
    'return': np.random.randn(10) * 0.02
})

earnings_pl = pl.DataFrame({
    'ticker': ['AAPL', 'AAPL', 'AAPL'],
    'announce_date': [
        date(2023, 12, 28),
        date(2024, 1, 3),
        date(2024, 1, 8)
    ],
    'eps': [1.25, 1.30, 1.35],
    'surprise': [0.02, -0.01, 0.03]
})

# Asof join in Polars
merged_pl = returns_pl.sort('date').join_asof(
    earnings_pl.sort('announce_date'),
    left_on='date',
    right_on='announce_date',
    by='ticker',
    strategy='backward'  # Equivalent to pandas direction='backward'
)

merged_pl.select(['date', 'return', 'announce_date', 'eps', 'surprise'])

/var/folders/jr/cn9h86ld68qb5rtvs9gsb1vr0000gn/T/ipykernel_65713/4247637574.py:25: UserWarning: Sortedness of columns cannot be checked when 'by' groups provided
  merged_pl = returns_pl.sort('date').join_asof(

shape: (10, 5)

date	return	announce_date	eps	surprise
date	f64	date	f64	f64
2024-01-01	0.01453	2023-12-28	1.25	0.02
2024-01-02	0.001239	2023-12-28	1.25	0.02
2024-01-03	0.016048	2024-01-03	1.3	-0.01
2024-01-04	-0.02854	2024-01-03	1.3	-0.01
2024-01-05	0.01889	2024-01-03	1.3	-0.01
2024-01-06	-0.005555	2024-01-03	1.3	-0.01
2024-01-07	0.045201	2024-01-03	1.3	-0.01
2024-01-08	-0.029402	2024-01-08	1.35	0.03
2024-01-09	-0.021957	2024-01-08	1.35	0.03
2024-01-10	0.026204	2024-01-08	1.35	0.03

Polars also supports tolerance (maximum time difference) and handles missing values more explicitly:

# Only match if announcement is within 90 days
merged_pl = returns_pl.sort('date').join_asof(
    earnings_pl.sort('announce_date'),
    left_on='date',
    right_on='announce_date',
    by='ticker',
    strategy='backward',
    tolerance='90d'  # NaN if no announcement within 90 days
)

16.3.5 Common Asof Join Patterns in Finance

Pattern 1: Point-in-time accounting data

Merge daily returns with quarterly accounting variables. Each return should use the most recent financial statement available at that date.

# Daily returns
returns = pd.DataFrame({
    'permno': [10107] * 100,
    'date': pd.date_range('2024-01-01', periods=100, freq='D'),
    'ret': np.random.randn(100) * 0.02
})

# Quarterly accounting (fiscal quarter end dates)
accounting = pd.DataFrame({
    'permno': [10107, 10107, 10107, 10107],
    'datadate': pd.to_datetime(['2023-09-30', '2023-12-31',
                                '2024-03-31', '2024-06-30']),
    'book_equity': [1000, 1100, 1150, 1200],
    'total_assets': [5000, 5200, 5300, 5400]
})

# Asof join: each return uses most recent accounting data
merged = pd.merge_asof(
    returns.sort_values('date'),
    accounting.sort_values('datadate'),
    left_on='date',
    right_on='datadate',
    by='permno',
    direction='backward'
)

Pattern 2: Analyst forecast updates

Analyst forecasts arrive irregularly. Tag each return with the current consensus forecast.

# Daily returns
returns = pd.DataFrame({
    'ticker': ['AAPL'] * 30,
    'date': pd.date_range('2024-01-01', periods=30, freq='D'),
    'ret': np.random.randn(30) * 0.02
})

# Analyst forecast updates (irregular)
forecasts = pd.DataFrame({
    'ticker': ['AAPL'] * 5,
    'forecast_date': pd.to_datetime([
        '2023-12-15', '2024-01-05', '2024-01-12',
        '2024-01-20', '2024-01-28'
    ]),
    'consensus_eps': [5.25, 5.30, 5.28, 5.35, 5.40],
    'num_analysts': [25, 26, 27, 28, 29]
})

# Asof join: current forecast as of each date
merged = pd.merge_asof(
    returns.sort_values('date'),
    forecasts.sort_values('forecast_date'),
    left_on='date',
    right_on='forecast_date',
    by='ticker',
    direction='backward'
)

Pattern 3: Bid-ask spreads and trades

Match each trade to the prevailing bid-ask spread (within milliseconds).

# Trades (exact timestamps)
trades = pd.DataFrame({
    'symbol': ['AAPL'] * 5,
    'trade_time': pd.to_datetime([
        '2024-01-02 09:30:00.123',
        '2024-01-02 09:30:00.456',
        '2024-01-02 09:30:01.789',
        '2024-01-02 09:30:02.012',
        '2024-01-02 09:30:03.345'
    ]),
    'trade_price': [150.25, 150.26, 150.24, 150.25, 150.27],
    'trade_size': [100, 200, 150, 300, 250]
})

# Quote updates (irregular, microsecond timestamps)
quotes = pd.DataFrame({
    'symbol': ['AAPL'] * 4,
    'quote_time': pd.to_datetime([
        '2024-01-02 09:30:00.100',
        '2024-01-02 09:30:00.500',
        '2024-01-02 09:30:02.000',
        '2024-01-02 09:30:03.000'
    ]),
    'bid': [150.24, 150.25, 150.23, 150.26],
    'ask': [150.26, 150.27, 150.25, 150.28]
})

# Asof join: prevailing quote at each trade
merged = pd.merge_asof(
    trades.sort_values('trade_time'),
    quotes.sort_values('quote_time'),
    left_on='trade_time',
    right_on='quote_time',
    by='symbol',
    direction='backward',
    tolerance=pd.Timedelta('1s'),  # Only match within 1 second
    allow_exact_matches=False  # Use quote *before* the trade, not at same instant
)

# Calculate effective spread
merged['spread'] = merged['ask'] - merged['bid']
merged['effective_spread'] = abs(merged['trade_price'] -
                                  (merged['bid'] + merged['ask']) / 2)

16.4 Diagnosing and Correcting Merge Errors

Merges fail silently. You run the code, get a DataFrame back, and assume it worked. But maybe 80% of rows didn’t match. Maybe you have duplicate keys creating a many-to-many join. Maybe your date formats don’t align. The code runs, but your results are garbage.

Run the following checks every time until they become automatic.

16.4.1 Essential checks before and after merging

Before merging:

Predict row counts: Inner join should give ≤ min(left, right) rows; left join should preserve left row count for one-to-one or one-to-many; outer join gives ≥ max(left, right) rows
Check for duplicate keys: Use df.duplicated(key_cols).sum() or validate parameter to enforce cardinality
Verify data types match: Use df[key_cols].dtypes to ensure join keys have compatible types
For string keys: Normalize with .str.strip().str.upper() to handle whitespace and case differences
For dates: Ensure both use datetime64, same time zone, and same granularity

After merging:

Check row count changes: Unexpected changes indicate duplicate keys or wrong join type
Check match rates: Use indicator=True to see how many rows matched (use merged['_merge'].value_counts())
Inspect missing values: NaN in right-side columns after left join indicates failed matches

16.4.2 Using the indicator parameter

The indicator=True parameter adds a _merge column showing match status:

# Re-create sample data for merge diagnostics
returns = pd.DataFrame({
    'ticker': ['AAPL', 'MSFT', 'TSLA'],
    'year': [2023, 2023, 2023],
    'return': [0.45, 0.35, 0.10]
})

accounting = pd.DataFrame({
    'ticker': ['AAPL', 'TSLA', 'GOOGL'],
    'year': [2023, 2023, 2023],
    'roa': [0.20, 0.05, 0.15],
    'leverage': [0.30, 0.10, 0.05]
})

# Left join with indicator
merged = returns.merge(
    accounting,
    on=['ticker', 'year'],
    how='left',
    indicator=True
)

# Check match rates
merged['_merge'].value_counts()

1: Adds a _merge column showing whether each row matched: left_only, right_only, or both.

_merge
both          2
left_only     1
right_only    0
Name: count, dtype: int64

print(f"Match rate: {(merged['_merge'] == 'both').mean():.1%}")

Match rate: 66.7%

Investigate low match rates

A match rate below 90% usually indicates problems: incorrect join keys, different identifier schemes (CUSIP vs PERMNO), date format mismatches, inconsistent capitalization, or datasets filtered to different subsets. Don’t just accept low match rates—understand why they occurred.

16.4.3 Common merge error patterns

Error	Symptom	Cause	Solution
Cartesian explosion	Row count explodes (10,000x+ more rows)	Duplicate keys in both datasets	Aggregate one dataset or use more specific join keys
Zero matches	All rows are `left_only`	Keys don’t overlap at all	Check identifier mapping (CUSIP vs PERMNO), date formats
Type mismatch	Low or zero matches despite overlapping keys	Different data types (int vs float, str vs object)	Convert both to same type before merge
Float precision	Values that should match don’t	`150.25` vs `150.25000001`	Round floats or use integer keys
Time zone mismatch	Asof join produces all NaN	One dataset in UTC, other in local time	Convert both to same time zone
Case/whitespace	Low matches on string keys	“AAPL” vs “aapl” vs ” AAPL”	Use `.str.strip().str.upper()`
Unsorted asof join	Wrong matches or NaN	Data not sorted by time column	Always `sort_values()` before asof joins

16.5 Summary and Best Practices

Merging is where most data errors occur in empirical finance. Follow these practices to catch problems early:

Understand cardinality before merging. Is this one-to-one, one-to-many, or many-to-many? Use validate to enforce expectations.
Choose the right merge type. Inner for complete cases only, left to preserve your sample, outer rarely (and carefully).
Use asof joins for time-series data. Don’t try to match timestamps exactly—use asof joins with direction='backward' to avoid look-ahead bias.
Validate every merge. Check row counts, match rates, duplicates, and data types until these checks become automatic.
Sort before asof joins. Unsorted data produces silent errors.
Document match rates. Report how many observations matched in your analysis. Low match rates indicate problems.
Normalize strings. Strip whitespace, standardize case, remove special characters before merging on text.
Use indicator columns. Add indicator=True to understand what matched and what didn’t.
Prefer left joins for clarity. Write base.merge(extra, how='left') instead of extra.merge(base, how='right').
Consider Polars for large datasets. Polars’ lazy evaluation and better error messages can save time on big merges.

The datasets in empirical finance are messy. Companies change tickers, databases use different identifiers, timestamps don’t align. Merging correctly requires skepticism, validation, and careful attention to detail. The techniques in this chapter won’t eliminate merge errors, but they’ll help you catch them before they corrupt your research.