16  Joins and Merges

Combining datasets is fundamental to empirical finance. Stock returns live in one database (CRSP), accounting data in another (Compustat), analyst forecasts in a third (I/B/E/S). Real research requires merging these sources, and doing it wrong can silently corrupt your results.

This chapter develops the theory and practice of joins. We start with the conceptual framework—what happens when you match rows between datasets—then move to specific merge operations. Special attention goes to asof joins, essential for time-series data with mismatched timestamps. We close with diagnostic techniques to catch the merge errors that plague real-world research.

16.1 Join Types: Cardinality Matters

Before learning merge syntax, understand what happens to row counts. The relationship between keys in your datasets determines whether you end up with more rows, fewer rows, or the same number. Getting this wrong is catastrophic—you might duplicate observations, lose data, or create spurious patterns.

16.1.1 One-to-One Joins

Each key appears at most once in both datasets. This is the simplest case: rows either match or they don’t.

graph LR
    subgraph "Dataset A"
        A1[ticker: AAPL<br/>ret: 0.02]
        A2[ticker: MSFT<br/>ret: -0.01]
    end

    subgraph "Dataset B"
        B1[ticker: AAPL<br/>mktcap: 2500B]
        B2[ticker: MSFT<br/>mktcap: 2200B]
    end

    A1 -.->|match| B1
    A2 -.->|match| B2

    style A1 fill:#e1f5ff
    style A2 fill:#e1f5ff
    style B1 fill:#fff4e1
    style B2 fill:#fff4e1

One-to-one join: each key matches at most once

Example: merging monthly portfolio returns with portfolio characteristics. Each portfolio-month appears once in each dataset.

import pandas as pd
import polars as pl
from datetime import date

# Monthly portfolio returns
returns = pd.DataFrame({
    'portfolio': ['growth', 'value', 'momentum'],
    'month': ['2024-01', '2024-01', '2024-01'],
    'return': [0.025, -0.010, 0.035]
})

# Portfolio characteristics (computed separately)
characteristics = pd.DataFrame({
    'portfolio': ['growth', 'value', 'momentum'],
    'month': ['2024-01', '2024-01', '2024-01'],
    'avg_size': [5000, 3000, 4000],
    'avg_bm': [0.5, 2.5, 1.2]
})

# One-to-one merge
merged = returns.merge(
    characteristics,
    on=['portfolio', 'month'],
    validate='one_to_one'
)
merged
1
Columns that must match exactly in both DataFrames.
2
Raises MergeError if duplicate keys exist in either DataFrame.
portfolio month return avg_size avg_bm
0 growth 2024-01 0.025 5000 0.5
1 value 2024-01 -0.010 3000 2.5
2 momentum 2024-01 0.035 4000 1.2
ImportantAlways Validate Cardinality

Use validate='one_to_one' in pandas or check row counts before and after. One-to-one merges should never change the number of rows (except for unmatched keys with inner joins). If row count changes unexpectedly, you have duplicate keys somewhere.

16.1.2 One-to-Many Joins

Keys are unique in one dataset but repeated in the other. This expands rows from the unique side.

graph LR
    subgraph "Firms (One)"
        F1[permno: 10107<br/>sector: Tech]
    end

    subgraph "Returns (Many)"
        R1[permno: 10107<br/>date: 2024-01-02<br/>ret: 0.01]
        R2[permno: 10107<br/>date: 2024-01-03<br/>ret: -0.02]
        R3[permno: 10107<br/>date: 2024-01-04<br/>ret: 0.03]
    end

    F1 -.->|broadcast| R1
    F1 -.->|broadcast| R2
    F1 -.->|broadcast| R3

    style F1 fill:#e1f5ff
    style R1 fill:#fff4e1
    style R2 fill:#fff4e1
    style R3 fill:#fff4e1

One-to-many join: firm-level data matched to daily returns

Example: merging firm characteristics (one row per firm) with daily returns (many rows per firm).

# Firm characteristics (one row per firm)
firms = pd.DataFrame({
    'permno': [10107, 10107, 14593],
    'year': [2023, 2024, 2024],
    'sector': ['Tech', 'Tech', 'Finance'],
    'headquarters': ['CA', 'CA', 'NY']
})

# Daily returns (many rows per firm-year)
returns = pd.DataFrame({
    'permno': [10107, 10107, 10107, 14593, 14593],
    'year': [2024, 2024, 2024, 2024, 2024],
    'date': ['2024-01-02', '2024-01-03', '2024-01-04',
             '2024-01-02', '2024-01-03'],
    'ret': [0.01, -0.02, 0.03, 0.005, -0.001]
})

# One-to-many merge: firm characteristics broadcast to each return
merged = returns.merge(
    firms,
    on=['permno', 'year'],
    validate='many_to_one'  # Ensures firms has unique keys
)
merged
permno year date ret sector headquarters
0 10107 2024 2024-01-02 0.010 Tech CA
1 10107 2024 2024-01-03 -0.020 Tech CA
2 10107 2024 2024-01-04 0.030 Tech CA
3 14593 2024 2024-01-02 0.005 Finance NY
4 14593 2024 2024-01-03 -0.001 Finance NY

Notice how sector and headquarters repeat for each date. This is broadcasting: the single firm row expands to match multiple return rows.

WarningWatch for Unintended Duplicates

If you think you have a one-to-many join but the “one” side has duplicates, you’ll get a many-to-many join instead. This multiplies rows in ways that corrupt analyses. Always validate.

16.1.3 Many-to-Many Joins

Keys repeat in both datasets. Every combination of matching keys produces a row. This explodes row counts and is rarely what you want.

graph LR
    subgraph "Dataset A (Many)"
        A1[sector: Tech<br/>year: 2024<br/>firm: AAPL]
        A2[sector: Tech<br/>year: 2024<br/>firm: MSFT]
    end

    subgraph "Dataset B (Many)"
        B1[sector: Tech<br/>year: 2024<br/>analyst: Goldman]
        B2[sector: Tech<br/>year: 2024<br/>analyst: Morgan]
    end

    A1 -.->|match| B1
    A1 -.->|match| B2
    A2 -.->|match| B1
    A2 -.->|match| B2

    style A1 fill:#e1f5ff
    style A2 fill:#e1f5ff
    style B1 fill:#fff4e1
    style B2 fill:#fff4e1

Many-to-many join: combinatorial explosion

Example: joining firm-level data to analyst coverage by sector-year creates all combinations.

# Firms in each sector-year
firms = pd.DataFrame({
    'sector': ['Tech', 'Tech', 'Finance'],
    'year': [2024, 2024, 2024],
    'ticker': ['AAPL', 'MSFT', 'JPM'],
    'return': [0.25, 0.30, 0.15]
})

# Analyst reports by sector-year (multiple analysts per sector)
analysts = pd.DataFrame({
    'sector': ['Tech', 'Tech', 'Finance'],
    'year': [2024, 2024, 2024],
    'analyst': ['Goldman', 'Morgan', 'Citi'],
    'recommendation': ['Buy', 'Hold', 'Buy']
})

# Many-to-many merge: DANGER!
merged = firms.merge(analysts, on=['sector', 'year'])
merged
sector year ticker return analyst recommendation
0 Tech 2024 AAPL 0.25 Goldman Buy
1 Tech 2024 AAPL 0.25 Morgan Hold
2 Tech 2024 MSFT 0.30 Goldman Buy
3 Tech 2024 MSFT 0.30 Morgan Hold
4 Finance 2024 JPM 0.15 Citi Buy

Notice how the row count exploded:

print(f"Original firms: {len(firms)}, analysts: {len(analysts)}, merged: {len(merged)}")
Original firms: 3, analysts: 3, merged: 5
CautionMany-to-Many is Usually Wrong

In finance, many-to-many joins typically indicate a modeling error. You probably need to aggregate one dataset first (e.g., average analyst recommendations by sector-year) or use a more specific join key. Pandas allows many-to-many by default; consider this a bug, not a feature.

16.2 Merge Operations: Which Rows Survive?

Join type (one-to-one, etc.) describes cardinality. Merge operation (inner, outer, left, right) describes which rows to keep when keys don’t match perfectly.

16.2.1 Inner Join: Intersection Only

Keep only rows where keys match in both datasets. This is conservative—you lose any observation that doesn’t have a partner—but ensures every row has complete data.

graph TD
    subgraph "Dataset A"
        A1[AAPL]
        A2[MSFT]
        A3[TSLA]
    end

    subgraph "Dataset B"
        B1[AAPL]
        B2[GOOGL]
        B3[TSLA]
    end

    subgraph "Result"
        R1[AAPL]
        R2[TSLA]
    end

    A1 --> R1
    B1 --> R1
    A3 --> R2
    B3 --> R2

    style A2 fill:#ffcccc
    style B2 fill:#ffcccc
    style R1 fill:#ccffcc
    style R2 fill:#ccffcc

Inner join: only matching keys survive

Example: merging returns with accounting data. Only firm-years with both survive.

# Stock returns
returns = pd.DataFrame({
    'ticker': ['AAPL', 'MSFT', 'TSLA'],
    'year': [2023, 2023, 2023],
    'return': [0.45, 0.35, 0.10]
})

# Accounting data (missing MSFT)
accounting = pd.DataFrame({
    'ticker': ['AAPL', 'TSLA', 'GOOGL'],
    'year': [2023, 2023, 2023],
    'roa': [0.20, 0.05, 0.15],
    'leverage': [0.30, 0.10, 0.05]
})

# Inner join: only AAPL and TSLA have both
inner = returns.merge(accounting, on=['ticker', 'year'], how='inner')
inner
ticker year return roa leverage
0 AAPL 2023 0.45 0.20 0.3
1 TSLA 2023 0.10 0.05 0.1
TipWhen to Use Inner Joins

Use inner joins when you need data from both sources for your analysis. For example, computing correlations between returns and accounting ratios requires both variables. The sample restriction is a feature, not a bug—you’re explicitly limiting to observations where analysis is possible.

16.2.2 Left Join: Keep All from Left Dataset

Keep every row from the left dataset, adding data from the right where available. Unmatched right rows disappear. Missing values (NaN) fill columns from the right dataset where no match exists.

graph TD
    subgraph "Left Dataset"
        L1[AAPL]
        L2[MSFT]
        L3[TSLA]
    end

    subgraph "Right Dataset"
        R1[AAPL]
        R2[GOOGL]
        R3[TSLA]
    end

    subgraph "Result"
        RES1[AAPL + match]
        RES2[MSFT + NaN]
        RES3[TSLA + match]
    end

    L1 --> RES1
    L2 --> RES2
    L3 --> RES3
    R1 --> RES1
    R3 --> RES3

    style R2 fill:#ffcccc
    style RES1 fill:#ccffcc
    style RES2 fill:#ffffcc
    style RES3 fill:#ccffcc

Left join: all left-side rows survive, right-side data added where available

Example: keeping all returns, adding accounting data where available.

# Left join: keep all returns
left = returns.merge(accounting, on=['ticker', 'year'], how='left')
left
ticker year return roa leverage
0 AAPL 2023 0.45 0.20 0.3
1 MSFT 2023 0.35 NaN NaN
2 TSLA 2023 0.10 0.05 0.1
NoteLeft Joins Preserve Sample

Use left joins when the left dataset defines your sample. For example, if you’ve carefully constructed a set of stocks for analysis, use those as the left dataset and merge in additional characteristics. This ensures you keep your full sample even if some characteristics are missing.

16.2.3 Right Join: Keep All from Right Dataset

The mirror of left join. Keep every row from the right dataset, add data from left where available. In practice, just swap your datasets and use a left join—right joins exist for symmetry but rarely clarify code.

# Right join: keep all accounting observations
right = returns.merge(accounting, on=['ticker', 'year'], how='right')
right
ticker year return roa leverage
0 AAPL 2023 0.45 0.20 0.30
1 TSLA 2023 0.10 0.05 0.10
2 GOOGL 2023 NaN 0.15 0.05
TipPrefer Left Joins for Clarity

Write accounting.merge(returns, how='left') instead of returns.merge(accounting, how='right'). Reading left-to-right matches execution order and makes the priority explicit: you’re starting with accounting data and adding returns.

16.2.4 Outer Join: Keep Everything

Keep all rows from both datasets. This maximizes sample size but can create large numbers of missing values.

graph TD
    subgraph "Dataset A"
        A1[AAPL]
        A2[MSFT]
        A3[TSLA]
    end

    subgraph "Dataset B"
        B1[AAPL]
        B2[GOOGL]
        B3[TSLA]
    end

    subgraph "Result"
        R1[AAPL + match]
        R2[MSFT + NaN]
        R3[TSLA + match]
        R4[GOOGL + NaN]
    end

    A1 --> R1
    A2 --> R2
    A3 --> R3
    B1 --> R1
    B2 --> R4
    B3 --> R3

    style R1 fill:#ccffcc
    style R2 fill:#ffffcc
    style R3 fill:#ccffcc
    style R4 fill:#ffffcc

Outer join: all rows from both datasets survive

Example: keeping all tickers from either dataset.

# Outer join: keep everything
outer = returns.merge(accounting, on=['ticker', 'year'], how='outer')
outer
ticker year return roa leverage
0 AAPL 2023 0.45 0.20 0.30
1 GOOGL 2023 NaN 0.15 0.05
2 MSFT 2023 0.35 NaN NaN
3 TSLA 2023 0.10 0.05 0.10
WarningOuter Joins Hide Problems

Outer joins are dangerous because they succeed even when keys match poorly. You might merge stock returns with bond yields on date, get mostly NaN, but never realize the datasets use different date formats. Always inspect outer join results carefully and report match rates.

16.2.5 Polars Merge Syntax

Polars uses similar concepts but different method names. The key difference: Polars is more explicit about join types and discourages many-to-many joins by default.

# Same data in Polars
returns_pl = pl.DataFrame({
    'ticker': ['AAPL', 'MSFT', 'TSLA'],
    'year': [2023, 2023, 2023],
    'return': [0.45, 0.35, 0.10]
})

accounting_pl = pl.DataFrame({
    'ticker': ['AAPL', 'TSLA', 'GOOGL'],
    'year': [2023, 2023, 2023],
    'roa': [0.20, 0.05, 0.15],
    'leverage': [0.30, 0.10, 0.05]
})

# Inner join
inner_pl = returns_pl.join(
    accounting_pl,
    on=['ticker', 'year'],
    how='inner'
)
inner_pl
shape: (2, 5)
ticker year return roa leverage
str i64 f64 f64 f64
"AAPL" 2023 0.45 0.2 0.3
"TSLA" 2023 0.1 0.05 0.1
# Left join
left_pl = returns_pl.join(
    accounting_pl,
    on=['ticker', 'year'],
    how='left'
)
left_pl
shape: (3, 5)
ticker year return roa leverage
str i64 f64 f64 f64
"AAPL" 2023 0.45 0.2 0.3
"MSFT" 2023 0.35 null null
"TSLA" 2023 0.1 0.05 0.1
# Outer join (called 'full' in Polars)
outer_pl = returns_pl.join(
    accounting_pl,
    on=['ticker', 'year'],
    how='full'
)
outer_pl
shape: (4, 7)
ticker year return ticker_right year_right roa leverage
str i64 f64 str i64 f64 f64
"AAPL" 2023 0.45 "AAPL" 2023 0.2 0.3
"TSLA" 2023 0.1 "TSLA" 2023 0.05 0.1
null null null "GOOGL" 2023 0.15 0.05
"MSFT" 2023 0.35 null null null null

Polars also provides a validate parameter similar to pandas:

# Validate one-to-one
inner_pl = returns_pl.join(
    accounting_pl,
    on=['ticker', 'year'],
    how='inner',
    validate='1:1'  # Raises error if violated
)

16.3 Asof Joins for Time-Series Data

Financial data rarely aligns perfectly in time. You have daily stock prices, quarterly earnings announcements, irregular analyst forecast updates. Matching these requires asof joins: find the most recent value as of each timestamp, without requiring exact matches.

16.3.1 The Asof Join Problem

Suppose you want to merge earnings announcement dates with stock returns. Earnings come out on irregular dates (whenever the company reports). You need to tag each daily return with the most recent earnings announcement as of that date. A standard merge fails because announcement dates don’t match return dates. An asof join solves this: for each return date, find the most recent announcement date on or before that date.

16.3.2 Asof Join Mechanics

Asof joins require sorted data and directional matching. The syntax varies between pandas and Polars, but the logic is identical:

  1. Sort both datasets by time
  2. For each row in the left dataset, find the row in the right dataset with the closest timestamp that doesn’t exceed the left timestamp (backward-looking)
  3. Optionally match exactly on other columns (e.g., ticker)

Example: merging daily returns with quarterly earnings.

import numpy as np

# Daily returns (many observations)
returns = pd.DataFrame({
    'ticker': ['AAPL'] * 10,
    'date': pd.date_range('2024-01-01', periods=10, freq='D'),
    'return': np.random.randn(10) * 0.02
})

# Earnings announcements (sparse, irregular)
earnings = pd.DataFrame({
    'ticker': ['AAPL', 'AAPL', 'AAPL'],
    'announce_date': pd.to_datetime(['2023-12-28', '2024-01-03', '2024-01-08']),
    'eps': [1.25, 1.30, 1.35],
    'surprise': [0.02, -0.01, 0.03]
})

# CRITICAL: Sort by time
returns = returns.sort_values('date')
earnings = earnings.sort_values('announce_date')

# Asof join: match each return to most recent earnings
merged = pd.merge_asof(
    returns,
    earnings,
    left_on='date',
    right_on='announce_date',
    by='ticker',
    direction='backward'
)

merged[['date', 'return', 'announce_date', 'eps', 'surprise']]
1
Both DataFrames must be sorted by their time columns before asof join.
2
Time columns to match on (can have different names).
3
Additional columns that must match exactly.
4
Use most recent prior value (avoids look-ahead bias).
date return announce_date eps surprise
0 2024-01-01 -0.032470 2023-12-28 1.25 0.02
1 2024-01-02 0.023386 2023-12-28 1.25 0.02
2 2024-01-03 -0.040501 2024-01-03 1.30 -0.01
3 2024-01-04 -0.017916 2024-01-03 1.30 -0.01
4 2024-01-05 -0.004434 2024-01-03 1.30 -0.01
5 2024-01-06 -0.020436 2024-01-03 1.30 -0.01
6 2024-01-07 -0.024500 2024-01-03 1.30 -0.01
7 2024-01-08 0.046251 2024-01-08 1.35 0.03
8 2024-01-09 -0.013361 2024-01-08 1.35 0.03
9 2024-01-10 0.024632 2024-01-08 1.35 0.03

Notice how the earnings data “fills forward”:

  • Jan 1-2 use the Dec 28 announcement
  • Jan 3-7 use the Jan 3 announcement
  • Jan 8+ use the Jan 8 announcement

This is exactly what you want for event studies: tag each return with the most recent earnings information available at that time.

ImportantAsof Joins Require Sorted Data

Both datasets must be sorted by the time column. Pandas doesn’t always check this, and unsorted data produces wrong results silently. Always sort explicitly before asof joins.

16.3.3 Asof Join Directions

The direction parameter controls which timestamp to match:

  • direction='backward' (default): Use most recent prior value (as of semantics)
  • direction='forward': Use next future value (rare in finance)
  • direction='nearest': Use closest value in either direction (dangerous for event studies—can introduce look-ahead bias)
# Forward-looking (CAREFUL: creates look-ahead bias)
forward = pd.merge_asof(
    returns,
    earnings,
    left_on='date',
    right_on='announce_date',
    by='ticker',
    direction='forward'  # Use next announcement
)

# Nearest (AVOID for event studies)
nearest = pd.merge_asof(
    returns,
    earnings,
    left_on='date',
    right_on='announce_date',
    by='ticker',
    direction='nearest'  # Use closest announcement
)
WarningBeware Look-Ahead Bias

Using direction='forward' or direction='nearest' can introduce look-ahead bias: your analysis uses information that didn’t exist at the time. For example, tagging a January return with a February earnings announcement. This inflates backtested strategy returns and invalidates empirical tests. Stick with direction='backward' unless you have a specific reason to do otherwise.

16.3.4 Asof Joins in Polars

Polars provides asof joins through the join_asof method. The syntax is cleaner and performance is often better for large datasets.

# Same data in Polars
returns_pl = pl.DataFrame({
    'ticker': ['AAPL'] * 10,
    'date': pl.date_range(
        date(2024, 1, 1),
        date(2024, 1, 10),
        interval='1d',
        eager=True
    ),
    'return': np.random.randn(10) * 0.02
})

earnings_pl = pl.DataFrame({
    'ticker': ['AAPL', 'AAPL', 'AAPL'],
    'announce_date': [
        date(2023, 12, 28),
        date(2024, 1, 3),
        date(2024, 1, 8)
    ],
    'eps': [1.25, 1.30, 1.35],
    'surprise': [0.02, -0.01, 0.03]
})

# Asof join in Polars
merged_pl = returns_pl.sort('date').join_asof(
    earnings_pl.sort('announce_date'),
    left_on='date',
    right_on='announce_date',
    by='ticker',
    strategy='backward'  # Equivalent to pandas direction='backward'
)

merged_pl.select(['date', 'return', 'announce_date', 'eps', 'surprise'])
/var/folders/jr/cn9h86ld68qb5rtvs9gsb1vr0000gn/T/ipykernel_65713/4247637574.py:25: UserWarning: Sortedness of columns cannot be checked when 'by' groups provided
  merged_pl = returns_pl.sort('date').join_asof(
shape: (10, 5)
date return announce_date eps surprise
date f64 date f64 f64
2024-01-01 0.01453 2023-12-28 1.25 0.02
2024-01-02 0.001239 2023-12-28 1.25 0.02
2024-01-03 0.016048 2024-01-03 1.3 -0.01
2024-01-04 -0.02854 2024-01-03 1.3 -0.01
2024-01-05 0.01889 2024-01-03 1.3 -0.01
2024-01-06 -0.005555 2024-01-03 1.3 -0.01
2024-01-07 0.045201 2024-01-03 1.3 -0.01
2024-01-08 -0.029402 2024-01-08 1.35 0.03
2024-01-09 -0.021957 2024-01-08 1.35 0.03
2024-01-10 0.026204 2024-01-08 1.35 0.03

Polars also supports tolerance (maximum time difference) and handles missing values more explicitly:

# Only match if announcement is within 90 days
merged_pl = returns_pl.sort('date').join_asof(
    earnings_pl.sort('announce_date'),
    left_on='date',
    right_on='announce_date',
    by='ticker',
    strategy='backward',
    tolerance='90d'  # NaN if no announcement within 90 days
)

16.3.5 Common Asof Join Patterns in Finance

Pattern 1: Point-in-time accounting data

Merge daily returns with quarterly accounting variables. Each return should use the most recent financial statement available at that date.

# Daily returns
returns = pd.DataFrame({
    'permno': [10107] * 100,
    'date': pd.date_range('2024-01-01', periods=100, freq='D'),
    'ret': np.random.randn(100) * 0.02
})

# Quarterly accounting (fiscal quarter end dates)
accounting = pd.DataFrame({
    'permno': [10107, 10107, 10107, 10107],
    'datadate': pd.to_datetime(['2023-09-30', '2023-12-31',
                                '2024-03-31', '2024-06-30']),
    'book_equity': [1000, 1100, 1150, 1200],
    'total_assets': [5000, 5200, 5300, 5400]
})

# Asof join: each return uses most recent accounting data
merged = pd.merge_asof(
    returns.sort_values('date'),
    accounting.sort_values('datadate'),
    left_on='date',
    right_on='datadate',
    by='permno',
    direction='backward'
)

Pattern 2: Analyst forecast updates

Analyst forecasts arrive irregularly. Tag each return with the current consensus forecast.

# Daily returns
returns = pd.DataFrame({
    'ticker': ['AAPL'] * 30,
    'date': pd.date_range('2024-01-01', periods=30, freq='D'),
    'ret': np.random.randn(30) * 0.02
})

# Analyst forecast updates (irregular)
forecasts = pd.DataFrame({
    'ticker': ['AAPL'] * 5,
    'forecast_date': pd.to_datetime([
        '2023-12-15', '2024-01-05', '2024-01-12',
        '2024-01-20', '2024-01-28'
    ]),
    'consensus_eps': [5.25, 5.30, 5.28, 5.35, 5.40],
    'num_analysts': [25, 26, 27, 28, 29]
})

# Asof join: current forecast as of each date
merged = pd.merge_asof(
    returns.sort_values('date'),
    forecasts.sort_values('forecast_date'),
    left_on='date',
    right_on='forecast_date',
    by='ticker',
    direction='backward'
)

Pattern 3: Bid-ask spreads and trades

Match each trade to the prevailing bid-ask spread (within milliseconds).

# Trades (exact timestamps)
trades = pd.DataFrame({
    'symbol': ['AAPL'] * 5,
    'trade_time': pd.to_datetime([
        '2024-01-02 09:30:00.123',
        '2024-01-02 09:30:00.456',
        '2024-01-02 09:30:01.789',
        '2024-01-02 09:30:02.012',
        '2024-01-02 09:30:03.345'
    ]),
    'trade_price': [150.25, 150.26, 150.24, 150.25, 150.27],
    'trade_size': [100, 200, 150, 300, 250]
})

# Quote updates (irregular, microsecond timestamps)
quotes = pd.DataFrame({
    'symbol': ['AAPL'] * 4,
    'quote_time': pd.to_datetime([
        '2024-01-02 09:30:00.100',
        '2024-01-02 09:30:00.500',
        '2024-01-02 09:30:02.000',
        '2024-01-02 09:30:03.000'
    ]),
    'bid': [150.24, 150.25, 150.23, 150.26],
    'ask': [150.26, 150.27, 150.25, 150.28]
})

# Asof join: prevailing quote at each trade
merged = pd.merge_asof(
    trades.sort_values('trade_time'),
    quotes.sort_values('quote_time'),
    left_on='trade_time',
    right_on='quote_time',
    by='symbol',
    direction='backward',
    tolerance=pd.Timedelta('1s'),  # Only match within 1 second
    allow_exact_matches=False  # Use quote *before* the trade, not at same instant
)

# Calculate effective spread
merged['spread'] = merged['ask'] - merged['bid']
merged['effective_spread'] = abs(merged['trade_price'] -
                                  (merged['bid'] + merged['ask']) / 2)

16.4 Diagnosing and Correcting Merge Errors

Merges fail silently. You run the code, get a DataFrame back, and assume it worked. But maybe 80% of rows didn’t match. Maybe you have duplicate keys creating a many-to-many join. Maybe your date formats don’t align. The code runs, but your results are garbage.

Run the following checks every time until they become automatic.

16.4.1 Essential checks before and after merging

Before merging:

  • Predict row counts: Inner join should give ≤ min(left, right) rows; left join should preserve left row count for one-to-one or one-to-many; outer join gives ≥ max(left, right) rows
  • Check for duplicate keys: Use df.duplicated(key_cols).sum() or validate parameter to enforce cardinality
  • Verify data types match: Use df[key_cols].dtypes to ensure join keys have compatible types
  • For string keys: Normalize with .str.strip().str.upper() to handle whitespace and case differences
  • For dates: Ensure both use datetime64, same time zone, and same granularity

After merging:

  • Check row count changes: Unexpected changes indicate duplicate keys or wrong join type
  • Check match rates: Use indicator=True to see how many rows matched (use merged['_merge'].value_counts())
  • Inspect missing values: NaN in right-side columns after left join indicates failed matches

16.4.2 Using the indicator parameter

The indicator=True parameter adds a _merge column showing match status:

# Re-create sample data for merge diagnostics
returns = pd.DataFrame({
    'ticker': ['AAPL', 'MSFT', 'TSLA'],
    'year': [2023, 2023, 2023],
    'return': [0.45, 0.35, 0.10]
})

accounting = pd.DataFrame({
    'ticker': ['AAPL', 'TSLA', 'GOOGL'],
    'year': [2023, 2023, 2023],
    'roa': [0.20, 0.05, 0.15],
    'leverage': [0.30, 0.10, 0.05]
})

# Left join with indicator
merged = returns.merge(
    accounting,
    on=['ticker', 'year'],
    how='left',
    indicator=True
)

# Check match rates
merged['_merge'].value_counts()
1
Adds a _merge column showing whether each row matched: left_only, right_only, or both.
_merge
both          2
left_only     1
right_only    0
Name: count, dtype: int64
print(f"Match rate: {(merged['_merge'] == 'both').mean():.1%}")
Match rate: 66.7%
ImportantInvestigate low match rates

A match rate below 90% usually indicates problems: incorrect join keys, different identifier schemes (CUSIP vs PERMNO), date format mismatches, inconsistent capitalization, or datasets filtered to different subsets. Don’t just accept low match rates—understand why they occurred.

16.4.3 Common merge error patterns

Error Symptom Cause Solution
Cartesian explosion Row count explodes (10,000x+ more rows) Duplicate keys in both datasets Aggregate one dataset or use more specific join keys
Zero matches All rows are left_only Keys don’t overlap at all Check identifier mapping (CUSIP vs PERMNO), date formats
Type mismatch Low or zero matches despite overlapping keys Different data types (int vs float, str vs object) Convert both to same type before merge
Float precision Values that should match don’t 150.25 vs 150.25000001 Round floats or use integer keys
Time zone mismatch Asof join produces all NaN One dataset in UTC, other in local time Convert both to same time zone
Case/whitespace Low matches on string keys “AAPL” vs “aapl” vs ” AAPL” Use .str.strip().str.upper()
Unsorted asof join Wrong matches or NaN Data not sorted by time column Always sort_values() before asof joins

16.5 Summary and Best Practices

Merging is where most data errors occur in empirical finance. Follow these practices to catch problems early:

  1. Understand cardinality before merging. Is this one-to-one, one-to-many, or many-to-many? Use validate to enforce expectations.

  2. Choose the right merge type. Inner for complete cases only, left to preserve your sample, outer rarely (and carefully).

  3. Use asof joins for time-series data. Don’t try to match timestamps exactly—use asof joins with direction='backward' to avoid look-ahead bias.

  4. Validate every merge. Check row counts, match rates, duplicates, and data types until these checks become automatic.

  5. Sort before asof joins. Unsorted data produces silent errors.

  6. Document match rates. Report how many observations matched in your analysis. Low match rates indicate problems.

  7. Normalize strings. Strip whitespace, standardize case, remove special characters before merging on text.

  8. Use indicator columns. Add indicator=True to understand what matched and what didn’t.

  9. Prefer left joins for clarity. Write base.merge(extra, how='left') instead of extra.merge(base, how='right').

  10. Consider Polars for large datasets. Polars’ lazy evaluation and better error messages can save time on big merges.

The datasets in empirical finance are messy. Companies change tickers, databases use different identifiers, timestamps don’t align. Merging correctly requires skepticism, validation, and careful attention to detail. The techniques in this chapter won’t eliminate merge errors, but they’ll help you catch them before they corrupt your research.