14 Data Structuring and Aggregation

Working with financial data requires more than knowing how to load datasets and run calculations. The structure of your data—how observations are identified, indexed, and organized—fundamentally shapes what analyses are possible and what mistakes are easy to make. A portfolio returns dataset indexed by date allows straightforward time-series operations but makes cross-sectional comparisons awkward. The same data with a multi-level index on date and ticker enables both temporal and cross-sectional analysis but requires more careful handling of group operations.

This chapter covers three essential skills for working with structured financial data. First, we examine how keys, indices, and identifiers work—the foundation for correctly organizing observations. Second, we explore grouping and aggregation operations that let you compute statistics within subsets of data (returns by sector, volatility by year, etc.). Third, we discuss common pitfalls specific to financial data, particularly look-ahead bias and survivorship bias, which can invalidate empirical results if not handled carefully.

14.1 Keys, Indices, and Identifiers

Financial datasets rarely consist of independent observations. Stock returns are measured repeatedly over time for many different securities. Portfolio weights change across rebalancing dates. Trades occur at specific timestamps for specific instruments. To work with such data correctly, you need to understand how observations are identified and how that identification is represented in your data structures.

14.1.1 What Makes an Observation Unique?

Consider a dataset of daily stock returns. What information do you need to uniquely identify a single return observation? At minimum, you need to know which stock (ticker symbol or identifier) and which date. Neither alone suffices: Apple’s ticker AAPL appears thousands of times in a long dataset (once per day), and any given date like 2024-01-15 appears once for every stock in your universe. The combination of ticker and date uniquely identifies each return.

This combination is called a composite key or multi-level identifier. Other examples in finance include:

Trade data: exchange + timestamp + order ID
Options prices: underlying + expiration + strike + call/put
Cross-country macro data: country + date + variable
Portfolio holdings: portfolio ID + date + security ID

The key affects everything downstream: what joins are valid, what grouping operations make sense, what reshaping is possible, and critically, what the index of a time series means.

14.1.2 Representing Keys in DataFrames

There are two main approaches to representing composite keys in DataFrames: as regular columns in the data or as a special index structure. Both pandas and Polars support the regular column approach, while only pandas supports the index approach.

Approach 1: Keys as Regular Columns

The simplest representation keeps all identifiers as ordinary columns:

import pandas as pd
import polars as pl

# Pandas version
df_pandas = pd.DataFrame({
    'ticker': ['AAPL', 'AAPL', 'MSFT', 'MSFT'],
    'date': pd.to_datetime(['2024-01-15', '2024-01-16', '2024-01-15', '2024-01-16']),
    'return': [0.012, -0.005, 0.008, 0.003]
})

# Polars version
df_polars = pl.DataFrame({
    'ticker': ['AAPL', 'AAPL', 'MSFT', 'MSFT'],
    'date': pl.date_range(pl.date(2024, 1, 15), pl.date(2024, 1, 16), "1d").repeat_by(2),
    'return': [0.012, -0.005, 0.008, 0.003]
})

This representation is explicit and flexible. Every piece of identifying information is visible in the data. Operations like filtering, grouping, and joining reference columns by name, making code clear and intentions obvious.

Approach 2: Pandas MultiIndex

Pandas offers an alternative where identifying columns become a special index structure:

df_indexed = df_pandas.set_index(['ticker', 'date'])

The result looks like:

                     return
ticker date
AAPL   2024-01-15    0.012
       2024-01-16   -0.005
MSFT   2024-01-15    0.008
       2024-01-16    0.003

Now ticker and date are no longer regular columns but levels of a hierarchical index. This structure enables convenient operations like:

# Select all observations for AAPL
df_indexed.loc['AAPL']

# Select specific ticker-date combination
df_indexed.loc[('AAPL', '2024-01-15')]

# Unstack to wide format
df_indexed.unstack('ticker')

The MultiIndex approach can make certain operations more concise, particularly when working with panel data where you frequently filter by identifier values or reshape between long and wide formats. However, it also introduces complexity. Index levels behave differently from columns: they don’t appear in df.columns, they require different syntax to modify, and they can complicate operations when you need to treat identifiers as data (e.g., creating a new column based on ticker characteristics).

14.1.3 Practical Guidance on Index Choice

Our general recommendation is to use regular columns for key variables, unless you have an explicit need for a pandas index. Regular columns make code more explicit and portable across libraries. However, a pandas index can simplify your work in specific situations: certain time-series operations (like resampling), creating plots or tables where the index provides axis labels, or when using third-party libraries that expect an index.

14.1.4 Temporal Identifiers and DatetimeIndex

Time is special in financial data. Unlike categorical identifiers like ticker symbols, dates and timestamps have inherent ordering and spacing. A pandas DatetimeIndex (single-level index with datetime values) provides specialized functionality for time-series work:

# Daily returns with date index
returns = pd.Series(
    [0.012, -0.005, 0.008],
    index=pd.to_datetime(['2024-01-15', '2024-01-16', '2024-01-17']),
    name='AAPL'
)

# Convenient date-based selection
returns['2024-01-16']  # Single date
returns['2024-01-15':'2024-01-16']  # Date range

# Resampling and frequency conversion
monthly_returns = (1 + returns).resample('ME').prod() - 1

# Time-aware operations
returns.shift(1)
returns.rolling(window=5).mean()

1: Compound daily returns to monthly by converting to gross returns (1 + r), multiplying within each month, then converting back to net returns.
2: Lag the series by one period—useful for avoiding look-ahead bias.
3: Compute a 5-day rolling mean.

This is one area where pandas indexing genuinely simplifies common operations. The alternative using regular columns requires more verbose syntax:

# Polars equivalent for date filtering
df_polars.filter(
    (pl.col('date') >= pl.date(2024, 1, 15)) &
    (pl.col('date') <= pl.date(2024, 1, 16))
)

# Polars rolling mean with explicit ordering
df_polars.sort('date').select([
    'date',
    pl.col('return').rolling_mean(window_size=5).alias('ma_5')
])

The pandas DatetimeIndex approach is particularly valuable when working with a single time series (one security or portfolio) where time is the primary organizing principle. For panel data with many securities, a regular date column often proves more flexible.

Polars encourages explicit column-based operations and provides excellent performance for time-based filtering and aggregation without special index structures. Its approach scales better to large datasets and makes parallelization transparent.

14.1.5 Identifier Best Practices

Several practical considerations matter when choosing and working with identifiers:

Uniqueness and Validation

Always verify that your chosen keys actually uniquely identify observations:

# Pandas: check for duplicates in composite key
duplicates = df_pandas.duplicated(subset=['ticker', 'date'], keep=False)
if duplicates.any():
    print(f"Found {duplicates.sum()} duplicate observations")
    print(df_pandas[duplicates])

# Polars: check for duplicates
duplicate_count = (
    df_polars
    .group_by(['ticker', 'date'])
    .agg(pl.len().alias('count'))
    .filter(pl.col('count') > 1)
)

Duplicate keys often indicate data quality problems: double-counting trades, duplicate downloads, or errors in joins. Finding them early prevents subtle errors in aggregation and analysis.

Identifier Stability

Ticker symbols change. Companies get acquired or reorganize. CUSIP identifiers remain stable but aren’t always available. When working with long time-series, consider using permanent identifiers like PERMNO (from CRSP) or assigning your own stable internal IDs that you map to tickers.

# Maintaining a ticker-to-permno mapping
ticker_map = pd.DataFrame({
    'permno': [14593, 10107],
    'ticker': ['AAPL', 'MSFT'],
    'start_date': pd.to_datetime(['1980-12-12', '1986-03-13']),
    'end_date': pd.to_datetime(['2024-12-31', '2024-12-31'])
})

# Join with date validation to handle ticker changes
def get_permno(df, ticker_map):
    return df.merge(
        ticker_map,
        on='ticker',
        how='left'
    ).query('date >= start_date and date <= end_date')

Missing Identifiers

Financial data often has gaps. A stock might not trade on certain days. An option chain might lack certain strikes. Design your data structures to make missing data explicit rather than creating ambiguous rows:

# Bad: using dummy values
df_bad = pd.DataFrame({
    'ticker': ['AAPL', 'NONE', 'MSFT'],  # 'NONE' for missing
    'return': [0.01, 0.0, 0.02]
})

# Good: using None/null
df_good = pd.DataFrame({
    'ticker': ['AAPL', None, 'MSFT'],
    'return': [0.01, None, 0.02]
})

# Polars explicitly handles null
df_polars = pl.DataFrame({
    'ticker': ['AAPL', None, 'MSFT'],
    'return': [0.01, None, 0.02]
})

This distinction matters because real zeros and missing values have different meanings in finance. A zero return is information; a missing return means we don’t know what happened.

14.2 Grouping and Aggregation

Much of financial data analysis involves computing statistics within subsets: average returns by sector, volatility by year, portfolio weights by strategy. The split-apply-combine pattern—dividing data into groups, computing something for each group, and combining results—is fundamental to empirical work.

14.2.1 The GroupBy Operation

The core operation splits your dataset into groups based on one or more columns, applies a function to each group, and combines the results. Both pandas and Polars implement this pattern, though with different syntax and performance characteristics.

Basic GroupBy Example

Consider computing average returns by sector:

# Sample data: daily returns for multiple stocks
data = pd.DataFrame({
    'ticker': ['AAPL', 'AAPL', 'MSFT', 'MSFT', 'JPM', 'JPM'],
    'date': pd.to_datetime(['2024-01-15', '2024-01-16'] * 3),
    'sector': ['Technology', 'Technology', 'Technology', 'Technology', 'Financials', 'Financials'],
    'return': [0.012, -0.005, 0.008, 0.003, -0.002, 0.015]
})

# Pandas groupby
sector_avg_pandas = data.groupby('sector')['return'].mean()

# Polars groupby
data_polars = pl.DataFrame(data)
sector_avg_polars = (
    data_polars
    .group_by('sector')
    .agg(pl.col('return').mean())
)

Both produce the same result: average returns for Technology and Financials. The key difference is syntax style. Pandas uses method chaining with implicit column selection (['return']), while Polars uses explicit expressions (pl.col('return').mean()).

14.2.2 Multiple Aggregations

Real analysis rarely computes just one statistic. You typically want mean, standard deviation, count, and other metrics simultaneously:

# Pandas: multiple aggregations
sector_stats_pandas = data.groupby('sector')['return'].agg([
    ('mean', 'mean'),
    ('std', 'std'),
    ('count', 'count')
])

# Alternative pandas syntax using dictionary
sector_stats_pandas_alt = data.groupby('sector').agg({
    'return': ['mean', 'std', 'count']
})

# Polars: explicit expression for each statistic
sector_stats_polars = (
    data_polars
    .group_by('sector')
    .agg([
        pl.col('return').mean().alias('mean'),
        pl.col('return').std().alias('std'),
        pl.col('return').count().alias('count')
    ])
)

The Polars approach requires more typing but makes each calculation explicit. This verbosity pays off in complex aggregations where you’re computing different statistics on different columns or using custom expressions.

14.2.3 Grouping by Multiple Keys

Financial data often requires grouping by multiple dimensions. For example, computing monthly returns by stock requires grouping by both ticker and month:

# Extended data with more dates
np.random.seed(42)
dates = pd.date_range('2024-01-01', '2024-03-31', freq='D')
tickers = ['AAPL', 'MSFT', 'JPM']

data_panel = pd.DataFrame({
    'ticker': np.repeat(tickers, len(dates)),
    'date': np.tile(dates, len(tickers)),
    'return': np.random.normal(0.001, 0.02, len(dates) * len(tickers))
})

# Add month identifier
data_panel['month'] = data_panel['date'].dt.to_period('M')

# Pandas: group by ticker and month
monthly_returns_pandas = (
    data_panel
    .groupby(['ticker', 'month'])['return']
    .apply(lambda x: (1 + x).prod() - 1)  # Compound returns
)

# Polars: explicit grouping
data_panel_polars = pl.DataFrame(data_panel)
monthly_returns_polars = (
    data_panel_polars
    .group_by(['ticker', 'month'])
    .agg(
        ((1 + pl.col('return')).product() - 1).alias('monthly_return')
    )
)

Notice the calculation itself: daily returns compound multiplicatively, not additively. This is a common pattern in financial aggregation where the mathematical operation matters. Simple averaging would be incorrect for returns.

14.2.4 Custom Aggregation Functions

Sometimes built-in aggregations aren’t sufficient. You might need to compute Sharpe ratios, apply winsorization, or implement custom risk metrics:

def sharpe_ratio(returns, risk_free_rate=0.0):
    """Compute annualized Sharpe ratio from daily returns."""
    excess_returns = returns - risk_free_rate / 252
    return np.sqrt(252) * excess_returns.mean() / excess_returns.std()

# Pandas: apply custom function
sector_sharpe_pandas = (
    data_panel
    .groupby(['ticker'])['return']
    .apply(sharpe_ratio)
)

1: Convert annual risk-free rate to daily by dividing by 252 trading days.
2: Annualize the Sharpe ratio by multiplying by √252 (volatility scales with √T).

# Polars: requires more explicit approach
# Option 1: Convert to pandas for complex custom functions
sector_sharpe_polars_via_pandas = (
    data_panel_polars
    .to_pandas()
    .groupby('ticker')['return']
    .apply(sharpe_ratio)
)

# Option 2: Use Polars expressions (more efficient)
def sharpe_ratio_polars(returns_col, risk_free_rate=0.0):
    """Sharpe ratio as Polars expression."""
    excess = returns_col - risk_free_rate / 252
    return (pl.lit(252).sqrt() * excess.mean() / excess.std()).alias('sharpe')

sector_sharpe_polars = (
    data_panel_polars
    .group_by('ticker')
    .agg(sharpe_ratio_polars(pl.col('return')))
)

Pandas’s apply method accepts arbitrary Python functions, offering maximum flexibility at the cost of performance (it applies the function to each group sequentially in Python). Polars requires expressing custom operations using its expression language, which enables query optimization and parallel execution but requires more thought about how to structure the calculation.

For complex custom functions, pandas is often more convenient. For operations that can be expressed using Polars’s built-in functions (which cover most standard statistical and mathematical operations), Polars provides better performance.

14.2.5 Time-Based Aggregation and Resampling

Financial data frequently needs aggregation at different time frequencies: daily data to monthly, tick data to minute bars, etc. Pandas provides convenient resampling functionality:

# Single time series of daily returns
aapl_returns = (
    data_panel[data_panel['ticker'] == 'AAPL']
    .set_index('date')['return']
)

# Resample to monthly, compounding returns
monthly_aapl = (1 + aapl_returns).resample('ME').prod() - 1

# Resample to weekly, computing volatility
weekly_vol = aapl_returns.resample('W').std() * np.sqrt(5)

# Multiple statistics at monthly frequency
monthly_stats = aapl_returns.resample('ME').agg({
    'return': [
        ('compound_return', lambda x: (1 + x).prod() - 1),
        ('volatility', lambda x: x.std() * np.sqrt(21)),
        ('days', 'count')
    ]
})

For panel data (multiple securities), combine groupby with resampling:

# Panel data: group by ticker, then resample
monthly_panel = (
    data_panel
    .set_index('date')
    .groupby('ticker')
    .resample('ME')['return']
    .apply(lambda x: (1 + x).prod() - 1)
    .reset_index()
)

Polars handles time-based aggregation through explicit grouping by time periods:

# Polars: group by ticker and month
monthly_panel_polars = (
    data_panel_polars
    .group_by(['ticker', pl.col('date').dt.month().alias('month')])
    .agg(
        ((1 + pl.col('return')).product() - 1).alias('monthly_return')
    )
)

# More sophisticated: group by ticker and dynamic time windows
monthly_panel_polars_alt = (
    data_panel_polars
    .sort(['ticker', 'date'])
    .group_by_dynamic(
        'date',
        every='1mo',
        by='ticker'
    )
    .agg([
        ((1 + pl.col('return')).product() - 1).alias('monthly_return'),
        pl.col('return').std().alias('volatility'),
        pl.col('return').count().alias('days')
    ])
)

The group_by_dynamic method in Polars provides powerful time-based grouping with clear syntax for the window specification.

14.2.6 Performance Considerations

For small to medium datasets (thousands to low millions of rows), both pandas and Polars perform well. Differences become significant with larger datasets or complex operations:

Pandas uses single-threaded execution for most operations. Custom apply functions run sequentially in Python, which can be slow for large groups or expensive functions.
Polars uses parallel execution by default and query optimization. Operations expressed using Polars expressions (rather than Python functions) run much faster, often 10-100x faster than pandas on large datasets.

For production pipelines or research involving millions of rows, Polars’s performance advantages justify the learning curve. For interactive analysis and moderate-sized datasets, pandas’s ecosystem maturity and flexibility often win.

14.2.7 Grouped Transformations

Aggregation reduces groups to single values, but sometimes you want to transform data within groups while preserving the original shape. Common examples include:

Demeaning returns within sectors (excess returns over sector average)
Computing lagged observations in panel data
Normalizing values within groups
Computing ranks within categories
Forward-filling missing data within securities

Pandas Transform

# Demean returns within each sector
data['excess_return'] = (
    data.groupby('sector')['return']
    .transform(lambda x: x - x.mean())
)

# Standardize returns within each ticker
data_panel['standardized_return'] = (
    data_panel.groupby('ticker')['return']
    .transform(lambda x: (x - x.mean()) / x.std())
)

# Rank returns within each date (cross-sectional ranks)
data_panel['return_rank'] = (
    data_panel.groupby('date')['return']
    .rank(pct=True)
)

Transform operations return a Series with the same index as the original data, allowing direct assignment to new columns.

Polars Window Functions

Polars uses window functions (similar to SQL) for grouped transformations:

# Demean returns within each sector
data_polars_with_excess = data_polars.with_columns(
    (pl.col('return') - pl.col('return').mean().over('sector')).alias('excess_return')
)

# Standardize returns within each ticker
data_panel_polars_standardized = data_panel_polars.with_columns(
    (
        (pl.col('return') - pl.col('return').mean().over('ticker')) /
        pl.col('return').std().over('ticker')
    ).alias('standardized_return')
)

# Rank returns within each date
data_panel_polars_ranked = data_panel_polars.with_columns(
    pl.col('return').rank().over('date').alias('return_rank')
)

1: The .over('sector') computes the mean within each sector group, similar to SQL window functions.
2: Z-score normalization: subtract the group mean and divide by the group standard deviation.
3: Cross-sectional ranking: rank returns among all stocks on each date.

The .over() syntax specifies the grouping for the window function. This approach is more explicit than pandas transform and integrates naturally with Polars’s expression system.

14.3 Common Pitfalls in Financial Data

Financial data has specific characteristics that create traps for the unwary. Two problems—look-ahead bias and survivorship bias—are particularly insidious because they can produce plausible-looking results that are completely invalid. Understanding these issues and designing your data structures to avoid them is essential for credible empirical work.

14.3.1 Look-Ahead Bias

Look-ahead bias occurs when your analysis uses information that would not have been available at the time a decision would have been made. This is surprisingly easy to do accidentally, and it typically makes strategies appear more profitable than they actually would have been.

Example 1: Using Future Data for Current Decisions

Consider computing a moving average of returns to generate trading signals:

# WRONG: Look-ahead bias
returns = pd.Series(
    [0.01, -0.02, 0.03, -0.01, 0.02],
    index=pd.date_range('2024-01-01', periods=5, freq='D')
)

# This looks innocent but is wrong!
returns['ma_5'] = returns.rolling(window=5, center=True).mean()

The center=True parameter makes each moving average value include both past and future returns. A trading signal based on the 2024-01-03 moving average would use returns through 2024-01-05, which wouldn’t be known on 2024-01-03. The correct approach:

# CORRECT: Only use past data
returns_with_ma = pd.DataFrame({
    'return': returns,
    'ma_5': returns.rolling(window=5).mean()
})

# Generate signal based on past information
returns_with_ma['signal'] = (
    returns_with_ma['return'].shift(1) > returns_with_ma['ma_5'].shift(1)
).astype(int)

1: Default rolling window uses only current and past observations (no future data).
2: The .shift(1) ensures today’s signal uses only yesterday’s values—information available at market open.

Example 2: Point-in-Time Data Issues

Financial datasets are often revised. A company’s book value reported in 2024 financial statements might differ from what was reported in earlier vintages due to restatements. Academic databases like Compustat provide “point-in-time” datasets that preserve what was known at each historical date.

# Problematic: Using latest restated values
fundamentals = pd.DataFrame({
    'ticker': ['AAPL', 'AAPL', 'AAPL'],
    'report_date': pd.to_datetime(['2023-09-30', '2023-12-31', '2024-03-31']),
    'book_value': [150e9, 155e9, 160e9]  # Current restated values
})

# Computing book-to-market ratio using restated book values
# would create look-ahead bias if values were adjusted retroactively

# Better: Use point-in-time database or verify no retroactive changes
# Query data as it existed at each decision point

The safest approach is to use datasets explicitly designed to avoid this issue (CRSP/Compustat point-in-time, FactSet’s point-in-time fundamentals) or carefully verify that your data source doesn’t include retroactive revisions.

Example 3: Survivorship Bias in Filtering

Filtering data before creating historical samples can inadvertently create look-ahead bias:

# WRONG: Filtering on current characteristics
# Get list of large-cap stocks as of 2024
large_caps_2024 = get_stocks_by_market_cap(min_cap=10e9, date='2024-01-01')

# Use this list to analyze returns from 2020-2024
historical_returns = get_returns(
    tickers=large_caps_2024,
    start_date='2020-01-01',
    end_date='2024-01-01'
)
# This is wrong! Uses 2024 information to select 2020 sample

# CORRECT: Filter at each point in time
def get_large_cap_returns(start_date, end_date, min_cap):
    """Get returns for stocks that were large-cap at each date."""
    all_dates = pd.date_range(start_date, end_date, freq='D')
    results = []

    for date in all_dates:
        # Get stocks that qualified on this date
        eligible = get_stocks_by_market_cap(min_cap=min_cap, date=date)
        # Get returns for these stocks on this date
        returns = get_returns(tickers=eligible, start_date=date, end_date=date)
        results.append(returns)

    return pd.concat(results)

The correct approach checks eligibility at each date, allowing the composition to change over time as stocks grow or shrink.

Detecting Look-Ahead Bias

Ask these questions about any empirical analysis:

Could I have computed this value in real-time? If your calculation requires future data, it’s look-ahead bias.
Does the analysis use the latest version of the data? If you’re using restated fundamentals or revised economic data, verify those revisions were available when decisions would have been made.
Are sample filters time-varying? Any filter based on characteristics (market cap, industry, etc.) should be applied using values from that point in time, not current values.
Are there suspicious shifts? If lagged variables are used to predict returns, ensure lags are implemented correctly with .shift() operations.

14.3.2 Survivorship Bias

Survivorship bias occurs when your dataset includes only entities that survived until the sample endpoint, excluding those that disappeared (delisted stocks, defunct funds, closed portfolios). This creates artificially optimistic results because you’re analyzing a selected sample of winners.

Example 1: Stock Return Analysis

Consider analyzing stock returns from 2020-2024 using a current stock list:

# WRONG: Current stock universe
current_stocks = ['AAPL', 'MSFT', 'GOOGL', 'AMZN', 'NVDA']
returns = get_returns(tickers=current_stocks, start='2020-01-01', end='2024-01-01')

# Average return calculation
mean_return = returns.mean()
# This is biased upward! Only includes stocks that survived and remained liquid

These are all large, successful companies. Hundreds of stocks that existed in 2020 but delisted due to bankruptcy, acquisition, or poor performance are missing. The average return will be substantially higher than the true market average.

The solution requires a database that includes delisted securities:

# CORRECT: Include delisted stocks
# Using a complete database (e.g., CRSP with delisting codes)
all_stocks = get_stocks(start='2020-01-01', include_delisted=True)
complete_returns = get_returns(
    tickers=all_stocks,
    start='2020-01-01',
    end='2024-01-01'
)

# Handle delisting returns
# CRSP provides delisting returns accounting for final liquidation values
complete_returns['total_return'] = (
    complete_returns['return'].fillna(0) +
    complete_returns['delisting_return'].fillna(0)
)

mean_return_unbiased = complete_returns['total_return'].mean()

Academic databases like CRSP explicitly address survivorship bias by maintaining historical records of all listed stocks, including delisting codes and final returns.

Example 2: Fund Performance Analysis

Mutual fund and hedge fund databases are notorious for survivorship bias. Failed or poorly performing funds often stop reporting, creating databases that over-represent successful funds:

# Analyzing fund performance with survivorship bias
funds_database = get_fund_returns(database='current_funds')  # Only active funds

# Average alpha appears positive
average_alpha = compute_alpha(funds_database)
# Misleading! Unsuccessful funds that closed are missing

# Better approach: Use survivorship-bias-free database
funds_complete = get_fund_returns(database='complete_with_defunct')

# Explicitly track fund status
active_funds = funds_complete[funds_complete['status'] == 'active']
defunct_funds = funds_complete[funds_complete['status'] == 'defunct']

# Compare performance
print(f"Active funds average return: {active_funds['return'].mean():.2%}")
print(f"Defunct funds average return: {defunct_funds['return'].mean():.2%}")
print(f"All funds average return: {funds_complete['return'].mean():.2%}")

Studies have found that survivorship bias can inflate reported mutual fund returns, a massive distortion in performance measurement.

Detecting Survivorship Bias

Guard against survivorship bias by:

Using complete databases: CRSP, Compustat, or commercial providers that maintain historical records of delisted securities.
Checking delisting treatment: Verify how your data source handles stocks that stopped trading. Are delisting returns included?
Comparing universe sizes: If analyzing 2020-2024 data in 2024, check how many securities existed in 2020 versus 2024. A large decline might indicate missing delistings.
Examining data provider methodology: Read documentation about survivorship bias treatment. Reputable providers are explicit about whether datasets include defunct entities.

14.4 Summary

Effective financial data analysis requires understanding how data structure shapes what questions you can answer and what mistakes you might make. Keys and identifiers determine how observations are uniquely identified; choosing whether to use regular columns or special index structures affects code clarity and operation convenience. Grouping and aggregation operations are central to computing statistics within subsets, whether sector averages, monthly volatility, or portfolio characteristics. Both pandas and Polars provide powerful capabilities, with pandas offering flexibility and ecosystem maturity while Polars delivers better performance through query optimization and parallel execution.

Look-ahead bias and survivorship bias represent the most serious threats to validity in financial empirical work. Look-ahead bias uses information that wouldn’t have been available at decision time, making strategies appear more profitable than possible. Survivorship bias analyzes only entities that survived to the sample endpoint, excluding failures and creating upward-biased performance measures. Both require careful attention to data structure, filtering logic, and temporal alignment.

Key practices for robust financial data analysis include:

Explicitly validate that composite keys uniquely identify observations
Choose index structures based on clarity and operation convenience, not capability
Use time-aware operations (resampling, shifting) carefully to maintain temporal integrity
Express grouped calculations explicitly, making aggregation logic clear
Use complete databases that include delisted securities and defunct entities
Document assumptions about data availability and point-in-time values

These skills form the foundation for more advanced topics: joining datasets, reshaping between long and wide formats, and handling complex temporal relationships. The ability to structure data correctly and aggregate it appropriately without introducing bias separates valid empirical research from plausible-looking but incorrect analyses.