13  Data Cleaning

Data cleaning is one of the most critical and time-consuming steps in any empirical analysis. In finance, working with real-world data means confronting issues like duplicate records, missing observations, data entry errors, and outliers. These problems are not merely technical annoyances—they can lead to incorrect conclusions, flawed trading strategies, or misleading research findings if not handled properly.

The goal of data cleaning is not to make data “perfect” but to make it suitable for analysis. This means understanding the nature of data quality issues, their potential causes, and the trade-offs involved in different cleaning approaches. In this chapter, we will cover the essential techniques for detecting and handling common data quality problems using Python’s main data manipulation libraries: pandas and Polars.

Before we begin, let’s understand an important principle: data cleaning decisions should be documented and reproducible. Every choice you make—whether to drop duplicates, impute missing values, or remove outliers—affects your results. Your code should clearly show what cleaning steps were taken and why, so that others (including your future self) can understand and validate your approach.

TipData Wrangler in VS Code

Microsoft provides a VS Code extension called “Data Wrangler” that provides a nice UI for filtering, cleaning, and transforming data in pandas DataFrames. When done, you can export the Python code to replicate your steps.

13.1 Detecting and Handling Duplicates

Duplicate records are a common problem in financial datasets. They can arise from multiple sources: data feed errors that transmit the same trade twice, mistakes in data aggregation where the same record appears in multiple files, or simple human error in manual data entry. Duplicates are particularly problematic because they can distort statistical analyses, inflate trading volumes, and create artificial patterns in the data.

13.1.1 Identifying duplicates

The first step in handling duplicates is identifying them. A duplicate can be defined in different ways depending on your data structure and business context.

NoteTypes of duplicates
  1. Complete duplicates: All columns have identical values
  2. Partial duplicates: Only certain key columns are identical (e.g., same date and ticker, but different price)
  3. Near duplicates: Values are very similar but not exactly identical (often due to floating-point precision)

The appropriate definition depends on your data and use case.

Let’s start with a simple example using stock price data:

import pandas as pd
import polars as pl
from datetime import datetime, timedelta

# Create sample stock price data with duplicates
dates = pd.date_range('2024-01-01', periods=5, freq='D')
data = {
    'date': list(dates) + [dates[2]],  # Duplicate date
    'ticker': ['AAPL', 'AAPL', 'AAPL', 'AAPL', 'AAPL', 'AAPL'],
    'price': [150.0, 152.5, 155.0, 153.0, 157.0, 155.0],
    'volume': [1000000, 1100000, 1200000, 950000, 1300000, 1200000]
}

df_pd = pd.DataFrame(data)
df_pd
1
Adding a duplicate of the third date to create a duplicate record.
date ticker price volume
0 2024-01-01 AAPL 150.0 1000000
1 2024-01-02 AAPL 152.5 1100000
2 2024-01-03 AAPL 155.0 1200000
3 2024-01-04 AAPL 153.0 950000
4 2024-01-05 AAPL 157.0 1300000
5 2024-01-03 AAPL 155.0 1200000

Now let’s detect duplicates. The duplicated() method returns a boolean Series indicating which rows are duplicates:

# Check for complete duplicates
df_pd.duplicated()
0    False
1    False
2    False
3    False
4    False
5     True
dtype: bool

We can also check for duplicates based on specific columns:

# Check for duplicates based on specific columns (date and ticker)
df_pd.duplicated(subset=['date', 'ticker'])
0    False
1    False
2    False
3    False
4    False
5     True
dtype: bool

To see which rows are duplicates, we can filter the DataFrame:

# See which rows are duplicates (keep=False marks all duplicates)
df_pd[df_pd.duplicated(subset=['date', 'ticker'], keep=False)]
date ticker price volume
2 2024-01-03 AAPL 155.0 1200000
5 2024-01-03 AAPL 155.0 1200000

The keep parameter controls which duplicate to mark:

  • keep='first' (default): Mark all duplicates as True except the first occurrence
  • keep='last': Mark all duplicates as True except the last occurrence
  • keep=False: Mark all duplicates as True, including the first occurrence

Now let’s see the same operations in Polars:

df_pl = pl.DataFrame(data)
df_pl
shape: (6, 4)
date ticker price volume
datetime[μs] str f64 i64
2024-01-01 00:00:00 "AAPL" 150.0 1000000
2024-01-02 00:00:00 "AAPL" 152.5 1100000
2024-01-03 00:00:00 "AAPL" 155.0 1200000
2024-01-04 00:00:00 "AAPL" 153.0 950000
2024-01-05 00:00:00 "AAPL" 157.0 1300000
2024-01-03 00:00:00 "AAPL" 155.0 1200000

To find duplicates in Polars, we use is_duplicated():

# Check for duplicates based on date and ticker
df_pl.filter(pl.struct(['date', 'ticker']).is_duplicated())
shape: (2, 4)
date ticker price volume
datetime[μs] str f64 i64
2024-01-03 00:00:00 "AAPL" 155.0 1200000
2024-01-03 00:00:00 "AAPL" 155.0 1200000

13.1.2 Removing duplicates

Once we’ve identified duplicates, we need to decide how to handle them. The most common approaches are:

  1. Drop all duplicates: Keep only the first (or last) occurrence
  2. Aggregate duplicates: Combine duplicate rows using aggregation (e.g., average prices)
  3. Manual investigation: For small numbers of duplicates, inspect each case individually

Let’s see how to implement these approaches. In pandas, we use drop_duplicates():

# pandas: Drop duplicates, keeping first occurrence
df_pd_clean = df_pd.drop_duplicates(subset=['date', 'ticker'], keep='first')
df_pd_clean
date ticker price volume
0 2024-01-01 AAPL 150.0 1000000
1 2024-01-02 AAPL 152.5 1100000
2 2024-01-03 AAPL 155.0 1200000
3 2024-01-04 AAPL 153.0 950000
4 2024-01-05 AAPL 157.0 1300000

In Polars, we use the unique() method:

# Polars: Drop duplicates
df_pl_clean = df_pl.unique(subset=['date', 'ticker'], keep='first')
df_pl_clean
shape: (5, 4)
date ticker price volume
datetime[μs] str f64 i64
2024-01-03 00:00:00 "AAPL" 155.0 1200000
2024-01-04 00:00:00 "AAPL" 153.0 950000
2024-01-01 00:00:00 "AAPL" 150.0 1000000
2024-01-02 00:00:00 "AAPL" 152.5 1100000
2024-01-05 00:00:00 "AAPL" 157.0 1300000

For the aggregation approach, suppose we want to take the average price and total volume when duplicates exist:

# pandas: Aggregate duplicates
df_pd_agg = df_pd.groupby(['date', 'ticker'], as_index=False).agg({
    'price': 'mean',
    'volume': 'sum'
})
df_pd_agg
date ticker price volume
0 2024-01-01 AAPL 150.0 1000000
1 2024-01-02 AAPL 152.5 1100000
2 2024-01-03 AAPL 155.0 2400000
3 2024-01-04 AAPL 153.0 950000
4 2024-01-05 AAPL 157.0 1300000
# Polars: Aggregate duplicates
df_pl_agg = df_pl.group_by(['date', 'ticker']).agg([
    pl.col('price').mean(),
    pl.col('volume').sum()
])
df_pl_agg
shape: (5, 4)
date ticker price volume
datetime[μs] str f64 i64
2024-01-01 00:00:00 "AAPL" 150.0 1000000
2024-01-03 00:00:00 "AAPL" 155.0 2400000
2024-01-05 00:00:00 "AAPL" 157.0 1300000
2024-01-02 00:00:00 "AAPL" 152.5 1100000
2024-01-04 00:00:00 "AAPL" 153.0 950000
WarningCommon pitfall: implicit duplicates

Be careful about duplicates that arise from data structure decisions. For example, if you merge two datasets and accidentally create a many-to-many join, you may generate duplicates without realizing it. Always check the row count before and after joins to ensure you haven’t inadvertently created duplicates. We discuss joins in detail in Chapter 16.

13.1.3 Best practices for duplicate handling

  1. Always investigate before removing: Look at a sample of duplicates to understand why they exist
  2. Document your decisions: Add comments explaining which columns define a duplicate and why
  3. Keep track of removals: Log how many duplicates you removed and their characteristics
  4. Consider the source: Different data sources may have different duplicate patterns
# Example: Comprehensive duplicate handling with logging
def handle_duplicates(df, subset_cols, keep='first', verbose=True):
    """
    Remove duplicates with logging.

    Parameters
    ----------
    df : pd.DataFrame
        Input dataframe
    subset_cols : list
        Columns that define a duplicate
    keep : str
        Which duplicate to keep ('first', 'last', or False)
    verbose : bool
        Whether to print diagnostic information

    Returns
    -------
    pd.DataFrame
        Cleaned dataframe
    """
    initial_rows = len(df)
    duplicated_mask = df.duplicated(subset=subset_cols, keep=keep)
    n_duplicates = duplicated_mask.sum()

    if verbose:
        print(f"Initial rows: {initial_rows}")
        print(f"Duplicates found: {n_duplicates}")
        if n_duplicates > 0:
            print(f"Duplicate rows (sample):")
            print(df[duplicated_mask].head())

    df_clean = df[~duplicated_mask].copy()

    if verbose:
        print(f"Final rows: {len(df_clean)}")
        print(f"Rows removed: {initial_rows - len(df_clean)}")

    return df_clean

# Use the function
df_cleaned = handle_duplicates(df_pd, subset_cols=['date', 'ticker'])
Initial rows: 6
Duplicates found: 1
Duplicate rows (sample):
        date ticker  price   volume
5 2024-01-03   AAPL  155.0  1200000
Final rows: 5
Rows removed: 1

13.2 Missing Data: Detection and Imputation

Missing data is ubiquitous in financial datasets. Stock markets close on holidays, companies report earnings quarterly but not daily, economic indicators are released with delays, and data collection systems occasionally fail. How you handle missing data can significantly impact your analysis, and there is rarely a single “correct” approach.

13.2.1 Understanding missing data mechanisms

Before deciding how to handle missing data, it’s important to understand why data might be missing. Statisticians classify missing data into three categories:

  1. Missing Completely at Random (MCAR): The probability that a value is missing is unrelated to any observed or unobserved data. For example, if a data feed randomly drops 1% of all observations due to network issues, the missing data is MCAR.

  2. Missing at Random (MAR): The probability that a value is missing depends on observed data but not on the missing value itself. For example, small-cap stocks might have more missing volume data than large-cap stocks, but conditional on market cap, the missingness is random.

  3. Missing Not at Random (MNAR): The probability that a value is missing depends on the missing value itself. For example, companies might be less likely to report earnings when performance is poor, or traders might fail to report large losses.

ImportantWhy this matters

The type of missingness affects which imputation methods are valid. MCAR is the easiest to handle (you can often just drop missing observations), while MNAR is the most problematic because missing data itself carries information. Unfortunately, in real financial data, you often cannot definitively determine which mechanism is operating, so you need to think carefully about the context.

13.2.2 Detecting missing data

Python uses different representations for missing data depending on the data type:

  • NaN (Not a Number): Used for missing floating-point values
  • None: Python’s built-in null value
  • NaT (Not a Time): Used for missing datetime values
  • pd.NA: pandas’ new missing indicator for all dtypes
NoteHistorical context: missing values in pandas

Initially, pandas used NumPy as its backend and could not represent missing values explicitly. Instead, it used NaN for floats and None (a Python object) for strings. This approach had important limitations:

  1. For integers, pandas would first convert all values in the column to float and use NaN to represent missing values. This meant you couldn’t have a truly integer column with missing values.

  2. NaN has a specific meaning in floating-point arithmetic: it represents a value that is “not a number,” such as the result of an undefined operation (e.g., log(-1)). Reusing NaN to mean “missing” conflates two distinct concepts and complicates handling of true NaN results.

Polars and newer backends for pandas (such as PyArrow) explicitly support missing data indicators for all column types, avoiding these limitations.

Let’s create a dataset with missing values:

import numpy as np

data_missing = {
    'date': pd.date_range('2024-01-01', periods=10, freq='D'),
    'ticker': ['AAPL'] * 10,
    'price': [150.0, 152.5, np.nan, 153.0, 157.0, np.nan, 159.0, 158.5, np.nan, 161.0],
    'volume': [1000000, np.nan, 1200000, 950000, np.nan, 1100000, 1300000, np.nan, 1250000, 1400000],
    'dividend': [0.0, 0.0, 0.0, 0.25, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
}

df_missing = pd.DataFrame(data_missing)
df_missing
date ticker price volume dividend
0 2024-01-01 AAPL 150.0 1000000.0 0.00
1 2024-01-02 AAPL 152.5 NaN 0.00
2 2024-01-03 AAPL NaN 1200000.0 0.00
3 2024-01-04 AAPL 153.0 950000.0 0.25
4 2024-01-05 AAPL 157.0 NaN 0.00
5 2024-01-06 AAPL NaN 1100000.0 0.00
6 2024-01-07 AAPL 159.0 1300000.0 0.00
7 2024-01-08 AAPL 158.5 NaN 0.00
8 2024-01-09 AAPL NaN 1250000.0 0.00
9 2024-01-10 AAPL 161.0 1400000.0 0.00

Now let’s detect missing values. The isna() method returns a boolean DataFrame indicating which values are missing:

df_missing.isna()
date ticker price volume dividend
0 False False False False False
1 False False False True False
2 False False True False False
3 False False False False False
4 False False False True False
5 False False True False False
6 False False False False False
7 False False False True False
8 False False True False False
9 False False False False False

We can count missing values per column:

df_missing.isna().sum()
date        0
ticker      0
price       3
volume      3
dividend    0
dtype: int64

Or calculate the percentage of missing values:

df_missing.isna().mean() * 100
date         0.0
ticker       0.0
price       30.0
volume      30.0
dividend     0.0
dtype: float64

To see rows with any missing values:

df_missing[df_missing.isna().any(axis=1)]
date ticker price volume dividend
1 2024-01-02 AAPL 152.5 NaN 0.0
2 2024-01-03 AAPL NaN 1200000.0 0.0
4 2024-01-05 AAPL 157.0 NaN 0.0
5 2024-01-06 AAPL NaN 1100000.0 0.0
7 2024-01-08 AAPL 158.5 NaN 0.0
8 2024-01-09 AAPL NaN 1250000.0 0.0

In Polars, we use null_count() to count missing values:

# Polars version - convert from pandas DataFrame
# (Polars uses None/null for missing values, not np.nan)
df_missing_pl = pl.from_pandas(df_missing)

df_missing_pl.null_count()
shape: (1, 5)
date ticker price volume dividend
u32 u32 u32 u32 u32
0 0 3 3 0

We can filter rows with missing values in a specific column:

df_missing_pl.filter(pl.col('price').is_null())
shape: (3, 5)
date ticker price volume dividend
datetime[ns] str f64 f64 f64
2024-01-03 00:00:00 "AAPL" null 1.2e6 0.0
2024-01-06 00:00:00 "AAPL" null 1.1e6 0.0
2024-01-09 00:00:00 "AAPL" null 1.25e6 0.0

13.2.3 Handling missing data: deletion

The simplest approach to missing data is to delete it. There are two main strategies:

  1. Listwise deletion (complete case analysis): Rows with any missing values are completely dropped from the sample, so they are not used in any calculation or analysis. This approach is simple and ensures consistency across analyses, but can significantly reduce sample size if missingness is widespread.

  2. Pairwise deletion: For each calculation or analysis (e.g., for each regression), use all observations for which we have the required variables. This approach maximizes the use of available data but can lead to different sample sizes for different analyses, making comparisons difficult.

Listwise deletion removes any row with missing values:

df_complete = df_missing.dropna()
df_complete
date ticker price volume dividend
0 2024-01-01 AAPL 150.0 1000000.0 0.00
3 2024-01-04 AAPL 153.0 950000.0 0.25
6 2024-01-07 AAPL 159.0 1300000.0 0.00
9 2024-01-10 AAPL 161.0 1400000.0 0.00
print(f"Rows removed: {len(df_missing) - len(df_complete)}")
Rows removed: 6

We can also drop rows only if specific columns are missing:

df_price_complete = df_missing.dropna(subset=['price'])
df_price_complete
date ticker price volume dividend
0 2024-01-01 AAPL 150.0 1000000.0 0.00
1 2024-01-02 AAPL 152.5 NaN 0.00
3 2024-01-04 AAPL 153.0 950000.0 0.25
4 2024-01-05 AAPL 157.0 NaN 0.00
6 2024-01-07 AAPL 159.0 1300000.0 0.00
7 2024-01-08 AAPL 158.5 NaN 0.00
9 2024-01-10 AAPL 161.0 1400000.0 0.00

Or drop columns with any missing values:

df_no_missing_cols = df_missing.dropna(axis=1)
df_no_missing_cols
date ticker dividend
0 2024-01-01 AAPL 0.00
1 2024-01-02 AAPL 0.00
2 2024-01-03 AAPL 0.00
3 2024-01-04 AAPL 0.25
4 2024-01-05 AAPL 0.00
5 2024-01-06 AAPL 0.00
6 2024-01-07 AAPL 0.00
7 2024-01-08 AAPL 0.00
8 2024-01-09 AAPL 0.00
9 2024-01-10 AAPL 0.00

In Polars, we use drop_nulls():

df_complete_pl = df_missing_pl.drop_nulls()
df_complete_pl
shape: (4, 5)
date ticker price volume dividend
datetime[ns] str f64 f64 f64
2024-01-01 00:00:00 "AAPL" 150.0 1e6 0.0
2024-01-04 00:00:00 "AAPL" 153.0 950000.0 0.25
2024-01-07 00:00:00 "AAPL" 159.0 1.3e6 0.0
2024-01-10 00:00:00 "AAPL" 161.0 1.4e6 0.0
# Drop rows with null in specific column
df_price_complete_pl = df_missing_pl.drop_nulls(subset=['price'])
df_price_complete_pl
shape: (7, 5)
date ticker price volume dividend
datetime[ns] str f64 f64 f64
2024-01-01 00:00:00 "AAPL" 150.0 1e6 0.0
2024-01-02 00:00:00 "AAPL" 152.5 null 0.0
2024-01-04 00:00:00 "AAPL" 153.0 950000.0 0.25
2024-01-05 00:00:00 "AAPL" 157.0 null 0.0
2024-01-07 00:00:00 "AAPL" 159.0 1.3e6 0.0
2024-01-08 00:00:00 "AAPL" 158.5 null 0.0
2024-01-10 00:00:00 "AAPL" 161.0 1.4e6 0.0
WarningWhen deletion is problematic

Deletion can introduce bias if:

  • You lose a large percentage of your data
  • Missingness is not random (MAR or MNAR)
  • Missing data occurs in key variables

In time series analysis, deleting observations can break the temporal structure of your data. Use deletion cautiously and always check how much data you’re losing.

13.2.4 Handling missing data: imputation

Imputation means filling in missing values with plausible estimates. There are many imputation strategies, each with different assumptions and use cases.

13.2.4.1 Simple imputation methods

  1. Forward fill: Use the last observed value
  2. Backward fill: Use the next observed value
  3. Mean/median imputation: Replace with the column mean or median
  4. Zero/constant imputation: Replace with zero or another constant

In empirical finance, we usually want to avoid introducing look-ahead bias or imputing information that was not available at the time of observation. For this reason, forward fill and zero/constant imputation are the most commonly used imputation methods in financial research.

Forward fill uses the last observed value:

df_ffill = df_missing.copy()
df_ffill[['price', 'volume']] = df_ffill[['price', 'volume']].ffill()
df_ffill
date ticker price volume dividend
0 2024-01-01 AAPL 150.0 1000000.0 0.00
1 2024-01-02 AAPL 152.5 1000000.0 0.00
2 2024-01-03 AAPL 152.5 1200000.0 0.00
3 2024-01-04 AAPL 153.0 950000.0 0.25
4 2024-01-05 AAPL 157.0 950000.0 0.00
5 2024-01-06 AAPL 157.0 1100000.0 0.00
6 2024-01-07 AAPL 159.0 1300000.0 0.00
7 2024-01-08 AAPL 158.5 1300000.0 0.00
8 2024-01-09 AAPL 158.5 1250000.0 0.00
9 2024-01-10 AAPL 161.0 1400000.0 0.00

Backward fill uses the next observed value:

df_bfill = df_missing.copy()
df_bfill[['price', 'volume']] = df_bfill[['price', 'volume']].bfill()
df_bfill
date ticker price volume dividend
0 2024-01-01 AAPL 150.0 1000000.0 0.00
1 2024-01-02 AAPL 152.5 1200000.0 0.00
2 2024-01-03 AAPL 153.0 1200000.0 0.00
3 2024-01-04 AAPL 153.0 950000.0 0.25
4 2024-01-05 AAPL 157.0 1100000.0 0.00
5 2024-01-06 AAPL 159.0 1100000.0 0.00
6 2024-01-07 AAPL 159.0 1300000.0 0.00
7 2024-01-08 AAPL 158.5 1250000.0 0.00
8 2024-01-09 AAPL 161.0 1250000.0 0.00
9 2024-01-10 AAPL 161.0 1400000.0 0.00

Mean imputation replaces missing values with the column mean:

df_mean = df_missing.copy()
df_mean['price'] = df_mean['price'].fillna(df_mean['price'].mean())
df_mean['volume'] = df_mean['volume'].fillna(df_mean['volume'].mean())
df_mean
date ticker price volume dividend
0 2024-01-01 AAPL 150.000000 1.000000e+06 0.00
1 2024-01-02 AAPL 152.500000 1.171429e+06 0.00
2 2024-01-03 AAPL 155.857143 1.200000e+06 0.00
3 2024-01-04 AAPL 153.000000 9.500000e+05 0.25
4 2024-01-05 AAPL 157.000000 1.171429e+06 0.00
5 2024-01-06 AAPL 155.857143 1.100000e+06 0.00
6 2024-01-07 AAPL 159.000000 1.300000e+06 0.00
7 2024-01-08 AAPL 158.500000 1.171429e+06 0.00
8 2024-01-09 AAPL 155.857143 1.250000e+06 0.00
9 2024-01-10 AAPL 161.000000 1.400000e+06 0.00

Median imputation is more robust to outliers:

df_median = df_missing.copy()
df_median['price'] = df_median['price'].fillna(df_median['price'].median())
df_median['volume'] = df_median['volume'].fillna(df_median['volume'].median())
df_median
date ticker price volume dividend
0 2024-01-01 AAPL 150.0 1000000.0 0.00
1 2024-01-02 AAPL 152.5 1200000.0 0.00
2 2024-01-03 AAPL 157.0 1200000.0 0.00
3 2024-01-04 AAPL 153.0 950000.0 0.25
4 2024-01-05 AAPL 157.0 1200000.0 0.00
5 2024-01-06 AAPL 157.0 1100000.0 0.00
6 2024-01-07 AAPL 159.0 1300000.0 0.00
7 2024-01-08 AAPL 158.5 1200000.0 0.00
8 2024-01-09 AAPL 157.0 1250000.0 0.00
9 2024-01-10 AAPL 161.0 1400000.0 0.00

Constant value imputation:

df_zero = df_missing.copy()
df_zero['dividend'] = df_zero['dividend'].fillna(0)
df_zero
date ticker price volume dividend
0 2024-01-01 AAPL 150.0 1000000.0 0.00
1 2024-01-02 AAPL 152.5 NaN 0.00
2 2024-01-03 AAPL NaN 1200000.0 0.00
3 2024-01-04 AAPL 153.0 950000.0 0.25
4 2024-01-05 AAPL 157.0 NaN 0.00
5 2024-01-06 AAPL NaN 1100000.0 0.00
6 2024-01-07 AAPL 159.0 1300000.0 0.00
7 2024-01-08 AAPL 158.5 NaN 0.00
8 2024-01-09 AAPL NaN 1250000.0 0.00
9 2024-01-10 AAPL 161.0 1400000.0 0.00

In Polars, we use forward_fill() and fill_null():

df_ffill_pl = df_missing_pl.select([
    pl.col('date'),
    pl.col('ticker'),
    pl.col('price').forward_fill(),
    pl.col('volume').forward_fill(),
    pl.col('dividend')
])
df_ffill_pl
shape: (10, 5)
date ticker price volume dividend
datetime[ns] str f64 f64 f64
2024-01-01 00:00:00 "AAPL" 150.0 1e6 0.0
2024-01-02 00:00:00 "AAPL" 152.5 1e6 0.0
2024-01-03 00:00:00 "AAPL" 152.5 1.2e6 0.0
2024-01-04 00:00:00 "AAPL" 153.0 950000.0 0.25
2024-01-05 00:00:00 "AAPL" 157.0 950000.0 0.0
2024-01-06 00:00:00 "AAPL" 157.0 1.1e6 0.0
2024-01-07 00:00:00 "AAPL" 159.0 1.3e6 0.0
2024-01-08 00:00:00 "AAPL" 158.5 1.3e6 0.0
2024-01-09 00:00:00 "AAPL" 158.5 1.25e6 0.0
2024-01-10 00:00:00 "AAPL" 161.0 1.4e6 0.0
# Polars mean imputation
mean_price = df_missing_pl['price'].mean()
mean_volume = df_missing_pl['volume'].mean()

df_mean_pl = df_missing_pl.with_columns([
    pl.col('price').fill_null(mean_price),
    pl.col('volume').fill_null(mean_volume)
])
df_mean_pl
shape: (10, 5)
date ticker price volume dividend
datetime[ns] str f64 f64 f64
2024-01-01 00:00:00 "AAPL" 150.0 1e6 0.0
2024-01-02 00:00:00 "AAPL" 152.5 1.1714e6 0.0
2024-01-03 00:00:00 "AAPL" 155.857143 1.2e6 0.0
2024-01-04 00:00:00 "AAPL" 153.0 950000.0 0.25
2024-01-05 00:00:00 "AAPL" 157.0 1.1714e6 0.0
2024-01-06 00:00:00 "AAPL" 155.857143 1.1e6 0.0
2024-01-07 00:00:00 "AAPL" 159.0 1.3e6 0.0
2024-01-08 00:00:00 "AAPL" 158.5 1.1714e6 0.0
2024-01-09 00:00:00 "AAPL" 155.857143 1.25e6 0.0
2024-01-10 00:00:00 "AAPL" 161.0 1.4e6 0.0
NoteChoosing an imputation method for financial data

The appropriate method depends on your data and context:

  • Forward fill: Good for prices and slowly-changing variables (assumes last observation is still valid)
  • Mean/median: Reasonable for cross-sectional data, but can distort distributions and relationships
  • Zero: Appropriate when zero is a meaningful value (e.g., dividends, corporate actions)
  • Interpolation: Useful when you expect smooth changes over time

Never use imputation blindly—think about what each method assumes about your data.

13.2.4.2 Linear interpolation

For time series data, linear interpolation can provide more realistic estimates than simple forward/backward filling:

df_interp = df_missing.copy()
df_interp['price'] = df_interp['price'].interpolate(method='linear')
df_interp['volume'] = df_interp['volume'].interpolate(method='linear')
df_interp
date ticker price volume dividend
0 2024-01-01 AAPL 150.00 1000000.0 0.00
1 2024-01-02 AAPL 152.50 1100000.0 0.00
2 2024-01-03 AAPL 152.75 1200000.0 0.00
3 2024-01-04 AAPL 153.00 950000.0 0.25
4 2024-01-05 AAPL 157.00 1025000.0 0.00
5 2024-01-06 AAPL 158.00 1100000.0 0.00
6 2024-01-07 AAPL 159.00 1300000.0 0.00
7 2024-01-08 AAPL 158.50 1275000.0 0.00
8 2024-01-09 AAPL 159.75 1250000.0 0.00
9 2024-01-10 AAPL 161.00 1400000.0 0.00

Time-based interpolation accounts for irregular spacing and requires a DatetimeIndex:

df_interp_time = df_missing.copy().set_index('date')
df_interp_time['price'] = df_interp_time['price'].interpolate(method='time')
df_interp_time = df_interp_time.reset_index()
df_interp_time
date ticker price volume dividend
0 2024-01-01 AAPL 150.00 1000000.0 0.00
1 2024-01-02 AAPL 152.50 NaN 0.00
2 2024-01-03 AAPL 152.75 1200000.0 0.00
3 2024-01-04 AAPL 153.00 950000.0 0.25
4 2024-01-05 AAPL 157.00 NaN 0.00
5 2024-01-06 AAPL 158.00 1100000.0 0.00
6 2024-01-07 AAPL 159.00 1300000.0 0.00
7 2024-01-08 AAPL 158.50 NaN 0.00
8 2024-01-09 AAPL 159.75 1250000.0 0.00
9 2024-01-10 AAPL 161.00 1400000.0 0.00

13.2.4.3 Group-based imputation

Often, you want to impute missing values differently for different groups. For example, you might want to impute missing returns for each industry separately:

# Create data with multiple tickers
data_grouped = {
    'date': pd.date_range('2024-01-01', periods=6, freq='D').tolist() * 2,
    'ticker': ['AAPL'] * 6 + ['MSFT'] * 6,
    'price': [150.0, np.nan, 155.0, 153.0, np.nan, 161.0,
              250.0, 252.0, np.nan, 258.0, 260.0, np.nan],
    'sector': ['Tech'] * 12
}

df_grouped = pd.DataFrame(data_grouped)
df_grouped
date ticker price sector
0 2024-01-01 AAPL 150.0 Tech
1 2024-01-02 AAPL NaN Tech
2 2024-01-03 AAPL 155.0 Tech
3 2024-01-04 AAPL 153.0 Tech
4 2024-01-05 AAPL NaN Tech
5 2024-01-06 AAPL 161.0 Tech
6 2024-01-01 MSFT 250.0 Tech
7 2024-01-02 MSFT 252.0 Tech
8 2024-01-03 MSFT NaN Tech
9 2024-01-04 MSFT 258.0 Tech
10 2024-01-05 MSFT 260.0 Tech
11 2024-01-06 MSFT NaN Tech

Forward fill within each ticker group:

df_grouped_ffill = df_grouped.copy()
df_grouped_ffill['price'] = df_grouped_ffill.groupby('ticker')['price'].ffill()
df_grouped_ffill
date ticker price sector
0 2024-01-01 AAPL 150.0 Tech
1 2024-01-02 AAPL 150.0 Tech
2 2024-01-03 AAPL 155.0 Tech
3 2024-01-04 AAPL 153.0 Tech
4 2024-01-05 AAPL 153.0 Tech
5 2024-01-06 AAPL 161.0 Tech
6 2024-01-01 MSFT 250.0 Tech
7 2024-01-02 MSFT 252.0 Tech
8 2024-01-03 MSFT 252.0 Tech
9 2024-01-04 MSFT 258.0 Tech
10 2024-01-05 MSFT 260.0 Tech
11 2024-01-06 MSFT 260.0 Tech

Mean imputation within each sector:

df_grouped_mean = df_grouped.copy()
df_grouped_mean['price'] = df_grouped_mean.groupby('sector')['price'].transform(
    lambda x: x.fillna(x.mean())
)
df_grouped_mean
date ticker price sector
0 2024-01-01 AAPL 150.000 Tech
1 2024-01-02 AAPL 204.875 Tech
2 2024-01-03 AAPL 155.000 Tech
3 2024-01-04 AAPL 153.000 Tech
4 2024-01-05 AAPL 204.875 Tech
5 2024-01-06 AAPL 161.000 Tech
6 2024-01-01 MSFT 250.000 Tech
7 2024-01-02 MSFT 252.000 Tech
8 2024-01-03 MSFT 204.875 Tech
9 2024-01-04 MSFT 258.000 Tech
10 2024-01-05 MSFT 260.000 Tech
11 2024-01-06 MSFT 204.875 Tech

In Polars, we use the over() method for group-based operations:

df_grouped_pl = pl.from_pandas(df_grouped)

# Forward fill within groups
df_grouped_ffill_pl = df_grouped_pl.with_columns([
    pl.col('price').forward_fill().over('ticker')
])
df_grouped_ffill_pl
shape: (12, 4)
date ticker price sector
datetime[ns] str f64 str
2024-01-01 00:00:00 "AAPL" 150.0 "Tech"
2024-01-02 00:00:00 "AAPL" 150.0 "Tech"
2024-01-03 00:00:00 "AAPL" 155.0 "Tech"
2024-01-04 00:00:00 "AAPL" 153.0 "Tech"
2024-01-05 00:00:00 "AAPL" 153.0 "Tech"
2024-01-02 00:00:00 "MSFT" 252.0 "Tech"
2024-01-03 00:00:00 "MSFT" 252.0 "Tech"
2024-01-04 00:00:00 "MSFT" 258.0 "Tech"
2024-01-05 00:00:00 "MSFT" 260.0 "Tech"
2024-01-06 00:00:00 "MSFT" 260.0 "Tech"
# Mean imputation within groups
df_grouped_mean_pl = df_grouped_pl.with_columns([
    pl.col('price').fill_null(pl.col('price').mean()).over('sector')
])
df_grouped_mean_pl
shape: (12, 4)
date ticker price sector
datetime[ns] str f64 str
2024-01-01 00:00:00 "AAPL" 150.0 "Tech"
2024-01-02 00:00:00 "AAPL" 204.875 "Tech"
2024-01-03 00:00:00 "AAPL" 155.0 "Tech"
2024-01-04 00:00:00 "AAPL" 153.0 "Tech"
2024-01-05 00:00:00 "AAPL" 204.875 "Tech"
2024-01-02 00:00:00 "MSFT" 252.0 "Tech"
2024-01-03 00:00:00 "MSFT" 204.875 "Tech"
2024-01-04 00:00:00 "MSFT" 258.0 "Tech"
2024-01-05 00:00:00 "MSFT" 260.0 "Tech"
2024-01-06 00:00:00 "MSFT" 204.875 "Tech"

13.3 Data Validation

Data validation is the process of checking whether your data meets quality standards and assumptions before analysis. In finance, invalid data can lead to catastrophic errors—imagine executing a trading strategy based on incorrect prices or publishing research with flawed data.

Validation encompasses several activities:

  • Checking data types and formats
  • Verifying data ranges and constraints
  • Identifying outliers and anomalies
  • Ensuring logical consistency
  • Cross-checking against external sources
TipAutomated validation with pandera

For systematic data validation, consider using pandera, a Python framework for automatic validation of DataFrames. Pandera lets you define schemas that specify expected data types, value ranges, and other constraints, then automatically validates your data against these schemas. It supports both pandas and Polars.

13.3.1 Validating data types and formats

First, ensure that each column has the correct data type:

# Create sample data with type issues
data_types = {
    'date': ['2024-01-01', '2024-01-02', '2024-01-03', 'invalid'],
    'ticker': ['AAPL', 'AAPL', 'AAPL', 'AAPL'],
    'price': ['150.0', '152.5', '155.0', '153.0'],  # Stored as strings
    'volume': [1000000, 1100000, 1200000, -950000]  # Negative volume
}

df_types = pd.DataFrame(data_types)
df_types
date ticker price volume
0 2024-01-01 AAPL 150.0 1000000
1 2024-01-02 AAPL 152.5 1100000
2 2024-01-03 AAPL 155.0 1200000
3 invalid AAPL 153.0 -950000
df_types.dtypes
date      object
ticker    object
price     object
volume     int64
dtype: object

Now let’s validate and fix the types. We use errors='coerce' to convert invalid values to NaN instead of raising an error:

# Convert price to numeric, coercing errors to NaN
df_types['price'] = pd.to_numeric(df_types['price'], errors='coerce')

# Convert date to datetime
df_types['date'] = pd.to_datetime(df_types['date'], errors='coerce')

df_types
date ticker price volume
0 2024-01-01 AAPL 150.0 1000000
1 2024-01-02 AAPL 152.5 1100000
2 2024-01-03 AAPL 155.0 1200000
3 NaT AAPL 153.0 -950000
df_types.dtypes
date      datetime64[ns]
ticker            object
price            float64
volume             int64
dtype: object

We can identify rows with conversion errors by checking for NaN values:

invalid_dates = df_types['date'].isna()
print(f"Rows with invalid dates: {invalid_dates.sum()}")
Rows with invalid dates: 1
df_types[invalid_dates]
date ticker price volume
3 NaT AAPL 153.0 -950000
WarningCommon type issues in financial data
  • Dates: Mixed formats (MM/DD/YYYY vs DD/MM/YYYY), Excel date serials, timezone issues
  • Numbers: Thousands separators, currency symbols, percentage signs stored as text
  • Missing values: Coded as -999, ‘NA’, empty strings, or other sentinel values
  • Categorical variables: Inconsistent capitalization or spelling

Always check types early in your pipeline and convert them explicitly.

13.3.2 Range and constraint validation

Financial data often has natural constraints. Prices must be positive, probabilities must be between 0 and 1, and returns are typically bounded (though extreme events can violate typical ranges).

# Create data with constraint violations
data_constraints = {
    'date': pd.date_range('2024-01-01', periods=5, freq='D'),
    'ticker': ['AAPL'] * 5,
    'price': [150.0, -152.5, 155.0, 1000000.0, 157.0],  # Negative and extreme prices
    'volume': [1000000, 1100000, 1200000, -950000, 1300000],  # Negative volume
    'return': [0.01, 0.02, -0.01, 5.0, -0.03]  # Extreme return
}

df_constraints = pd.DataFrame(data_constraints)

# Define validation rules
def validate_price(price):
    """Check if price is valid."""
    return (price > 0) & (price < 100000)

def validate_volume(volume):
    """Check if volume is valid."""
    return volume >= 0

def validate_return(ret):
    """Check if return is reasonable."""
    return (ret > -1) & (ret < 1)

# Apply validation
df_constraints['price_valid'] = validate_price(df_constraints['price'])
df_constraints['volume_valid'] = validate_volume(df_constraints['volume'])
df_constraints['return_valid'] = validate_return(df_constraints['return'])

df_constraints
date ticker price volume return price_valid volume_valid return_valid
0 2024-01-01 AAPL 150.0 1000000 0.01 True True True
1 2024-01-02 AAPL -152.5 1100000 0.02 False True True
2 2024-01-03 AAPL 155.0 1200000 -0.01 True True True
3 2024-01-04 AAPL 1000000.0 -950000 5.00 False False False
4 2024-01-05 AAPL 157.0 1300000 -0.03 True True True

We can identify invalid rows by combining the validation columns:

invalid_mask = ~(df_constraints['price_valid'] &
                 df_constraints['volume_valid'] &
                 df_constraints['return_valid'])
print(f"Invalid rows: {invalid_mask.sum()}")
Invalid rows: 2
df_constraints[invalid_mask]
date ticker price volume return price_valid volume_valid return_valid
1 2024-01-02 AAPL -152.5 1100000 0.02 False True True
3 2024-01-04 AAPL 1000000.0 -950000 5.00 False False False

In Polars, we can use expressions for validation:

df_constraints_pl = pl.DataFrame(data_constraints)

# Add validation columns
df_validated_pl = df_constraints_pl.with_columns([
    ((pl.col('price') > 0) & (pl.col('price') < 100000)).alias('price_valid'),
    (pl.col('volume') >= 0).alias('volume_valid'),
    ((pl.col('return') > -1) & (pl.col('return') < 1)).alias('return_valid')
])

df_validated_pl
shape: (5, 8)
date ticker price volume return price_valid volume_valid return_valid
datetime[ns] str f64 i64 f64 bool bool bool
2024-01-01 00:00:00 "AAPL" 150.0 1000000 0.01 true true true
2024-01-02 00:00:00 "AAPL" -152.5 1100000 0.02 false true true
2024-01-03 00:00:00 "AAPL" 155.0 1200000 -0.01 true true true
2024-01-04 00:00:00 "AAPL" 1e6 -950000 5.0 false false false
2024-01-05 00:00:00 "AAPL" 157.0 1300000 -0.03 true true true
# Filter to invalid rows
invalid_pl = df_validated_pl.filter(
    ~pl.col('price_valid') | ~pl.col('volume_valid') | ~pl.col('return_valid')
)
invalid_pl
shape: (2, 8)
date ticker price volume return price_valid volume_valid return_valid
datetime[ns] str f64 i64 f64 bool bool bool
2024-01-02 00:00:00 "AAPL" -152.5 1100000 0.02 false true true
2024-01-04 00:00:00 "AAPL" 1e6 -950000 5.0 false false false

13.3.3 Detecting outliers

Outliers are observations that deviate significantly from other observations. In finance, outliers could indicate:

  • Genuine extreme events (market crashes, flash crashes)
  • Data errors (wrong decimal place, data feed glitches)
  • Special situations (stock splits, spinoffs)

Outlier detection requires domain knowledge. A 50% daily return is an error for a typical stock but normal for a penny stock or during a takeover announcement.

13.3.3.1 Statistical methods for outlier detection

Common approaches include:

  1. Z-score method: Flag observations more than N standard deviations from the mean
  2. IQR method: Flag observations outside 1.5 times the interquartile range
  3. Modified Z-score: Use median absolute deviation instead of standard deviation (more robust)
  4. Domain-specific rules: Use knowledge about reasonable ranges

Choosing the right method for a particular application is beyond the scope of this book and requires careful consideration of the data characteristics and analysis goals.

13.3.3.2 Handling outliers

Once you’ve identified outliers, you have several options:

  1. Keep them: If they’re genuine extreme events
  2. Remove them: If they’re clearly errors
  3. Cap/Winsorize: Replace extreme values with a threshold
  4. Transform the data: Use log transformation or other methods to reduce the impact
  5. Investigate: Manually check each outlier
WarningBe cautious with outlier removal

In finance, extreme events are often the most important observations. The 2008 financial crisis, COVID-19 market crash, and other tail events are “outliers” but carry critical information. Before removing outliers:

  1. Investigate each one individually if possible
  2. Check if they coincide with known events (earnings announcements, news, market crises)
  3. Consider the impact on your conclusions if you keep vs. remove them
  4. Document your decision and reasoning

When in doubt, perform your analysis both with and without outliers to assess sensitivity.