13 Data Cleaning

Data cleaning is one of the most critical and time-consuming steps in any empirical analysis. In finance, working with real-world data means confronting issues like duplicate records, missing observations, data entry errors, and outliers. These problems are not merely technical annoyances—they can lead to incorrect conclusions, flawed trading strategies, or misleading research findings if not handled properly.

The goal of data cleaning is not to make data “perfect” but to make it suitable for analysis. This means understanding the nature of data quality issues, their potential causes, and the trade-offs involved in different cleaning approaches. In this chapter, we will cover the essential techniques for detecting and handling common data quality problems using Python’s main data manipulation libraries: pandas and Polars.

Before we begin, let’s understand an important principle: data cleaning decisions should be documented and reproducible. Every choice you make—whether to drop duplicates, impute missing values, or remove outliers—affects your results. Your code should clearly show what cleaning steps were taken and why, so that others (including your future self) can understand and validate your approach.

Data Wrangler in VS Code

Microsoft provides a VS Code extension called “Data Wrangler” that provides a nice UI for filtering, cleaning, and transforming data in pandas DataFrames. When done, you can export the Python code to replicate your steps.

13.1 Detecting and Handling Duplicates

Duplicate records are a common problem in financial datasets. They can arise from multiple sources: data feed errors that transmit the same trade twice, mistakes in data aggregation where the same record appears in multiple files, or simple human error in manual data entry. Duplicates are particularly problematic because they can distort statistical analyses, inflate trading volumes, and create artificial patterns in the data.

13.1.1 Identifying duplicates

The first step in handling duplicates is identifying them. A duplicate can be defined in different ways depending on your data structure and business context.

Types of duplicates

Complete duplicates: All columns have identical values
Partial duplicates: Only certain key columns are identical (e.g., same date and ticker, but different price)
Near duplicates: Values are very similar but not exactly identical (often due to floating-point precision)

The appropriate definition depends on your data and use case.

Let’s start with a simple example using stock price data:

import pandas as pd
import polars as pl
from datetime import datetime, timedelta

# Create sample stock price data with duplicates
dates = pd.date_range('2024-01-01', periods=5, freq='D')
data = {
    'date': list(dates) + [dates[2]],  # Duplicate date
    'ticker': ['AAPL', 'AAPL', 'AAPL', 'AAPL', 'AAPL', 'AAPL'],
    'price': [150.0, 152.5, 155.0, 153.0, 157.0, 155.0],
    'volume': [1000000, 1100000, 1200000, 950000, 1300000, 1200000]
}

df_pd = pd.DataFrame(data)
df_pd

1: Adding a duplicate of the third date to create a duplicate record.

	date	ticker	price	volume
0	2024-01-01	AAPL	150.0	1000000
1	2024-01-02	AAPL	152.5	1100000
2	2024-01-03	AAPL	155.0	1200000
3	2024-01-04	AAPL	153.0	950000
4	2024-01-05	AAPL	157.0	1300000
5	2024-01-03	AAPL	155.0	1200000

Now let’s detect duplicates. The duplicated() method returns a boolean Series indicating which rows are duplicates:

# Check for complete duplicates
df_pd.duplicated()

0    False
1    False
2    False
3    False
4    False
5     True
dtype: bool

We can also check for duplicates based on specific columns:

# Check for duplicates based on specific columns (date and ticker)
df_pd.duplicated(subset=['date', 'ticker'])

0    False
1    False
2    False
3    False
4    False
5     True
dtype: bool

To see which rows are duplicates, we can filter the DataFrame:

# See which rows are duplicates (keep=False marks all duplicates)
df_pd[df_pd.duplicated(subset=['date', 'ticker'], keep=False)]

	date	ticker	price	volume
2	2024-01-03	AAPL	155.0	1200000
5	2024-01-03	AAPL	155.0	1200000

The keep parameter controls which duplicate to mark:

keep='first' (default): Mark all duplicates as True except the first occurrence
keep='last': Mark all duplicates as True except the last occurrence
keep=False: Mark all duplicates as True, including the first occurrence

Now let’s see the same operations in Polars:

df_pl = pl.DataFrame(data)
df_pl

shape: (6, 4)

date	ticker	price	volume
datetime[μs]	str	f64	i64
2024-01-01 00:00:00	"AAPL"	150.0	1000000
2024-01-02 00:00:00	"AAPL"	152.5	1100000
2024-01-03 00:00:00	"AAPL"	155.0	1200000
2024-01-04 00:00:00	"AAPL"	153.0	950000
2024-01-05 00:00:00	"AAPL"	157.0	1300000
2024-01-03 00:00:00	"AAPL"	155.0	1200000

To find duplicates in Polars, we use is_duplicated():

# Check for duplicates based on date and ticker
df_pl.filter(pl.struct(['date', 'ticker']).is_duplicated())

shape: (2, 4)

date	ticker	price	volume
datetime[μs]	str	f64	i64
2024-01-03 00:00:00	"AAPL"	155.0	1200000
2024-01-03 00:00:00	"AAPL"	155.0	1200000

13.1.2 Removing duplicates

Once we’ve identified duplicates, we need to decide how to handle them. The most common approaches are:

Drop all duplicates: Keep only the first (or last) occurrence
Aggregate duplicates: Combine duplicate rows using aggregation (e.g., average prices)
Manual investigation: For small numbers of duplicates, inspect each case individually

Let’s see how to implement these approaches. In pandas, we use drop_duplicates():

# pandas: Drop duplicates, keeping first occurrence
df_pd_clean = df_pd.drop_duplicates(subset=['date', 'ticker'], keep='first')
df_pd_clean

	date	ticker	price	volume
0	2024-01-01	AAPL	150.0	1000000
1	2024-01-02	AAPL	152.5	1100000
2	2024-01-03	AAPL	155.0	1200000
3	2024-01-04	AAPL	153.0	950000
4	2024-01-05	AAPL	157.0	1300000

In Polars, we use the unique() method:

# Polars: Drop duplicates
df_pl_clean = df_pl.unique(subset=['date', 'ticker'], keep='first')
df_pl_clean

shape: (5, 4)

date	ticker	price	volume
datetime[μs]	str	f64	i64
2024-01-03 00:00:00	"AAPL"	155.0	1200000
2024-01-04 00:00:00	"AAPL"	153.0	950000
2024-01-01 00:00:00	"AAPL"	150.0	1000000
2024-01-02 00:00:00	"AAPL"	152.5	1100000
2024-01-05 00:00:00	"AAPL"	157.0	1300000

For the aggregation approach, suppose we want to take the average price and total volume when duplicates exist:

# pandas: Aggregate duplicates
df_pd_agg = df_pd.groupby(['date', 'ticker'], as_index=False).agg({
    'price': 'mean',
    'volume': 'sum'
})
df_pd_agg

	date	ticker	price	volume
0	2024-01-01	AAPL	150.0	1000000
1	2024-01-02	AAPL	152.5	1100000
2	2024-01-03	AAPL	155.0	2400000
3	2024-01-04	AAPL	153.0	950000
4	2024-01-05	AAPL	157.0	1300000

# Polars: Aggregate duplicates
df_pl_agg = df_pl.group_by(['date', 'ticker']).agg([
    pl.col('price').mean(),
    pl.col('volume').sum()
])
df_pl_agg

shape: (5, 4)

date	ticker	price	volume
datetime[μs]	str	f64	i64
2024-01-01 00:00:00	"AAPL"	150.0	1000000
2024-01-03 00:00:00	"AAPL"	155.0	2400000
2024-01-05 00:00:00	"AAPL"	157.0	1300000
2024-01-02 00:00:00	"AAPL"	152.5	1100000
2024-01-04 00:00:00	"AAPL"	153.0	950000

Common pitfall: implicit duplicates

Be careful about duplicates that arise from data structure decisions. For example, if you merge two datasets and accidentally create a many-to-many join, you may generate duplicates without realizing it. Always check the row count before and after joins to ensure you haven’t inadvertently created duplicates. We discuss joins in detail in Chapter 16.

13.1.3 Best practices for duplicate handling

Always investigate before removing: Look at a sample of duplicates to understand why they exist
Document your decisions: Add comments explaining which columns define a duplicate and why
Keep track of removals: Log how many duplicates you removed and their characteristics
Consider the source: Different data sources may have different duplicate patterns

# Example: Comprehensive duplicate handling with logging
def handle_duplicates(df, subset_cols, keep='first', verbose=True):
    """
    Remove duplicates with logging.

    Parameters
    ----------
    df : pd.DataFrame
        Input dataframe
    subset_cols : list
        Columns that define a duplicate
    keep : str
        Which duplicate to keep ('first', 'last', or False)
    verbose : bool
        Whether to print diagnostic information

    Returns
    -------
    pd.DataFrame
        Cleaned dataframe
    """
    initial_rows = len(df)
    duplicated_mask = df.duplicated(subset=subset_cols, keep=keep)
    n_duplicates = duplicated_mask.sum()

    if verbose:
        print(f"Initial rows: {initial_rows}")
        print(f"Duplicates found: {n_duplicates}")
        if n_duplicates > 0:
            print(f"Duplicate rows (sample):")
            print(df[duplicated_mask].head())

    df_clean = df[~duplicated_mask].copy()

    if verbose:
        print(f"Final rows: {len(df_clean)}")
        print(f"Rows removed: {initial_rows - len(df_clean)}")

    return df_clean

# Use the function
df_cleaned = handle_duplicates(df_pd, subset_cols=['date', 'ticker'])

Initial rows: 6
Duplicates found: 1
Duplicate rows (sample):
        date ticker  price   volume
5 2024-01-03   AAPL  155.0  1200000
Final rows: 5
Rows removed: 1

13.2 Missing Data: Detection and Imputation

Missing data is ubiquitous in financial datasets. Stock markets close on holidays, companies report earnings quarterly but not daily, economic indicators are released with delays, and data collection systems occasionally fail. How you handle missing data can significantly impact your analysis, and there is rarely a single “correct” approach.

13.2.1 Understanding missing data mechanisms

Before deciding how to handle missing data, it’s important to understand why data might be missing. Statisticians classify missing data into three categories:

Missing Completely at Random (MCAR): The probability that a value is missing is unrelated to any observed or unobserved data. For example, if a data feed randomly drops 1% of all observations due to network issues, the missing data is MCAR.
Missing at Random (MAR): The probability that a value is missing depends on observed data but not on the missing value itself. For example, small-cap stocks might have more missing volume data than large-cap stocks, but conditional on market cap, the missingness is random.
Missing Not at Random (MNAR): The probability that a value is missing depends on the missing value itself. For example, companies might be less likely to report earnings when performance is poor, or traders might fail to report large losses.

Why this matters

The type of missingness affects which imputation methods are valid. MCAR is the easiest to handle (you can often just drop missing observations), while MNAR is the most problematic because missing data itself carries information. Unfortunately, in real financial data, you often cannot definitively determine which mechanism is operating, so you need to think carefully about the context.

13.2.2 Detecting missing data

Python uses different representations for missing data depending on the data type:

NaN (Not a Number): Used for missing floating-point values
None: Python’s built-in null value
NaT (Not a Time): Used for missing datetime values
pd.NA: pandas’ new missing indicator for all dtypes

Historical context: missing values in pandas

Initially, pandas used NumPy as its backend and could not represent missing values explicitly. Instead, it used NaN for floats and None (a Python object) for strings. This approach had important limitations:

For integers, pandas would first convert all values in the column to float and use NaN to represent missing values. This meant you couldn’t have a truly integer column with missing values.
NaN has a specific meaning in floating-point arithmetic: it represents a value that is “not a number,” such as the result of an undefined operation (e.g., log(-1)). Reusing NaN to mean “missing” conflates two distinct concepts and complicates handling of true NaN results.

Polars and newer backends for pandas (such as PyArrow) explicitly support missing data indicators for all column types, avoiding these limitations.

Let’s create a dataset with missing values:

import numpy as np

data_missing = {
    'date': pd.date_range('2024-01-01', periods=10, freq='D'),
    'ticker': ['AAPL'] * 10,
    'price': [150.0, 152.5, np.nan, 153.0, 157.0, np.nan, 159.0, 158.5, np.nan, 161.0],
    'volume': [1000000, np.nan, 1200000, 950000, np.nan, 1100000, 1300000, np.nan, 1250000, 1400000],
    'dividend': [0.0, 0.0, 0.0, 0.25, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
}

df_missing = pd.DataFrame(data_missing)
df_missing

	date	ticker	price	volume	dividend
0	2024-01-01	AAPL	150.0	1000000.0	0.00
1	2024-01-02	AAPL	152.5	NaN	0.00
2	2024-01-03	AAPL	NaN	1200000.0	0.00
3	2024-01-04	AAPL	153.0	950000.0	0.25
4	2024-01-05	AAPL	157.0	NaN	0.00
5	2024-01-06	AAPL	NaN	1100000.0	0.00
6	2024-01-07	AAPL	159.0	1300000.0	0.00
7	2024-01-08	AAPL	158.5	NaN	0.00
8	2024-01-09	AAPL	NaN	1250000.0	0.00
9	2024-01-10	AAPL	161.0	1400000.0	0.00

Now let’s detect missing values. The isna() method returns a boolean DataFrame indicating which values are missing:

df_missing.isna()

	date	ticker	price	volume	dividend
0	False	False	False	False	False
1	False	False	False	True	False
2	False	False	True	False	False
3	False	False	False	False	False
4	False	False	False	True	False
5	False	False	True	False	False
6	False	False	False	False	False
7	False	False	False	True	False
8	False	False	True	False	False
9	False	False	False	False	False

We can count missing values per column:

df_missing.isna().sum()

date        0
ticker      0
price       3
volume      3
dividend    0
dtype: int64

Or calculate the percentage of missing values:

df_missing.isna().mean() * 100

date         0.0
ticker       0.0
price       30.0
volume      30.0
dividend     0.0
dtype: float64

To see rows with any missing values:

df_missing[df_missing.isna().any(axis=1)]

	date	ticker	price	volume
1	2024-01-02	AAPL	152.5	NaN
2	2024-01-03	AAPL	NaN	1200000.0
4	2024-01-05	AAPL	157.0	NaN
5	2024-01-06	AAPL	NaN	1100000.0
7	2024-01-08	AAPL	158.5	NaN
8	2024-01-09	AAPL	NaN	1250000.0

In Polars, we use null_count() to count missing values:

# Polars version - convert from pandas DataFrame
# (Polars uses None/null for missing values, not np.nan)
df_missing_pl = pl.from_pandas(df_missing)

df_missing_pl.null_count()

shape: (1, 5)

date	ticker	price	volume	dividend
u32	u32	u32	u32	u32
0	0	3	3	0

We can filter rows with missing values in a specific column:

df_missing_pl.filter(pl.col('price').is_null())

shape: (3, 5)

date	ticker	price	volume	dividend
datetime[ns]	str	f64	f64	f64
2024-01-03 00:00:00	"AAPL"	null	1.2e6	0.0
2024-01-06 00:00:00	"AAPL"	null	1.1e6	0.0
2024-01-09 00:00:00	"AAPL"	null	1.25e6	0.0

13.2.3 Handling missing data: deletion

The simplest approach to missing data is to delete it. There are two main strategies:

Listwise deletion (complete case analysis): Rows with any missing values are completely dropped from the sample, so they are not used in any calculation or analysis. This approach is simple and ensures consistency across analyses, but can significantly reduce sample size if missingness is widespread.
Pairwise deletion: For each calculation or analysis (e.g., for each regression), use all observations for which we have the required variables. This approach maximizes the use of available data but can lead to different sample sizes for different analyses, making comparisons difficult.

Listwise deletion removes any row with missing values:

df_complete = df_missing.dropna()
df_complete

	date	ticker	price	volume	dividend
0	2024-01-01	AAPL	150.0	1000000.0	0.00
3	2024-01-04	AAPL	153.0	950000.0	0.25
6	2024-01-07	AAPL	159.0	1300000.0	0.00
9	2024-01-10	AAPL	161.0	1400000.0	0.00

print(f"Rows removed: {len(df_missing) - len(df_complete)}")

Rows removed: 6

We can also drop rows only if specific columns are missing:

df_price_complete = df_missing.dropna(subset=['price'])
df_price_complete

	date	ticker	price	volume	dividend
0	2024-01-01	AAPL	150.0	1000000.0	0.00
1	2024-01-02	AAPL	152.5	NaN	0.00
3	2024-01-04	AAPL	153.0	950000.0	0.25
4	2024-01-05	AAPL	157.0	NaN	0.00
6	2024-01-07	AAPL	159.0	1300000.0	0.00
7	2024-01-08	AAPL	158.5	NaN	0.00
9	2024-01-10	AAPL	161.0	1400000.0	0.00

Or drop columns with any missing values:

df_no_missing_cols = df_missing.dropna(axis=1)
df_no_missing_cols

	date	ticker	dividend
0	2024-01-01	AAPL	0.00
1	2024-01-02	AAPL	0.00
2	2024-01-03	AAPL	0.00
3	2024-01-04	AAPL	0.25
4	2024-01-05	AAPL	0.00
5	2024-01-06	AAPL	0.00
6	2024-01-07	AAPL	0.00
7	2024-01-08	AAPL	0.00
8	2024-01-09	AAPL	0.00
9	2024-01-10	AAPL	0.00

In Polars, we use drop_nulls():

df_complete_pl = df_missing_pl.drop_nulls()
df_complete_pl

shape: (4, 5)

date	ticker	price	volume	dividend
datetime[ns]	str	f64	f64	f64
2024-01-01 00:00:00	"AAPL"	150.0	1e6	0.0
2024-01-04 00:00:00	"AAPL"	153.0	950000.0	0.25
2024-01-07 00:00:00	"AAPL"	159.0	1.3e6	0.0
2024-01-10 00:00:00	"AAPL"	161.0	1.4e6	0.0

# Drop rows with null in specific column
df_price_complete_pl = df_missing_pl.drop_nulls(subset=['price'])
df_price_complete_pl

shape: (7, 5)

date	ticker	price	volume	dividend
datetime[ns]	str	f64	f64	f64
2024-01-01 00:00:00	"AAPL"	150.0	1e6	0.0
2024-01-02 00:00:00	"AAPL"	152.5	null	0.0
2024-01-04 00:00:00	"AAPL"	153.0	950000.0	0.25
2024-01-05 00:00:00	"AAPL"	157.0	null	0.0
2024-01-07 00:00:00	"AAPL"	159.0	1.3e6	0.0
2024-01-08 00:00:00	"AAPL"	158.5	null	0.0
2024-01-10 00:00:00	"AAPL"	161.0	1.4e6	0.0

When deletion is problematic

Deletion can introduce bias if:

You lose a large percentage of your data
Missingness is not random (MAR or MNAR)
Missing data occurs in key variables

In time series analysis, deleting observations can break the temporal structure of your data. Use deletion cautiously and always check how much data you’re losing.

13.2.4 Handling missing data: imputation

Imputation means filling in missing values with plausible estimates. There are many imputation strategies, each with different assumptions and use cases.

13.2.4.1 Simple imputation methods

Forward fill: Use the last observed value
Backward fill: Use the next observed value
Mean/median imputation: Replace with the column mean or median
Zero/constant imputation: Replace with zero or another constant

In empirical finance, we usually want to avoid introducing look-ahead bias or imputing information that was not available at the time of observation. For this reason, forward fill and zero/constant imputation are the most commonly used imputation methods in financial research.

Forward fill uses the last observed value:

df_ffill = df_missing.copy()
df_ffill[['price', 'volume']] = df_ffill[['price', 'volume']].ffill()
df_ffill

	date	ticker	price	volume	dividend
0	2024-01-01	AAPL	150.0	1000000.0	0.00
1	2024-01-02	AAPL	152.5	1000000.0	0.00
2	2024-01-03	AAPL	152.5	1200000.0	0.00
3	2024-01-04	AAPL	153.0	950000.0	0.25
4	2024-01-05	AAPL	157.0	950000.0	0.00
5	2024-01-06	AAPL	157.0	1100000.0	0.00
6	2024-01-07	AAPL	159.0	1300000.0	0.00
7	2024-01-08	AAPL	158.5	1300000.0	0.00
8	2024-01-09	AAPL	158.5	1250000.0	0.00
9	2024-01-10	AAPL	161.0	1400000.0	0.00

Backward fill uses the next observed value:

df_bfill = df_missing.copy()
df_bfill[['price', 'volume']] = df_bfill[['price', 'volume']].bfill()
df_bfill

	date	ticker	price	volume	dividend
0	2024-01-01	AAPL	150.0	1000000.0	0.00
1	2024-01-02	AAPL	152.5	1200000.0	0.00
2	2024-01-03	AAPL	153.0	1200000.0	0.00
3	2024-01-04	AAPL	153.0	950000.0	0.25
4	2024-01-05	AAPL	157.0	1100000.0	0.00
5	2024-01-06	AAPL	159.0	1100000.0	0.00
6	2024-01-07	AAPL	159.0	1300000.0	0.00
7	2024-01-08	AAPL	158.5	1250000.0	0.00
8	2024-01-09	AAPL	161.0	1250000.0	0.00
9	2024-01-10	AAPL	161.0	1400000.0	0.00

Mean imputation replaces missing values with the column mean:

df_mean = df_missing.copy()
df_mean['price'] = df_mean['price'].fillna(df_mean['price'].mean())
df_mean['volume'] = df_mean['volume'].fillna(df_mean['volume'].mean())
df_mean

	date	ticker	price	volume	dividend
0	2024-01-01	AAPL	150.000000	1.000000e+06	0.00
1	2024-01-02	AAPL	152.500000	1.171429e+06	0.00
2	2024-01-03	AAPL	155.857143	1.200000e+06	0.00
3	2024-01-04	AAPL	153.000000	9.500000e+05	0.25
4	2024-01-05	AAPL	157.000000	1.171429e+06	0.00
5	2024-01-06	AAPL	155.857143	1.100000e+06	0.00
6	2024-01-07	AAPL	159.000000	1.300000e+06	0.00
7	2024-01-08	AAPL	158.500000	1.171429e+06	0.00
8	2024-01-09	AAPL	155.857143	1.250000e+06	0.00
9	2024-01-10	AAPL	161.000000	1.400000e+06	0.00

Median imputation is more robust to outliers:

df_median = df_missing.copy()
df_median['price'] = df_median['price'].fillna(df_median['price'].median())
df_median['volume'] = df_median['volume'].fillna(df_median['volume'].median())
df_median

	date	ticker	price	volume	dividend
0	2024-01-01	AAPL	150.0	1000000.0	0.00
1	2024-01-02	AAPL	152.5	1200000.0	0.00
2	2024-01-03	AAPL	157.0	1200000.0	0.00
3	2024-01-04	AAPL	153.0	950000.0	0.25
4	2024-01-05	AAPL	157.0	1200000.0	0.00
5	2024-01-06	AAPL	157.0	1100000.0	0.00
6	2024-01-07	AAPL	159.0	1300000.0	0.00
7	2024-01-08	AAPL	158.5	1200000.0	0.00
8	2024-01-09	AAPL	157.0	1250000.0	0.00
9	2024-01-10	AAPL	161.0	1400000.0	0.00

Constant value imputation:

df_zero = df_missing.copy()
df_zero['dividend'] = df_zero['dividend'].fillna(0)
df_zero

	date	ticker	price	volume	dividend
0	2024-01-01	AAPL	150.0	1000000.0	0.00
1	2024-01-02	AAPL	152.5	NaN	0.00
2	2024-01-03	AAPL	NaN	1200000.0	0.00
3	2024-01-04	AAPL	153.0	950000.0	0.25
4	2024-01-05	AAPL	157.0	NaN	0.00
5	2024-01-06	AAPL	NaN	1100000.0	0.00
6	2024-01-07	AAPL	159.0	1300000.0	0.00
7	2024-01-08	AAPL	158.5	NaN	0.00
8	2024-01-09	AAPL	NaN	1250000.0	0.00
9	2024-01-10	AAPL	161.0	1400000.0	0.00

In Polars, we use forward_fill() and fill_null():

df_ffill_pl = df_missing_pl.select([
    pl.col('date'),
    pl.col('ticker'),
    pl.col('price').forward_fill(),
    pl.col('volume').forward_fill(),
    pl.col('dividend')
])
df_ffill_pl

shape: (10, 5)

date	ticker	price	volume	dividend
datetime[ns]	str	f64	f64	f64
2024-01-01 00:00:00	"AAPL"	150.0	1e6	0.0
2024-01-02 00:00:00	"AAPL"	152.5	1e6	0.0
2024-01-03 00:00:00	"AAPL"	152.5	1.2e6	0.0
2024-01-04 00:00:00	"AAPL"	153.0	950000.0	0.25
2024-01-05 00:00:00	"AAPL"	157.0	950000.0	0.0
2024-01-06 00:00:00	"AAPL"	157.0	1.1e6	0.0
2024-01-07 00:00:00	"AAPL"	159.0	1.3e6	0.0
2024-01-08 00:00:00	"AAPL"	158.5	1.3e6	0.0
2024-01-09 00:00:00	"AAPL"	158.5	1.25e6	0.0
2024-01-10 00:00:00	"AAPL"	161.0	1.4e6	0.0

# Polars mean imputation
mean_price = df_missing_pl['price'].mean()
mean_volume = df_missing_pl['volume'].mean()

df_mean_pl = df_missing_pl.with_columns([
    pl.col('price').fill_null(mean_price),
    pl.col('volume').fill_null(mean_volume)
])
df_mean_pl

shape: (10, 5)

date	ticker	price	volume	dividend
datetime[ns]	str	f64	f64	f64
2024-01-01 00:00:00	"AAPL"	150.0	1e6	0.0
2024-01-02 00:00:00	"AAPL"	152.5	1.1714e6	0.0
2024-01-03 00:00:00	"AAPL"	155.857143	1.2e6	0.0
2024-01-04 00:00:00	"AAPL"	153.0	950000.0	0.25
2024-01-05 00:00:00	"AAPL"	157.0	1.1714e6	0.0
2024-01-06 00:00:00	"AAPL"	155.857143	1.1e6	0.0
2024-01-07 00:00:00	"AAPL"	159.0	1.3e6	0.0
2024-01-08 00:00:00	"AAPL"	158.5	1.1714e6	0.0
2024-01-09 00:00:00	"AAPL"	155.857143	1.25e6	0.0
2024-01-10 00:00:00	"AAPL"	161.0	1.4e6	0.0

Choosing an imputation method for financial data

The appropriate method depends on your data and context:

Forward fill: Good for prices and slowly-changing variables (assumes last observation is still valid)
Mean/median: Reasonable for cross-sectional data, but can distort distributions and relationships
Zero: Appropriate when zero is a meaningful value (e.g., dividends, corporate actions)
Interpolation: Useful when you expect smooth changes over time

Never use imputation blindly—think about what each method assumes about your data.

13.2.4.2 Linear interpolation

For time series data, linear interpolation can provide more realistic estimates than simple forward/backward filling:

df_interp = df_missing.copy()
df_interp['price'] = df_interp['price'].interpolate(method='linear')
df_interp['volume'] = df_interp['volume'].interpolate(method='linear')
df_interp

	date	ticker	price	volume	dividend
0	2024-01-01	AAPL	150.00	1000000.0	0.00
1	2024-01-02	AAPL	152.50	1100000.0	0.00
2	2024-01-03	AAPL	152.75	1200000.0	0.00
3	2024-01-04	AAPL	153.00	950000.0	0.25
4	2024-01-05	AAPL	157.00	1025000.0	0.00
5	2024-01-06	AAPL	158.00	1100000.0	0.00
6	2024-01-07	AAPL	159.00	1300000.0	0.00
7	2024-01-08	AAPL	158.50	1275000.0	0.00
8	2024-01-09	AAPL	159.75	1250000.0	0.00
9	2024-01-10	AAPL	161.00	1400000.0	0.00

Time-based interpolation accounts for irregular spacing and requires a DatetimeIndex:

df_interp_time = df_missing.copy().set_index('date')
df_interp_time['price'] = df_interp_time['price'].interpolate(method='time')
df_interp_time = df_interp_time.reset_index()
df_interp_time

	date	ticker	price	volume	dividend
0	2024-01-01	AAPL	150.00	1000000.0	0.00
1	2024-01-02	AAPL	152.50	NaN	0.00
2	2024-01-03	AAPL	152.75	1200000.0	0.00
3	2024-01-04	AAPL	153.00	950000.0	0.25
4	2024-01-05	AAPL	157.00	NaN	0.00
5	2024-01-06	AAPL	158.00	1100000.0	0.00
6	2024-01-07	AAPL	159.00	1300000.0	0.00
7	2024-01-08	AAPL	158.50	NaN	0.00
8	2024-01-09	AAPL	159.75	1250000.0	0.00
9	2024-01-10	AAPL	161.00	1400000.0	0.00

13.2.4.3 Group-based imputation

Often, you want to impute missing values differently for different groups. For example, you might want to impute missing returns for each industry separately:

# Create data with multiple tickers
data_grouped = {
    'date': pd.date_range('2024-01-01', periods=6, freq='D').tolist() * 2,
    'ticker': ['AAPL'] * 6 + ['MSFT'] * 6,
    'price': [150.0, np.nan, 155.0, 153.0, np.nan, 161.0,
              250.0, 252.0, np.nan, 258.0, 260.0, np.nan],
    'sector': ['Tech'] * 12
}

df_grouped = pd.DataFrame(data_grouped)
df_grouped

	date	ticker	price	sector
0	2024-01-01	AAPL	150.0	Tech
1	2024-01-02	AAPL	NaN	Tech
2	2024-01-03	AAPL	155.0	Tech
3	2024-01-04	AAPL	153.0	Tech
4	2024-01-05	AAPL	NaN	Tech
5	2024-01-06	AAPL	161.0	Tech
6	2024-01-01	MSFT	250.0	Tech
7	2024-01-02	MSFT	252.0	Tech
8	2024-01-03	MSFT	NaN	Tech
9	2024-01-04	MSFT	258.0	Tech
10	2024-01-05	MSFT	260.0	Tech
11	2024-01-06	MSFT	NaN	Tech

Forward fill within each ticker group:

df_grouped_ffill = df_grouped.copy()
df_grouped_ffill['price'] = df_grouped_ffill.groupby('ticker')['price'].ffill()
df_grouped_ffill

	date	ticker	price	sector
0	2024-01-01	AAPL	150.0	Tech
1	2024-01-02	AAPL	150.0	Tech
2	2024-01-03	AAPL	155.0	Tech
3	2024-01-04	AAPL	153.0	Tech
4	2024-01-05	AAPL	153.0	Tech
5	2024-01-06	AAPL	161.0	Tech
6	2024-01-01	MSFT	250.0	Tech
7	2024-01-02	MSFT	252.0	Tech
8	2024-01-03	MSFT	252.0	Tech
9	2024-01-04	MSFT	258.0	Tech
10	2024-01-05	MSFT	260.0	Tech
11	2024-01-06	MSFT	260.0	Tech

Mean imputation within each sector:

df_grouped_mean = df_grouped.copy()
df_grouped_mean['price'] = df_grouped_mean.groupby('sector')['price'].transform(
    lambda x: x.fillna(x.mean())
)
df_grouped_mean

	date	ticker	price	sector
0	2024-01-01	AAPL	150.000	Tech
1	2024-01-02	AAPL	204.875	Tech
2	2024-01-03	AAPL	155.000	Tech
3	2024-01-04	AAPL	153.000	Tech
4	2024-01-05	AAPL	204.875	Tech
5	2024-01-06	AAPL	161.000	Tech
6	2024-01-01	MSFT	250.000	Tech
7	2024-01-02	MSFT	252.000	Tech
8	2024-01-03	MSFT	204.875	Tech
9	2024-01-04	MSFT	258.000	Tech
10	2024-01-05	MSFT	260.000	Tech
11	2024-01-06	MSFT	204.875	Tech

In Polars, we use the over() method for group-based operations:

df_grouped_pl = pl.from_pandas(df_grouped)

# Forward fill within groups
df_grouped_ffill_pl = df_grouped_pl.with_columns([
    pl.col('price').forward_fill().over('ticker')
])
df_grouped_ffill_pl

shape: (12, 4)

date	ticker	price	sector
datetime[ns]	str	f64	str
2024-01-01 00:00:00	"AAPL"	150.0	"Tech"
2024-01-02 00:00:00	"AAPL"	150.0	"Tech"
2024-01-03 00:00:00	"AAPL"	155.0	"Tech"
2024-01-04 00:00:00	"AAPL"	153.0	"Tech"
2024-01-05 00:00:00	"AAPL"	153.0	"Tech"
…	…	…	…
2024-01-02 00:00:00	"MSFT"	252.0	"Tech"
2024-01-03 00:00:00	"MSFT"	252.0	"Tech"
2024-01-04 00:00:00	"MSFT"	258.0	"Tech"
2024-01-05 00:00:00	"MSFT"	260.0	"Tech"
2024-01-06 00:00:00	"MSFT"	260.0	"Tech"

# Mean imputation within groups
df_grouped_mean_pl = df_grouped_pl.with_columns([
    pl.col('price').fill_null(pl.col('price').mean()).over('sector')
])
df_grouped_mean_pl

shape: (12, 4)

date	ticker	price	sector
datetime[ns]	str	f64	str
2024-01-01 00:00:00	"AAPL"	150.0	"Tech"
2024-01-02 00:00:00	"AAPL"	204.875	"Tech"
2024-01-03 00:00:00	"AAPL"	155.0	"Tech"
2024-01-04 00:00:00	"AAPL"	153.0	"Tech"
2024-01-05 00:00:00	"AAPL"	204.875	"Tech"
…	…	…	…
2024-01-02 00:00:00	"MSFT"	252.0	"Tech"
2024-01-03 00:00:00	"MSFT"	204.875	"Tech"
2024-01-04 00:00:00	"MSFT"	258.0	"Tech"
2024-01-05 00:00:00	"MSFT"	260.0	"Tech"
2024-01-06 00:00:00	"MSFT"	204.875	"Tech"

13.3 Data Validation

Data validation is the process of checking whether your data meets quality standards and assumptions before analysis. In finance, invalid data can lead to catastrophic errors—imagine executing a trading strategy based on incorrect prices or publishing research with flawed data.

Validation encompasses several activities:

Checking data types and formats
Verifying data ranges and constraints
Identifying outliers and anomalies
Ensuring logical consistency
Cross-checking against external sources

Automated validation with pandera

For systematic data validation, consider using pandera, a Python framework for automatic validation of DataFrames. Pandera lets you define schemas that specify expected data types, value ranges, and other constraints, then automatically validates your data against these schemas. It supports both pandas and Polars.

13.3.1 Validating data types and formats

First, ensure that each column has the correct data type:

# Create sample data with type issues
data_types = {
    'date': ['2024-01-01', '2024-01-02', '2024-01-03', 'invalid'],
    'ticker': ['AAPL', 'AAPL', 'AAPL', 'AAPL'],
    'price': ['150.0', '152.5', '155.0', '153.0'],  # Stored as strings
    'volume': [1000000, 1100000, 1200000, -950000]  # Negative volume
}

df_types = pd.DataFrame(data_types)
df_types

	date	ticker	price	volume
0	2024-01-01	AAPL	150.0	1000000
1	2024-01-02	AAPL	152.5	1100000
2	2024-01-03	AAPL	155.0	1200000
3	invalid	AAPL	153.0	-950000

df_types.dtypes

date      object
ticker    object
price     object
volume     int64
dtype: object

Now let’s validate and fix the types. We use errors='coerce' to convert invalid values to NaN instead of raising an error:

# Convert price to numeric, coercing errors to NaN
df_types['price'] = pd.to_numeric(df_types['price'], errors='coerce')

# Convert date to datetime
df_types['date'] = pd.to_datetime(df_types['date'], errors='coerce')

df_types

	date	ticker	price	volume
0	2024-01-01	AAPL	150.0	1000000
1	2024-01-02	AAPL	152.5	1100000
2	2024-01-03	AAPL	155.0	1200000
3	NaT	AAPL	153.0	-950000

df_types.dtypes

date      datetime64[ns]
ticker            object
price            float64
volume             int64
dtype: object

We can identify rows with conversion errors by checking for NaN values:

invalid_dates = df_types['date'].isna()
print(f"Rows with invalid dates: {invalid_dates.sum()}")

Rows with invalid dates: 1

df_types[invalid_dates]

	date	ticker	price	volume
3	NaT	AAPL	153.0	-950000

Common type issues in financial data

Dates: Mixed formats (MM/DD/YYYY vs DD/MM/YYYY), Excel date serials, timezone issues
Numbers: Thousands separators, currency symbols, percentage signs stored as text
Missing values: Coded as -999, ‘NA’, empty strings, or other sentinel values
Categorical variables: Inconsistent capitalization or spelling

Always check types early in your pipeline and convert them explicitly.

13.3.2 Range and constraint validation

Financial data often has natural constraints. Prices must be positive, probabilities must be between 0 and 1, and returns are typically bounded (though extreme events can violate typical ranges).

# Create data with constraint violations
data_constraints = {
    'date': pd.date_range('2024-01-01', periods=5, freq='D'),
    'ticker': ['AAPL'] * 5,
    'price': [150.0, -152.5, 155.0, 1000000.0, 157.0],  # Negative and extreme prices
    'volume': [1000000, 1100000, 1200000, -950000, 1300000],  # Negative volume
    'return': [0.01, 0.02, -0.01, 5.0, -0.03]  # Extreme return
}

df_constraints = pd.DataFrame(data_constraints)

# Define validation rules
def validate_price(price):
    """Check if price is valid."""
    return (price > 0) & (price < 100000)

def validate_volume(volume):
    """Check if volume is valid."""
    return volume >= 0

def validate_return(ret):
    """Check if return is reasonable."""
    return (ret > -1) & (ret < 1)

# Apply validation
df_constraints['price_valid'] = validate_price(df_constraints['price'])
df_constraints['volume_valid'] = validate_volume(df_constraints['volume'])
df_constraints['return_valid'] = validate_return(df_constraints['return'])

df_constraints

	date	ticker	price	volume	return	price_valid	volume_valid	return_valid
0	2024-01-01	AAPL	150.0	1000000	0.01	True	True	True
1	2024-01-02	AAPL	-152.5	1100000	0.02	False	True	True
2	2024-01-03	AAPL	155.0	1200000	-0.01	True	True	True
3	2024-01-04	AAPL	1000000.0	-950000	5.00	False	False	False
4	2024-01-05	AAPL	157.0	1300000	-0.03	True	True	True

We can identify invalid rows by combining the validation columns:

invalid_mask = ~(df_constraints['price_valid'] &
                 df_constraints['volume_valid'] &
                 df_constraints['return_valid'])
print(f"Invalid rows: {invalid_mask.sum()}")

Invalid rows: 2

df_constraints[invalid_mask]

	date	ticker	price	volume	return	price_valid	volume_valid	return_valid
1	2024-01-02	AAPL	-152.5	1100000	0.02	False	True	True
3	2024-01-04	AAPL	1000000.0	-950000	5.00	False	False	False

In Polars, we can use expressions for validation:

df_constraints_pl = pl.DataFrame(data_constraints)

# Add validation columns
df_validated_pl = df_constraints_pl.with_columns([
    ((pl.col('price') > 0) & (pl.col('price') < 100000)).alias('price_valid'),
    (pl.col('volume') >= 0).alias('volume_valid'),
    ((pl.col('return') > -1) & (pl.col('return') < 1)).alias('return_valid')
])

df_validated_pl

shape: (5, 8)

date	ticker	price	volume	return	price_valid	volume_valid	return_valid
datetime[ns]	str	f64	i64	f64	bool	bool	bool
2024-01-01 00:00:00	"AAPL"	150.0	1000000	0.01	true	true	true
2024-01-02 00:00:00	"AAPL"	-152.5	1100000	0.02	false	true	true
2024-01-03 00:00:00	"AAPL"	155.0	1200000	-0.01	true	true	true
2024-01-04 00:00:00	"AAPL"	1e6	-950000	5.0	false	false	false
2024-01-05 00:00:00	"AAPL"	157.0	1300000	-0.03	true	true	true

# Filter to invalid rows
invalid_pl = df_validated_pl.filter(
    ~pl.col('price_valid') | ~pl.col('volume_valid') | ~pl.col('return_valid')
)
invalid_pl

shape: (2, 8)

date	ticker	price	volume	return	price_valid	volume_valid	return_valid
datetime[ns]	str	f64	i64	f64	bool	bool	bool
2024-01-02 00:00:00	"AAPL"	-152.5	1100000	0.02	false	true	true
2024-01-04 00:00:00	"AAPL"	1e6	-950000	5.0	false	false	false

13.3.3 Detecting outliers

Outliers are observations that deviate significantly from other observations. In finance, outliers could indicate:

Genuine extreme events (market crashes, flash crashes)
Data errors (wrong decimal place, data feed glitches)
Special situations (stock splits, spinoffs)

Outlier detection requires domain knowledge. A 50% daily return is an error for a typical stock but normal for a penny stock or during a takeover announcement.

13.3.3.1 Statistical methods for outlier detection

Common approaches include:

Z-score method: Flag observations more than N standard deviations from the mean
IQR method: Flag observations outside 1.5 times the interquartile range
Modified Z-score: Use median absolute deviation instead of standard deviation (more robust)
Domain-specific rules: Use knowledge about reasonable ranges

Choosing the right method for a particular application is beyond the scope of this book and requires careful consideration of the data characteristics and analysis goals.

13.3.3.2 Handling outliers

Once you’ve identified outliers, you have several options:

Keep them: If they’re genuine extreme events
Remove them: If they’re clearly errors
Cap/Winsorize: Replace extreme values with a threshold
Transform the data: Use log transformation or other methods to reduce the impact
Investigate: Manually check each outlier

Be cautious with outlier removal

In finance, extreme events are often the most important observations. The 2008 financial crisis, COVID-19 market crash, and other tail events are “outliers” but carry critical information. Before removing outliers:

Investigate each one individually if possible
Check if they coincide with known events (earnings announcements, news, market crises)
Consider the impact on your conclusions if you keep vs. remove them
Document your decision and reasoning

When in doubt, perform your analysis both with and without outliers to assess sensitivity.