Data cleaning is one of the most critical and time-consuming steps in any empirical analysis. In finance, working with real-world data means confronting issues like duplicate records, missing observations, data entry errors, and outliers. These problems are not merely technical annoyances—they can lead to incorrect conclusions, flawed trading strategies, or misleading research findings if not handled properly.
The goal of data cleaning is not to make data “perfect” but to make it suitable for analysis. This means understanding the nature of data quality issues, their potential causes, and the trade-offs involved in different cleaning approaches. In this chapter, we will cover the essential techniques for detecting and handling common data quality problems using Python’s main data manipulation libraries: pandas and Polars.
Before we begin, let’s understand an important principle: data cleaning decisions should be documented and reproducible. Every choice you make—whether to drop duplicates, impute missing values, or remove outliers—affects your results. Your code should clearly show what cleaning steps were taken and why, so that others (including your future self) can understand and validate your approach.
TipData Wrangler in VS Code
Microsoft provides a VS Code extension called “Data Wrangler” that provides a nice UI for filtering, cleaning, and transforming data in pandas DataFrames. When done, you can export the Python code to replicate your steps.
13.1 Detecting and Handling Duplicates
Duplicate records are a common problem in financial datasets. They can arise from multiple sources: data feed errors that transmit the same trade twice, mistakes in data aggregation where the same record appears in multiple files, or simple human error in manual data entry. Duplicates are particularly problematic because they can distort statistical analyses, inflate trading volumes, and create artificial patterns in the data.
13.1.1 Identifying duplicates
The first step in handling duplicates is identifying them. A duplicate can be defined in different ways depending on your data structure and business context.
NoteTypes of duplicates
Complete duplicates: All columns have identical values
Partial duplicates: Only certain key columns are identical (e.g., same date and ticker, but different price)
Near duplicates: Values are very similar but not exactly identical (often due to floating-point precision)
The appropriate definition depends on your data and use case.
Let’s start with a simple example using stock price data:
Be careful about duplicates that arise from data structure decisions. For example, if you merge two datasets and accidentally create a many-to-many join, you may generate duplicates without realizing it. Always check the row count before and after joins to ensure you haven’t inadvertently created duplicates. We discuss joins in detail in Chapter 16.
13.1.3 Best practices for duplicate handling
Always investigate before removing: Look at a sample of duplicates to understand why they exist
Document your decisions: Add comments explaining which columns define a duplicate and why
Keep track of removals: Log how many duplicates you removed and their characteristics
Consider the source: Different data sources may have different duplicate patterns
# Example: Comprehensive duplicate handling with loggingdef handle_duplicates(df, subset_cols, keep='first', verbose=True):""" Remove duplicates with logging. Parameters ---------- df : pd.DataFrame Input dataframe subset_cols : list Columns that define a duplicate keep : str Which duplicate to keep ('first', 'last', or False) verbose : bool Whether to print diagnostic information Returns ------- pd.DataFrame Cleaned dataframe """ initial_rows =len(df) duplicated_mask = df.duplicated(subset=subset_cols, keep=keep) n_duplicates = duplicated_mask.sum()if verbose:print(f"Initial rows: {initial_rows}")print(f"Duplicates found: {n_duplicates}")if n_duplicates >0:print(f"Duplicate rows (sample):")print(df[duplicated_mask].head()) df_clean = df[~duplicated_mask].copy()if verbose:print(f"Final rows: {len(df_clean)}")print(f"Rows removed: {initial_rows -len(df_clean)}")return df_clean# Use the functiondf_cleaned = handle_duplicates(df_pd, subset_cols=['date', 'ticker'])
Missing data is ubiquitous in financial datasets. Stock markets close on holidays, companies report earnings quarterly but not daily, economic indicators are released with delays, and data collection systems occasionally fail. How you handle missing data can significantly impact your analysis, and there is rarely a single “correct” approach.
13.2.1 Understanding missing data mechanisms
Before deciding how to handle missing data, it’s important to understand why data might be missing. Statisticians classify missing data into three categories:
Missing Completely at Random (MCAR): The probability that a value is missing is unrelated to any observed or unobserved data. For example, if a data feed randomly drops 1% of all observations due to network issues, the missing data is MCAR.
Missing at Random (MAR): The probability that a value is missing depends on observed data but not on the missing value itself. For example, small-cap stocks might have more missing volume data than large-cap stocks, but conditional on market cap, the missingness is random.
Missing Not at Random (MNAR): The probability that a value is missing depends on the missing value itself. For example, companies might be less likely to report earnings when performance is poor, or traders might fail to report large losses.
ImportantWhy this matters
The type of missingness affects which imputation methods are valid. MCAR is the easiest to handle (you can often just drop missing observations), while MNAR is the most problematic because missing data itself carries information. Unfortunately, in real financial data, you often cannot definitively determine which mechanism is operating, so you need to think carefully about the context.
13.2.2 Detecting missing data
Python uses different representations for missing data depending on the data type:
NaN (Not a Number): Used for missing floating-point values
None: Python’s built-in null value
NaT (Not a Time): Used for missing datetime values
pd.NA: pandas’ new missing indicator for all dtypes
NoteHistorical context: missing values in pandas
Initially, pandas used NumPy as its backend and could not represent missing values explicitly. Instead, it used NaN for floats and None (a Python object) for strings. This approach had important limitations:
For integers, pandas would first convert all values in the column to float and use NaN to represent missing values. This meant you couldn’t have a truly integer column with missing values.
NaN has a specific meaning in floating-point arithmetic: it represents a value that is “not a number,” such as the result of an undefined operation (e.g., log(-1)). Reusing NaN to mean “missing” conflates two distinct concepts and complicates handling of true NaN results.
Polars and newer backends for pandas (such as PyArrow) explicitly support missing data indicators for all column types, avoiding these limitations.
In Polars, we use null_count() to count missing values:
# Polars version - convert from pandas DataFrame# (Polars uses None/null for missing values, not np.nan)df_missing_pl = pl.from_pandas(df_missing)df_missing_pl.null_count()
shape: (1, 5)
date
ticker
price
volume
dividend
u32
u32
u32
u32
u32
0
0
3
3
0
We can filter rows with missing values in a specific column:
df_missing_pl.filter(pl.col('price').is_null())
shape: (3, 5)
date
ticker
price
volume
dividend
datetime[ns]
str
f64
f64
f64
2024-01-03 00:00:00
"AAPL"
null
1.2e6
0.0
2024-01-06 00:00:00
"AAPL"
null
1.1e6
0.0
2024-01-09 00:00:00
"AAPL"
null
1.25e6
0.0
13.2.3 Handling missing data: deletion
The simplest approach to missing data is to delete it. There are two main strategies:
Listwise deletion (complete case analysis): Rows with any missing values are completely dropped from the sample, so they are not used in any calculation or analysis. This approach is simple and ensures consistency across analyses, but can significantly reduce sample size if missingness is widespread.
Pairwise deletion: For each calculation or analysis (e.g., for each regression), use all observations for which we have the required variables. This approach maximizes the use of available data but can lead to different sample sizes for different analyses, making comparisons difficult.
Listwise deletion removes any row with missing values:
# Drop rows with null in specific columndf_price_complete_pl = df_missing_pl.drop_nulls(subset=['price'])df_price_complete_pl
shape: (7, 5)
date
ticker
price
volume
dividend
datetime[ns]
str
f64
f64
f64
2024-01-01 00:00:00
"AAPL"
150.0
1e6
0.0
2024-01-02 00:00:00
"AAPL"
152.5
null
0.0
2024-01-04 00:00:00
"AAPL"
153.0
950000.0
0.25
2024-01-05 00:00:00
"AAPL"
157.0
null
0.0
2024-01-07 00:00:00
"AAPL"
159.0
1.3e6
0.0
2024-01-08 00:00:00
"AAPL"
158.5
null
0.0
2024-01-10 00:00:00
"AAPL"
161.0
1.4e6
0.0
WarningWhen deletion is problematic
Deletion can introduce bias if:
You lose a large percentage of your data
Missingness is not random (MAR or MNAR)
Missing data occurs in key variables
In time series analysis, deleting observations can break the temporal structure of your data. Use deletion cautiously and always check how much data you’re losing.
13.2.4 Handling missing data: imputation
Imputation means filling in missing values with plausible estimates. There are many imputation strategies, each with different assumptions and use cases.
13.2.4.1 Simple imputation methods
Forward fill: Use the last observed value
Backward fill: Use the next observed value
Mean/median imputation: Replace with the column mean or median
Zero/constant imputation: Replace with zero or another constant
In empirical finance, we usually want to avoid introducing look-ahead bias or imputing information that was not available at the time of observation. For this reason, forward fill and zero/constant imputation are the most commonly used imputation methods in financial research.
Often, you want to impute missing values differently for different groups. For example, you might want to impute missing returns for each industry separately:
In Polars, we use the over() method for group-based operations:
df_grouped_pl = pl.from_pandas(df_grouped)# Forward fill within groupsdf_grouped_ffill_pl = df_grouped_pl.with_columns([ pl.col('price').forward_fill().over('ticker')])df_grouped_ffill_pl
shape: (12, 4)
date
ticker
price
sector
datetime[ns]
str
f64
str
2024-01-01 00:00:00
"AAPL"
150.0
"Tech"
2024-01-02 00:00:00
"AAPL"
150.0
"Tech"
2024-01-03 00:00:00
"AAPL"
155.0
"Tech"
2024-01-04 00:00:00
"AAPL"
153.0
"Tech"
2024-01-05 00:00:00
"AAPL"
153.0
"Tech"
…
…
…
…
2024-01-02 00:00:00
"MSFT"
252.0
"Tech"
2024-01-03 00:00:00
"MSFT"
252.0
"Tech"
2024-01-04 00:00:00
"MSFT"
258.0
"Tech"
2024-01-05 00:00:00
"MSFT"
260.0
"Tech"
2024-01-06 00:00:00
"MSFT"
260.0
"Tech"
# Mean imputation within groupsdf_grouped_mean_pl = df_grouped_pl.with_columns([ pl.col('price').fill_null(pl.col('price').mean()).over('sector')])df_grouped_mean_pl
shape: (12, 4)
date
ticker
price
sector
datetime[ns]
str
f64
str
2024-01-01 00:00:00
"AAPL"
150.0
"Tech"
2024-01-02 00:00:00
"AAPL"
204.875
"Tech"
2024-01-03 00:00:00
"AAPL"
155.0
"Tech"
2024-01-04 00:00:00
"AAPL"
153.0
"Tech"
2024-01-05 00:00:00
"AAPL"
204.875
"Tech"
…
…
…
…
2024-01-02 00:00:00
"MSFT"
252.0
"Tech"
2024-01-03 00:00:00
"MSFT"
204.875
"Tech"
2024-01-04 00:00:00
"MSFT"
258.0
"Tech"
2024-01-05 00:00:00
"MSFT"
260.0
"Tech"
2024-01-06 00:00:00
"MSFT"
204.875
"Tech"
13.3 Data Validation
Data validation is the process of checking whether your data meets quality standards and assumptions before analysis. In finance, invalid data can lead to catastrophic errors—imagine executing a trading strategy based on incorrect prices or publishing research with flawed data.
Validation encompasses several activities:
Checking data types and formats
Verifying data ranges and constraints
Identifying outliers and anomalies
Ensuring logical consistency
Cross-checking against external sources
TipAutomated validation with pandera
For systematic data validation, consider using pandera, a Python framework for automatic validation of DataFrames. Pandera lets you define schemas that specify expected data types, value ranges, and other constraints, then automatically validates your data against these schemas. It supports both pandas and Polars.
13.3.1 Validating data types and formats
First, ensure that each column has the correct data type:
# Create sample data with type issuesdata_types = {'date': ['2024-01-01', '2024-01-02', '2024-01-03', 'invalid'],'ticker': ['AAPL', 'AAPL', 'AAPL', 'AAPL'],'price': ['150.0', '152.5', '155.0', '153.0'], # Stored as strings'volume': [1000000, 1100000, 1200000, -950000] # Negative volume}df_types = pd.DataFrame(data_types)df_types
date
ticker
price
volume
0
2024-01-01
AAPL
150.0
1000000
1
2024-01-02
AAPL
152.5
1100000
2
2024-01-03
AAPL
155.0
1200000
3
invalid
AAPL
153.0
-950000
df_types.dtypes
date object
ticker object
price object
volume int64
dtype: object
Now let’s validate and fix the types. We use errors='coerce' to convert invalid values to NaN instead of raising an error:
# Convert price to numeric, coercing errors to NaNdf_types['price'] = pd.to_numeric(df_types['price'], errors='coerce')# Convert date to datetimedf_types['date'] = pd.to_datetime(df_types['date'], errors='coerce')df_types
date
ticker
price
volume
0
2024-01-01
AAPL
150.0
1000000
1
2024-01-02
AAPL
152.5
1100000
2
2024-01-03
AAPL
155.0
1200000
3
NaT
AAPL
153.0
-950000
df_types.dtypes
date datetime64[ns]
ticker object
price float64
volume int64
dtype: object
We can identify rows with conversion errors by checking for NaN values:
invalid_dates = df_types['date'].isna()print(f"Rows with invalid dates: {invalid_dates.sum()}")
Rows with invalid dates: 1
df_types[invalid_dates]
date
ticker
price
volume
3
NaT
AAPL
153.0
-950000
WarningCommon type issues in financial data
Dates: Mixed formats (MM/DD/YYYY vs DD/MM/YYYY), Excel date serials, timezone issues
Numbers: Thousands separators, currency symbols, percentage signs stored as text
Missing values: Coded as -999, ‘NA’, empty strings, or other sentinel values
Categorical variables: Inconsistent capitalization or spelling
Always check types early in your pipeline and convert them explicitly.
13.3.2 Range and constraint validation
Financial data often has natural constraints. Prices must be positive, probabilities must be between 0 and 1, and returns are typically bounded (though extreme events can violate typical ranges).
# Create data with constraint violationsdata_constraints = {'date': pd.date_range('2024-01-01', periods=5, freq='D'),'ticker': ['AAPL'] *5,'price': [150.0, -152.5, 155.0, 1000000.0, 157.0], # Negative and extreme prices'volume': [1000000, 1100000, 1200000, -950000, 1300000], # Negative volume'return': [0.01, 0.02, -0.01, 5.0, -0.03] # Extreme return}df_constraints = pd.DataFrame(data_constraints)# Define validation rulesdef validate_price(price):"""Check if price is valid."""return (price >0) & (price <100000)def validate_volume(volume):"""Check if volume is valid."""return volume >=0def validate_return(ret):"""Check if return is reasonable."""return (ret >-1) & (ret <1)# Apply validationdf_constraints['price_valid'] = validate_price(df_constraints['price'])df_constraints['volume_valid'] = validate_volume(df_constraints['volume'])df_constraints['return_valid'] = validate_return(df_constraints['return'])df_constraints
date
ticker
price
volume
return
price_valid
volume_valid
return_valid
0
2024-01-01
AAPL
150.0
1000000
0.01
True
True
True
1
2024-01-02
AAPL
-152.5
1100000
0.02
False
True
True
2
2024-01-03
AAPL
155.0
1200000
-0.01
True
True
True
3
2024-01-04
AAPL
1000000.0
-950000
5.00
False
False
False
4
2024-01-05
AAPL
157.0
1300000
-0.03
True
True
True
We can identify invalid rows by combining the validation columns:
Data errors (wrong decimal place, data feed glitches)
Special situations (stock splits, spinoffs)
Outlier detection requires domain knowledge. A 50% daily return is an error for a typical stock but normal for a penny stock or during a takeover announcement.
13.3.3.1 Statistical methods for outlier detection
Common approaches include:
Z-score method: Flag observations more than N standard deviations from the mean
IQR method: Flag observations outside 1.5 times the interquartile range
Modified Z-score: Use median absolute deviation instead of standard deviation (more robust)
Domain-specific rules: Use knowledge about reasonable ranges
Choosing the right method for a particular application is beyond the scope of this book and requires careful consideration of the data characteristics and analysis goals.
13.3.3.2 Handling outliers
Once you’ve identified outliers, you have several options:
Keep them: If they’re genuine extreme events
Remove them: If they’re clearly errors
Cap/Winsorize: Replace extreme values with a threshold
Transform the data: Use log transformation or other methods to reduce the impact
Investigate: Manually check each outlier
WarningBe cautious with outlier removal
In finance, extreme events are often the most important observations. The 2008 financial crisis, COVID-19 market crash, and other tail events are “outliers” but carry critical information. Before removing outliers:
Investigate each one individually if possible
Check if they coincide with known events (earnings announcements, news, market crises)
Consider the impact on your conclusions if you keep vs. remove them
Document your decision and reasoning
When in doubt, perform your analysis both with and without outliers to assess sensitivity.