In empirical finance, we work mostly with structured data: stock prices organized by date and ticker, financial statements arranged by company and quarter, portfolio returns indexed by time and strategy. The DataFrame abstraction is the fundamental tool for handling this kind of tabular data in Python.
This chapter introduces three major DataFrame libraries in the Python ecosystem: pandas (the established standard), Polars (a modern, high-performance alternative), and DuckDB (which brings SQL to DataFrames). Understanding when and how to use each tool will significantly impact your productivity and the performance of your empirical work.
11.2 The DataFrame Abstraction
11.2.1 What is a DataFrame?
A DataFrame is a two-dimensional labeled data structure with columns of potentially different types. Think of it as a spreadsheet in code, a database table with an API, or a generalization of a matrix that allows mixed types.
But a DataFrame is more than just a container. It’s an abstraction that encapsulates:
Data storage: Efficient memory layout for columnar data
Indexing: Labels for rows and columns enabling semantic access
Operations: A rich API for transformations, aggregations, and joins
Type system: Handling heterogeneous data types within a single structure
11.2.2 Why DataFrames Matter in Finance
In empirical finance research, DataFrames solve several critical problems:
Data Alignment: When you merge stock returns with company characteristics, the DataFrame automatically handles date and identifier alignment. No manual loops checking if dates match.
Missing Data: Financial datasets are riddled with missing values. DataFrames provide principled methods for handling missing values that respect the underlying data structure.
Performance: Modern DataFrame libraries use columnar storage and vectorized operations, making them orders of magnitude faster than row-by-row processing.
Expressiveness: Complex operations like “calculate rolling volatility by stock, then merge with quarterly earnings” can be expressed in a few lines of readable code.
NoteThe Columnar Advantage
DataFrames store data column-by-column rather than row-by-row. This means:
Operations on a single column (like calculating returns) are blazingly fast
Memory access patterns are optimized for modern CPUs
Compression works better when similar data is stored together
Aggregations can skip irrelevant columns entirely
This is why DataFrames can process millions of rows efficiently while naive Python loops struggle.
11.2.3 Core Concepts
All DataFrame libraries share several fundamental concepts:
Columns and Rows: The basic 2D structure. Columns typically represent variables (price, volume, return), while rows represent observations (a specific date, company, or transaction).
Index: A special column (or set of columns) that labels rows. In finance, this is often a date or a combination of date and identifier.
Schema: The data type of each column. Strongly-typed schemas enable optimization and catch errors early.
Operations: Transformations that produce new DataFrames. These include filtering, selecting, aggregating, and joining.
NoteBeyond two dimensions
If you need more than two dimensions for your data, the xarray library in Python can be helpful. It provides labeled, multi-dimensional arrays that extend the DataFrame concept to higher dimensions. While less widespread in finance than pandas or Polars, xarray is useful for working with panel data or other multi-dimensional structures.
11.3 pandas Overview
pandas is the original DataFrame library for Python and remains the most widely used. Created by Wes McKinney in 2008 for financial data analysis, it has become the de facto standard for data manipulation in Python. While pandas is no longer my go-to library for data processing (I now mostly use Polars for that), pandas remains the standard and is unavoidable because many additional libraries for plotting and statistics rely on pandas DataFrames. If you only learn one DataFrame library, it should be this one.
11.3.1 Basic Operations
Let’s start with creating and manipulating a simple DataFrame:
The standard convention is to import pandas as pd, allowing you to reference it with this short alias throughout your code.
2
Create a DataFrame from a dictionary where keys become column names.
3
pd.date_range() generates a sequence of dates with the specified frequency ('D' for daily).
4
In Jupyter notebooks and Quarto, simply placing a DataFrame at the end of a cell displays it with nice formatting.
ticker
date
price
volume
0
AAPL
2024-01-01
150.5
1000000
1
AAPL
2024-01-02
152.3
1050000
2
MSFT
2024-01-03
380.2
800000
3
MSFT
2024-01-04
378.9
820000
4
GOOG
2024-01-05
140.1
950000
5
GOOG
2024-01-06
142.5
980000
Key characteristics of this DataFrame:
Index: Automatically created as integers (0-5)
Columns: Four columns with mixed types (object, datetime64, float64, int64)
Shape: 6 rows × 4 columns
11.3.2 Indexing and Selection
pandas provides multiple ways to select data, which can be confusing at first but powerful once mastered. pandas is somewhat unique in this regard: Polars and DuckDB don’t rely on indexing the same way pandas does—they basically treat all columns the same. In pandas, the index column (or columns, if there’s a multi-index) has a special status that affects how the DataFrame behaves.
11.3.2.1 Column Selection
# Select a single column (returns a Series)prices = df['price']prices
# Select multiple columns (returns a DataFrame)subset = df[['ticker', 'price']]subset
ticker
price
0
AAPL
150.5
1
AAPL
152.3
2
MSFT
380.2
3
MSFT
378.9
4
GOOG
140.1
5
GOOG
142.5
Notice that the double brackets in the second example aren’t a different operator—we’re still using the same square bracket indexing, but instead of passing a single column name, we’re passing a list of column names. When you pass a list, pandas returns a DataFrame; when you pass a single column name, it returns a Series.
11.3.2.2 Row Selection with .loc and .iloc
pandas distinguishes between label-based indexing (.loc) and position-based indexing (.iloc):
# Select rows 0-2 by positiondf.iloc[0:3]
ticker
date
price
volume
0
AAPL
2024-01-01
150.5
1000000
1
AAPL
2024-01-02
152.3
1050000
2
MSFT
2024-01-03
380.2
800000
# If we set an index, we can use .loc with labelsdf_indexed = df.set_index('date')df_indexed.loc['2024-01-01':'2024-01-03']
ticker
price
volume
date
2024-01-01
AAPL
150.5
1000000
2024-01-02
AAPL
152.3
1050000
2024-01-03
MSFT
380.2
800000
Tip.loc vs .iloc vs []
Use .loc[row_labels, column_labels] for label-based indexing
Use .iloc[row_positions, column_positions] for integer position-based indexing
Use [] for simple column selection or boolean masking
The distinction prevents ambiguity: df[0] is an error (is it column 0 or row 0?), but df.iloc[0] and df['column_0'] are clear.
11.3.2.3 Boolean Indexing
Filtering data is a core operation in empirical finance:
# Select rows where price > 150high_price = df[df['price'] >150]high_price
ticker
date
price
volume
0
AAPL
2024-01-01
150.5
1000000
1
AAPL
2024-01-02
152.3
1050000
2
MSFT
2024-01-03
380.2
800000
3
MSFT
2024-01-04
378.9
820000
The expression inside the square brackets (df['price'] > 150) is itself a Series of Boolean values. Let’s look at what it produces:
# The condition produces a Boolean Seriesdf['price'] >150
# Using query method for more readable syntaxresult = df.query('ticker == "AAPL" and price > 150')result
ticker
date
price
volume
0
AAPL
2024-01-01
150.5
1000000
1
AAPL
2024-01-02
152.3
1050000
11.3.3 Creating New Columns
A common operation is creating new columns from existing ones. When you perform arithmetic operations between columns, pandas aligns them by index and applies the operation element by element:
# Create a new column by multiplying two existing columnsdf['dollar_volume'] = df['price'] * df['volume']df
1
Element-wise multiplication: each row’s price is multiplied by that row’s volume.
ticker
date
price
volume
dollar_volume
0
AAPL
2024-01-01
150.5
1000000
150500000.0
1
AAPL
2024-01-02
152.3
1050000
159915000.0
2
MSFT
2024-01-03
380.2
800000
304160000.0
3
MSFT
2024-01-04
378.9
820000
310698000.0
4
GOOG
2024-01-05
140.1
950000
133095000.0
5
GOOG
2024-01-06
142.5
980000
139650000.0
When you multiply or add a scalar, it applies to all elements in the column:
# Multiply a column by a scalardf['price_cents'] = df['price'] *100df
1
The scalar 100 is applied to every element in the price column.
ticker
date
price
volume
dollar_volume
price_cents
0
AAPL
2024-01-01
150.5
1000000
150500000.0
15050.0
1
AAPL
2024-01-02
152.3
1050000
159915000.0
15230.0
2
MSFT
2024-01-03
380.2
800000
304160000.0
38020.0
3
MSFT
2024-01-04
378.9
820000
310698000.0
37890.0
4
GOOG
2024-01-05
140.1
950000
133095000.0
14010.0
5
GOOG
2024-01-06
142.5
980000
139650000.0
14250.0
You can also combine columns with different operations:
When you perform operations between two columns (or two DataFrames), pandas automatically aligns them by their index. This is powerful but can also be a source of subtle bugs if your indices don’t match as expected. We’ll cover sorting, grouping, and aggregation operations in a later chapter.
11.3.4 Strengths and Limitations
Strengths:
Mature ecosystem with extensive documentation
Rich functionality covering nearly every data manipulation need
Excellent integration with other libraries (scikit-learn, statsmodels, matplotlib)
Flexible indexing system enabling time-series analysis
Wide adoption means abundant examples, Stack Overflow answers, and AI coding assistance
Limitations:
Performance can degrade with multi-gigabyte datasets, depending on available RAM
Memory usage is higher than necessary due to row-based implementation details
API inconsistencies accumulated over 15+ years of development
Single-threaded execution for most operations
String operations are particularly slow
11.4 Polars Overview
Polars is a modern DataFrame library written in Rust, designed from the ground up for performance. It addresses many of pandas’ limitations while maintaining a familiar API for common operations.
Polars differs from pandas in several important ways:
Lazy evaluation allows you to write your data processing code without immediately executing it. Polars builds a query plan and defers execution until you explicitly collect the results. This enables the query optimizer to analyze the entire pipeline and find efficiencies—such as eliminating unnecessary computations, reordering operations, or pushing filters earlier in the pipeline.
Parallel execution means Polars can automatically use multiple CPU cores for operations that support it. Unlike pandas, which is largely single-threaded, Polars can process different parts of your data concurrently, leading to significant speedups on modern multi-core machines.
Apache Arrow memory representation is an efficient columnar format that Polars uses internally. Arrow improves interoperability with other libraries that support the format (pandas includes Arrow as an optional backend), and generally makes operations more efficient. Importantly, Arrow explicitly supports missing values as a distinct type rather than relying on NaN, which is important for accurate calculations and proper handling of missing data. Arrow also enables efficient reading and writing to the optimized Parquet file format.
The expression API provides a composable syntax for data transformations that is not identical to pandas. Users coming from the R tidyverse may find it feels more familiar than pandas’ approach. Expressions can be combined and optimized together, leading to more efficient execution.
.lazy() converts an eager DataFrame to a lazy frame.
2
.collect() executes the optimized query plan and returns the result.
shape: (2, 3)
ticker
avg_price
total_volume
str
f64
i64
"AAPL"
151.4
2050000
"GOOG"
141.3
1930000
ImportantWhen to Use Lazy Evaluation
Lazy evaluation shines when:
You have a complex chain of operations
Your data is large enough that optimization matters
You’re reading from files (Polars can push down filters and projections)
For small datasets and simple operations, the overhead of optimization may not be worth it. Use eager mode (regular DataFrames) for interactive exploration.
11.4.4 Grouping and Aggregation
# Group by with multiple aggregationssummary_pl = df_pl.group_by('ticker').agg([ pl.col('price').mean().alias('avg_price'), pl.col('price').std().alias('std_price'), pl.col('volume').sum().alias('total_volume')])summary_pl
Index-less design may require adaptation for time-series workflows
11.5 DuckDB and SQL for DataFrames
DuckDB is an in-process SQL database optimized for analytical queries. While not strictly a DataFrame library, it provides a powerful alternative interface for DataFrame operations through SQL.
While this book focuses mostly on Python, SQL is also a very important language for financial data analytics because it is the language most databases use. Many third-party databases can be accessed through SQL from Python and return DataFrames, but DuckDB presents a nice alternative that lets you write SQL directly within your Python environment. DuckDB doesn’t need to run inside Python—it can also operate as a standalone tool—but using it within Python makes it easy to take partial results and transfer them to pandas or Polars for further processing. Like Polars, DuckDB uses Arrow for its memory representation, enabling efficient data exchange between tools.
11.5.1 Why SQL for DataFrames?
SQL has several advantages:
Declarative: Describe what you want, not how to get it
Optimized: Query optimizers can find efficient execution plans
Familiar: Many finance professionals already know SQL
Standard: SQL skills transfer across tools and platforms
11.5.2 Basic Usage
import duckdbresult = duckdb.sql(""" SELECT ticker, AVG(price) as avg_price, SUM(volume) as total_volume FROM df GROUP BY ticker ORDER BY avg_price DESC""").df()result
1
.df() converts the DuckDB result to a pandas DataFrame.
ticker
avg_price
total_volume
0
MSFT
379.55
1620000.0
1
AAPL
151.40
2050000.0
2
GOOG
141.30
1930000.0
Notice that we didn’t need to import the DataFrame into DuckDB explicitly. DuckDB automatically detects pandas DataFrames in the scope and makes them queryable.
11.5.3 Advanced SQL Operations
# Window functions for returnsreturns_sql = duckdb.sql(""" SELECT ticker, date, price, price / LAG(price) OVER (PARTITION BY ticker ORDER BY date) - 1 as return FROM df ORDER BY ticker, date""").df()returns_sql
ticker
date
price
return
0
AAPL
2024-01-01
150.5
NaN
1
AAPL
2024-01-02
152.3
0.011960
2
GOOG
2024-01-05
140.1
NaN
3
GOOG
2024-01-06
142.5
0.017131
4
MSFT
2024-01-03
380.2
NaN
5
MSFT
2024-01-04
378.9
-0.003419
# Complex filtering and aggregationanalysis = duckdb.sql(""" WITH high_volume AS ( SELECT * FROM df WHERE volume > 900000 ) SELECT ticker, COUNT(*) as n_high_volume_days, AVG(price) as avg_price_on_high_volume FROM high_volume GROUP BY ticker""").df()analysis
ticker
n_high_volume_days
avg_price_on_high_volume
0
GOOG
2
141.3
1
AAPL
2
151.4
11.5.4 Integration with Multiple Sources
DuckDB can query multiple DataFrame types simultaneously:
# Query both pandas and Polars DataFrames in one querycombined = duckdb.sql(""" SELECT p.ticker, p.avg_price as pandas_avg, pl.avg_price as polars_avg FROM (SELECT ticker, AVG(price) as avg_price FROM df GROUP BY ticker) p JOIN (SELECT ticker, AVG(price) as avg_price FROM df_pl GROUP BY ticker) pl ON p.ticker = pl.ticker""").df()combined
ticker
pandas_avg
polars_avg
0
GOOG
141.30
141.30
1
AAPL
151.40
151.40
2
MSFT
379.55
379.55
11.5.5 DuckDB Relations API
For a more programmatic interface, DuckDB provides a relational API:
rel = duckdb.sql("SELECT * FROM df")# Chain operationsresult = ( rel .filter("volume > 900000") .aggregate("ticker, AVG(price) as avg_price") .order("avg_price DESC"))result.df()
1
Create a relation from a DataFrame that can be queried programmatically.
ticker
avg_price
0
AAPL
151.4
1
GOOG
141.3
11.5.6 Strengths and Limitations
Strengths:
Extremely fast for analytical queries
Optimized query plans can outperform hand-written DataFrame code
SQL is a standard, transferable skill
Seamless integration with pandas and Polars
Can query files directly without loading into memory
Excellent for joins and complex aggregations
Limitations:
SQL syntax can be verbose for simple operations
Less suitable for row-wise operations or complex custom logic
Debugging SQL queries can be harder than Python code
Some operations are more natural in a DataFrame API
Embedded database means limited concurrency
11.6 Choosing Between pandas, Polars, and DuckDB
The right tool depends on your specific use case. Here’s a decision framework:
11.6.1 When to Use pandas
Choose pandas when:
You’re working with small to medium datasets (<1GB in memory)
You need integration with scikit-learn, statsmodels, or other pandas-centric libraries
You’re learning and want maximum community support
Your code needs to be maintained by others who know pandas
You need pandas’ advanced time-series functionality (business day calendars, time zone handling)
Example use case: Exploratory analysis of a few thousand stocks with daily data for factor model estimation.
11.6.2 When to Use Polars
Choose Polars when:
You’re working with large datasets (>1GB)
Performance is critical (e.g., production pipelines, real-time systems)
You want cleaner, more maintainable code through the expression API
You need parallel processing without explicit threading code
String manipulation performance matters
Example use case: Processing tick-by-tick trade data for thousands of stocks to compute intraday liquidity measures.
11.6.3 When to Use DuckDB
Choose DuckDB when:
You’re comfortable with SQL and prefer declarative queries
You need to join multiple large datasets
Your data lives in files (CSV, Parquet) and you want to query without loading
You need maximum analytical query performance
You’re aggregating or filtering large datasets
Example use case: Joining quarterly earnings data with daily stock returns across thousands of companies and years.
11.6.4 Hybrid Approaches
You don’t need to choose just one. Modern workflows often combine tools:
# Read large file with DuckDBduckdb_result = duckdb.sql(""" SELECT ticker, date, price, volume FROM 'large_dataset.parquet' WHERE date >= '2020-01-01' AND ticker IN ('AAPL', 'MSFT', 'GOOG')""")# Convert to Polars for further processingdf_polars = pl.from_arrow(duckdb_result.arrow())# Do complex feature engineering in Polarsfeatures = df_polars.with_columns([ pl.col('price').pct_change().over('ticker').alias('return'), pl.col('volume').rolling_mean(20).over('ticker').alias('avg_volume_20d')])# Convert to pandas for sklearndf_pandas = features.to_pandas()# Use in machine learning pipeline# from sklearn.ensemble import RandomForestRegressor# model = RandomForestRegressor()# ...
TipPractical Recommendation
For MSc-level empirical finance work, I recommend:
Learn pandas first: It’s the standard, and you’ll encounter it everywhere
Add Polars for performance: When pandas becomes slow, rewrite critical sections in Polars
Use DuckDB for data wrangling: ETL pipelines and complex joins are often cleaner in SQL
Profile your code: Don’t optimize prematurely—measure where time is actually spent
The best practitioners know all three and choose the right tool for each task.
11.7 Summary
DataFrames are the fundamental abstraction for tabular data in empirical finance. This chapter introduced three powerful tools:
pandas: The established standard with broad ecosystem support. Best for general-purpose data analysis, learning, and integration with statistical libraries.
Polars: A modern, high-performance alternative with lazy evaluation and automatic parallelization. Best for large datasets and performance-critical applications.
DuckDB: SQL interface for analytical queries on DataFrames and files. Best for complex joins, aggregations, and data pipeline work.
The key insight is that these tools are complementary. Understanding when to use each—and how to combine them—will make you significantly more productive in empirical finance research.
In the next chapters, we’ll dive deeper into specific operations: data cleaning, merging and joining, time-series operations, and efficient computation with large datasets.