11 Introduction to DataFrames

11.1 Overview

In empirical finance, we work mostly with structured data: stock prices organized by date and ticker, financial statements arranged by company and quarter, portfolio returns indexed by time and strategy. The DataFrame abstraction is the fundamental tool for handling this kind of tabular data in Python.

This chapter introduces three major DataFrame libraries in the Python ecosystem: pandas (the established standard), Polars (a modern, high-performance alternative), and DuckDB (which brings SQL to DataFrames). Understanding when and how to use each tool will significantly impact your productivity and the performance of your empirical work.

11.2 The DataFrame Abstraction

11.2.1 What is a DataFrame?

A DataFrame is a two-dimensional labeled data structure with columns of potentially different types. Think of it as a spreadsheet in code, a database table with an API, or a generalization of a matrix that allows mixed types.

But a DataFrame is more than just a container. It’s an abstraction that encapsulates:

Data storage: Efficient memory layout for columnar data
Indexing: Labels for rows and columns enabling semantic access
Operations: A rich API for transformations, aggregations, and joins
Type system: Handling heterogeneous data types within a single structure

11.2.2 Why DataFrames Matter in Finance

In empirical finance research, DataFrames solve several critical problems:

Data Alignment: When you merge stock returns with company characteristics, the DataFrame automatically handles date and identifier alignment. No manual loops checking if dates match.

Missing Data: Financial datasets are riddled with missing values. DataFrames provide principled methods for handling missing values that respect the underlying data structure.

Performance: Modern DataFrame libraries use columnar storage and vectorized operations, making them orders of magnitude faster than row-by-row processing.

Expressiveness: Complex operations like “calculate rolling volatility by stock, then merge with quarterly earnings” can be expressed in a few lines of readable code.

The Columnar Advantage

DataFrames store data column-by-column rather than row-by-row. This means:

Operations on a single column (like calculating returns) are blazingly fast
Memory access patterns are optimized for modern CPUs
Compression works better when similar data is stored together
Aggregations can skip irrelevant columns entirely

This is why DataFrames can process millions of rows efficiently while naive Python loops struggle.

11.2.3 Core Concepts

All DataFrame libraries share several fundamental concepts:

Columns and Rows: The basic 2D structure. Columns typically represent variables (price, volume, return), while rows represent observations (a specific date, company, or transaction).

Index: A special column (or set of columns) that labels rows. In finance, this is often a date or a combination of date and identifier.

Schema: The data type of each column. Strongly-typed schemas enable optimization and catch errors early.

Operations: Transformations that produce new DataFrames. These include filtering, selecting, aggregating, and joining.

Beyond two dimensions

If you need more than two dimensions for your data, the xarray library in Python can be helpful. It provides labeled, multi-dimensional arrays that extend the DataFrame concept to higher dimensions. While less widespread in finance than pandas or Polars, xarray is useful for working with panel data or other multi-dimensional structures.

11.3 pandas Overview

pandas is the original DataFrame library for Python and remains the most widely used. Created by Wes McKinney in 2008 for financial data analysis, it has become the de facto standard for data manipulation in Python. While pandas is no longer my go-to library for data processing (I now mostly use Polars for that), pandas remains the standard and is unavoidable because many additional libraries for plotting and statistics rely on pandas DataFrames. If you only learn one DataFrame library, it should be this one.

11.3.1 Basic Operations

Let’s start with creating and manipulating a simple DataFrame:

import pandas as pd

df = pd.DataFrame({
    'ticker': ['AAPL', 'AAPL', 'MSFT', 'MSFT', 'GOOG', 'GOOG'],
    'date': pd.date_range('2024-01-01', periods=6, freq='D')[:6],
    'price': [150.5, 152.3, 380.2, 378.9, 140.1, 142.5],
    'volume': [1000000, 1050000, 800000, 820000, 950000, 980000]
})

df

1: The standard convention is to import pandas as pd, allowing you to reference it with this short alias throughout your code.
2: Create a DataFrame from a dictionary where keys become column names.
3: pd.date_range() generates a sequence of dates with the specified frequency ('D' for daily).
4: In Jupyter notebooks and Quarto, simply placing a DataFrame at the end of a cell displays it with nice formatting.

	ticker	date	price	volume
0	AAPL	2024-01-01	150.5	1000000
1	AAPL	2024-01-02	152.3	1050000
2	MSFT	2024-01-03	380.2	800000
3	MSFT	2024-01-04	378.9	820000
4	GOOG	2024-01-05	140.1	950000
5	GOOG	2024-01-06	142.5	980000

Key characteristics of this DataFrame:

Index: Automatically created as integers (0-5)
Columns: Four columns with mixed types (object, datetime64, float64, int64)
Shape: 6 rows × 4 columns

11.3.2 Indexing and Selection

pandas provides multiple ways to select data, which can be confusing at first but powerful once mastered. pandas is somewhat unique in this regard: Polars and DuckDB don’t rely on indexing the same way pandas does—they basically treat all columns the same. In pandas, the index column (or columns, if there’s a multi-index) has a special status that affects how the DataFrame behaves.

11.3.2.1 Column Selection

# Select a single column (returns a Series)
prices = df['price']
prices

0    150.5
1    152.3
2    380.2
3    378.9
4    140.1
5    142.5
Name: price, dtype: float64

# Select multiple columns (returns a DataFrame)
subset = df[['ticker', 'price']]
subset

	ticker	price
0	AAPL	150.5
1	AAPL	152.3
2	MSFT	380.2
3	MSFT	378.9
4	GOOG	140.1
5	GOOG	142.5

Notice that the double brackets in the second example aren’t a different operator—we’re still using the same square bracket indexing, but instead of passing a single column name, we’re passing a list of column names. When you pass a list, pandas returns a DataFrame; when you pass a single column name, it returns a Series.

11.3.2.2 Row Selection with .loc and .iloc

pandas distinguishes between label-based indexing (.loc) and position-based indexing (.iloc):

# Select rows 0-2 by position
df.iloc[0:3]

	ticker	date	price	volume
0	AAPL	2024-01-01	150.5	1000000
1	AAPL	2024-01-02	152.3	1050000
2	MSFT	2024-01-03	380.2	800000

# If we set an index, we can use .loc with labels
df_indexed = df.set_index('date')
df_indexed.loc['2024-01-01':'2024-01-03']

	ticker	price	volume
date
2024-01-01	AAPL	150.5	1000000
2024-01-02	AAPL	152.3	1050000
2024-01-03	MSFT	380.2	800000

.loc vs .iloc vs []

Use .loc[row_labels, column_labels] for label-based indexing
Use .iloc[row_positions, column_positions] for integer position-based indexing
Use [] for simple column selection or boolean masking

The distinction prevents ambiguity: df[0] is an error (is it column 0 or row 0?), but df.iloc[0] and df['column_0'] are clear.

11.3.2.3 Boolean Indexing

Filtering data is a core operation in empirical finance:

# Select rows where price > 150
high_price = df[df['price'] > 150]
high_price

	ticker	date	price	volume
0	AAPL	2024-01-01	150.5	1000000
1	AAPL	2024-01-02	152.3	1050000
2	MSFT	2024-01-03	380.2	800000
3	MSFT	2024-01-04	378.9	820000

The expression inside the square brackets (df['price'] > 150) is itself a Series of Boolean values. Let’s look at what it produces:

# The condition produces a Boolean Series
df['price'] > 150

0     True
1     True
2     True
3     True
4    False
5    False
Name: price, dtype: bool

The rows that get selected are those where the Boolean value is True.

# Multiple conditions: AAPL stocks with price > 150
aapl_high = df[(df['ticker'] == 'AAPL') & (df['price'] > 150)]
aapl_high

	ticker	date	price	volume
0	AAPL	2024-01-01	150.5	1000000
1	AAPL	2024-01-02	152.3	1050000

# Using query method for more readable syntax
result = df.query('ticker == "AAPL" and price > 150')
result

	ticker	date	price	volume
0	AAPL	2024-01-01	150.5	1000000
1	AAPL	2024-01-02	152.3	1050000

11.3.3 Creating New Columns

A common operation is creating new columns from existing ones. When you perform arithmetic operations between columns, pandas aligns them by index and applies the operation element by element:

# Create a new column by multiplying two existing columns
df['dollar_volume'] = df['price'] * df['volume']
df

1: Element-wise multiplication: each row’s price is multiplied by that row’s volume.

	ticker	date	price	volume	dollar_volume
0	AAPL	2024-01-01	150.5	1000000	150500000.0
1	AAPL	2024-01-02	152.3	1050000	159915000.0
2	MSFT	2024-01-03	380.2	800000	304160000.0
3	MSFT	2024-01-04	378.9	820000	310698000.0
4	GOOG	2024-01-05	140.1	950000	133095000.0
5	GOOG	2024-01-06	142.5	980000	139650000.0

When you multiply or add a scalar, it applies to all elements in the column:

# Multiply a column by a scalar
df['price_cents'] = df['price'] * 100
df

1: The scalar 100 is applied to every element in the price column.

	ticker	date	price	volume	dollar_volume	price_cents
0	AAPL	2024-01-01	150.5	1000000	150500000.0	15050.0
1	AAPL	2024-01-02	152.3	1050000	159915000.0	15230.0
2	MSFT	2024-01-03	380.2	800000	304160000.0	38020.0
3	MSFT	2024-01-04	378.9	820000	310698000.0	37890.0
4	GOOG	2024-01-05	140.1	950000	133095000.0	14010.0
5	GOOG	2024-01-06	142.5	980000	139650000.0	14250.0

You can also combine columns with different operations:

# Combining columns with arithmetic
df['avg_price_per_share'] = df['dollar_volume'] / df['volume']
df[['ticker', 'price', 'avg_price_per_share']]

	ticker	price	avg_price_per_share
0	AAPL	150.5	150.5
1	AAPL	152.3	152.3
2	MSFT	380.2	380.2
3	MSFT	378.9	378.9
4	GOOG	140.1	140.1
5	GOOG	142.5	142.5

Index alignment in operations

When you perform operations between two columns (or two DataFrames), pandas automatically aligns them by their index. This is powerful but can also be a source of subtle bugs if your indices don’t match as expected. We’ll cover sorting, grouping, and aggregation operations in a later chapter.

11.3.4 Strengths and Limitations

Strengths:

Mature ecosystem with extensive documentation
Rich functionality covering nearly every data manipulation need
Excellent integration with other libraries (scikit-learn, statsmodels, matplotlib)
Flexible indexing system enabling time-series analysis
Wide adoption means abundant examples, Stack Overflow answers, and AI coding assistance

Limitations:

Performance can degrade with multi-gigabyte datasets, depending on available RAM
Memory usage is higher than necessary due to row-based implementation details
API inconsistencies accumulated over 15+ years of development
Single-threaded execution for most operations
String operations are particularly slow

11.4 Polars Overview

Polars is a modern DataFrame library written in Rust, designed from the ground up for performance. It addresses many of pandas’ limitations while maintaining a familiar API for common operations.

Polars differs from pandas in several important ways:

Lazy evaluation allows you to write your data processing code without immediately executing it. Polars builds a query plan and defers execution until you explicitly collect the results. This enables the query optimizer to analyze the entire pipeline and find efficiencies—such as eliminating unnecessary computations, reordering operations, or pushing filters earlier in the pipeline.

Parallel execution means Polars can automatically use multiple CPU cores for operations that support it. Unlike pandas, which is largely single-threaded, Polars can process different parts of your data concurrently, leading to significant speedups on modern multi-core machines.

Apache Arrow memory representation is an efficient columnar format that Polars uses internally. Arrow improves interoperability with other libraries that support the format (pandas includes Arrow as an optional backend), and generally makes operations more efficient. Importantly, Arrow explicitly supports missing values as a distinct type rather than relying on NaN, which is important for accurate calculations and proper handling of missing data. Arrow also enables efficient reading and writing to the optimized Parquet file format.

The expression API provides a composable syntax for data transformations that is not identical to pandas. Users coming from the R tidyverse may find it feels more familiar than pandas’ approach. Expressions can be combined and optimized together, leading to more efficient execution.

11.4.1 Basic Operations

import polars as pl

df_pl = pl.DataFrame({
    'ticker': ['AAPL', 'AAPL', 'MSFT', 'MSFT', 'GOOG', 'GOOG'],
    'date': pd.date_range('2024-01-01', periods=6, freq='D'),
    'price': [150.5, 152.3, 380.2, 378.9, 140.1, 142.5],
    'volume': [1000000, 1050000, 800000, 820000, 950000, 980000]
})

df_pl

1: The standard convention is to import Polars as pl.

shape: (6, 4)

ticker	date	price	volume
str	datetime[ns]	f64	i64
"AAPL"	2024-01-01 00:00:00	150.5	1000000
"AAPL"	2024-01-02 00:00:00	152.3	1050000
"MSFT"	2024-01-03 00:00:00	380.2	800000
"MSFT"	2024-01-04 00:00:00	378.9	820000
"GOOG"	2024-01-05 00:00:00	140.1	950000
"GOOG"	2024-01-06 00:00:00	142.5	980000

11.4.2 The Expression API

Polars introduces an expression-based API that is more composable than pandas:

# Select and transform columns
result = df_pl.select([
    pl.col('ticker'),
    pl.col('price'),
    (pl.col('price') * pl.col('volume')).alias('dollar_volume')
])
result

1: pl.col() references a column by name.
2: .alias() names the resulting column.

shape: (6, 3)

ticker	price	dollar_volume
str	f64	f64
"AAPL"	150.5	1.505e8
"AAPL"	152.3	1.59915e8
"MSFT"	380.2	3.0416e8
"MSFT"	378.9	3.10698e8
"GOOG"	140.1	1.33095e8
"GOOG"	142.5	1.3965e8

# Filtering with expressions
high_price_pl = df_pl.filter(pl.col('price') > 150)
high_price_pl

shape: (4, 4)

ticker	date	price	volume
str	datetime[ns]	f64	i64
"AAPL"	2024-01-01 00:00:00	150.5	1000000
"AAPL"	2024-01-02 00:00:00	152.3	1050000
"MSFT"	2024-01-03 00:00:00	380.2	800000
"MSFT"	2024-01-04 00:00:00	378.9	820000

# Chaining operations
result = (
    df_pl
    .filter(pl.col('ticker') == 'AAPL')
    .with_columns((pl.col('price') * 1.1).alias('price_plus_10pct'))
    .select(['ticker', 'date', 'price', 'price_plus_10pct'])
)
result

1: .with_columns() adds or replaces columns in the DataFrame.

shape: (2, 4)

ticker	date	price	price_plus_10pct
str	datetime[ns]	f64	f64
"AAPL"	2024-01-01 00:00:00	150.5	165.55
"AAPL"	2024-01-02 00:00:00	152.3	167.53

11.4.3 Lazy Evaluation

Polars’ lazy API builds a query plan and optimizes it before execution:

lazy_df = df_pl.lazy()

# Build a query (nothing executes yet)
lazy_query = (
    lazy_df
    .filter(pl.col('volume') > 850000)
    .group_by('ticker')
    .agg([
        pl.col('price').mean().alias('avg_price'),
        pl.col('volume').sum().alias('total_volume')
    ])
    .sort('avg_price', descending=True)
)

result = lazy_query.collect()
result

1: .lazy() converts an eager DataFrame to a lazy frame.
2: .collect() executes the optimized query plan and returns the result.

shape: (2, 3)

ticker	avg_price	total_volume
str	f64	i64
"AAPL"	151.4	2050000
"GOOG"	141.3	1930000

When to Use Lazy Evaluation

Lazy evaluation shines when:

You have a complex chain of operations
Your data is large enough that optimization matters
You’re reading from files (Polars can push down filters and projections)

For small datasets and simple operations, the overhead of optimization may not be worth it. Use eager mode (regular DataFrames) for interactive exploration.

11.4.4 Grouping and Aggregation

# Group by with multiple aggregations
summary_pl = df_pl.group_by('ticker').agg([
    pl.col('price').mean().alias('avg_price'),
    pl.col('price').std().alias('std_price'),
    pl.col('volume').sum().alias('total_volume')
])
summary_pl

shape: (3, 4)

ticker	avg_price	std_price	total_volume
str	f64	f64	i64
"MSFT"	379.55	0.919239	1620000
"GOOG"	141.3	1.697056	1930000
"AAPL"	151.4	1.272792	2050000

# Window functions for returns calculation
df_pl_sorted = df_pl.sort(['ticker', 'date'])
df_with_returns = df_pl_sorted.with_columns(
    pl.col('price').pct_change().over('ticker').alias('return')
)
df_with_returns

shape: (6, 5)

ticker	date	price	volume	return
str	datetime[ns]	f64	i64	f64
"AAPL"	2024-01-01 00:00:00	150.5	1000000	null
"AAPL"	2024-01-02 00:00:00	152.3	1050000	0.01196
"GOOG"	2024-01-05 00:00:00	140.1	950000	null
"GOOG"	2024-01-06 00:00:00	142.5	980000	0.017131
"MSFT"	2024-01-03 00:00:00	380.2	800000	null
"MSFT"	2024-01-04 00:00:00	378.9	820000	-0.003419

11.4.5 Strengths and Limitations

Strengths:

Significantly faster than pandas, especially for large datasets
Lower memory usage thanks to Arrow format
Automatic parallelization
Lazy evaluation enables query optimization
More consistent API design
Excellent handling of string and categorical data

Limitations:

Smaller ecosystem (fewer third-party integrations)
Less mature documentation and community resources
Some advanced pandas features not yet implemented
Different mental model requires learning curve
Index-less design may require adaptation for time-series workflows

11.5 DuckDB and SQL for DataFrames

DuckDB is an in-process SQL database optimized for analytical queries. While not strictly a DataFrame library, it provides a powerful alternative interface for DataFrame operations through SQL.

While this book focuses mostly on Python, SQL is also a very important language for financial data analytics because it is the language most databases use. Many third-party databases can be accessed through SQL from Python and return DataFrames, but DuckDB presents a nice alternative that lets you write SQL directly within your Python environment. DuckDB doesn’t need to run inside Python—it can also operate as a standalone tool—but using it within Python makes it easy to take partial results and transfer them to pandas or Polars for further processing. Like Polars, DuckDB uses Arrow for its memory representation, enabling efficient data exchange between tools.

11.5.1 Why SQL for DataFrames?

SQL has several advantages:

Declarative: Describe what you want, not how to get it
Optimized: Query optimizers can find efficient execution plans
Familiar: Many finance professionals already know SQL
Standard: SQL skills transfer across tools and platforms

11.5.2 Basic Usage

import duckdb

result = duckdb.sql("""
    SELECT ticker,
           AVG(price) as avg_price,
           SUM(volume) as total_volume
    FROM df
    GROUP BY ticker
    ORDER BY avg_price DESC
""").df()

result

1: .df() converts the DuckDB result to a pandas DataFrame.

	ticker	avg_price	total_volume
0	MSFT	379.55	1620000.0
1	AAPL	151.40	2050000.0
2	GOOG	141.30	1930000.0

Notice that we didn’t need to import the DataFrame into DuckDB explicitly. DuckDB automatically detects pandas DataFrames in the scope and makes them queryable.

11.5.3 Advanced SQL Operations

# Window functions for returns
returns_sql = duckdb.sql("""
    SELECT ticker,
           date,
           price,
           price / LAG(price) OVER (PARTITION BY ticker ORDER BY date) - 1 as return
    FROM df
    ORDER BY ticker, date
""").df()

returns_sql

	ticker	date	price	return
0	AAPL	2024-01-01	150.5	NaN
1	AAPL	2024-01-02	152.3	0.011960
2	GOOG	2024-01-05	140.1	NaN
3	GOOG	2024-01-06	142.5	0.017131
4	MSFT	2024-01-03	380.2	NaN
5	MSFT	2024-01-04	378.9	-0.003419

# Complex filtering and aggregation
analysis = duckdb.sql("""
    WITH high_volume AS (
        SELECT *
        FROM df
        WHERE volume > 900000
    )
    SELECT
        ticker,
        COUNT(*) as n_high_volume_days,
        AVG(price) as avg_price_on_high_volume
    FROM high_volume
    GROUP BY ticker
""").df()

analysis

	ticker	n_high_volume_days	avg_price_on_high_volume
0	GOOG	2	141.3
1	AAPL	2	151.4

11.5.4 Integration with Multiple Sources

DuckDB can query multiple DataFrame types simultaneously:

# Query both pandas and Polars DataFrames in one query
combined = duckdb.sql("""
    SELECT
        p.ticker,
        p.avg_price as pandas_avg,
        pl.avg_price as polars_avg
    FROM (SELECT ticker, AVG(price) as avg_price FROM df GROUP BY ticker) p
    JOIN (SELECT ticker, AVG(price) as avg_price FROM df_pl GROUP BY ticker) pl
        ON p.ticker = pl.ticker
""").df()

combined

	ticker	pandas_avg	polars_avg
0	GOOG	141.30	141.30
1	AAPL	151.40	151.40
2	MSFT	379.55	379.55

11.5.5 DuckDB Relations API

For a more programmatic interface, DuckDB provides a relational API:

rel = duckdb.sql("SELECT * FROM df")

# Chain operations
result = (
    rel
    .filter("volume > 900000")
    .aggregate("ticker, AVG(price) as avg_price")
    .order("avg_price DESC")
)

result.df()

1: Create a relation from a DataFrame that can be queried programmatically.

	ticker	avg_price
0	AAPL	151.4
1	GOOG	141.3

11.5.6 Strengths and Limitations

Strengths:

Extremely fast for analytical queries
Optimized query plans can outperform hand-written DataFrame code
SQL is a standard, transferable skill
Seamless integration with pandas and Polars
Can query files directly without loading into memory
Excellent for joins and complex aggregations

Limitations:

SQL syntax can be verbose for simple operations
Less suitable for row-wise operations or complex custom logic
Debugging SQL queries can be harder than Python code
Some operations are more natural in a DataFrame API
Embedded database means limited concurrency

11.6 Choosing Between pandas, Polars, and DuckDB

The right tool depends on your specific use case. Here’s a decision framework:

11.6.1 When to Use pandas

Choose pandas when:

You’re working with small to medium datasets (<1GB in memory)
You need integration with scikit-learn, statsmodels, or other pandas-centric libraries
You’re learning and want maximum community support
Your code needs to be maintained by others who know pandas
You need pandas’ advanced time-series functionality (business day calendars, time zone handling)

Example use case: Exploratory analysis of a few thousand stocks with daily data for factor model estimation.

11.6.2 When to Use Polars

Choose Polars when:

You’re working with large datasets (>1GB)
Performance is critical (e.g., production pipelines, real-time systems)
You want cleaner, more maintainable code through the expression API
You need parallel processing without explicit threading code
String manipulation performance matters

Example use case: Processing tick-by-tick trade data for thousands of stocks to compute intraday liquidity measures.

11.6.3 When to Use DuckDB

Choose DuckDB when:

You’re comfortable with SQL and prefer declarative queries
You need to join multiple large datasets
Your data lives in files (CSV, Parquet) and you want to query without loading
You need maximum analytical query performance
You’re aggregating or filtering large datasets

Example use case: Joining quarterly earnings data with daily stock returns across thousands of companies and years.

11.6.4 Hybrid Approaches

You don’t need to choose just one. Modern workflows often combine tools:

# Read large file with DuckDB
duckdb_result = duckdb.sql("""
    SELECT ticker, date, price, volume
    FROM 'large_dataset.parquet'
    WHERE date >= '2020-01-01'
    AND ticker IN ('AAPL', 'MSFT', 'GOOG')
""")

# Convert to Polars for further processing
df_polars = pl.from_arrow(duckdb_result.arrow())

# Do complex feature engineering in Polars
features = df_polars.with_columns([
    pl.col('price').pct_change().over('ticker').alias('return'),
    pl.col('volume').rolling_mean(20).over('ticker').alias('avg_volume_20d')
])

# Convert to pandas for sklearn
df_pandas = features.to_pandas()

# Use in machine learning pipeline
# from sklearn.ensemble import RandomForestRegressor
# model = RandomForestRegressor()
# ...

Practical Recommendation

For MSc-level empirical finance work, I recommend:

Learn pandas first: It’s the standard, and you’ll encounter it everywhere
Add Polars for performance: When pandas becomes slow, rewrite critical sections in Polars
Use DuckDB for data wrangling: ETL pipelines and complex joins are often cleaner in SQL
Profile your code: Don’t optimize prematurely—measure where time is actually spent

The best practitioners know all three and choose the right tool for each task.

11.7 Summary

DataFrames are the fundamental abstraction for tabular data in empirical finance. This chapter introduced three powerful tools:

pandas: The established standard with broad ecosystem support. Best for general-purpose data analysis, learning, and integration with statistical libraries.

Polars: A modern, high-performance alternative with lazy evaluation and automatic parallelization. Best for large datasets and performance-critical applications.

DuckDB: SQL interface for analytical queries on DataFrames and files. Best for complex joins, aggregations, and data pipeline work.

The key insight is that these tools are complementary. Understanding when to use each—and how to combine them—will make you significantly more productive in empirical finance research.

In the next chapters, we’ll dive deeper into specific operations: data cleaning, merging and joining, time-series operations, and efficient computation with large datasets.

11.8 Further Reading

pandas documentation: Comprehensive guide to pandas
Polars User Guide: Official Polars documentation
DuckDB documentation: Complete DuckDB reference
McKinney, W. (2022). Python for Data Analysis, 3rd Edition. O’Reilly. (by pandas creator)
VanderPlas, J. (2016). Python Data Science Handbook. O’Reilly.