11  Introduction to DataFrames

11.1 Overview

In empirical finance, we work mostly with structured data: stock prices organized by date and ticker, financial statements arranged by company and quarter, portfolio returns indexed by time and strategy. The DataFrame abstraction is the fundamental tool for handling this kind of tabular data in Python.

This chapter introduces three major DataFrame libraries in the Python ecosystem: pandas (the established standard), Polars (a modern, high-performance alternative), and DuckDB (which brings SQL to DataFrames). Understanding when and how to use each tool will significantly impact your productivity and the performance of your empirical work.

11.2 The DataFrame Abstraction

11.2.1 What is a DataFrame?

A DataFrame is a two-dimensional labeled data structure with columns of potentially different types. Think of it as a spreadsheet in code, a database table with an API, or a generalization of a matrix that allows mixed types.

But a DataFrame is more than just a container. It’s an abstraction that encapsulates:

  1. Data storage: Efficient memory layout for columnar data
  2. Indexing: Labels for rows and columns enabling semantic access
  3. Operations: A rich API for transformations, aggregations, and joins
  4. Type system: Handling heterogeneous data types within a single structure

11.2.2 Why DataFrames Matter in Finance

In empirical finance research, DataFrames solve several critical problems:

Data Alignment: When you merge stock returns with company characteristics, the DataFrame automatically handles date and identifier alignment. No manual loops checking if dates match.

Missing Data: Financial datasets are riddled with missing values. DataFrames provide principled methods for handling missing values that respect the underlying data structure.

Performance: Modern DataFrame libraries use columnar storage and vectorized operations, making them orders of magnitude faster than row-by-row processing.

Expressiveness: Complex operations like “calculate rolling volatility by stock, then merge with quarterly earnings” can be expressed in a few lines of readable code.

NoteThe Columnar Advantage

DataFrames store data column-by-column rather than row-by-row. This means:

  • Operations on a single column (like calculating returns) are blazingly fast
  • Memory access patterns are optimized for modern CPUs
  • Compression works better when similar data is stored together
  • Aggregations can skip irrelevant columns entirely

This is why DataFrames can process millions of rows efficiently while naive Python loops struggle.

11.2.3 Core Concepts

All DataFrame libraries share several fundamental concepts:

Columns and Rows: The basic 2D structure. Columns typically represent variables (price, volume, return), while rows represent observations (a specific date, company, or transaction).

Index: A special column (or set of columns) that labels rows. In finance, this is often a date or a combination of date and identifier.

Schema: The data type of each column. Strongly-typed schemas enable optimization and catch errors early.

Operations: Transformations that produce new DataFrames. These include filtering, selecting, aggregating, and joining.

NoteBeyond two dimensions

If you need more than two dimensions for your data, the xarray library in Python can be helpful. It provides labeled, multi-dimensional arrays that extend the DataFrame concept to higher dimensions. While less widespread in finance than pandas or Polars, xarray is useful for working with panel data or other multi-dimensional structures.

11.3 pandas Overview

pandas is the original DataFrame library for Python and remains the most widely used. Created by Wes McKinney in 2008 for financial data analysis, it has become the de facto standard for data manipulation in Python. While pandas is no longer my go-to library for data processing (I now mostly use Polars for that), pandas remains the standard and is unavoidable because many additional libraries for plotting and statistics rely on pandas DataFrames. If you only learn one DataFrame library, it should be this one.

11.3.1 Basic Operations

Let’s start with creating and manipulating a simple DataFrame:

import pandas as pd

df = pd.DataFrame({
    'ticker': ['AAPL', 'AAPL', 'MSFT', 'MSFT', 'GOOG', 'GOOG'],
    'date': pd.date_range('2024-01-01', periods=6, freq='D')[:6],
    'price': [150.5, 152.3, 380.2, 378.9, 140.1, 142.5],
    'volume': [1000000, 1050000, 800000, 820000, 950000, 980000]
})

df
1
The standard convention is to import pandas as pd, allowing you to reference it with this short alias throughout your code.
2
Create a DataFrame from a dictionary where keys become column names.
3
pd.date_range() generates a sequence of dates with the specified frequency ('D' for daily).
4
In Jupyter notebooks and Quarto, simply placing a DataFrame at the end of a cell displays it with nice formatting.
ticker date price volume
0 AAPL 2024-01-01 150.5 1000000
1 AAPL 2024-01-02 152.3 1050000
2 MSFT 2024-01-03 380.2 800000
3 MSFT 2024-01-04 378.9 820000
4 GOOG 2024-01-05 140.1 950000
5 GOOG 2024-01-06 142.5 980000

Key characteristics of this DataFrame:

  • Index: Automatically created as integers (0-5)
  • Columns: Four columns with mixed types (object, datetime64, float64, int64)
  • Shape: 6 rows × 4 columns

11.3.2 Indexing and Selection

pandas provides multiple ways to select data, which can be confusing at first but powerful once mastered. pandas is somewhat unique in this regard: Polars and DuckDB don’t rely on indexing the same way pandas does—they basically treat all columns the same. In pandas, the index column (or columns, if there’s a multi-index) has a special status that affects how the DataFrame behaves.

11.3.2.1 Column Selection

# Select a single column (returns a Series)
prices = df['price']
prices
0    150.5
1    152.3
2    380.2
3    378.9
4    140.1
5    142.5
Name: price, dtype: float64
# Select multiple columns (returns a DataFrame)
subset = df[['ticker', 'price']]
subset
ticker price
0 AAPL 150.5
1 AAPL 152.3
2 MSFT 380.2
3 MSFT 378.9
4 GOOG 140.1
5 GOOG 142.5

Notice that the double brackets in the second example aren’t a different operator—we’re still using the same square bracket indexing, but instead of passing a single column name, we’re passing a list of column names. When you pass a list, pandas returns a DataFrame; when you pass a single column name, it returns a Series.

11.3.2.2 Row Selection with .loc and .iloc

pandas distinguishes between label-based indexing (.loc) and position-based indexing (.iloc):

# Select rows 0-2 by position
df.iloc[0:3]
ticker date price volume
0 AAPL 2024-01-01 150.5 1000000
1 AAPL 2024-01-02 152.3 1050000
2 MSFT 2024-01-03 380.2 800000
# If we set an index, we can use .loc with labels
df_indexed = df.set_index('date')
df_indexed.loc['2024-01-01':'2024-01-03']
ticker price volume
date
2024-01-01 AAPL 150.5 1000000
2024-01-02 AAPL 152.3 1050000
2024-01-03 MSFT 380.2 800000
Tip.loc vs .iloc vs []
  • Use .loc[row_labels, column_labels] for label-based indexing
  • Use .iloc[row_positions, column_positions] for integer position-based indexing
  • Use [] for simple column selection or boolean masking

The distinction prevents ambiguity: df[0] is an error (is it column 0 or row 0?), but df.iloc[0] and df['column_0'] are clear.

11.3.2.3 Boolean Indexing

Filtering data is a core operation in empirical finance:

# Select rows where price > 150
high_price = df[df['price'] > 150]
high_price
ticker date price volume
0 AAPL 2024-01-01 150.5 1000000
1 AAPL 2024-01-02 152.3 1050000
2 MSFT 2024-01-03 380.2 800000
3 MSFT 2024-01-04 378.9 820000

The expression inside the square brackets (df['price'] > 150) is itself a Series of Boolean values. Let’s look at what it produces:

# The condition produces a Boolean Series
df['price'] > 150
0     True
1     True
2     True
3     True
4    False
5    False
Name: price, dtype: bool

The rows that get selected are those where the Boolean value is True.

# Multiple conditions: AAPL stocks with price > 150
aapl_high = df[(df['ticker'] == 'AAPL') & (df['price'] > 150)]
aapl_high
ticker date price volume
0 AAPL 2024-01-01 150.5 1000000
1 AAPL 2024-01-02 152.3 1050000
# Using query method for more readable syntax
result = df.query('ticker == "AAPL" and price > 150')
result
ticker date price volume
0 AAPL 2024-01-01 150.5 1000000
1 AAPL 2024-01-02 152.3 1050000

11.3.3 Creating New Columns

A common operation is creating new columns from existing ones. When you perform arithmetic operations between columns, pandas aligns them by index and applies the operation element by element:

# Create a new column by multiplying two existing columns
df['dollar_volume'] = df['price'] * df['volume']
df
1
Element-wise multiplication: each row’s price is multiplied by that row’s volume.
ticker date price volume dollar_volume
0 AAPL 2024-01-01 150.5 1000000 150500000.0
1 AAPL 2024-01-02 152.3 1050000 159915000.0
2 MSFT 2024-01-03 380.2 800000 304160000.0
3 MSFT 2024-01-04 378.9 820000 310698000.0
4 GOOG 2024-01-05 140.1 950000 133095000.0
5 GOOG 2024-01-06 142.5 980000 139650000.0

When you multiply or add a scalar, it applies to all elements in the column:

# Multiply a column by a scalar
df['price_cents'] = df['price'] * 100
df
1
The scalar 100 is applied to every element in the price column.
ticker date price volume dollar_volume price_cents
0 AAPL 2024-01-01 150.5 1000000 150500000.0 15050.0
1 AAPL 2024-01-02 152.3 1050000 159915000.0 15230.0
2 MSFT 2024-01-03 380.2 800000 304160000.0 38020.0
3 MSFT 2024-01-04 378.9 820000 310698000.0 37890.0
4 GOOG 2024-01-05 140.1 950000 133095000.0 14010.0
5 GOOG 2024-01-06 142.5 980000 139650000.0 14250.0

You can also combine columns with different operations:

# Combining columns with arithmetic
df['avg_price_per_share'] = df['dollar_volume'] / df['volume']
df[['ticker', 'price', 'avg_price_per_share']]
ticker price avg_price_per_share
0 AAPL 150.5 150.5
1 AAPL 152.3 152.3
2 MSFT 380.2 380.2
3 MSFT 378.9 378.9
4 GOOG 140.1 140.1
5 GOOG 142.5 142.5
NoteIndex alignment in operations

When you perform operations between two columns (or two DataFrames), pandas automatically aligns them by their index. This is powerful but can also be a source of subtle bugs if your indices don’t match as expected. We’ll cover sorting, grouping, and aggregation operations in a later chapter.

11.3.4 Strengths and Limitations

Strengths:

  • Mature ecosystem with extensive documentation
  • Rich functionality covering nearly every data manipulation need
  • Excellent integration with other libraries (scikit-learn, statsmodels, matplotlib)
  • Flexible indexing system enabling time-series analysis
  • Wide adoption means abundant examples, Stack Overflow answers, and AI coding assistance

Limitations:

  • Performance can degrade with multi-gigabyte datasets, depending on available RAM
  • Memory usage is higher than necessary due to row-based implementation details
  • API inconsistencies accumulated over 15+ years of development
  • Single-threaded execution for most operations
  • String operations are particularly slow

11.4 Polars Overview

Polars is a modern DataFrame library written in Rust, designed from the ground up for performance. It addresses many of pandas’ limitations while maintaining a familiar API for common operations.

Polars differs from pandas in several important ways:

Lazy evaluation allows you to write your data processing code without immediately executing it. Polars builds a query plan and defers execution until you explicitly collect the results. This enables the query optimizer to analyze the entire pipeline and find efficiencies—such as eliminating unnecessary computations, reordering operations, or pushing filters earlier in the pipeline.

Parallel execution means Polars can automatically use multiple CPU cores for operations that support it. Unlike pandas, which is largely single-threaded, Polars can process different parts of your data concurrently, leading to significant speedups on modern multi-core machines.

Apache Arrow memory representation is an efficient columnar format that Polars uses internally. Arrow improves interoperability with other libraries that support the format (pandas includes Arrow as an optional backend), and generally makes operations more efficient. Importantly, Arrow explicitly supports missing values as a distinct type rather than relying on NaN, which is important for accurate calculations and proper handling of missing data. Arrow also enables efficient reading and writing to the optimized Parquet file format.

The expression API provides a composable syntax for data transformations that is not identical to pandas. Users coming from the R tidyverse may find it feels more familiar than pandas’ approach. Expressions can be combined and optimized together, leading to more efficient execution.

11.4.1 Basic Operations

import polars as pl

df_pl = pl.DataFrame({
    'ticker': ['AAPL', 'AAPL', 'MSFT', 'MSFT', 'GOOG', 'GOOG'],
    'date': pd.date_range('2024-01-01', periods=6, freq='D'),
    'price': [150.5, 152.3, 380.2, 378.9, 140.1, 142.5],
    'volume': [1000000, 1050000, 800000, 820000, 950000, 980000]
})

df_pl
1
The standard convention is to import Polars as pl.
shape: (6, 4)
ticker date price volume
str datetime[ns] f64 i64
"AAPL" 2024-01-01 00:00:00 150.5 1000000
"AAPL" 2024-01-02 00:00:00 152.3 1050000
"MSFT" 2024-01-03 00:00:00 380.2 800000
"MSFT" 2024-01-04 00:00:00 378.9 820000
"GOOG" 2024-01-05 00:00:00 140.1 950000
"GOOG" 2024-01-06 00:00:00 142.5 980000

11.4.2 The Expression API

Polars introduces an expression-based API that is more composable than pandas:

# Select and transform columns
result = df_pl.select([
    pl.col('ticker'),
    pl.col('price'),
    (pl.col('price') * pl.col('volume')).alias('dollar_volume')
])
result
1
pl.col() references a column by name.
2
.alias() names the resulting column.
shape: (6, 3)
ticker price dollar_volume
str f64 f64
"AAPL" 150.5 1.505e8
"AAPL" 152.3 1.59915e8
"MSFT" 380.2 3.0416e8
"MSFT" 378.9 3.10698e8
"GOOG" 140.1 1.33095e8
"GOOG" 142.5 1.3965e8
# Filtering with expressions
high_price_pl = df_pl.filter(pl.col('price') > 150)
high_price_pl
shape: (4, 4)
ticker date price volume
str datetime[ns] f64 i64
"AAPL" 2024-01-01 00:00:00 150.5 1000000
"AAPL" 2024-01-02 00:00:00 152.3 1050000
"MSFT" 2024-01-03 00:00:00 380.2 800000
"MSFT" 2024-01-04 00:00:00 378.9 820000
# Chaining operations
result = (
    df_pl
    .filter(pl.col('ticker') == 'AAPL')
    .with_columns((pl.col('price') * 1.1).alias('price_plus_10pct'))
    .select(['ticker', 'date', 'price', 'price_plus_10pct'])
)
result
1
.with_columns() adds or replaces columns in the DataFrame.
shape: (2, 4)
ticker date price price_plus_10pct
str datetime[ns] f64 f64
"AAPL" 2024-01-01 00:00:00 150.5 165.55
"AAPL" 2024-01-02 00:00:00 152.3 167.53

11.4.3 Lazy Evaluation

Polars’ lazy API builds a query plan and optimizes it before execution:

lazy_df = df_pl.lazy()

# Build a query (nothing executes yet)
lazy_query = (
    lazy_df
    .filter(pl.col('volume') > 850000)
    .group_by('ticker')
    .agg([
        pl.col('price').mean().alias('avg_price'),
        pl.col('volume').sum().alias('total_volume')
    ])
    .sort('avg_price', descending=True)
)

result = lazy_query.collect()
result
1
.lazy() converts an eager DataFrame to a lazy frame.
2
.collect() executes the optimized query plan and returns the result.
shape: (2, 3)
ticker avg_price total_volume
str f64 i64
"AAPL" 151.4 2050000
"GOOG" 141.3 1930000
ImportantWhen to Use Lazy Evaluation

Lazy evaluation shines when:

  • You have a complex chain of operations
  • Your data is large enough that optimization matters
  • You’re reading from files (Polars can push down filters and projections)

For small datasets and simple operations, the overhead of optimization may not be worth it. Use eager mode (regular DataFrames) for interactive exploration.

11.4.4 Grouping and Aggregation

# Group by with multiple aggregations
summary_pl = df_pl.group_by('ticker').agg([
    pl.col('price').mean().alias('avg_price'),
    pl.col('price').std().alias('std_price'),
    pl.col('volume').sum().alias('total_volume')
])
summary_pl
shape: (3, 4)
ticker avg_price std_price total_volume
str f64 f64 i64
"MSFT" 379.55 0.919239 1620000
"GOOG" 141.3 1.697056 1930000
"AAPL" 151.4 1.272792 2050000
# Window functions for returns calculation
df_pl_sorted = df_pl.sort(['ticker', 'date'])
df_with_returns = df_pl_sorted.with_columns(
    pl.col('price').pct_change().over('ticker').alias('return')
)
df_with_returns
shape: (6, 5)
ticker date price volume return
str datetime[ns] f64 i64 f64
"AAPL" 2024-01-01 00:00:00 150.5 1000000 null
"AAPL" 2024-01-02 00:00:00 152.3 1050000 0.01196
"GOOG" 2024-01-05 00:00:00 140.1 950000 null
"GOOG" 2024-01-06 00:00:00 142.5 980000 0.017131
"MSFT" 2024-01-03 00:00:00 380.2 800000 null
"MSFT" 2024-01-04 00:00:00 378.9 820000 -0.003419

11.4.5 Strengths and Limitations

Strengths:

  • Significantly faster than pandas, especially for large datasets
  • Lower memory usage thanks to Arrow format
  • Automatic parallelization
  • Lazy evaluation enables query optimization
  • More consistent API design
  • Excellent handling of string and categorical data

Limitations:

  • Smaller ecosystem (fewer third-party integrations)
  • Less mature documentation and community resources
  • Some advanced pandas features not yet implemented
  • Different mental model requires learning curve
  • Index-less design may require adaptation for time-series workflows

11.5 DuckDB and SQL for DataFrames

DuckDB is an in-process SQL database optimized for analytical queries. While not strictly a DataFrame library, it provides a powerful alternative interface for DataFrame operations through SQL.

While this book focuses mostly on Python, SQL is also a very important language for financial data analytics because it is the language most databases use. Many third-party databases can be accessed through SQL from Python and return DataFrames, but DuckDB presents a nice alternative that lets you write SQL directly within your Python environment. DuckDB doesn’t need to run inside Python—it can also operate as a standalone tool—but using it within Python makes it easy to take partial results and transfer them to pandas or Polars for further processing. Like Polars, DuckDB uses Arrow for its memory representation, enabling efficient data exchange between tools.

11.5.1 Why SQL for DataFrames?

SQL has several advantages:

  1. Declarative: Describe what you want, not how to get it
  2. Optimized: Query optimizers can find efficient execution plans
  3. Familiar: Many finance professionals already know SQL
  4. Standard: SQL skills transfer across tools and platforms

11.5.2 Basic Usage

import duckdb

result = duckdb.sql("""
    SELECT ticker,
           AVG(price) as avg_price,
           SUM(volume) as total_volume
    FROM df
    GROUP BY ticker
    ORDER BY avg_price DESC
""").df()

result
1
.df() converts the DuckDB result to a pandas DataFrame.
ticker avg_price total_volume
0 MSFT 379.55 1620000.0
1 AAPL 151.40 2050000.0
2 GOOG 141.30 1930000.0

Notice that we didn’t need to import the DataFrame into DuckDB explicitly. DuckDB automatically detects pandas DataFrames in the scope and makes them queryable.

11.5.3 Advanced SQL Operations

# Window functions for returns
returns_sql = duckdb.sql("""
    SELECT ticker,
           date,
           price,
           price / LAG(price) OVER (PARTITION BY ticker ORDER BY date) - 1 as return
    FROM df
    ORDER BY ticker, date
""").df()

returns_sql
ticker date price return
0 AAPL 2024-01-01 150.5 NaN
1 AAPL 2024-01-02 152.3 0.011960
2 GOOG 2024-01-05 140.1 NaN
3 GOOG 2024-01-06 142.5 0.017131
4 MSFT 2024-01-03 380.2 NaN
5 MSFT 2024-01-04 378.9 -0.003419
# Complex filtering and aggregation
analysis = duckdb.sql("""
    WITH high_volume AS (
        SELECT *
        FROM df
        WHERE volume > 900000
    )
    SELECT
        ticker,
        COUNT(*) as n_high_volume_days,
        AVG(price) as avg_price_on_high_volume
    FROM high_volume
    GROUP BY ticker
""").df()

analysis
ticker n_high_volume_days avg_price_on_high_volume
0 GOOG 2 141.3
1 AAPL 2 151.4

11.5.4 Integration with Multiple Sources

DuckDB can query multiple DataFrame types simultaneously:

# Query both pandas and Polars DataFrames in one query
combined = duckdb.sql("""
    SELECT
        p.ticker,
        p.avg_price as pandas_avg,
        pl.avg_price as polars_avg
    FROM (SELECT ticker, AVG(price) as avg_price FROM df GROUP BY ticker) p
    JOIN (SELECT ticker, AVG(price) as avg_price FROM df_pl GROUP BY ticker) pl
        ON p.ticker = pl.ticker
""").df()

combined
ticker pandas_avg polars_avg
0 GOOG 141.30 141.30
1 AAPL 151.40 151.40
2 MSFT 379.55 379.55

11.5.5 DuckDB Relations API

For a more programmatic interface, DuckDB provides a relational API:

rel = duckdb.sql("SELECT * FROM df")

# Chain operations
result = (
    rel
    .filter("volume > 900000")
    .aggregate("ticker, AVG(price) as avg_price")
    .order("avg_price DESC")
)

result.df()
1
Create a relation from a DataFrame that can be queried programmatically.
ticker avg_price
0 AAPL 151.4
1 GOOG 141.3

11.5.6 Strengths and Limitations

Strengths:

  • Extremely fast for analytical queries
  • Optimized query plans can outperform hand-written DataFrame code
  • SQL is a standard, transferable skill
  • Seamless integration with pandas and Polars
  • Can query files directly without loading into memory
  • Excellent for joins and complex aggregations

Limitations:

  • SQL syntax can be verbose for simple operations
  • Less suitable for row-wise operations or complex custom logic
  • Debugging SQL queries can be harder than Python code
  • Some operations are more natural in a DataFrame API
  • Embedded database means limited concurrency

11.6 Choosing Between pandas, Polars, and DuckDB

The right tool depends on your specific use case. Here’s a decision framework:

11.6.1 When to Use pandas

Choose pandas when:

  • You’re working with small to medium datasets (<1GB in memory)
  • You need integration with scikit-learn, statsmodels, or other pandas-centric libraries
  • You’re learning and want maximum community support
  • Your code needs to be maintained by others who know pandas
  • You need pandas’ advanced time-series functionality (business day calendars, time zone handling)

Example use case: Exploratory analysis of a few thousand stocks with daily data for factor model estimation.

11.6.2 When to Use Polars

Choose Polars when:

  • You’re working with large datasets (>1GB)
  • Performance is critical (e.g., production pipelines, real-time systems)
  • You want cleaner, more maintainable code through the expression API
  • You need parallel processing without explicit threading code
  • String manipulation performance matters

Example use case: Processing tick-by-tick trade data for thousands of stocks to compute intraday liquidity measures.

11.6.3 When to Use DuckDB

Choose DuckDB when:

  • You’re comfortable with SQL and prefer declarative queries
  • You need to join multiple large datasets
  • Your data lives in files (CSV, Parquet) and you want to query without loading
  • You need maximum analytical query performance
  • You’re aggregating or filtering large datasets

Example use case: Joining quarterly earnings data with daily stock returns across thousands of companies and years.

11.6.4 Hybrid Approaches

You don’t need to choose just one. Modern workflows often combine tools:

# Read large file with DuckDB
duckdb_result = duckdb.sql("""
    SELECT ticker, date, price, volume
    FROM 'large_dataset.parquet'
    WHERE date >= '2020-01-01'
    AND ticker IN ('AAPL', 'MSFT', 'GOOG')
""")

# Convert to Polars for further processing
df_polars = pl.from_arrow(duckdb_result.arrow())

# Do complex feature engineering in Polars
features = df_polars.with_columns([
    pl.col('price').pct_change().over('ticker').alias('return'),
    pl.col('volume').rolling_mean(20).over('ticker').alias('avg_volume_20d')
])

# Convert to pandas for sklearn
df_pandas = features.to_pandas()

# Use in machine learning pipeline
# from sklearn.ensemble import RandomForestRegressor
# model = RandomForestRegressor()
# ...
TipPractical Recommendation

For MSc-level empirical finance work, I recommend:

  1. Learn pandas first: It’s the standard, and you’ll encounter it everywhere
  2. Add Polars for performance: When pandas becomes slow, rewrite critical sections in Polars
  3. Use DuckDB for data wrangling: ETL pipelines and complex joins are often cleaner in SQL
  4. Profile your code: Don’t optimize prematurely—measure where time is actually spent

The best practitioners know all three and choose the right tool for each task.

11.7 Summary

DataFrames are the fundamental abstraction for tabular data in empirical finance. This chapter introduced three powerful tools:

pandas: The established standard with broad ecosystem support. Best for general-purpose data analysis, learning, and integration with statistical libraries.

Polars: A modern, high-performance alternative with lazy evaluation and automatic parallelization. Best for large datasets and performance-critical applications.

DuckDB: SQL interface for analytical queries on DataFrames and files. Best for complex joins, aggregations, and data pipeline work.

The key insight is that these tools are complementary. Understanding when to use each—and how to combine them—will make you significantly more productive in empirical finance research.

In the next chapters, we’ll dive deeper into specific operations: data cleaning, merging and joining, time-series operations, and efficient computation with large datasets.

11.8 Further Reading

  • pandas documentation: Comprehensive guide to pandas
  • Polars User Guide: Official Polars documentation
  • DuckDB documentation: Complete DuckDB reference
  • McKinney, W. (2022). Python for Data Analysis, 3rd Edition. O’Reilly. (by pandas creator)
  • VanderPlas, J. (2016). Python Data Science Handbook. O’Reilly.