15  Reshaping Data

Financial data rarely arrives in the exact format you need for analysis. Stock returns might come as one row per stock-date combination, but you need a matrix with dates as rows and stocks as columns. Or you might receive a wide table of quarterly earnings across columns that needs to be converted into a tidy long format for regression analysis. These transformations—collectively known as “reshaping”—are fundamental operations in empirical finance.

This chapter covers the essential reshaping operations you’ll use repeatedly: converting between long and wide formats, pivoting and melting data, and stacking and unstacking hierarchical indices. We’ll work through practical examples using both pandas and Polars, showing how each library handles these transformations.

15.1 Long vs. Wide Formats

The distinction between long and wide formats is fundamental to data analysis. Understanding when to use each format, and how to convert between them, will save you countless hours of frustration.

15.1.1 What Are Long and Wide Formats?

Wide format stores each entity as a row and each time period (or category) as a separate column. This is how many people naturally think about panel data—it looks like a spreadsheet:

date        AAPL    MSFT    GOOGL
2024-01-01  0.012   0.008   0.015
2024-01-02  -0.005  0.003   0.002
2024-01-03  0.018   0.012   0.010

Long format (also called “tidy” or “narrow” format) stores each observation as a separate row, with columns indicating the entity, time period, and value:

date        ticker  return
2024-01-01  AAPL    0.012
2024-01-01  MSFT    0.008
2024-01-01  GOOGL   0.015
2024-01-02  AAPL    -0.005
2024-01-02  MSFT    0.003
...

The same information appears in both formats, but the organization is completely different.

15.1.2 When to Use Each Format

Use wide format when:

  • Performing matrix operations (correlation matrices, portfolio optimization)
  • Calculating cross-sectional statistics (comparing values across entities at a point in time)
  • Creating visualizations that show multiple series over time
  • Working with time series models that expect matrices

Use long format when:

  • Running regressions or statistical models (most modeling libraries expect long data)
  • Creating grouped or faceted visualizations
  • Filtering or aggregating by category
  • Following “tidy data” principles for data analysis

In empirical finance, you’ll often start with data in one format and need to convert it for your specific analysis. For example, stock return data from CRSP arrives in long format (one row per stock-date), but calculating a covariance matrix requires wide format (dates as rows, stocks as columns).

15.1.3 A Simple Example

Let’s create a small dataset of daily stock returns in long format:

import pandas as pd
import polars as pl
from datetime import datetime, timedelta

# Create sample return data in long format
dates = pd.date_range('2024-01-01', periods=5, freq='D')
tickers = ['AAPL', 'MSFT', 'GOOGL']

# pandas version
data_long_pd = pd.DataFrame([
    {'date': date, 'ticker': ticker, 'return': round(0.01 * (hash(str(date) + ticker) % 100 - 50) / 100, 4)}
    for date in dates
    for ticker in tickers
])

data_long_pd.head(10)
date ticker return
0 2024-01-01 AAPL 0.0047
1 2024-01-01 MSFT 0.0023
2 2024-01-01 GOOGL -0.0038
3 2024-01-02 AAPL 0.0047
4 2024-01-02 MSFT 0.0035
5 2024-01-02 GOOGL -0.0009
6 2024-01-03 AAPL -0.0027
7 2024-01-03 MSFT -0.0035
8 2024-01-03 GOOGL -0.0026
9 2024-01-04 AAPL -0.0017
# Polars version
data_long_pl = pl.DataFrame([
    {'date': date, 'ticker': ticker, 'return': round(0.01 * (hash(str(date) + ticker) % 100 - 50) / 100, 4)}
    for date in dates
    for ticker in tickers
])

data_long_pl.head(10)
shape: (10, 3)
date ticker return
datetime[μs] str f64
2024-01-01 00:00:00 "AAPL" 0.0047
2024-01-01 00:00:00 "MSFT" 0.0023
2024-01-01 00:00:00 "GOOGL" -0.0038
2024-01-02 00:00:00 "AAPL" 0.0047
2024-01-02 00:00:00 "MSFT" 0.0035
2024-01-02 00:00:00 "GOOGL" -0.0009
2024-01-03 00:00:00 "AAPL" -0.0027
2024-01-03 00:00:00 "MSFT" -0.0035
2024-01-03 00:00:00 "GOOGL" -0.0026
2024-01-04 00:00:00 "AAPL" -0.0017

This long format is ideal for most statistical operations. Each row is an observation: a specific stock on a specific date. You can easily filter to specific stocks, calculate summary statistics by ticker, or run panel regressions.

15.2 Pivot, Melt, Stack, Unstack

Python’s data libraries provide several operations for reshaping data. While they might seem redundant at first, each serves a specific purpose and understanding all of them makes you more fluent in data manipulation.

15.2.1 Pivot: Long to Wide

The pivot operation transforms long format data into wide format. You specify which column becomes the index (rows), which becomes the columns, and which contains the values to fill the resulting matrix.

pandas syntax:

# Convert long to wide: dates as rows, tickers as columns
data_wide_pd = data_long_pd.pivot(
    index='date',
    columns='ticker',
    values='return'
)

data_wide_pd
1
Column that becomes the row index
2
Column whose unique values become new columns
3
Column containing the values to fill the matrix
ticker AAPL GOOGL MSFT
date
2024-01-01 0.0047 -0.0038 0.0023
2024-01-02 0.0047 -0.0009 0.0035
2024-01-03 -0.0027 -0.0026 -0.0035
2024-01-04 -0.0017 0.0007 -0.0033
2024-01-05 -0.0013 0.0006 -0.0012

Notice that pivot creates a DataFrame where:

  • Each unique date becomes a row
  • Each unique ticker becomes a column
  • Return values fill the cells

Polars syntax:

# Polars uses pivot differently - need to specify aggregation
data_wide_pl = data_long_pl.pivot(
    index='date',
    columns='ticker',
    values='return'
)

data_wide_pl
/var/folders/jr/cn9h86ld68qb5rtvs9gsb1vr0000gn/T/ipykernel_60679/105412148.py:2: DeprecationWarning: the argument `columns` for `DataFrame.pivot` is deprecated. It was renamed to `on` in version 1.0.0.
  data_wide_pl = data_long_pl.pivot(
shape: (5, 4)
date AAPL MSFT GOOGL
datetime[μs] f64 f64 f64
2024-01-01 00:00:00 0.0047 0.0023 -0.0038
2024-01-02 00:00:00 0.0047 0.0035 -0.0009
2024-01-03 00:00:00 -0.0027 -0.0035 -0.0026
2024-01-04 00:00:00 -0.0017 -0.0033 0.0007
2024-01-05 00:00:00 -0.0013 -0.0012 0.0006

Polars’ pivot is similar but requires you to think about aggregation from the start. If multiple rows could map to the same cell (same date and ticker), you must specify how to combine them (sum, mean, first, etc.).

15.2.2 Handling Duplicate Entries

One common issue with pivoting is duplicate entries. What happens if your data has multiple return observations for the same stock on the same date?

# Create data with duplicates
data_with_dups_pd = pd.DataFrame([
    {'date': '2024-01-01', 'ticker': 'AAPL', 'return': 0.01},
    {'date': '2024-01-01', 'ticker': 'AAPL', 'return': 0.02},  # duplicate!
    {'date': '2024-01-01', 'ticker': 'MSFT', 'return': 0.03},
])

# pandas pivot will fail with duplicates
try:
    data_with_dups_pd.pivot(index='date', columns='ticker', values='return')
except ValueError as e:
    print(f"Error: {e}")
Error: Index contains duplicate entries, cannot reshape

When pandas encounters duplicates, it raises an error. You have two options:

  1. Use pivot_table with an aggregation function
  2. Clean your data first to remove duplicates
# Option 1: Use pivot_table with aggregation
wide_with_agg_pd = data_with_dups_pd.pivot_table(
    index='date',
    columns='ticker',
    values='return',
    aggfunc='mean'  # or 'sum', 'first', 'last', etc.
)

wide_with_agg_pd
ticker AAPL MSFT
date
2024-01-01 0.015 0.03

Polars requires you to specify aggregation from the start, so duplicates are handled automatically:

data_with_dups_pl = pl.DataFrame([
    {'date': '2024-01-01', 'ticker': 'AAPL', 'return': 0.01},
    {'date': '2024-01-01', 'ticker': 'AAPL', 'return': 0.02},
    {'date': '2024-01-01', 'ticker': 'MSFT', 'return': 0.03},
])

# Polars pivot with aggregation
wide_with_agg_pl = data_with_dups_pl.pivot(
    index='date',
    columns='ticker',
    values='return',
    aggregate_function='mean'
)

wide_with_agg_pl
/var/folders/jr/cn9h86ld68qb5rtvs9gsb1vr0000gn/T/ipykernel_60679/556176386.py:8: DeprecationWarning: the argument `columns` for `DataFrame.pivot` is deprecated. It was renamed to `on` in version 1.0.0.
  wide_with_agg_pl = data_with_dups_pl.pivot(
shape: (1, 3)
date AAPL MSFT
str f64 f64
"2024-01-01" 0.015 0.03

15.2.3 Melt/Unpivot: Wide to Long

The melt operation (called unpivot in Polars) is the inverse of pivot—it transforms wide format data back into long format. You specify which columns to keep as identifiers and which to “melt” into a single column.

pandas syntax:

# Reset index to make date a regular column
data_wide_pd_reset = data_wide_pd.reset_index()

# Melt back to long format
data_melted_pd = data_wide_pd_reset.melt(
    id_vars=['date'],
    value_vars=['AAPL', 'MSFT', 'GOOGL'],
    var_name='ticker',
    value_name='return'
)

data_melted_pd.head(10)
1
Columns to keep as identifiers
2
Columns to melt (optional—melts all others if not specified)
3
Name for the new column containing original column names
4
Name for the new column containing values
date ticker return
0 2024-01-01 AAPL 0.0047
1 2024-01-02 AAPL 0.0047
2 2024-01-03 AAPL -0.0027
3 2024-01-04 AAPL -0.0017
4 2024-01-05 AAPL -0.0013
5 2024-01-01 MSFT 0.0023
6 2024-01-02 MSFT 0.0035
7 2024-01-03 MSFT -0.0035
8 2024-01-04 MSFT -0.0033
9 2024-01-05 MSFT -0.0012

Polars syntax:

# Polars uses unpivot (melt is deprecated)
data_melted_pl = data_wide_pl.unpivot(
    index=['date'],
    on=['AAPL', 'MSFT', 'GOOGL'],
    variable_name='ticker',
    value_name='return'
)

data_melted_pl.head(10)
1
Columns to keep as identifiers (called index in Polars)
2
Columns to unpivot (called on in Polars)
shape: (10, 3)
date ticker return
datetime[μs] str f64
2024-01-01 00:00:00 "AAPL" 0.0047
2024-01-02 00:00:00 "AAPL" 0.0047
2024-01-03 00:00:00 "AAPL" -0.0027
2024-01-04 00:00:00 "AAPL" -0.0017
2024-01-05 00:00:00 "AAPL" -0.0013
2024-01-01 00:00:00 "MSFT" 0.0023
2024-01-02 00:00:00 "MSFT" 0.0035
2024-01-03 00:00:00 "MSFT" -0.0035
2024-01-04 00:00:00 "MSFT" -0.0033
2024-01-05 00:00:00 "MSFT" -0.0012

15.2.4 Stack and Unstack: Index-Based Reshaping

While pivot and melt work with columns, stack and unstack work with index levels. They’re particularly useful when you have MultiIndex DataFrames (hierarchical indices with multiple levels).

Stack: Moves a column level into the row index (wide to long)

Unstack: Moves a row index level into the columns (long to wide)

# Create a MultiIndex DataFrame
# Set both date and ticker as index
data_multi_pd = data_long_pd.set_index(['date', 'ticker'])

data_multi_pd.head()
return
date ticker
2024-01-01 AAPL 0.0047
MSFT 0.0023
GOOGL -0.0038
2024-01-02 AAPL 0.0047
MSFT 0.0035
# Unstack: move ticker from index to columns (long to wide)
data_unstacked_pd = data_multi_pd.unstack(level='ticker')

data_unstacked_pd
return
ticker AAPL GOOGL MSFT
date
2024-01-01 0.0047 -0.0038 0.0023
2024-01-02 0.0047 -0.0009 0.0035
2024-01-03 -0.0027 -0.0026 -0.0035
2024-01-04 -0.0017 0.0007 -0.0033
2024-01-05 -0.0013 0.0006 -0.0012
# Stack: move ticker from columns back to index (wide to long)
data_stacked_pd = data_unstacked_pd.stack(level='ticker')

data_stacked_pd.head()
/var/folders/jr/cn9h86ld68qb5rtvs9gsb1vr0000gn/T/ipykernel_60679/4172318071.py:2: FutureWarning: The previous implementation of stack is deprecated and will be removed in a future version of pandas. See the What's New notes for pandas 2.1.0 for details. Specify future_stack=True to adopt the new implementation and silence this warning.
  data_stacked_pd = data_unstacked_pd.stack(level='ticker')
return
date ticker
2024-01-01 AAPL 0.0047
GOOGL -0.0038
MSFT 0.0023
2024-01-02 AAPL 0.0047
GOOGL -0.0009

The key difference between pivot/melt and stack/unstack:

  • pivot/melt: Work with regular columns, create new structure from scratch
  • stack/unstack: Work with existing MultiIndex structure, move levels between index and columns

Polars doesn’t have direct equivalents to stack/unstack because it doesn’t use MultiIndex. Instead, you would use combinations of pivot, melt, group_by, and join to achieve the same results.

15.3 Summary

Reshaping data is a fundamental skill in empirical finance. The key operations are:

  • Long vs. Wide: Choose the format appropriate for your analysis
    • Long for modeling and tidy operations
    • Wide for matrix operations and cross-sectional analysis
  • Pivot/Melt: Convert between long and wide formats
    • pivot: long → wide (create matrix from observations)
    • melt/unpivot: wide → long (create observations from matrix)
    • Handle duplicates carefully with aggregation
  • Stack/Unstack: Work with MultiIndex structures
    • unstack: move index level to columns
    • stack: move column level to index
    • Primarily useful in pandas (Polars doesn’t use MultiIndex)

The specific syntax differs between pandas and Polars, but the concepts are universal. pandas uses pivot, pivot_table, melt, stack, and unstack. Polars uses pivot and unpivot, relying more on grouped operations for reshaping tasks.

Practice these operations until they become second nature. You’ll use them constantly in empirical finance, and fluency with reshaping will make your analyses both faster and more reliable.