12 Data Input and File Formats

The first step in any empirical analysis is getting your data into a form where you can work with it. In the world of finance, data comes in many shapes and sizes: CSV files exported from Bloomberg, Excel spreadsheets from analysts, proprietary formats from data vendors, and increasingly, large datasets stored in efficient binary formats. Understanding how to work with different file formats efficiently is crucial for productive data analysis.

The choice of file format matters more than you might think. A format that works well for a small dataset of a few hundred stocks might become painfully slow when you’re dealing with millions of trades. Similarly, while Excel is ubiquitous in finance, it has limitations when working with large datasets or when you need to ensure reproducibility. In this chapter, we’ll explore the most common file formats you’ll encounter in empirical finance and learn how to work with them using both pandas and polars, two powerful Python libraries for data manipulation.

Pandas has long been the standard for data analysis in Python, offering a rich API and extensive functionality. Polars is a newer library built in Rust that offers significantly faster performance, especially for larger datasets, while maintaining a similar conceptual model. Throughout this chapter, we’ll show examples using both libraries so you can choose the right tool for your specific needs.

12.1 CSV Files

CSV (Comma-Separated Values) is perhaps the most universal data format. It’s simple, human-readable, and supported by virtually every data tool and programming language. In finance, you’ll frequently encounter CSV files: historical price data from exchanges, bulk downloads from data vendors, exports from Bloomberg terminals, and outputs from other analysis tools.

The beauty of CSV lies in its simplicity. Each line represents a row of data, with values separated by commas (or sometimes other delimiters like tabs or semicolons). The first line typically contains column names. Here’s what a simple CSV file of stock prices might look like:

date,ticker,price,volume
2024-01-02,AAPL,185.64,52000000
2024-01-02,MSFT,374.58,28000000
2024-01-03,AAPL,184.25,48000000
2024-01-03,MSFT,371.32,25000000

However, this simplicity comes with some drawbacks. CSV files store everything as text, so data types must be inferred when reading. They don’t compress particularly well compared to binary formats. And for very large files, parsing text can be slow compared to reading binary data directly.

12.1.1 Reading CSV Files with Pandas

pandas supports many file formats

Pandas supports reading and writing a wide variety of file formats beyond CSV, including Excel spreadsheets, SAS data sets, Stata data sets, and many others. However, not all functionality is implemented in pandas itself—it relies on third-party libraries for some formats.

For CSV files specifically, pandas includes multiple parsing engines. The Python engine is the most flexible but also the slowest. You can also use the C engine (the default) or the PyArrow engine for better performance. Whenever you need to handle a nonstandard CSV file or work with a different file format, consult the pandas I/O documentation for guidance on which engine or approach to use.

Reading a CSV file with pandas is straightforward using the read_csv() function:

import pandas as pd

# Basic CSV reading
df = pd.read_csv("stock_prices.csv")

# View the first few rows
print(df.head())

The read_csv() function is remarkably flexible and can handle many variations and edge cases you’ll encounter with real-world data:

# Specify date columns to parse
df = pd.read_csv(
    "stock_prices.csv",
    parse_dates=["date"],
    index_col="date"
)

# Handle different delimiters (e.g., semicolon-separated)
df = pd.read_csv("data.csv", sep=";")

# Skip rows or specify column names
df = pd.read_csv(
    "data.csv",
    skiprows=2,
    names=["date", "price", "vol"],
    header=None
)

# Handle missing values
df = pd.read_csv(
    "data.csv",
    na_values=[".", "NA", "n/a"]
)

# Read only specific columns
df = pd.read_csv(
    "large_file.csv",
    usecols=["date", "ticker", "price"]
)

1: Convert date column to datetime during parsing
2: Use the date column as the DataFrame index
3: Use semicolon as delimiter instead of comma
4: Skip the first 2 rows of the file
5: Provide custom column names
6: Indicate the file has no header row
7: Treat additional strings as missing values (NaN)
8: Read only the specified columns, ignoring others

Memory considerations with large CSV files

When reading very large CSV files, pandas loads the entire file into memory. If your CSV file is larger than your available RAM, you have several options:

Read the file in chunks using the chunksize parameter
Use usecols to read only the columns you need
Use polars with lazy evaluation (discussed below)
Convert the file to a more efficient format like Parquet

For truly large files, you can process them in chunks:

# Process CSV in chunks
chunk_size = 100000
results = []

for chunk in pd.read_csv("large_file.csv", chunksize=chunk_size):
    # Process each chunk
    filtered = chunk[chunk["volume"] > 1000000]
    results.append(filtered)

# Combine all results
df = pd.concat(results, ignore_index=True)

1: Read the file in chunks of 100,000 rows at a time
2: Filter each chunk to keep only rows with high volume
3: Append filtered results to a list
4: Combine all filtered chunks into a single DataFrame

Pandas also supports reading compressed CSV files directly. If you have a CSV file compressed as .gz, .bz2, .zip, or .xz, pandas will automatically decompress it:

# Read a gzip-compressed CSV file
df = pd.read_csv("stock_prices.csv.gz")

# Read from a ZIP archive containing a single CSV file
df = pd.read_csv("data.zip")

12.1.2 Writing CSV Files with Pandas

Writing data to CSV is just as easy using the to_csv() method:

# Basic CSV writing
df.to_csv("output.csv")

# Customize the output
df.to_csv(
    "output.csv",
    index=False,           # Don't write row indices
    float_format="%.4f",   # Format floats to 4 decimal places
    encoding="utf-8"       # Specify encoding
)

# Write to a compressed file
df.to_csv("output.csv.gz", compression="gzip")

12.1.3 Reading CSV Files with Polars

Polars offers two ways to read CSV files: eager reading with read_csv() and lazy reading with scan_csv(). Eager reading loads the entire file into memory immediately, while lazy reading creates a query plan that can be optimized before execution.

import polars as pl

# Eager reading - similar to pandas
df = pl.read_csv("stock_prices.csv")

# Specify data types for better performance
df = pl.read_csv(
    "stock_prices.csv",
    dtypes={"ticker": pl.Categorical, "volume": pl.Int64},
    try_parse_dates=True
)

# Lazy reading - more efficient for large files
lazy_df = pl.scan_csv("large_file.csv")

# Build up a query (not executed yet)
result = (
    lazy_df
    .filter(pl.col("volume") > 1000000)
    .select(["date", "ticker", "price"])
    .group_by("ticker")
    .agg(pl.col("price").mean().alias("avg_price"))
)

# Execute the optimized query
final_df = result.collect()

1: Explicitly specify data types for columns
2: Automatically parse date-like columns
3: Create a lazy frame that defers reading until needed
4: Filter rows (not executed yet)
5: Select only the columns we need (not executed yet)
6: Calculate average price by ticker (not executed yet)
7: Execute the optimized query plan and return the result

The lazy approach is particularly powerful because polars can optimize the entire query before executing it. For example, if you’re filtering rows and then selecting specific columns, polars can read only those columns for only the rows that pass the filter, dramatically reducing I/O.

Lazy evaluation for large datasets

When working with CSV files that are large but still fit in memory after filtering, using scan_csv() followed by filtering operations can be much faster than read_csv(). Polars will optimize the query to avoid reading unnecessary data.

This is especially valuable when your CSV file has many columns but you only need a few, or when you’ll be filtering most rows out early in your analysis.

12.1.4 Writing CSV Files with Polars

Writing CSV files with polars is similarly straightforward:

# Write a DataFrame to CSV
df.write_csv("output.csv")

# You can also write to a file-like object
with open("output.csv", "w") as f:
    df.write_csv(f)

12.2 Parquet and Apache Arrow

While CSV files are universal and human-readable, they’re not the most efficient format for large datasets. This is where Parquet comes in. Parquet is a columnar storage format originally developed for use in the Hadoop ecosystem but now widely used across the data science world. It offers several key advantages over CSV:

Compression: Parquet files are typically 5-10x smaller than equivalent CSV files
Speed: Reading Parquet is much faster because data is stored in a binary format optimized for quick access
Type preservation: Data types are stored in the file, so there’s no need for type inference
Complex data types: Parquet handles complex types like datetimes with timezone information natively, encoding them exactly as-is with no loss of information or conversion errors
Column selection: You can read only specific columns without parsing the entire file
Predicate pushdown: Filters can be applied during reading, avoiding loading unnecessary data

The columnar storage format is key to Parquet’s efficiency. Rather than storing data row by row (like CSV), Parquet stores each column’s data together. This means if you want to read just a few columns, Parquet only needs to read those column chunks, not the entire file. It also means values in the same column can be compressed very efficiently since similar values are stored together.

Parquet is built on Apache Arrow, a cross-language development platform for in-memory data. Arrow defines a standardized columnar memory format that is highly efficient for analytical operations. Both pandas and polars can work with Arrow, and polars is built on Arrow from the ground up.

When to use Parquet

Parquet is an excellent choice for:

Large datasets that you’ll read multiple times
Datasets where you often need only a subset of columns
Sharing data between different tools and languages
Long-term data storage with efficient compression

Stick with CSV for:

Small datasets where file size doesn’t matter
Data that needs to be human-readable
Sharing with tools that don’t support Parquet
Quick exports for inspection in Excel or text editors

When I receive a dataset larger than a few megabytes, my first step is usually to run a script that reads it and converts it to Parquet. All subsequent processing then benefits from Parquet’s faster read times and smaller file size.

12.2.1 Reading and Writing Parquet with Pandas

Pandas makes working with Parquet files very simple:

import pandas as pd

# Read a Parquet file
df = pd.read_parquet("stock_prices.parquet")

# Read only specific columns
df = pd.read_parquet(
    "large_file.parquet",
    columns=["date", "ticker", "price"]
)

# Read using PyArrow as both the engine and in-memory backend
df = pd.read_parquet(
    "stock_prices.parquet",
    engine="pyarrow",
    dtype_backend="pyarrow"
)

# Write to Parquet
df.to_parquet("output.parquet")

# Write without the index
df.to_parquet("output.parquet", index=False)

Using dtype_backend="pyarrow" keeps the data in pandas using the PyArrow memory format, which is the same format that polars uses internally. This can provide significant performance improvements for certain operations, though not all pandas features are fully supported with this backend yet.

12.2.2 Reading and Writing Parquet with Polars

Polars has excellent support for Parquet and, like with CSV, offers both eager and lazy reading:

import polars as pl

# Eager reading
df = pl.read_parquet("stock_prices.parquet")

# Read only specific columns
df = pl.read_parquet(
    "large_file.parquet",
    columns=["date", "ticker", "price"]
)

# Lazy reading with scan_parquet
lazy_df = pl.scan_parquet("large_file.parquet")

# Build an optimized query
result = (
    lazy_df
    .filter(pl.col("date") >= "2024-01-01")
    .select(["ticker", "price", "volume"])
    .group_by("ticker")
    .agg([
        pl.col("price").mean().alias("avg_price"),
        pl.col("volume").sum().alias("total_volume")
    ])
    .collect()
)

# Write to Parquet
df.write_parquet("output.parquet")

1: Only read the specified columns from the Parquet file
2: Create a lazy frame that defers execution
3: Filters are pushed down to the Parquet reader when possible
4: Execute the query and return the result

The combination of Parquet’s columnar format and polars’ query optimization can be extremely powerful. When you use scan_parquet(), polars can push filters down to the Parquet reader, meaning only relevant data is loaded into memory.

12.2.3 Practical Example: CSV to Parquet Conversion

A common workflow is to receive data as CSV (perhaps from a vendor or Bloomberg) and immediately convert it to Parquet for more efficient analysis:

import pandas as pd

# Read CSV with appropriate data types
df = pd.read_csv(
    "stock_data.csv",
    parse_dates=["date"],
    dtype={
        "ticker": "category",  # Use categorical for string columns with few unique values
        "volume": "int64"
    }
)

# Save as Parquet
df.to_parquet("stock_data.parquet", compression="snappy")

# Future reads will be much faster
df_fast = pd.read_parquet("stock_data.parquet")

For very large CSV files, you might do this conversion with polars:

import polars as pl

# Lazy read CSV, optimize, and write to Parquet
(
    pl.scan_csv("large_stock_data.csv")
    .select([
        pl.col("date").str.to_date(),
        pl.col("ticker").cast(pl.Categorical),
        pl.col("price").cast(pl.Float64),
        pl.col("volume").cast(pl.Int64)
    ])
    .sink_parquet("stock_data.parquet")
)

1: Create a lazy frame from the CSV file
2: Parse the date string column to a proper date type
3: Convert ticker to categorical for memory efficiency
4: Stream results directly to Parquet without loading into memory

The sink_parquet() method is particularly efficient because it streams the data to disk without loading everything into memory first.

12.3 Excel Files

Despite the advantages of formats like CSV and Parquet, Excel remains ubiquitous in finance. Analysts send Excel files, regulatory filings include Excel attachments, and many financial models live in Excel workbooks. As a result, being able to read from and write to Excel is an essential skill for empirical finance work.

Excel files (.xlsx) are actually compressed archives containing XML files that describe the workbook structure, data, and formatting. This makes them more complex than simple text formats but also allows them to store multiple sheets, formulas, formatting, and metadata in a single file.

12.3.1 Reading Excel Files with Pandas

Pandas relies on third-party libraries for Excel support. These are optional dependencies that are not installed by default when you install pandas. If a required library is missing, pandas will notify you when you try to use the related function and indicate which package to install. You can then install the missing dependency with uv add.

Pandas provides robust Excel support through the read_excel() function:

import pandas as pd

# Read the first sheet
df = pd.read_excel("financial_data.xlsx")

# Read a specific sheet by name
df = pd.read_excel("financial_data.xlsx", sheet_name="Returns")

# Read a specific sheet by index (0-indexed)
df = pd.read_excel("financial_data.xlsx", sheet_name=1)

# Read multiple sheets at once
sheets_dict = pd.read_excel(
    "financial_data.xlsx",
    sheet_name=["Returns", "Fundamentals"]
)
# Returns a dictionary: {"Returns": df1, "Fundamentals": df2}

# Read all sheets
all_sheets = pd.read_excel("financial_data.xlsx", sheet_name=None)

1: Specify a sheet by its name
2: Specify a sheet by its position (0-indexed)
3: Read multiple sheets by passing a list of names
4: Use None to read all sheets as a dictionary

Excel files often have headers on multiple rows, merged cells, or data that doesn’t start in cell A1. Pandas provides parameters to handle these cases:

# Skip rows at the beginning
df = pd.read_excel(
    "report.xlsx",
    skiprows=3,  # Skip first 3 rows
    sheet_name="Data"
)

# Specify header row
df = pd.read_excel(
    "report.xlsx",
    header=2  # Use row 2 (0-indexed) as column names
)

# Specify which columns to read
df = pd.read_excel(
    "data.xlsx",
    usecols="A:D"  # Read only columns A through D
)

# Or specify by column names
df = pd.read_excel(
    "data.xlsx",
    usecols=["Date", "Price", "Volume"]
)

# Handle dates
df = pd.read_excel(
    "data.xlsx",
    parse_dates=["date_column"]
)

Excel file size and performance

Reading Excel files is significantly slower than reading CSV or Parquet files, especially for large datasets. If you’re repeatedly reading the same Excel file, consider converting it to Parquet for better performance.

Additionally, Excel has a row limit of 1,048,576 rows. If you’re working with larger datasets, you’ll need to use a different format.

12.3.2 Writing Excel Files with Pandas

Writing to Excel is similarly straightforward:

# Write to a single sheet
df.to_excel("output.xlsx", sheet_name="Data", index=False)

# Write multiple DataFrames to different sheets
with pd.ExcelWriter("output.xlsx") as writer:
    df_returns.to_excel(writer, sheet_name="Returns", index=False)
    df_fundamentals.to_excel(writer, sheet_name="Fundamentals", index=False)
    df_summary.to_excel(writer, sheet_name="Summary", index=False)

# Specify float formatting
df.to_excel(
    "output.xlsx",
    sheet_name="Data",
    index=False,
    float_format="%.4f"
)

You can also use the ExcelWriter context manager to add formatting:

# Create an Excel file with formatting
with pd.ExcelWriter(
    "formatted_output.xlsx",
    engine="xlsxwriter"
) as writer:
    df.to_excel(writer, sheet_name="Data", index=False)

    # Get the workbook and worksheet objects
    workbook = writer.book
    worksheet = writer.sheets["Data"]

    # Add a format
    format1 = workbook.add_format({"num_format": "$#,##0.00"})

    # Apply formatting to a column (e.g., column B)
    worksheet.set_column("B:B", 12, format1)

12.3.3 Reading Excel Files with Polars

Polars also supports reading Excel files, though with a focus on data extraction rather than preserving formatting:

import polars as pl

# Read the first sheet
df = pl.read_excel("financial_data.xlsx")

# Read a specific sheet
df = pl.read_excel(
    "financial_data.xlsx",
    sheet_name="Returns"
)

# Read with specific sheet index
df = pl.read_excel("financial_data.xlsx", sheet_id=2)

Polars uses the fastexcel engine by default, which is optimized for speed. For more complex Excel files, you might need to install additional engines like openpyxl or xlsx2csv.

12.3.4 Writing Excel Files with Polars

Polars can write DataFrames to Excel files, though it requires the xlsxwriter library:

import polars as pl

# Write to Excel (requires xlsxwriter)
df.write_excel("output.xlsx")

# Write to a specific worksheet
df.write_excel("output.xlsx", worksheet="MyData")

Excel as a data exchange format

While Excel is convenient for sharing data with non-technical stakeholders, it’s not ideal for:

Large datasets (performance and row limits)
Preserving exact data types (dates can be problematic)
Version control (binary format is hard to diff)
Reproducible research (formulas and manual edits aren’t tracked)

Consider using Excel for final outputs and presentation, but use CSV or Parquet for data storage and analysis pipelines.

12.4 Choosing the Right Format

We’ve covered three common file formats: CSV, Parquet, and Excel. Each has its place in a data analysis workflow. Here’s a decision guide:

Use CSV when:

Working with small to medium datasets (under 1 GB)
You need human-readable data
Sharing data with tools that don’t support other formats
You value simplicity and universality over performance
Version controlling data (text files work well with git)

Use Parquet when:

Working with large datasets (multiple GB or more)
You’ll read the data multiple times
You often need only a subset of columns
Performance and storage efficiency matter
Sharing data between different tools and languages
Building data pipelines and ETL processes

Use Excel when:

Sharing data with business users
You need multiple related tables in one file (multiple sheets)
The recipient expects Excel format
Working with small datasets where performance doesn’t matter
You need to include formatting, formulas, or charts

In practice, many empirical finance workflows involve all three: receiving data in Excel or CSV, converting to Parquet for efficient analysis, and exporting results back to Excel for presentation.

12.5 Best Practices for Data I/O

As you work with different file formats, keep these best practices in mind:

1. Specify data types explicitly

When reading data, especially CSV files, explicitly specifying data types can improve both performance and correctness:

# Pandas
df = pd.read_csv(
    "data.csv",
    dtype={
        "ticker": "category",
        "volume": "int64",
        "price": "float64"
    },
    parse_dates=["date"]
)

# Polars
df = pl.read_csv(
    "data.csv",
    dtypes={
        "ticker": pl.Categorical,
        "volume": pl.Int64,
        "price": pl.Float64
    },
    try_parse_dates=True
)

2. Use compression for large files

Both CSV and Parquet support compression. For CSV files you’ll store long-term, use gzip compression:

# Pandas - automatic compression based on filename
df.to_csv("data.csv.gz")

# Explicit compression
df.to_parquet("data.parquet", compression="snappy")

3. Consider storage vs. memory tradeoffs

Parquet files with strong compression (like gzip or zstd) are smaller on disk but take longer to read. For files you’ll read frequently, snappy compression offers a good balance. For archival storage, use stronger compression.

4. Validate data after reading

After reading data, especially from external sources, validate that it matches your expectations:

import pandas as pd
import numpy as np

# Sample data with some issues to detect
df = pd.DataFrame({
    "date": pd.to_datetime(["2024-01-02", "2024-01-02", "2024-01-03", "2024-01-03"]),
    "ticker": ["AAPL", "MSFT", "AAPL", "AAPL"],
    "price": [185.64, 374.58, None, 184.25],
    "volume": [52000000, 28000000, 48000000, 48000000]
})

# Check for missing values
df.isnull().sum()

date      0
ticker    0
price     1
volume    0
dtype: int64

# Check data types
df.dtypes

date      datetime64[ns]
ticker            object
price            float64
volume             int64
dtype: object

# Basic statistics
df.describe()

	date	price	volume
count	4	3.000000	4.000000e+00
mean	2024-01-02 12:00:00	248.156667	4.400000e+07
min	2024-01-02 00:00:00	184.250000	2.800000e+07
25%	2024-01-02 00:00:00	184.945000	4.300000e+07
50%	2024-01-02 12:00:00	185.640000	4.800000e+07
75%	2024-01-03 00:00:00	280.110000	4.900000e+07
max	2024-01-03 00:00:00	374.580000	5.200000e+07
std	NaN	109.488024	1.083205e+07

# Check for duplicates
df.duplicated().sum()

np.int64(0)

5. Document your data sources

Keep a record of where your data came from, when it was downloaded, and any transformations applied. This is crucial for reproducibility:

# Add metadata when saving
metadata = {
    "source": "Bloomberg Terminal",
    "download_date": "2024-01-15",
    "description": "Daily stock prices for S&P 500 constituents"
}

# Parquet supports metadata
df.to_parquet("data.parquet", metadata=metadata)

Working with data files is the foundation of empirical finance research. By understanding the strengths and limitations of different file formats, and by using the right tools for each format, you can build efficient, reproducible data analysis pipelines. In the next chapter, we’ll explore data cleaning techniques to prepare your data for analysis.