6  Logging and Configuration

When you run a research pipeline—downloading data, cleaning it, estimating models, and generating output—things will go wrong. Files will be missing, APIs will fail, and edge cases will surface. Without proper logging, you’re left guessing what happened and when. This chapter covers Python’s built-in logging module and introduces Hydra, a powerful framework for managing complex research configurations.

Good logging practices are essential for reproducible research. When you run an analysis months later or share code with collaborators, logs provide a record of what the code did, what warnings occurred, and where things failed. Combined with proper configuration management, you can recreate any run exactly as it happened.

6.1 Why Logging Matters

Many researchers start by sprinkling print() statements throughout their code:

print("Starting data download...")
print(f"Downloaded {len(df)} rows")
print("WARNING: Missing values detected")
print("Error: API rate limit exceeded")

This approach has several problems:

  1. No severity levels: You can’t distinguish informational messages from warnings or errors
  2. No timestamps: You don’t know when events occurred
  3. No control: You can’t easily turn messages on or off or redirect them to files
  4. No context: You don’t know which module or function produced the message
  5. Cluttered output: Everything goes to the same place, making it hard to find important messages

A proper logging system offers:

  • Severity levels: DEBUG, INFO, WARNING, ERROR, CRITICAL—so you can filter by importance
  • Timestamps: Know exactly when each event occurred
  • Source information: See which module and function generated each message
  • Flexible output: Send logs to console, files, or external services
  • Configuration: Control logging behavior without changing code

6.2 Python’s logging Module

Python’s standard library includes a powerful logging module. While it has a learning curve, understanding its core concepts pays off in any serious project.

Note Video

The following video provides a good overview of Python logging.

6.2.1 Basic Usage

The simplest way to use logging:

import logging

# Configure basic logging
logging.basicConfig(level=logging.INFO)

# Create a logger for this module
logger = logging.getLogger(__name__)

# Log messages at different levels
logger.debug("Detailed information for debugging")
logger.info("General information about program execution")
logger.warning("Something unexpected happened, but program continues")
logger.error("A serious problem occurred")
logger.critical("Program may not be able to continue")
INFO:__main__:General information about program execution
WARNING:__main__:Something unexpected happened, but program continues
ERROR:__main__:A serious problem occurred
CRITICAL:__main__:Program may not be able to continue

6.2.2 Understanding Log Levels

Log levels form a hierarchy. When you set a level, you see messages at that level and above:

Level Numeric Value When to Use
DEBUG 10 Detailed diagnostic information
INFO 20 Confirmation that things work as expected
WARNING 30 Something unexpected but not necessarily wrong
ERROR 40 A serious problem; some functionality failed
CRITICAL 50 A very serious error; program may crash
import logging

# Only show WARNING and above
logging.basicConfig(level=logging.WARNING, force=True)
logger = logging.getLogger("level_demo")

logger.debug("This won't appear")
logger.info("This won't appear either")
logger.warning("This will appear")
logger.error("This will definitely appear")
1
The force=True parameter is needed here because we already called basicConfig() earlier in this chapter. By default, basicConfig() only configures logging once—subsequent calls are ignored. Using force=True removes any existing handlers and reconfigures logging with the new settings.
WARNING:level_demo:This will appear
ERROR:level_demo:This will definitely appear

Using log levels provides two key advantages:

  1. Route messages to different outputs: You can direct messages of different levels to different destinations—for example, send INFO messages to a file while only showing WARNING and above on the console.
  2. Control verbosity at runtime: You can leave all log messages in your code but choose at runtime which levels to display. This means you can include detailed DEBUG messages during development that won’t clutter your output in production unless you need them.

6.2.3 Configuring Log Format

The default format is minimal. For research workflows, you typically want more information:

import logging

# Configure with a custom format
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
    datefmt='%Y-%m-%d %H:%M:%S',
    force=True
)

logger = logging.getLogger("format_demo")
logger.info("Now you can see when this happened")
2026-01-03 10:48:27 - format_demo - INFO - Now you can see when this happened

Common format fields:

  • %(asctime)s: Human-readable timestamp
  • %(name)s: Logger name (usually module name)
  • %(levelname)s: DEBUG, INFO, WARNING, etc.
  • %(message)s: The actual log message
  • %(filename)s: Source file name
  • %(lineno)d: Line number in source file
  • %(funcName)s: Function name

6.2.4 Logging to Files

For long-running research pipelines, you want logs saved to files:

import logging

logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
    handlers=[
        logging.FileHandler('research_pipeline.log'),
        logging.StreamHandler()  # Also print to console
    ]
)

logger = logging.getLogger(__name__)
logger.info("This goes to both the file and console")

6.2.5 Logging Exceptions

When catching exceptions, use logger.exception() to automatically include the traceback:

import logging

logging.basicConfig(level=logging.INFO, force=True)
logger = logging.getLogger("exception_demo")

def risky_calculation(x):
    return 1 / x

try:
    result = risky_calculation(0)
except ZeroDivisionError:
    logger.exception("Calculation failed")
    # The traceback is automatically included
ERROR:exception_demo:Calculation failed
Traceback (most recent call last):
  File "/var/folders/jr/cn9h86ld68qb5rtvs9gsb1vr0000gn/T/ipykernel_77347/1524267329.py", line 10, in <module>
    result = risky_calculation(0)
  File "/var/folders/jr/cn9h86ld68qb5rtvs9gsb1vr0000gn/T/ipykernel_77347/1524267329.py", line 7, in risky_calculation
    return 1 / x
           ~~^~~
ZeroDivisionError: division by zero

Including the full traceback is a tradeoff: it provides valuable debugging information, but the multi-line output can break the structure of log files, making them harder to parse or query programmatically. For production systems where logs are processed automatically, you might prefer logging just the exception message and using logger.error() instead.

6.2.6 Module-Level Loggers

The recommended pattern is to create a logger at the top of each module:

# In portfolio_analysis.py
import logging

logger = logging.getLogger(__name__)

def calculate_returns(prices):
    logger.info(f"Calculating returns for {len(prices)} observations")
    returns = prices.pct_change().dropna()

    if returns.isna().any().any():
        logger.warning("NaN values detected in returns")

    logger.debug(f"Returns shape: {returns.shape}")
    return returns

The __name__ variable becomes the module’s fully qualified name (e.g., myproject.portfolio_analysis), which helps you trace where messages came from.

6.3 Logging Best Practices

Here are key practices to follow when implementing logging in your research projects.

Use appropriate levels. Choose log levels thoughtfully:

# DEBUG: Detailed diagnostic info, usually only for debugging
logger.debug(f"Processing row {i}: values = {row}")

# INFO: Key milestones and confirmations
logger.info(f"Loaded {len(df)} rows from {filename}")

# WARNING: Unexpected but handled situations
logger.warning(f"Missing data for {ticker}, using interpolation")

# ERROR: Something failed, but program can continue
logger.error(f"Failed to download {ticker}: {e}")

# CRITICAL: Serious failure, program may need to stop
logger.critical("Database connection lost, cannot continue")

Include context in messages. Log messages should be self-explanatory:

# Bad: Not enough context
logger.info("Processing file")
logger.warning("Missing values")

# Good: Clear context
logger.info(f"Processing file: {filepath}")
logger.warning(f"Missing values in column '{col}': {count} rows affected")

Don’t log sensitive information. Be careful not to log passwords, API keys, or sensitive data:

# Bad: Logs the API key
logger.info(f"Connecting with API key: {api_key}")

# Good: Masks sensitive information
logger.info(f"Connecting with API key: {api_key[:4]}...")

Use structured logging for complex data. For data that might be parsed later, consider structured formats:

import json
import logging

logger = logging.getLogger(__name__)

# Log structured data
metrics = {
    'ticker': 'AAPL',
    'sharpe_ratio': 1.45,
    'max_drawdown': -0.15,
    'n_observations': 252
}
logger.info(f"Performance metrics: {json.dumps(metrics)}")

Configure logging once at entry point. Configure logging at your application’s entry point, not in library modules. Include a timestamp in the log filename so that each run generates a new file:

# In main.py or run_analysis.py
import logging
from datetime import datetime
from my_research import run_pipeline

def setup_logging():
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    logging.basicConfig(
        level=logging.INFO,
        format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
        handlers=[
            logging.FileHandler(f'analysis_{timestamp}.log'),
            logging.StreamHandler()
        ]
    )

if __name__ == "__main__":
    setup_logging()
    run_pipeline()

Library modules should only create loggers, not configure them:

# In my_research/analysis.py
import logging

logger = logging.getLogger(__name__)  # Just create the logger

def run_pipeline():
    logger.info("Starting pipeline")
    # ...

6.4 Configuration Management with Hydra

As research projects grow, managing configuration becomes a challenge. You might have:

  • Different data sources (local files, APIs, databases)
  • Multiple model specifications to compare
  • Various output formats and destinations
  • Development vs. production settings

Hardcoding these in Python leads to messy code and makes it hard to reproduce specific runs. YAML configuration files help, but you end up writing boilerplate code to load and validate them.

Hydra is a framework developed by Facebook Research (Yadan 2019) that elegantly solves these problems. It provides:

  • Hierarchical configuration: Compose configs from multiple sources
  • Command-line overrides: Change any parameter without editing files
  • Automatic working directories: Each run gets its own output directory
  • Multi-run support: Sweep over parameter combinations
Note Video

The following video provides a good overview of Hydra for managing project configurations.

6.4.1 Installing Hydra

uv add hydra-core

6.4.2 Basic Hydra Application

Here’s a minimal Hydra application:

# my_analysis.py
import hydra
from omegaconf import DictConfig

@hydra.main(version_base=None, config_path="conf", config_name="config")
def main(cfg: DictConfig) -> None:
    print(f"Processing data from: {cfg.data.source}")
    print(f"Output directory: {cfg.output.dir}")
    print(f"Model: {cfg.model.name}")

if __name__ == "__main__":
    main()

With a configuration file:

# conf/config.yaml
data:
  source: "data/returns.parquet"
  start_date: "2020-01-01"
  end_date: "2023-12-31"

model:
  name: "ols"
  robust_se: true

output:
  dir: "results"
  format: "parquet"

Run it:

python my_analysis.py

Override parameters from command line:

python my_analysis.py data.start_date=2022-01-01 model.name=fama_macbeth

6.4.3 Configuration Composition

Hydra’s power comes from composing configurations. Organize your configs into groups:

conf/
├── config.yaml          # Main config with defaults
├── data/
│   ├── crsp.yaml       # CRSP data settings
│   ├── compustat.yaml  # Compustat settings
│   └── local.yaml      # Local file settings
├── model/
│   ├── ols.yaml
│   ├── fama_macbeth.yaml
│   └── panel.yaml
└── output/
    ├── paper.yaml      # Publication-ready output
    └── debug.yaml      # Quick debug output

The main config selects defaults:

# conf/config.yaml
defaults:
  - data: crsp
  - model: ols
  - output: paper

experiment_name: "baseline"

Each group config defines its settings:

# conf/data/crsp.yaml
source: "wrds"
database: "crsp"
table: "msf"
start_date: "1990-01-01"
end_date: "2023-12-31"
# conf/model/fama_macbeth.yaml
name: "fama_macbeth"
robust_se: true
lags: 5

Switch configurations easily:

# Use Compustat data with Fama-MacBeth model
python my_analysis.py data=compustat model=fama_macbeth

# Quick debug run
python my_analysis.py output=debug data.end_date=2020-01-31

6.4.4 Automatic Output Directories

Hydra automatically creates a unique output directory for each run:

outputs/
└── 2024-01-15/
    └── 14-30-22/
        ├── .hydra/
        │   ├── config.yaml      # Full resolved config
        │   ├── hydra.yaml       # Hydra settings
        │   └── overrides.yaml   # Command-line overrides
        ├── my_analysis.log      # Automatic logging
        └── results/             # Your output files

This makes every run reproducible—you can see exactly what configuration was used.

6.4.5 Using Hydra for Research Pipelines

Here’s a more complete example for an empirical finance pipeline:

# run_analysis.py
import logging
from pathlib import Path

import hydra
from omegaconf import DictConfig, OmegaConf
import pandas as pd

logger = logging.getLogger(__name__)


def load_data(cfg: DictConfig) -> pd.DataFrame:
    """Load data according to configuration."""
    logger.info(f"Loading data from {cfg.data.source}")

    if cfg.data.source == "local":
        df = pd.read_parquet(cfg.data.path)
    elif cfg.data.source == "wrds":
        # WRDS loading logic
        pass

    # Apply date filters
    df = df[(df['date'] >= cfg.data.start_date) &
            (df['date'] <= cfg.data.end_date)]

    logger.info(f"Loaded {len(df)} observations")
    return df


def run_model(df: pd.DataFrame, cfg: DictConfig) -> dict:
    """Run the specified model."""
    logger.info(f"Running {cfg.model.name} model")

    # Model logic here
    results = {"coefficients": {}, "stats": {}}

    return results


def save_results(results: dict, cfg: DictConfig) -> None:
    """Save results according to configuration."""
    output_dir = Path(cfg.output.dir)
    output_dir.mkdir(parents=True, exist_ok=True)

    # Save based on format
    if cfg.output.format == "parquet":
        # Save as parquet
        pass
    elif cfg.output.format == "latex":
        # Generate LaTeX tables
        pass

    logger.info(f"Results saved to {output_dir}")


@hydra.main(version_base=None, config_path="conf", config_name="config")
def main(cfg: DictConfig) -> None:
    # Log the full configuration
    logger.info("Configuration:\n" + OmegaConf.to_yaml(cfg))

    # Run pipeline
    df = load_data(cfg)
    results = run_model(df, cfg)
    save_results(results, cfg)

    logger.info("Pipeline completed successfully")


if __name__ == "__main__":
    main()

6.4.6 Multi-Run for Parameter Sweeps

Hydra can automatically run your code with multiple parameter combinations:

# Run with multiple date ranges
python run_analysis.py -m data.start_date=2010-01-01,2015-01-01,2020-01-01

# Sweep over models
python run_analysis.py -m model=ols,fama_macbeth,panel

Each combination gets its own output directory with full configuration tracking.

This feature is particularly useful for testing the sensitivity of your analysis to empirical choices. For example, you can run your analysis with multiple winsorization levels, different sample periods, or alternative variable definitions to ensure your results are robust to these choices.

6.4.7 Hydra with Logging

Hydra automatically configures Python’s logging module. Your log messages go to both the console and a file in the output directory:

import logging
import hydra
from omegaconf import DictConfig

logger = logging.getLogger(__name__)

@hydra.main(version_base=None, config_path="conf", config_name="config")
def main(cfg: DictConfig) -> None:
    logger.info("Starting analysis")  # Automatically logged to file
    logger.debug("Debug info")  # Also captured

    # Your code here

You can customize logging in conf/hydra/job_logging.yaml or in your main config.

6.5 Summary

Proper logging and configuration management are essential for reproducible research:

  • Use Python’s logging module instead of print statements for production code
  • Choose appropriate log levels to distinguish routine information from warnings and errors
  • Include context in log messages so they’re meaningful when read later
  • Configure logging at the entry point, not in library modules
  • Use Hydra for configuration management in complex research pipelines
  • Leverage Hydra’s automatic output directories to make every run reproducible

These practices might seem like overhead for small scripts, but they pay dividends as projects grow. When you need to debug a failed run from last month or share code with collaborators, good logging and configuration management make the difference between hours of frustration and quickly finding the answer.