When you run a research pipeline—downloading data, cleaning it, estimating models, and generating output—things will go wrong. Files will be missing, APIs will fail, and edge cases will surface. Without proper logging, you’re left guessing what happened and when. This chapter covers Python’s built-in logging module and introduces Hydra, a powerful framework for managing complex research configurations.
Good logging practices are essential for reproducible research. When you run an analysis months later or share code with collaborators, logs provide a record of what the code did, what warnings occurred, and where things failed. Combined with proper configuration management, you can recreate any run exactly as it happened.
6.1 Why Logging Matters
Many researchers start by sprinkling print() statements throughout their code:
print("Starting data download...")print(f"Downloaded {len(df)} rows")print("WARNING: Missing values detected")print("Error: API rate limit exceeded")
This approach has several problems:
No severity levels: You can’t distinguish informational messages from warnings or errors
No timestamps: You don’t know when events occurred
No control: You can’t easily turn messages on or off or redirect them to files
No context: You don’t know which module or function produced the message
Cluttered output: Everything goes to the same place, making it hard to find important messages
A proper logging system offers:
Severity levels: DEBUG, INFO, WARNING, ERROR, CRITICAL—so you can filter by importance
Timestamps: Know exactly when each event occurred
Source information: See which module and function generated each message
Flexible output: Send logs to console, files, or external services
Configuration: Control logging behavior without changing code
6.2 Python’s logging Module
Python’s standard library includes a powerful logging module. While it has a learning curve, understanding its core concepts pays off in any serious project.
Note Video
The following video provides a good overview of Python logging.
6.2.1 Basic Usage
The simplest way to use logging:
import logging# Configure basic logginglogging.basicConfig(level=logging.INFO)# Create a logger for this modulelogger = logging.getLogger(__name__)# Log messages at different levelslogger.debug("Detailed information for debugging")logger.info("General information about program execution")logger.warning("Something unexpected happened, but program continues")logger.error("A serious problem occurred")logger.critical("Program may not be able to continue")
INFO:__main__:General information about program execution
WARNING:__main__:Something unexpected happened, but program continues
ERROR:__main__:A serious problem occurred
CRITICAL:__main__:Program may not be able to continue
6.2.2 Understanding Log Levels
Log levels form a hierarchy. When you set a level, you see messages at that level and above:
Level
Numeric Value
When to Use
DEBUG
10
Detailed diagnostic information
INFO
20
Confirmation that things work as expected
WARNING
30
Something unexpected but not necessarily wrong
ERROR
40
A serious problem; some functionality failed
CRITICAL
50
A very serious error; program may crash
import logging# Only show WARNING and abovelogging.basicConfig(level=logging.WARNING, force=True)logger = logging.getLogger("level_demo")logger.debug("This won't appear")logger.info("This won't appear either")logger.warning("This will appear")logger.error("This will definitely appear")
1
The force=True parameter is needed here because we already called basicConfig() earlier in this chapter. By default, basicConfig() only configures logging once—subsequent calls are ignored. Using force=True removes any existing handlers and reconfigures logging with the new settings.
WARNING:level_demo:This will appear
ERROR:level_demo:This will definitely appear
Using log levels provides two key advantages:
Route messages to different outputs: You can direct messages of different levels to different destinations—for example, send INFO messages to a file while only showing WARNING and above on the console.
Control verbosity at runtime: You can leave all log messages in your code but choose at runtime which levels to display. This means you can include detailed DEBUG messages during development that won’t clutter your output in production unless you need them.
6.2.3 Configuring Log Format
The default format is minimal. For research workflows, you typically want more information:
import logging# Configure with a custom formatlogging.basicConfig( level=logging.INFO,format='%(asctime)s - %(name)s - %(levelname)s - %(message)s', datefmt='%Y-%m-%d %H:%M:%S', force=True)logger = logging.getLogger("format_demo")logger.info("Now you can see when this happened")
2026-01-03 10:48:27 - format_demo - INFO - Now you can see when this happened
Common format fields:
%(asctime)s: Human-readable timestamp
%(name)s: Logger name (usually module name)
%(levelname)s: DEBUG, INFO, WARNING, etc.
%(message)s: The actual log message
%(filename)s: Source file name
%(lineno)d: Line number in source file
%(funcName)s: Function name
6.2.4 Logging to Files
For long-running research pipelines, you want logs saved to files:
import logginglogging.basicConfig( level=logging.INFO,format='%(asctime)s - %(name)s - %(levelname)s - %(message)s', handlers=[ logging.FileHandler('research_pipeline.log'), logging.StreamHandler() # Also print to console ])logger = logging.getLogger(__name__)logger.info("This goes to both the file and console")
6.2.5 Logging Exceptions
When catching exceptions, use logger.exception() to automatically include the traceback:
import logginglogging.basicConfig(level=logging.INFO, force=True)logger = logging.getLogger("exception_demo")def risky_calculation(x):return1/ xtry: result = risky_calculation(0)exceptZeroDivisionError: logger.exception("Calculation failed")# The traceback is automatically included
ERROR:exception_demo:Calculation failed
Traceback (most recent call last):
File "/var/folders/jr/cn9h86ld68qb5rtvs9gsb1vr0000gn/T/ipykernel_77347/1524267329.py", line 10, in <module>
result = risky_calculation(0)
File "/var/folders/jr/cn9h86ld68qb5rtvs9gsb1vr0000gn/T/ipykernel_77347/1524267329.py", line 7, in risky_calculation
return 1 / x
~~^~~
ZeroDivisionError: division by zero
Including the full traceback is a tradeoff: it provides valuable debugging information, but the multi-line output can break the structure of log files, making them harder to parse or query programmatically. For production systems where logs are processed automatically, you might prefer logging just the exception message and using logger.error() instead.
6.2.6 Module-Level Loggers
The recommended pattern is to create a logger at the top of each module:
# In portfolio_analysis.pyimport logginglogger = logging.getLogger(__name__)def calculate_returns(prices): logger.info(f"Calculating returns for {len(prices)} observations") returns = prices.pct_change().dropna()if returns.isna().any().any(): logger.warning("NaN values detected in returns") logger.debug(f"Returns shape: {returns.shape}")return returns
The __name__ variable becomes the module’s fully qualified name (e.g., myproject.portfolio_analysis), which helps you trace where messages came from.
6.3 Logging Best Practices
Here are key practices to follow when implementing logging in your research projects.
Use appropriate levels. Choose log levels thoughtfully:
# DEBUG: Detailed diagnostic info, usually only for debugginglogger.debug(f"Processing row {i}: values = {row}")# INFO: Key milestones and confirmationslogger.info(f"Loaded {len(df)} rows from {filename}")# WARNING: Unexpected but handled situationslogger.warning(f"Missing data for {ticker}, using interpolation")# ERROR: Something failed, but program can continuelogger.error(f"Failed to download {ticker}: {e}")# CRITICAL: Serious failure, program may need to stoplogger.critical("Database connection lost, cannot continue")
Include context in messages. Log messages should be self-explanatory:
# Bad: Not enough contextlogger.info("Processing file")logger.warning("Missing values")# Good: Clear contextlogger.info(f"Processing file: {filepath}")logger.warning(f"Missing values in column '{col}': {count} rows affected")
Don’t log sensitive information. Be careful not to log passwords, API keys, or sensitive data:
# Bad: Logs the API keylogger.info(f"Connecting with API key: {api_key}")# Good: Masks sensitive informationlogger.info(f"Connecting with API key: {api_key[:4]}...")
Use structured logging for complex data. For data that might be parsed later, consider structured formats:
Configure logging once at entry point. Configure logging at your application’s entry point, not in library modules. Include a timestamp in the log filename so that each run generates a new file:
Library modules should only create loggers, not configure them:
# In my_research/analysis.pyimport logginglogger = logging.getLogger(__name__) # Just create the loggerdef run_pipeline(): logger.info("Starting pipeline")# ...
6.4 Configuration Management with Hydra
As research projects grow, managing configuration becomes a challenge. You might have:
Different data sources (local files, APIs, databases)
Multiple model specifications to compare
Various output formats and destinations
Development vs. production settings
Hardcoding these in Python leads to messy code and makes it hard to reproduce specific runs. YAML configuration files help, but you end up writing boilerplate code to load and validate them.
Hydra is a framework developed by Facebook Research (Yadan 2019) that elegantly solves these problems. It provides:
Hierarchical configuration: Compose configs from multiple sources
Command-line overrides: Change any parameter without editing files
Automatic working directories: Each run gets its own output directory
Multi-run support: Sweep over parameter combinations
Note Video
The following video provides a good overview of Hydra for managing project configurations.
This makes every run reproducible—you can see exactly what configuration was used.
6.4.5 Using Hydra for Research Pipelines
Here’s a more complete example for an empirical finance pipeline:
# run_analysis.pyimport loggingfrom pathlib import Pathimport hydrafrom omegaconf import DictConfig, OmegaConfimport pandas as pdlogger = logging.getLogger(__name__)def load_data(cfg: DictConfig) -> pd.DataFrame:"""Load data according to configuration.""" logger.info(f"Loading data from {cfg.data.source}")if cfg.data.source =="local": df = pd.read_parquet(cfg.data.path)elif cfg.data.source =="wrds":# WRDS loading logicpass# Apply date filters df = df[(df['date'] >= cfg.data.start_date) & (df['date'] <= cfg.data.end_date)] logger.info(f"Loaded {len(df)} observations")return dfdef run_model(df: pd.DataFrame, cfg: DictConfig) ->dict:"""Run the specified model.""" logger.info(f"Running {cfg.model.name} model")# Model logic here results = {"coefficients": {}, "stats": {}}return resultsdef save_results(results: dict, cfg: DictConfig) ->None:"""Save results according to configuration.""" output_dir = Path(cfg.output.dir) output_dir.mkdir(parents=True, exist_ok=True)# Save based on formatif cfg.output.format=="parquet":# Save as parquetpasselif cfg.output.format=="latex":# Generate LaTeX tablespass logger.info(f"Results saved to {output_dir}")@hydra.main(version_base=None, config_path="conf", config_name="config")def main(cfg: DictConfig) ->None:# Log the full configuration logger.info("Configuration:\n"+ OmegaConf.to_yaml(cfg))# Run pipeline df = load_data(cfg) results = run_model(df, cfg) save_results(results, cfg) logger.info("Pipeline completed successfully")if__name__=="__main__": main()
6.4.6 Multi-Run for Parameter Sweeps
Hydra can automatically run your code with multiple parameter combinations:
# Run with multiple date rangespython run_analysis.py -m data.start_date=2010-01-01,2015-01-01,2020-01-01# Sweep over modelspython run_analysis.py -m model=ols,fama_macbeth,panel
Each combination gets its own output directory with full configuration tracking.
This feature is particularly useful for testing the sensitivity of your analysis to empirical choices. For example, you can run your analysis with multiple winsorization levels, different sample periods, or alternative variable definitions to ensure your results are robust to these choices.
6.4.7 Hydra with Logging
Hydra automatically configures Python’s logging module. Your log messages go to both the console and a file in the output directory:
import loggingimport hydrafrom omegaconf import DictConfiglogger = logging.getLogger(__name__)@hydra.main(version_base=None, config_path="conf", config_name="config")def main(cfg: DictConfig) ->None: logger.info("Starting analysis") # Automatically logged to file logger.debug("Debug info") # Also captured# Your code here
You can customize logging in conf/hydra/job_logging.yaml or in your main config.
6.5 Summary
Proper logging and configuration management are essential for reproducible research:
Use Python’s logging module instead of print statements for production code
Choose appropriate log levels to distinguish routine information from warnings and errors
Include context in log messages so they’re meaningful when read later
Configure logging at the entry point, not in library modules
Use Hydra for configuration management in complex research pipelines
Leverage Hydra’s automatic output directories to make every run reproducible
These practices might seem like overhead for small scripts, but they pay dividends as projects grow. When you need to debug a failed run from last month or share code with collaborators, good logging and configuration management make the difference between hours of frustration and quickly finding the answer.