
Library website:
Many databases are available through the WRDS portal
News
Publicly available data:
The first step should always be to identify the research question and hypotheses, then design the tests based on available data.
Data availability is often a binding constraint to research. We will often use data from standard sources (i.e. WRDS), but it can also come from other sources (proprietary data, web, etc.)
Before doing any analysis, we have to get to know the data:
After loading the data, the first thing should be looking at the data:
The next step is usually data cleaning, or data wrangling.
Once the dataset is ready, we can proceed with the analysis and report the results using tables and figures.


See my video on Data Wrangler for a demo.
Wide format
Each entity is a row, each time period is a column.
date AAPL MSFT GOOGL
2024-01-01 0.012 0.008 0.015
2024-01-02 -0.005 0.003 0.002
Use for: Matrix operations, correlations, cross-sectional stats.
Long format (tidy)
Each observation is a separate row.
date ticker return
2024-01-01 AAPL 0.012
2024-01-01 MSFT 0.008
2024-01-02 AAPL -0.005
Use for: Regressions, grouped operations, filtering.
| Operation | Direction | pandas function |
|---|---|---|
| Pivot | Long → Wide | df.pivot() or df.pivot_table() |
| Melt | Wide → Long | df.melt() |
| Unstack | Index → Columns | df.unstack() |
| Stack | Columns → Index | df.stack() |
pivot_table() when you have duplicate keys (requires aggregation).df.pivot() and df.unpivot().

Consider a function to estimate the derivative of another function:
Suppose we have a function that computes the square of a number:
You can now pass square as an argument to deriv:
Note: \frac{\partial x^2}{\partial x} = 2x
def can be overkill.apply() function in pandas that applies a function to every element of a Series (a column).Pandas!
MATH60230