MATH60230: Lecture 10

Text Analysis and Natural Language Processing

Vincent Grégoire

vincent.gregoire@hec.ca

HEC Montréal

Saad Ali Khan

saad-ali.khan@hec.ca

HEC Montréal

Plan for Today

Introduction to text analysis in finance
Text preprocessing fundamentals
Text representation: BoW, TF-IDF, dictionaries
Word and document embeddings
Transformers and LLMs
Practical considerations for research

What is Text Analysis?

Text analysis (text mining, NLP) is the process of extracting structured information from unstructured text

Turning words into numbers that can be used for:
- Statistical analysis
- Machine learning
- Information retrieval

Natural Language Processing (NLP) provides the tools and techniques

Why Text Analysis in Finance?

Vast amounts of valuable information exist in text form:
- SEC filings (10-K, 10-Q, 8-K)
- Earnings call transcripts
- News articles and press releases
- Analyst reports
- Social media (Twitter/X, Reddit, StockTwits)
- Central bank communications

Traditional structured datasets (CRSP, Compustat) are well-studied
Text data offers new sources of data to answer new research questions

NLP Applications in Finance

Application	Description	Example
Sentiment Analysis	Measure tone/mood of text	Is this earnings call positive or negative?
Information Extraction	Extract specific facts	What is the reported revenue?
Named Entity Recognition	Identify entities	Which companies are mentioned?
Document Classification	Categorize documents	What topics does this 10-K discuss?
Event Detection	Identify events	Is this news about an M&A?
Summarization	Condense text	Summarize this 100-page filing

Sentiment Analysis in Finance

Quantify the tone of financial communications

Applications:
- Predicting stock returns from news sentiment
- Measuring management confidence in earnings calls
- Detecting changes in Fed communication tone
- Aggregating social media sentiment

Key insight: sentiment in financial text differs from general sentiment
- “Liability” is negative in general English, neutral in finance

Named Entity Recognition (NER)

Identify and classify named entities in text:
- Organizations: Apple Inc., Federal Reserve
- People: Warren Buffett, Jerome Powell
- Locations: New York, Silicon Valley
- Financial terms: revenue, EBITDA, market cap
- Dates and amounts: Q3 2024, $5.2 billion

Finance-specific NER models exist (e.g., FinBERT-NER)

Use case: building knowledge graphs of company relationships

The Text Analysis Pipeline

flowchart LR
    A[Raw Text] --> B[Preprocessing]
    B --> C[Tokenization]
    C --> D[Normalization]
    D --> E[Representation]
    E --> F[Analysis/ML]

    style A fill:#e1f5fe
    style F fill:#c8e6c9

Preprocessing: Clean the raw text
Tokenization: Split into words/subwords
Normalization: Standardize tokens
Representation: Convert to numerical features
Analysis: Apply statistical/ML methods

Tokenization

Tokenization = splitting text into individual units (tokens)

text = "Apple's Q3 revenue was $81.8 billion."

# Simple whitespace tokenization
tokens = text.split()
# ['Apple's', 'Q3', 'revenue', 'was', '$81.8', 'billion.']

# Better: use a tokenizer
from nltk.tokenize import word_tokenize
tokens = word_tokenize(text)
# ['Apple', "'s", 'Q3', 'revenue', 'was', '$', '81.8', 'billion', '.']

Challenges: contractions, hyphenated words, numbers, punctuation

Stop Words

Stop words = common words that carry little meaning

Examples: “the”, “is”, “at”, “which”, “on”, “a”, “an”

from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))
tokens = ['the', 'company', 'reported', 'strong', 'revenue', 'growth']
filtered = [w for w in tokens if w.lower() not in stop_words]
# ['company', 'reported', 'strong', 'revenue', 'growth']

Removing stop words reduces noise and dimensionality
But: sometimes context words matter (e.g., “not profitable”)

Stemming and Lemmatization

Reduce words to their root form to group related words:

Stemming

Crude rules-based truncation
Fast but imprecise
“running” → “run”
“studies” → “studi”

from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
stemmer.stem("profitable")
# 'profit'

Lemmatization

Uses vocabulary and morphology
Slower but accurate
“running” → “run”
“studies” → “study”

from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
lemmatizer.lemmatize("profitable", pos='a')
# 'profitable'

N-grams

N-grams = contiguous sequences of n tokens

Unigrams (n=1): individual words
Bigrams (n=2): pairs of consecutive words
Trigrams (n=3): triplets of consecutive words

from nltk import ngrams

text = ["interest", "rate", "hike", "expected"]
list(ngrams(text, 2))
# [('interest', 'rate'), ('rate', 'hike'), ('hike', 'expected')]

Captures phrases: “interest rate” vs. “interest” + “rate”
Trade-off: more context vs. data sparsity

Bag of Words (BoW)

The simplest text representation: count word occurrences

from sklearn.feature_extraction.text import CountVectorizer

docs = [
    "revenue increased significantly",
    "profit margins decreased",
    "revenue and profit both increased"
]

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(docs)
print(vectorizer.get_feature_names_out())
# ['and', 'both', 'decreased', 'increased', 'margins',
#  'profit', 'revenue', 'significantly']

Result: document-term matrix (documents × words)
Ignores word order (“bag” of words)

TF-IDF

Term Frequency-Inverse Document Frequency

TF (Term Frequency): how often a word appears in a document
IDF (Inverse Document Frequency): how rare a word is across all documents

\text{TF-IDF}(t, d) = \text{TF}(t, d) \times \log\left(\frac{N}{\text{DF}(t)}\right)

Words that appear frequently in one document but rarely overall get high scores
Common words across all documents get low scores
Better than raw counts for identifying distinctive terms

TF-IDF in Python

from sklearn.feature_extraction.text import TfidfVectorizer

docs = [
    "The company reported strong revenue growth in Q3",
    "Revenue declined due to market conditions",
    "Strong earnings beat analyst expectations"
]

vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(docs)

# X is now a sparse matrix of TF-IDF features
print(X.shape)  # (3 documents, n features)

TfidfVectorizer combines tokenization, stop word removal, and TF-IDF
Can specify ngram_range=(1, 2) to include bigrams

Dictionary-Based Methods

Use predefined word lists to measure specific concepts:

Create/obtain a dictionary of words for each category
Count occurrences of dictionary words in text
Compute scores (e.g., % positive words, net sentiment)

Simple, interpretable, and transparent
No training required
But: context-independent (same word = same meaning)

Loughran-McDonald Dictionary

The standard sentiment dictionary for financial text (Loughran and McDonald 2011)

Developed specifically for 10-K filings
Categories: Negative, Positive, Uncertainty, Litigious, Constraining, Strong Modal, Weak Modal

Category	Example Words
Negative	loss, decline, adverse, deficit, litigation
Positive	achieve, benefit, gain, improve, profitable
Uncertainty	approximate, contingent, possible, risk
Litigious	attorney, court, lawsuit, legal, tribunal

Available at: sraf.nd.edu/loughranmcdonald-master-dictionary

Using Loughran-McDonald

import pandas as pd

# Load the dictionary
lm_dict = pd.read_csv('LoughranMcDonald_MasterDictionary.csv')

# Get negative words
negative_words = set(
    lm_dict[lm_dict['Negative'] > 0]['Word'].str.lower()
)

# Count negative words in text
def count_negative(text):
    words = text.lower().split()
    return sum(1 for w in words if w in negative_words)

# Compute sentiment score
def sentiment_score(text):
    words = text.lower().split()
    n_neg = sum(1 for w in words if w in negative_words)
    n_pos = sum(1 for w in words if w in positive_words)
    return (n_pos - n_neg) / len(words)

Limitations of BoW/TF-IDF

No semantic meaning: “good” and “excellent” are unrelated vectors

High dimensionality: vocabulary can be 10,000+ words

Sparse representations: most entries are zero

No context: “bank” (financial) vs. “bank” (river) are the same

Word order lost: “dog bites man” = “man bites dog”

Solution: word embeddings — dense, semantic representations

Word Embeddings

Map words to dense vectors where similar words are close together

%%{init: {'theme': 'base', 'themeVariables': {'clusterBkg': '#f5f5f5', 'clusterBorder': '#cccccc'}}}%%
flowchart LR
    subgraph "Sparse (BoW)"
        A["king: [0,0,1,0,0,...]<br/>queen: [0,1,0,0,0,...]<br/>man: [1,0,0,0,0,...]"]
    end
    subgraph "Dense (Embedding)"
        B["king: [0.2, 0.8, -0.1, ...]<br/>queen: [0.3, 0.7, -0.2, ...]<br/>man: [0.1, 0.6, 0.4, ...]"]
    end
    A --> B

    style A fill:#ffcdd2
    style B fill:#c8e6c9

Typically 100-300 dimensions (vs. 10,000+ for BoW)
Learned from large text corpora
Capture semantic relationships

Word2Vec

The foundational word embedding method (Mikolov et al. 2013)

Key idea: “You shall know a word by the company it keeps”

Train a neural network to predict:
- CBOW: predict word from context
- Skip-gram: predict context from word

Famous example: \text{king} - \text{man} + \text{woman} \approx \text{queen}

Pre-trained models available (Google News, GloVe, FastText)

Using Pre-trained Embeddings

import gensim.downloader as api

# Load pre-trained Word2Vec embeddings
model = api.load('word2vec-google-news-300')

# Find similar words
model.most_similar('profit')
# [('profits', 0.82), ('earnings', 0.71), ('revenue', 0.65), ...]

# Word arithmetic
model.most_similar(positive=['ceo', 'woman'], negative=['man'])
# [('chairwoman', 0.71), ('executive', 0.69), ...]

# Get embedding vector
vector = model['stock']  # 300-dimensional vector

GloVe, FastText are alternatives with similar usage

Document Embeddings

How to represent an entire document as a vector?

Simple approach: average word embeddings

import numpy as np

def document_embedding(text, model):
    words = text.lower().split()
    vectors = [model[w] for w in words if w in model]
    if vectors:
        return np.mean(vectors, axis=0)
    return np.zeros(model.vector_size)

Works reasonably well for short texts
Loses word order and importance weighting
Better methods: Doc2Vec, Sentence-BERT, LLM embeddings

The Context Problem

Traditional embeddings give one vector per word:

“The bank raised interest rates” “I sat by the river bank”

Same vector for “bank” in both sentences!

Solution: contextual embeddings from Transformers

Different vector for the same word depending on context
Revolutionized NLP starting with BERT (2018)

The Transformer Architecture

The foundational architecture for modern NLP (Vaswani et al. 2017)

Self-attention: each word attends to all other words
Captures long-range dependencies
Parallelizable (unlike RNNs)

Attention Mechanism (Intuition)

For each word, compute attention weights to all other words:

“The company reported that its revenue increased”

“its” attends strongly to “company” (learns the reference)
Each word’s representation is a weighted sum of all words

Allows model to understand:
- Coreference (“its” refers to “company”)
- Long-distance dependencies
- Word sense disambiguation

BERT: Bidirectional Encoder

Bidirectional Encoder Representations from Transformers (Devlin et al. 2019)

Pre-trained on massive text corpus (Wikipedia + Books)
Two pre-training objectives:
- Masked Language Model: predict masked words
- Next Sentence Prediction: are two sentences consecutive?

Fine-tune on downstream tasks:
- Sentiment classification
- Named entity recognition
- Question answering

See Hugging Face Transformers

Finance-Specific Transformers

General BERT may not understand financial language well

FinBERT variants (Araci 2019):

ProsusAI/finbert: fine-tuned for financial sentiment
yiyanghkust/finbert-tone: trained on analyst reports

from transformers import pipeline

classifier = pipeline("sentiment-analysis",
                      model="ProsusAI/finbert")

classifier("Revenue growth exceeded expectations")
# [{'label': 'positive', 'score': 0.94}]

Using Transformers for Embeddings

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')

sentences = [
    "The company reported strong earnings",
    "Profits exceeded analyst expectations",
    "The weather was sunny today"
]

embeddings = model.encode(sentences)
# embeddings.shape: (3, 384)

# Compute similarity
from sklearn.metrics.pairwise import cosine_similarity
similarity = cosine_similarity(embeddings)
# sentences 0 and 1 will have high similarity

Use for semantic search, clustering, classification

Large Language Models (LLMs)

LLMs = very large Transformer models trained on vast text corpora

GPT-4, Claude, Llama, Gemini, Mistral…
Billions of parameters
Trained on internet-scale data

Key capability: can perform tasks from instructions (prompts)

No task-specific training needed
“Zero-shot” and “few-shot” learning
Can follow complex instructions

LLM Capabilities for Finance

Task	Example Prompt
Sentiment	“Is this earnings call positive, negative, or neutral?”
Classification	“What risk factors are discussed in this 10-K section?”
Extraction	“Extract the reported revenue and net income from this text”
Summarization	“Summarize this 10-K in 3 bullet points”
Q&A	“Based on this filing, what are the main business risks?”
NER	“List all company names mentioned in this article”

No training required — just write good prompts
Can handle nuance and context better than rule-based methods

Using the OpenAI API

from openai import OpenAI

client = OpenAI()  # Uses OPENAI_API_KEY env variable

response = client.chat.completions.create(
    model="gpt-5-mini",
    messages=[
        {"role": "system", "content": "You are a financial analyst."},
        {"role": "user", "content": """
            Classify the sentiment of this text as
            positive, negative, or neutral:

            "Despite challenging market conditions,
            the company achieved record revenue growth."
        """}
    ]
)

print(response.choices[0].message.content)
# "positive"

Prompt Engineering

The art of writing effective prompts:

Be specific: clearly define the task and expected output format
Provide context: explain the domain (finance, accounting, etc.)
Give examples: few-shot learning improves accuracy
Define constraints: “respond only with: positive, negative, neutral”

prompt = """
You are a financial sentiment analyst. Classify the following
text as exactly one of: POSITIVE, NEGATIVE, NEUTRAL.

Example: "Revenue increased 15%" -> POSITIVE
Example: "We face significant headwinds" -> NEGATIVE

Text: "{text}"
Classification:
"""

Local LLMs

Run LLMs locally for privacy and cost control:

import ollama

response = ollama.chat(
    model='llama3.2',
    messages=[{
        'role': 'user',
        'content': 'Summarize this earnings report in one sentence: ...'
    }]
)

print(response['message']['content'])

Ollama: command-line tool for local LLM deployment
LM Studio: GUI application for running local LLMs
Models: Llama 3, Mistral, Phi, Gemma, Qwen, etc.
Trade-off: smaller models = less capable but free and private

Research Challenge: Look-Ahead Bias

Look-ahead bias = using information not available at the time

In NLP research, this can occur through:

Model training data: GPT-4 was trained on data up to a cutoff date
- Using it to “predict” events before that date is cheating

Model release dates: Can’t use GPT-4 for 2020 predictions
- Document which model version you used

Text availability: When was the document actually public?
- 10-K filing date vs. fiscal year end

Research Challenge: Replicability

LLM-based research faces unique replicability challenges:

Model versioning: APIs update silently
- Document exact model name: gpt-4-0613 not just gpt-4

API deprecation: models get retired
- Consider local models for long-term reproducibility

Stochastic outputs: same prompt → different answers
- Set temperature=0 for deterministic outputs
- Save all raw outputs

Temperature and Sampling

Temperature controls randomness in LLM outputs:

temperature=0: deterministic, always pick most likely token
temperature=1: sample according to model probabilities
temperature>1: more random/creative

response = client.chat.completions.create(
    model="gpt-5-mini",
    temperature=0,  # Reproducible outputs
    messages=[...]
)

For research: use temperature=0 for reproducibility, but note that most models perform best at higher temperatures (typically 0.7-1.0)

Structured Output

LLMs output text, but we often need structured data:

response = client.chat.completions.create(
    model="gpt-5-mini",
    response_format={"type": "json_object"},
    messages=[{
        "role": "user",
        "content": """Extract financial data as JSON:
        {"revenue": number, "net_income": number, "sentiment": string}

        Text: Revenue was $5.2B, net income $800M.
        Management expressed cautious optimism."""
    }]
)
# {"revenue": 5200000000, "net_income": 800000000,
#  "sentiment": "cautiously positive"}

Use JSON mode for reliable parsing
Define clear schemas in prompts

Structured Output with Pydantic

from pydantic import BaseModel
from openai import OpenAI

class FinancialExtraction(BaseModel):
    revenue: float | None
    net_income: float | None
    sentiment: str
    confidence: float

client = OpenAI()
completion = client.beta.chat.completions.parse(
    model="gpt-5-mini",
    messages=[
        {"role": "system", "content": "Extract financial information."},
        {"role": "user", "content": "Revenue grew 15% to $5.2B..."}
    ],
    response_format=FinancialExtraction,
)

result = completion.choices[0].message.parsed
print(result.revenue)  # 5200000000.0

LLM Workflow for Research

flowchart LR
    A[Documents] --> B[Prompt Template]
    B --> C[LLM API]
    C --> D[Parse Response]
    D --> E[Validate]
    E --> F[Store Results]
    F --> G[Analysis]

    E -->|Invalid| B

    style A fill:#e1f5fe
    style G fill:#c8e6c9

Batch processing: process documents in parallel
Error handling: API failures, rate limits, invalid responses
Validation: check outputs match expected format
Logging: save prompts, responses, and metadata

Cost Considerations

LLM APIs charge per token (≈ 0.75 words):

Model	Input	Output
GPT-5.2 (OpenAI)	$1.75/1M tokens	$14/1M tokens
GPT-5 mini (OpenAI)	$0.25/1M tokens	$2/1M tokens
Claude Sonnet 4.5 (Anthropic)	$3/1M tokens	$15/1M tokens
Claude Haiku 4.5 (Anthropic)	$1/1M tokens	$5/1M tokens

10,000 documents × 1,000 tokens each = 10M tokens
With GPT-5 mini: ~$22.50 for processing

Start with smaller/cheaper models for development
Consider local models for large-scale processing

Python Libraries Summary

Library	Use Case
NLTK	Classic NLP: tokenization, stemming, corpora
spaCy	Industrial NLP: fast, production-ready pipelines
scikit-learn	BoW, TF-IDF, text classification
Gensim	Word2Vec, Doc2Vec, topic modeling
Transformers	BERT, FinBERT, state-of-the-art models
Sentence-Transformers	Document embeddings, semantic search
OpenAI	GPT models via API
Ollama	Local LLM deployment

Best Practices Summary

Start simple: dictionaries and TF-IDF are interpretable baselines

Use domain-specific resources: Loughran-McDonald, FinBERT

Document everything: model versions, prompts, parameters

Set temperature=0: for reproducible research

Validate outputs: LLMs can hallucinate or misformat

Consider costs: batch processing, model selection

Avoid look-ahead bias: respect model training cutoffs

References

Essential reading:

Gentzkow, Kelly, and Taddy (2019) - comprehensive survey of text analysis in economics
Jurafsky and Martin (2025) - free online textbook: web.stanford.edu/~jurafsky/slp3

Books (available via HEC library):

Applied Text Analysis with Python - O’Reilly
Natural Language Processing with Transformers - O’Reilly

Videos:

Neural Networks - 3Blue1Brown series on neural networks

References

Araci, Dogu. 2019. “FinBERT: Financial Sentiment Analysis with Pre-Trained Language Models.” arXiv Preprint arXiv:1908.10063.

Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. “BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding.” Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 4171–86.

Gentzkow, Matthew, Bryan Kelly, and Matt Taddy. 2019. “Text as Data.” Journal of Economic Literature 57 (3): 535–74.

Jurafsky, Dan, and James H. Martin. 2025. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition with Language Models. 3rd ed. https://web.stanford.edu/~jurafsky/slp3/.

Loughran, Tim, and Bill McDonald. 2011. “When Is a Liability Not a Liability? Textual Analysis, Dictionaries, and 10-Ks.” The Journal of Finance 66 (1): 35–65.

Mikolov, Tomas, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. “Efficient Estimation of Word Representations in Vector Space.” arXiv Preprint arXiv:1301.3781.

Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. “Attention Is All You Need.” Advances in Neural Information Processing Systems 30.