MATH60230: Lecture 10

Text Analysis and Natural Language Processing

Vincent Grégoire

HEC Montréal

Saad Ali Khan

HEC Montréal

Plan for Today

  • Introduction to text analysis in finance
  • Text preprocessing fundamentals
  • Text representation: BoW, TF-IDF, dictionaries
  • Word and document embeddings
  • Transformers and LLMs
  • Practical considerations for research

What is Text Analysis?

  • Text analysis (text mining, NLP) is the process of extracting structured information from unstructured text
  • Turning words into numbers that can be used for:
    • Statistical analysis
    • Machine learning
    • Information retrieval
  • Natural Language Processing (NLP) provides the tools and techniques

Why Text Analysis in Finance?

  • Vast amounts of valuable information exist in text form:
    • SEC filings (10-K, 10-Q, 8-K)
    • Earnings call transcripts
    • News articles and press releases
    • Analyst reports
    • Social media (Twitter/X, Reddit, StockTwits)
    • Central bank communications
  • Traditional structured datasets (CRSP, Compustat) are well-studied
  • Text data offers new sources of data to answer new research questions

NLP Applications in Finance

Application Description Example
Sentiment Analysis Measure tone/mood of text Is this earnings call positive or negative?
Information Extraction Extract specific facts What is the reported revenue?
Named Entity Recognition Identify entities Which companies are mentioned?
Document Classification Categorize documents What topics does this 10-K discuss?
Event Detection Identify events Is this news about an M&A?
Summarization Condense text Summarize this 100-page filing

Sentiment Analysis in Finance

  • Quantify the tone of financial communications
  • Applications:
    • Predicting stock returns from news sentiment
    • Measuring management confidence in earnings calls
    • Detecting changes in Fed communication tone
    • Aggregating social media sentiment
  • Key insight: sentiment in financial text differs from general sentiment
    • “Liability” is negative in general English, neutral in finance

Named Entity Recognition (NER)

  • Identify and classify named entities in text:
    • Organizations: Apple Inc., Federal Reserve
    • People: Warren Buffett, Jerome Powell
    • Locations: New York, Silicon Valley
    • Financial terms: revenue, EBITDA, market cap
    • Dates and amounts: Q3 2024, $5.2 billion
  • Finance-specific NER models exist (e.g., FinBERT-NER)
  • Use case: building knowledge graphs of company relationships

The Text Analysis Pipeline

flowchart LR
    A[Raw Text] --> B[Preprocessing]
    B --> C[Tokenization]
    C --> D[Normalization]
    D --> E[Representation]
    E --> F[Analysis/ML]

    style A fill:#e1f5fe
    style F fill:#c8e6c9

  1. Preprocessing: Clean the raw text
  2. Tokenization: Split into words/subwords
  3. Normalization: Standardize tokens
  4. Representation: Convert to numerical features
  5. Analysis: Apply statistical/ML methods

Tokenization

Tokenization = splitting text into individual units (tokens)

text = "Apple's Q3 revenue was $81.8 billion."

# Simple whitespace tokenization
tokens = text.split()
# ['Apple's', 'Q3', 'revenue', 'was', '$81.8', 'billion.']

# Better: use a tokenizer
from nltk.tokenize import word_tokenize
tokens = word_tokenize(text)
# ['Apple', "'s", 'Q3', 'revenue', 'was', '$', '81.8', 'billion', '.']
  • Challenges: contractions, hyphenated words, numbers, punctuation

Stop Words

Stop words = common words that carry little meaning

  • Examples: “the”, “is”, “at”, “which”, “on”, “a”, “an”
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))
tokens = ['the', 'company', 'reported', 'strong', 'revenue', 'growth']
filtered = [w for w in tokens if w.lower() not in stop_words]
# ['company', 'reported', 'strong', 'revenue', 'growth']
  • Removing stop words reduces noise and dimensionality
  • But: sometimes context words matter (e.g., “not profitable”)

Stemming and Lemmatization

Reduce words to their root form to group related words:

Stemming

  • Crude rules-based truncation
  • Fast but imprecise
  • “running” → “run”
  • “studies” → “studi”
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
stemmer.stem("profitable")
# 'profit'

Lemmatization

  • Uses vocabulary and morphology
  • Slower but accurate
  • “running” → “run”
  • “studies” → “study”
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
lemmatizer.lemmatize("profitable", pos='a')
# 'profitable'

N-grams

N-grams = contiguous sequences of n tokens

  • Unigrams (n=1): individual words
  • Bigrams (n=2): pairs of consecutive words
  • Trigrams (n=3): triplets of consecutive words
from nltk import ngrams

text = ["interest", "rate", "hike", "expected"]
list(ngrams(text, 2))
# [('interest', 'rate'), ('rate', 'hike'), ('hike', 'expected')]
  • Captures phrases: “interest rate” vs. “interest” + “rate”
  • Trade-off: more context vs. data sparsity

Bag of Words (BoW)

The simplest text representation: count word occurrences

from sklearn.feature_extraction.text import CountVectorizer

docs = [
    "revenue increased significantly",
    "profit margins decreased",
    "revenue and profit both increased"
]

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(docs)
print(vectorizer.get_feature_names_out())
# ['and', 'both', 'decreased', 'increased', 'margins',
#  'profit', 'revenue', 'significantly']
  • Result: document-term matrix (documents × words)
  • Ignores word order (“bag” of words)

TF-IDF

Term Frequency-Inverse Document Frequency

  • TF (Term Frequency): how often a word appears in a document
  • IDF (Inverse Document Frequency): how rare a word is across all documents

\text{TF-IDF}(t, d) = \text{TF}(t, d) \times \log\left(\frac{N}{\text{DF}(t)}\right)

  • Words that appear frequently in one document but rarely overall get high scores
  • Common words across all documents get low scores
  • Better than raw counts for identifying distinctive terms

TF-IDF in Python

from sklearn.feature_extraction.text import TfidfVectorizer

docs = [
    "The company reported strong revenue growth in Q3",
    "Revenue declined due to market conditions",
    "Strong earnings beat analyst expectations"
]

vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(docs)

# X is now a sparse matrix of TF-IDF features
print(X.shape)  # (3 documents, n features)
  • TfidfVectorizer combines tokenization, stop word removal, and TF-IDF
  • Can specify ngram_range=(1, 2) to include bigrams

Dictionary-Based Methods

Use predefined word lists to measure specific concepts:

  1. Create/obtain a dictionary of words for each category
  2. Count occurrences of dictionary words in text
  3. Compute scores (e.g., % positive words, net sentiment)
  • Simple, interpretable, and transparent
  • No training required
  • But: context-independent (same word = same meaning)

Loughran-McDonald Dictionary

The standard sentiment dictionary for financial text (Loughran and McDonald 2011)

  • Developed specifically for 10-K filings
  • Categories: Negative, Positive, Uncertainty, Litigious, Constraining, Strong Modal, Weak Modal
Category Example Words
Negative loss, decline, adverse, deficit, litigation
Positive achieve, benefit, gain, improve, profitable
Uncertainty approximate, contingent, possible, risk
Litigious attorney, court, lawsuit, legal, tribunal

Available at: sraf.nd.edu/loughranmcdonald-master-dictionary

Using Loughran-McDonald

import pandas as pd

# Load the dictionary
lm_dict = pd.read_csv('LoughranMcDonald_MasterDictionary.csv')

# Get negative words
negative_words = set(
    lm_dict[lm_dict['Negative'] > 0]['Word'].str.lower()
)

# Count negative words in text
def count_negative(text):
    words = text.lower().split()
    return sum(1 for w in words if w in negative_words)

# Compute sentiment score
def sentiment_score(text):
    words = text.lower().split()
    n_neg = sum(1 for w in words if w in negative_words)
    n_pos = sum(1 for w in words if w in positive_words)
    return (n_pos - n_neg) / len(words)

Limitations of BoW/TF-IDF

  • No semantic meaning: “good” and “excellent” are unrelated vectors
  • High dimensionality: vocabulary can be 10,000+ words
  • Sparse representations: most entries are zero
  • No context: “bank” (financial) vs. “bank” (river) are the same
  • Word order lost: “dog bites man” = “man bites dog”

Solution: word embeddings — dense, semantic representations

Word Embeddings

Map words to dense vectors where similar words are close together

%%{init: {'theme': 'base', 'themeVariables': {'clusterBkg': '#f5f5f5', 'clusterBorder': '#cccccc'}}}%%
flowchart LR
    subgraph "Sparse (BoW)"
        A["king: [0,0,1,0,0,...]<br/>queen: [0,1,0,0,0,...]<br/>man: [1,0,0,0,0,...]"]
    end
    subgraph "Dense (Embedding)"
        B["king: [0.2, 0.8, -0.1, ...]<br/>queen: [0.3, 0.7, -0.2, ...]<br/>man: [0.1, 0.6, 0.4, ...]"]
    end
    A --> B

    style A fill:#ffcdd2
    style B fill:#c8e6c9

  • Typically 100-300 dimensions (vs. 10,000+ for BoW)
  • Learned from large text corpora
  • Capture semantic relationships

Word2Vec

The foundational word embedding method (Mikolov et al. 2013)

Key idea: “You shall know a word by the company it keeps”

  • Train a neural network to predict:
    • CBOW: predict word from context
    • Skip-gram: predict context from word
  • Famous example: \text{king} - \text{man} + \text{woman} \approx \text{queen}
  • Pre-trained models available (Google News, GloVe, FastText)

Using Pre-trained Embeddings

import gensim.downloader as api

# Load pre-trained Word2Vec embeddings
model = api.load('word2vec-google-news-300')

# Find similar words
model.most_similar('profit')
# [('profits', 0.82), ('earnings', 0.71), ('revenue', 0.65), ...]

# Word arithmetic
model.most_similar(positive=['ceo', 'woman'], negative=['man'])
# [('chairwoman', 0.71), ('executive', 0.69), ...]

# Get embedding vector
vector = model['stock']  # 300-dimensional vector
  • GloVe, FastText are alternatives with similar usage

Document Embeddings

How to represent an entire document as a vector?

Simple approach: average word embeddings

import numpy as np

def document_embedding(text, model):
    words = text.lower().split()
    vectors = [model[w] for w in words if w in model]
    if vectors:
        return np.mean(vectors, axis=0)
    return np.zeros(model.vector_size)
  • Works reasonably well for short texts
  • Loses word order and importance weighting
  • Better methods: Doc2Vec, Sentence-BERT, LLM embeddings

The Context Problem

Traditional embeddings give one vector per word:

“The bank raised interest rates” “I sat by the river bank”

Same vector for “bank” in both sentences!

Solution: contextual embeddings from Transformers

  • Different vector for the same word depending on context
  • Revolutionized NLP starting with BERT (2018)

The Transformer Architecture

The foundational architecture for modern NLP (Vaswani et al. 2017)

  • Self-attention: each word attends to all other words
  • Captures long-range dependencies
  • Parallelizable (unlike RNNs)

Attention Mechanism (Intuition)

For each word, compute attention weights to all other words:

“The company reported that its revenue increased”

  • “its” attends strongly to “company” (learns the reference)
  • Each word’s representation is a weighted sum of all words
  • Allows model to understand:
    • Coreference (“its” refers to “company”)
    • Long-distance dependencies
    • Word sense disambiguation

BERT: Bidirectional Encoder

Bidirectional Encoder Representations from Transformers (Devlin et al. 2019)

  • Pre-trained on massive text corpus (Wikipedia + Books)
  • Two pre-training objectives:
    • Masked Language Model: predict masked words
    • Next Sentence Prediction: are two sentences consecutive?
  • Fine-tune on downstream tasks:
    • Sentiment classification
    • Named entity recognition
    • Question answering

Finance-Specific Transformers

General BERT may not understand financial language well

FinBERT variants (Araci 2019):

from transformers import pipeline

classifier = pipeline("sentiment-analysis",
                      model="ProsusAI/finbert")

classifier("Revenue growth exceeded expectations")
# [{'label': 'positive', 'score': 0.94}]

Using Transformers for Embeddings

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')

sentences = [
    "The company reported strong earnings",
    "Profits exceeded analyst expectations",
    "The weather was sunny today"
]

embeddings = model.encode(sentences)
# embeddings.shape: (3, 384)

# Compute similarity
from sklearn.metrics.pairwise import cosine_similarity
similarity = cosine_similarity(embeddings)
# sentences 0 and 1 will have high similarity
  • Use for semantic search, clustering, classification

Large Language Models (LLMs)

LLMs = very large Transformer models trained on vast text corpora

  • GPT-4, Claude, Llama, Gemini, Mistral…
  • Billions of parameters
  • Trained on internet-scale data

Key capability: can perform tasks from instructions (prompts)

  • No task-specific training needed
  • “Zero-shot” and “few-shot” learning
  • Can follow complex instructions

LLM Capabilities for Finance

Task Example Prompt
Sentiment “Is this earnings call positive, negative, or neutral?”
Classification “What risk factors are discussed in this 10-K section?”
Extraction “Extract the reported revenue and net income from this text”
Summarization “Summarize this 10-K in 3 bullet points”
Q&A “Based on this filing, what are the main business risks?”
NER “List all company names mentioned in this article”
  • No training required — just write good prompts
  • Can handle nuance and context better than rule-based methods

Using the OpenAI API

from openai import OpenAI

client = OpenAI()  # Uses OPENAI_API_KEY env variable

response = client.chat.completions.create(
    model="gpt-5-mini",
    messages=[
        {"role": "system", "content": "You are a financial analyst."},
        {"role": "user", "content": """
            Classify the sentiment of this text as
            positive, negative, or neutral:

            "Despite challenging market conditions,
            the company achieved record revenue growth."
        """}
    ]
)

print(response.choices[0].message.content)
# "positive"

Prompt Engineering

The art of writing effective prompts:

  1. Be specific: clearly define the task and expected output format
  2. Provide context: explain the domain (finance, accounting, etc.)
  3. Give examples: few-shot learning improves accuracy
  4. Define constraints: “respond only with: positive, negative, neutral”
prompt = """
You are a financial sentiment analyst. Classify the following
text as exactly one of: POSITIVE, NEGATIVE, NEUTRAL.

Example: "Revenue increased 15%" -> POSITIVE
Example: "We face significant headwinds" -> NEGATIVE

Text: "{text}"
Classification:
"""

Local LLMs

Run LLMs locally for privacy and cost control:

import ollama

response = ollama.chat(
    model='llama3.2',
    messages=[{
        'role': 'user',
        'content': 'Summarize this earnings report in one sentence: ...'
    }]
)

print(response['message']['content'])
  • Ollama: command-line tool for local LLM deployment
  • LM Studio: GUI application for running local LLMs
  • Models: Llama 3, Mistral, Phi, Gemma, Qwen, etc.
  • Trade-off: smaller models = less capable but free and private

Research Challenge: Look-Ahead Bias

Look-ahead bias = using information not available at the time

In NLP research, this can occur through:

  1. Model training data: GPT-4 was trained on data up to a cutoff date
    • Using it to “predict” events before that date is cheating
  1. Model release dates: Can’t use GPT-4 for 2020 predictions
    • Document which model version you used
  1. Text availability: When was the document actually public?
    • 10-K filing date vs. fiscal year end

Research Challenge: Replicability

LLM-based research faces unique replicability challenges:

  • Model versioning: APIs update silently
    • Document exact model name: gpt-4-0613 not just gpt-4
  • API deprecation: models get retired
    • Consider local models for long-term reproducibility
  • Stochastic outputs: same prompt → different answers
    • Set temperature=0 for deterministic outputs
    • Save all raw outputs

Temperature and Sampling

Temperature controls randomness in LLM outputs:

  • temperature=0: deterministic, always pick most likely token
  • temperature=1: sample according to model probabilities
  • temperature>1: more random/creative
response = client.chat.completions.create(
    model="gpt-5-mini",
    temperature=0,  # Reproducible outputs
    messages=[...]
)

For research: use temperature=0 for reproducibility, but note that most models perform best at higher temperatures (typically 0.7-1.0)

Structured Output

LLMs output text, but we often need structured data:

response = client.chat.completions.create(
    model="gpt-5-mini",
    response_format={"type": "json_object"},
    messages=[{
        "role": "user",
        "content": """Extract financial data as JSON:
        {"revenue": number, "net_income": number, "sentiment": string}

        Text: Revenue was $5.2B, net income $800M.
        Management expressed cautious optimism."""
    }]
)
# {"revenue": 5200000000, "net_income": 800000000,
#  "sentiment": "cautiously positive"}
  • Use JSON mode for reliable parsing
  • Define clear schemas in prompts

Structured Output with Pydantic

from pydantic import BaseModel
from openai import OpenAI

class FinancialExtraction(BaseModel):
    revenue: float | None
    net_income: float | None
    sentiment: str
    confidence: float

client = OpenAI()
completion = client.beta.chat.completions.parse(
    model="gpt-5-mini",
    messages=[
        {"role": "system", "content": "Extract financial information."},
        {"role": "user", "content": "Revenue grew 15% to $5.2B..."}
    ],
    response_format=FinancialExtraction,
)

result = completion.choices[0].message.parsed
print(result.revenue)  # 5200000000.0

LLM Workflow for Research

flowchart LR
    A[Documents] --> B[Prompt Template]
    B --> C[LLM API]
    C --> D[Parse Response]
    D --> E[Validate]
    E --> F[Store Results]
    F --> G[Analysis]

    E -->|Invalid| B

    style A fill:#e1f5fe
    style G fill:#c8e6c9

  • Batch processing: process documents in parallel
  • Error handling: API failures, rate limits, invalid responses
  • Validation: check outputs match expected format
  • Logging: save prompts, responses, and metadata

Cost Considerations

LLM APIs charge per token (≈ 0.75 words):

Model Input Output
GPT-5.2 (OpenAI) $1.75/1M tokens $14/1M tokens
GPT-5 mini (OpenAI) $0.25/1M tokens $2/1M tokens
Claude Sonnet 4.5 (Anthropic) $3/1M tokens $15/1M tokens
Claude Haiku 4.5 (Anthropic) $1/1M tokens $5/1M tokens
  • 10,000 documents × 1,000 tokens each = 10M tokens
  • With GPT-5 mini: ~$22.50 for processing
  • Start with smaller/cheaper models for development
  • Consider local models for large-scale processing

Python Libraries Summary

Library Use Case
NLTK Classic NLP: tokenization, stemming, corpora
spaCy Industrial NLP: fast, production-ready pipelines
scikit-learn BoW, TF-IDF, text classification
Gensim Word2Vec, Doc2Vec, topic modeling
Transformers BERT, FinBERT, state-of-the-art models
Sentence-Transformers Document embeddings, semantic search
OpenAI GPT models via API
Ollama Local LLM deployment

Best Practices Summary

  1. Start simple: dictionaries and TF-IDF are interpretable baselines
  1. Use domain-specific resources: Loughran-McDonald, FinBERT
  1. Document everything: model versions, prompts, parameters
  1. Set temperature=0: for reproducible research
  1. Validate outputs: LLMs can hallucinate or misformat
  1. Consider costs: batch processing, model selection
  1. Avoid look-ahead bias: respect model training cutoffs

References

Essential reading:

Books (available via HEC library):

Videos:

References

Araci, Dogu. 2019. “FinBERT: Financial Sentiment Analysis with Pre-Trained Language Models.” arXiv Preprint arXiv:1908.10063.
Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. “BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding.” Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 4171–86.
Gentzkow, Matthew, Bryan Kelly, and Matt Taddy. 2019. “Text as Data.” Journal of Economic Literature 57 (3): 535–74.
Jurafsky, Dan, and James H. Martin. 2025. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition with Language Models. 3rd ed. https://web.stanford.edu/~jurafsky/slp3/.
Loughran, Tim, and Bill McDonald. 2011. “When Is a Liability Not a Liability? Textual Analysis, Dictionaries, and 10-Ks.” The Journal of Finance 66 (1): 35–65.
Mikolov, Tomas, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. “Efficient Estimation of Word Representations in Vector Space.” arXiv Preprint arXiv:1301.3781.
Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. “Attention Is All You Need.” Advances in Neural Information Processing Systems 30.