flowchart LR
A[Raw Text] --> B[Preprocessing]
B --> C[Tokenization]
C --> D[Normalization]
D --> E[Representation]
E --> F[Analysis/ML]
style A fill:#e1f5fe
style F fill:#c8e6c9
Text Analysis and Natural Language Processing
| Application | Description | Example |
|---|---|---|
| Sentiment Analysis | Measure tone/mood of text | Is this earnings call positive or negative? |
| Information Extraction | Extract specific facts | What is the reported revenue? |
| Named Entity Recognition | Identify entities | Which companies are mentioned? |
| Document Classification | Categorize documents | What topics does this 10-K discuss? |
| Event Detection | Identify events | Is this news about an M&A? |
| Summarization | Condense text | Summarize this 100-page filing |
flowchart LR
A[Raw Text] --> B[Preprocessing]
B --> C[Tokenization]
C --> D[Normalization]
D --> E[Representation]
E --> F[Analysis/ML]
style A fill:#e1f5fe
style F fill:#c8e6c9
Tokenization = splitting text into individual units (tokens)
text = "Apple's Q3 revenue was $81.8 billion."
# Simple whitespace tokenization
tokens = text.split()
# ['Apple's', 'Q3', 'revenue', 'was', '$81.8', 'billion.']
# Better: use a tokenizer
from nltk.tokenize import word_tokenize
tokens = word_tokenize(text)
# ['Apple', "'s", 'Q3', 'revenue', 'was', '$', '81.8', 'billion', '.']Stop words = common words that carry little meaning
Reduce words to their root form to group related words:
N-grams = contiguous sequences of n tokens
The simplest text representation: count word occurrences
from sklearn.feature_extraction.text import CountVectorizer
docs = [
"revenue increased significantly",
"profit margins decreased",
"revenue and profit both increased"
]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(docs)
print(vectorizer.get_feature_names_out())
# ['and', 'both', 'decreased', 'increased', 'margins',
# 'profit', 'revenue', 'significantly']Term Frequency-Inverse Document Frequency
\text{TF-IDF}(t, d) = \text{TF}(t, d) \times \log\left(\frac{N}{\text{DF}(t)}\right)
from sklearn.feature_extraction.text import TfidfVectorizer
docs = [
"The company reported strong revenue growth in Q3",
"Revenue declined due to market conditions",
"Strong earnings beat analyst expectations"
]
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(docs)
# X is now a sparse matrix of TF-IDF features
print(X.shape) # (3 documents, n features)TfidfVectorizer combines tokenization, stop word removal, and TF-IDFngram_range=(1, 2) to include bigramsUse predefined word lists to measure specific concepts:
The standard sentiment dictionary for financial text (Loughran and McDonald 2011)
| Category | Example Words |
|---|---|
| Negative | loss, decline, adverse, deficit, litigation |
| Positive | achieve, benefit, gain, improve, profitable |
| Uncertainty | approximate, contingent, possible, risk |
| Litigious | attorney, court, lawsuit, legal, tribunal |
Available at: sraf.nd.edu/loughranmcdonald-master-dictionary
import pandas as pd
# Load the dictionary
lm_dict = pd.read_csv('LoughranMcDonald_MasterDictionary.csv')
# Get negative words
negative_words = set(
lm_dict[lm_dict['Negative'] > 0]['Word'].str.lower()
)
# Count negative words in text
def count_negative(text):
words = text.lower().split()
return sum(1 for w in words if w in negative_words)
# Compute sentiment score
def sentiment_score(text):
words = text.lower().split()
n_neg = sum(1 for w in words if w in negative_words)
n_pos = sum(1 for w in words if w in positive_words)
return (n_pos - n_neg) / len(words)Solution: word embeddings — dense, semantic representations
Map words to dense vectors where similar words are close together
%%{init: {'theme': 'base', 'themeVariables': {'clusterBkg': '#f5f5f5', 'clusterBorder': '#cccccc'}}}%%
flowchart LR
subgraph "Sparse (BoW)"
A["king: [0,0,1,0,0,...]<br/>queen: [0,1,0,0,0,...]<br/>man: [1,0,0,0,0,...]"]
end
subgraph "Dense (Embedding)"
B["king: [0.2, 0.8, -0.1, ...]<br/>queen: [0.3, 0.7, -0.2, ...]<br/>man: [0.1, 0.6, 0.4, ...]"]
end
A --> B
style A fill:#ffcdd2
style B fill:#c8e6c9
The foundational word embedding method (Mikolov et al. 2013)
Key idea: “You shall know a word by the company it keeps”
import gensim.downloader as api
# Load pre-trained Word2Vec embeddings
model = api.load('word2vec-google-news-300')
# Find similar words
model.most_similar('profit')
# [('profits', 0.82), ('earnings', 0.71), ('revenue', 0.65), ...]
# Word arithmetic
model.most_similar(positive=['ceo', 'woman'], negative=['man'])
# [('chairwoman', 0.71), ('executive', 0.69), ...]
# Get embedding vector
vector = model['stock'] # 300-dimensional vectorHow to represent an entire document as a vector?
Simple approach: average word embeddings
Traditional embeddings give one vector per word:
“The bank raised interest rates” “I sat by the river bank”
Same vector for “bank” in both sentences!
Solution: contextual embeddings from Transformers
The foundational architecture for modern NLP (Vaswani et al. 2017)
For each word, compute attention weights to all other words:
“The company reported that its revenue increased”
Bidirectional Encoder Representations from Transformers (Devlin et al. 2019)
General BERT may not understand financial language well
FinBERT variants (Araci 2019):
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
sentences = [
"The company reported strong earnings",
"Profits exceeded analyst expectations",
"The weather was sunny today"
]
embeddings = model.encode(sentences)
# embeddings.shape: (3, 384)
# Compute similarity
from sklearn.metrics.pairwise import cosine_similarity
similarity = cosine_similarity(embeddings)
# sentences 0 and 1 will have high similarityLLMs = very large Transformer models trained on vast text corpora
Key capability: can perform tasks from instructions (prompts)
| Task | Example Prompt |
|---|---|
| Sentiment | “Is this earnings call positive, negative, or neutral?” |
| Classification | “What risk factors are discussed in this 10-K section?” |
| Extraction | “Extract the reported revenue and net income from this text” |
| Summarization | “Summarize this 10-K in 3 bullet points” |
| Q&A | “Based on this filing, what are the main business risks?” |
| NER | “List all company names mentioned in this article” |
from openai import OpenAI
client = OpenAI() # Uses OPENAI_API_KEY env variable
response = client.chat.completions.create(
model="gpt-5-mini",
messages=[
{"role": "system", "content": "You are a financial analyst."},
{"role": "user", "content": """
Classify the sentiment of this text as
positive, negative, or neutral:
"Despite challenging market conditions,
the company achieved record revenue growth."
"""}
]
)
print(response.choices[0].message.content)
# "positive"The art of writing effective prompts:
Run LLMs locally for privacy and cost control:
Look-ahead bias = using information not available at the time
In NLP research, this can occur through:
LLM-based research faces unique replicability challenges:
gpt-4-0613 not just gpt-4temperature=0 for deterministic outputsTemperature controls randomness in LLM outputs:
temperature=0: deterministic, always pick most likely tokentemperature=1: sample according to model probabilitiestemperature>1: more random/creativeFor research: use temperature=0 for reproducibility, but note that most models perform best at higher temperatures (typically 0.7-1.0)
LLMs output text, but we often need structured data:
response = client.chat.completions.create(
model="gpt-5-mini",
response_format={"type": "json_object"},
messages=[{
"role": "user",
"content": """Extract financial data as JSON:
{"revenue": number, "net_income": number, "sentiment": string}
Text: Revenue was $5.2B, net income $800M.
Management expressed cautious optimism."""
}]
)
# {"revenue": 5200000000, "net_income": 800000000,
# "sentiment": "cautiously positive"}from pydantic import BaseModel
from openai import OpenAI
class FinancialExtraction(BaseModel):
revenue: float | None
net_income: float | None
sentiment: str
confidence: float
client = OpenAI()
completion = client.beta.chat.completions.parse(
model="gpt-5-mini",
messages=[
{"role": "system", "content": "Extract financial information."},
{"role": "user", "content": "Revenue grew 15% to $5.2B..."}
],
response_format=FinancialExtraction,
)
result = completion.choices[0].message.parsed
print(result.revenue) # 5200000000.0flowchart LR
A[Documents] --> B[Prompt Template]
B --> C[LLM API]
C --> D[Parse Response]
D --> E[Validate]
E --> F[Store Results]
F --> G[Analysis]
E -->|Invalid| B
style A fill:#e1f5fe
style G fill:#c8e6c9
LLM APIs charge per token (≈ 0.75 words):
| Model | Input | Output |
|---|---|---|
| GPT-5.2 (OpenAI) | $1.75/1M tokens | $14/1M tokens |
| GPT-5 mini (OpenAI) | $0.25/1M tokens | $2/1M tokens |
| Claude Sonnet 4.5 (Anthropic) | $3/1M tokens | $15/1M tokens |
| Claude Haiku 4.5 (Anthropic) | $1/1M tokens | $5/1M tokens |
| Library | Use Case |
|---|---|
| NLTK | Classic NLP: tokenization, stemming, corpora |
| spaCy | Industrial NLP: fast, production-ready pipelines |
| scikit-learn | BoW, TF-IDF, text classification |
| Gensim | Word2Vec, Doc2Vec, topic modeling |
| Transformers | BERT, FinBERT, state-of-the-art models |
| Sentence-Transformers | Document embeddings, semantic search |
| OpenAI | GPT models via API |
| Ollama | Local LLM deployment |
Essential reading:
Books (available via HEC library):
Videos:
MATH60230