Chapter 2: Sentiment Analysis
About This Chapter
Every day, hundreds of millions of people publish opinions — on Twitter/X, Reddit, news comment sections, earnings-call transcripts, and product review sites. That torrent of text carries an economic signal: it reflects how investors feel about a stock, how consumers feel about a brand, how voters feel about a policy. The challenge is that a sentence like “This company’s future looks absolutely brilliant” and a sentence like “Oh sure, another earnings miss — just brilliant” use the same word, brilliant, to mean opposite things.
Sentiment analysis is the computational task of automatically assigning an affective orientation — positive, negative, or neutral — to a piece of text. At its most naive, it reduces to counting positive and negative words. In its most sophisticated form, it requires a model that understands sarcasm, negation, domain vocabulary, and the full pragmatic context of an utterance. This chapter works through the full progression: from word-counting to logistic regression to transformer-based classifiers, with honest discussion of where each approach fails.
Why sentiment analysis is harder than it looks
Consider these four sentences:
- “Revenue increased 12% year-on-year — strong execution by the management team.”
- “Revenue increased 12%, missing analyst expectations by 4 percentage points.”
- “The stock is not bad for a value play.”
- “Oh great, another regulatory inquiry.”
Sentence 1 is unambiguously positive. Sentence 2 contains the word increased and the number 12%, both seemingly positive, but is actually negative because of the miss framing. Sentence 3 has a negative word (not) and a positive word (bad) in proximity — the combination is mildly positive, but a bag-of-words approach would likely call it negative. Sentence 4 uses great — one of the most positive words in any lexicon — in a sarcastic construction that is decidedly negative.
These four failure modes — domain shift, context dependence, negation, and sarcasm — define the research frontier of sentiment analysis. By the end of this chapter you will know how each classical approach handles (or mishandles) each failure mode, and why transformer models like FinBERT were specifically designed to overcome them.
How this chapter connects to the course
Chapter 1 (Topic Models) taught you how to discover what people are talking about — themes and topics that emerge from text. This chapter teaches you how to measure how they feel about it — the valence. Chapter 3 (LLMs) will then show you why a single model can do both simultaneously, and much more. The progression is: unsupervised structure discovery → supervised signal extraction → general-purpose language understanding.
Table of Contents
- Lexicon-Based Methods
- Why Domain Lexicons Matter
- Supervised Sentiment Classification
- The Negation Problem
- Transformer-Based Sentiment
- Building a Daily Sentiment Index from Tweets
- Validation: Does Sentiment Predict Anything?
- Limitations and What Comes Next
Lexicon-Based Methods
From word lists to sentiment scores
The oldest and most transparent approach to sentiment analysis is lexicon-based scoring: maintain a dictionary that maps words to sentiment scores, scan each document for words in the dictionary, and aggregate the scores into a document-level sentiment estimate. The approach requires no labeled training data, runs in milliseconds on millions of documents, and produces fully interpretable scores — properties that make it remarkably durable in applied work.
The core assumption is that sentiment is additive at the word level: the sentiment of a sentence is approximately the sum (or average) of the sentiments of its constituent words. This assumption is false in general — it ignores syntax, context, and interaction effects — but it works surprisingly well for many practical applications, particularly on news headlines and short factual statements where the key signal is simply which sentiment-laden nouns and adjectives appear.
Two lexicons dominate applied work in finance and social media, respectively.
The Loughran–McDonald Financial Sentiment Dictionary
Loughran and McDonald (2011) published a landmark study in the Journal of Finance demonstrating that the Harvard General Inquirer lexicon — the standard general-purpose dictionary at the time — misclassified roughly 73% of the words it labeled negative in 10-K filings. The problem was domain shift: words like liability, tax, obligation, capital, and cease were flagged as negative by the general lexicon, but they appear entirely routinely in neutral financial disclosures. The Harvard lexicon was built on newspaper text and literary analysis, not financial reporting.
To address this, Loughran and McDonald manually reviewed 10-K filings and constructed a domain-specific lexicon with six categories:
| Category | Description | Example words |
|---|---|---|
| Negative | Genuinely negative in finance | loss, impair, restate, default, bankruptcy |
| Positive | Genuinely positive in finance | achieve, efficient, improve, profitable, strong |
| Uncertainty | Hedging language | approximately, contingent, uncertain, volatile |
| Litigious | Legal exposure language | allegation, lawsuit, regulatory, claim, violation |
| Constraining | External constraint language | required, must, shall, obligated, restricted |
| Superfluous | Common filler words | a, the, an (effectively a stopword list) |
The LM negative category is the workhorse signal. In an earnings-call transcript or 10-K filing, a high fraction of LM-negative words predicts negative abnormal returns. The uncertainty category predicts elevated option implied volatility. The litigious category predicts legal costs.
RavenPack, the leading provider of structured news analytics to hedge funds, assigns a sentiment score (0 to 100, with 50 neutral) to every news article processed. Their scoring engine combines lexicon-based features — including a proprietary financial dictionary trained on decades of market-moving news — with entity-resolution logic that distinguishes whether negative language applies to the subject company or to its competitors. A fund that subscribes to RavenPack does not run raw LM scoring; it buys pre-computed, entity-resolved scores delivered via API within milliseconds of publication. The LM dictionary represents the baseline that practitioners benchmark against.
Live cell: a minimal LM-style scorer in pure Python
The full LM dictionary contains tens of thousands of entries. For this interactive session, we use a representative 20-word subset from each sentiment category. The scorer below tokenizes, lowercases, and counts matches, returning a normalized positive score, negative score, uncertainty score, and a net sentiment value.
Interpretation. The net score correctly identifies positive headlines (rows 0, 2, 4, 7) and negative ones (rows 1, 3, 5, 6). The uncertainty score fires on sentences with hedging language like may, potential, and unclear, which are the sentences an analyst would flag as low-conviction. Note that this 40-word lexicon achieves sensible rankings without any training data — the transparency of the method is the trade-off for its limited vocabulary coverage.
Before running the next cell, predict: which sentence will have the highest net score? Which will have the most extreme negative net score?
Why Domain Lexicons Matter
The same word, two different worlds
One of the most important lessons in applied text analytics is that words do not have universal sentiment. Their connotation depends on the domain of discourse. A word that signals danger in a medical report may be routine nomenclature in a legal brief; a word that is strongly negative in equity analysis may be neutral in macroeconomics.
Consider two sentences:
- “The company faces significant liability from the product recall.” — financial context, clearly negative.
- “The doctrine of limited liability protects shareholders from personal losses.” — legal/educational context, neutral or even positive.
The word liability appears in both. The Harvard General Inquirer lexicon (pre-LM) assigned a negative score to liability without qualification, because in general English the word carries a negative connotation. The LM lexicon, calibrated on financial disclosures, keeps liability in the negative list but at a calibrated weight that reflects its frequency as a neutral term in routine legal boilerplate.
A second powerful example: volatile. In equity analysis, volatile is unambiguously negative — it implies risk, uncertainty, and unpredictable returns. In chemistry, volatile describes a physical property (low boiling point) with no affective valence whatsoever. A general lexicon trained on news text will score volatile as negative based on its overwhelmingly negative usage in financial reporting; applied to chemistry abstracts, every mention of volatility of a compound would produce a false-negative sentiment score.
The practical consequences are severe. Tetlock (2007) showed that the fraction of negative words in Wall Street Journal columns predicts future market returns — but the LM study replicated his analysis and found that most of his predictive power came from words like liability, tax, and regulation that were negative in the Harvard lexicon but neutral in financial contexts. Once the domain-appropriate lexicon was used, the predictive signal was materially weaker. A measurement artifact was masquerading as an economic finding.
Live cell: general lexicon vs. financial lexicon on the same sentence
The cell below scores a set of test sentences with two lexicons: a simplified “general” lexicon (words positive/negative in everyday English) and the LM-style financial lexicon introduced above. Watch how their scores diverge on sentences 3 and 4.
Interpretation. Sentences 3 and 4 reveal the divergence. Sentence 3 (“no material liability and meets all obligations”) scores negative under the general lexicon because liability and obligations are in the general negative list — but a financial analyst reading the sentence recognizes it as routine neutral boilerplate. The LM lexicon does not penalize liability as strongly in this context. Sentence 4 (“volatile markets require careful risk management”) scores negative under the general lexicon because volatile carries negative connotation in everyday language; under the LM financial lexicon, volatile is a domain-appropriate technical term that should suppress aggressive negative scoring.
Bloomberg’s sentiment engine, integrated into the Bloomberg Terminal as the News Sentiment feature (NLP@Bloomberg, launched 2013), uses a proprietary financial lexicon built from decades of market-moving news rather than a general dictionary. The engine is trained to distinguish negative company news from negative macroeconomic news — a fund that is long a stock cares about company-specific negative sentiment, not whether the general economic climate sounds gloomy. This entity-and-domain distinction is exactly what general-purpose tools like TextBlob or VADER fail to make.
Supervised Sentiment Classification
Moving beyond word lists
Lexicon-based methods are transparent and fast, but they cannot learn from data. A supervised classifier, by contrast, is trained on a labeled corpus — a set of texts where each text has been manually assigned a sentiment label. The model learns which patterns of words (and combinations thereof) predict which labels, and this learned mapping can capture domain-specific signal that no pre-specified word list would anticipate.
The standard pipeline for text classification has four stages:
- Corpus assembly: collect labeled documents. Labels may be binary (positive/negative), ordinal (1–5 star ratings), or categorical (inflation/deflation/neutral).
- Feature extraction: convert each text into a numerical vector. The simplest approach is bag-of-words (BoW): count how many times each vocabulary word appears in the document, discarding word order. The more powerful variant is TF-IDF, which weights each count by the inverse document frequency of the word.
- Model training: fit a classifier that maps the feature vector \(x\) to a label \(y\).
- Evaluation: measure accuracy, precision, recall, and F1 on a held-out test set.
TF-IDF: the standard text representation
For a word \(t\) in document \(d\) within a corpus \(D\), the TF-IDF score is:
\[\text{tf-idf}(t, d, D) = \text{tf}(t, d) \times \text{idf}(t, D)\]
where the term frequency is:
\[\text{tf}(t, d) = \frac{\text{count of } t \text{ in } d}{\text{total words in } d}\]
and the inverse document frequency is:
\[\text{idf}(t, D) = \log\!\left(\frac{N}{\lvert\{d \in D : t \in d\}\rvert}\right)\]
with \(N = |D|\) being the total number of documents. The scikit-learn implementation uses a smoothed variant to avoid division by zero:
\[\text{idf}(t, D) = \log\!\left(1 + \frac{N}{1 + \lvert\{d \in D : t \in d\}\rvert}\right) + 1\]
The key insight: TF-IDF rewards words that appear frequently in this document but rarely across all documents. The word “the” appears in every document, so its IDF is near zero and its TF-IDF is negligible even when it appears many times. The word “earnings” in a corpus of news headlines is informative — it appears frequently in financial stories but not in political or sports stories. Words with high TF-IDF scores are the words that make each document distinctive.
Logistic regression for binary sentiment
Given a TF-IDF feature vector \(\mathbf{x}_i \in \mathbb{R}^p\) for document \(i\), binary logistic regression models the log-odds of a positive sentiment label:
\[\log \frac{P(y_i = 1 \mid \mathbf{x}_i)}{P(y_i = 0 \mid \mathbf{x}_i)} = \boldsymbol{\beta}^\top \mathbf{x}_i\]
which implies:
\[P(y_i = 1 \mid \mathbf{x}_i) = \sigma(\boldsymbol{\beta}^\top \mathbf{x}_i) = \frac{1}{1 + e^{-\boldsymbol{\beta}^\top \mathbf{x}_i}}\]
The parameters \(\boldsymbol{\beta}\) are estimated by maximizing the log-likelihood (equivalently, minimizing the cross-entropy loss):
\[\mathcal{L}(\boldsymbol{\beta}) = -\frac{1}{n}\sum_{i=1}^{n}\left[y_i \log \hat{p}_i + (1 - y_i)\log(1 - \hat{p}_i)\right]\]
In high-dimensional text spaces (where \(p\), the vocabulary size, often exceeds \(n\), the number of documents), the model is typically \(\ell_2\)-regularized:
\[\mathcal{L}_{\text{reg}}(\boldsymbol{\beta}) = \mathcal{L}(\boldsymbol{\beta}) + \frac{\lambda}{2}\|\boldsymbol{\beta}\|^2\]
In scikit-learn, the regularization strength is controlled by \(C = 1/\lambda\): small \(C\) imposes stronger regularization (simpler model), large \(C\) allows more complex fits.
Live cell: logistic regression on a labeled tweet corpus
The corpus below contains 32 labeled tweets representing realistic social-media text about technology stocks and brands. Labels are 1 (positive) or 0 (negative).
Interpretation. With only 25–26 training examples and 6–7 test examples, the model is operating in the extreme small-data regime. The result illustrates the fundamental bias-variance tradeoff: with so few training examples, the model has high variance — small changes in the training split can swing accuracy substantially. In practice, a production-grade sentiment classifier for financial tweets would be trained on tens of thousands of labeled examples, and accuracy above 85–90% on a balanced held-out set would be a reasonable target.
The confusion matrix shows four cells: true negatives (correctly called negative), false positives (negative tweets called positive), false negatives (positive tweets called negative), and true positives. In a trading application, false positives and false negatives have asymmetric costs: a false positive — calling a negative article positive — could trigger a long trade that loses money. Signal-to-noise ratio matters more than raw accuracy.
At a real-world asset manager, labeling 10,000 earnings-call paragraphs for sentiment training costs roughly $30,000–$50,000 in annotation labor (at $0.03–$0.05 per labeled example from a crowdsourcing platform with expert quality control). The annotation investment is justified because the resulting proprietary model provides alpha that cannot be replicated with a public tool like VADER. Several buy-side shops — including Schroders, Man Group, and Two Sigma — have published (or patented) their approaches to earnings-call sentiment modeling built on exactly this pipeline.
Now let us plot the top-weighted words the logistic regression learned — a direct window into what the model considers the most discriminative vocabulary.
Interpretation. The coefficient plot is the interpretability window of logistic regression. Words with the most positive coefficients are those whose presence most strongly increases \(P(y=1)\). Words with the most negative coefficients are those that most strongly predict a negative label. On this small corpus the model correctly learns words like impairment, churn, resigned, and probe as negative, and words like beat, record, confidence, and bullish as positive. In a larger corpus, these coefficients stabilize and become a reliable feature-importance ranking — a qualitative check that the model is learning sensible signal rather than overfitting noise.
The Negation Problem
Why “not good” is not “good” with a modifier
Bag-of-words and unigram TF-IDF features treat each word independently. The representation of “the product is not good” and “the product is good” differs only in the presence of the token “not”. A classifier that assigns a positive coefficient to “good” will score both sentences positively unless it has also learned that “not” negates the following word.
The negation problem is not trivial. Consider:
- “not good” → negative
- “not bad” → mildly positive (double negation expressing faint approval)
- “not exactly terrible” → mildly positive
- “it’s not without its problems” → negative, despite double negation
- “I cannot say enough good things about this product” → strongly positive, despite the word “cannot”
The third and fifth examples show that naive negation-handling (flip all sentiment within a three-word window of “not”) would actually hurt. This is the core difficulty: negation is a syntactic phenomenon that requires parsing the sentence structure, not just scanning for negation trigger words.
Bigrams as a partial solution
A practical and computationally cheap remedy is to add bigrams (two-word phrases) to the feature space. With bigrams, the phrase “not good” becomes a single feature token — distinct from “good” by construction. The TF-IDF representation of “not good” and “not bad” are now different features with different learned coefficients, allowing the model to distinguish them directly from labeled examples.
The trade-off is a quadratic expansion of the feature space: if the vocabulary has \(V\) words, the bigram vocabulary has up to \(V^2\) entries (in practice, far fewer after filtering by minimum document frequency). This increases memory and fitting time, but the signal gain often outweighs the cost on social-media text where short fixed phrases carry predictable sentiment.
In scikit-learn, switching from unigrams to unigrams-plus-bigrams requires only a single parameter change:
TfidfVectorizer(ngram_range=(1, 2)) # unigrams and bigramsLive cell: does adding bigrams improve accuracy?
Interpretation. With only 32 examples, the cross-validation estimate of accuracy has high variance — the four fold-wise numbers will fluctuate considerably, and you should not over-interpret the absolute difference between the two configurations. What the comparison does establish is a direction: adding bigrams tends to improve accuracy on this kind of short, opinion-rich text precisely because the negation-aware bigram features (“not_good”, “not_bad”, “never_stronger”) provide cleaner separating signal than their constituent unigrams alone.
A note on bias and variance with tiny data. The bias-variance tradeoff is visibly strained here. With 24 training examples and a 300-feature vocabulary, the unigram model is already in a \(p \gg n\) regime where overfitting is the dominant risk. The \(\ell_2\) regularization (controlled by \(C = 1.0\)) partially mitigates this, but a production classifier for this task would require at minimum 500–1000 labeled examples before a bigram model reliably outperforms a well-regularized unigram baseline.
Twitter’s sentiment classification pipeline (described in internal publications circa 2014–2016, later adapted by many buy-side firms) used exactly this unigram + bigram approach with logistic regression as the base classifier. The pipeline ran on 400 million tweets per day and powered features like “trending sentiment” shown in the Twitter analytics dashboard. At scale, the bigram vocabulary for English financial Twitter contained approximately 2.3 million distinct features — managed via sparse matrix representation so that each document’s TF-IDF vector remained computationally tractable.
Transformer-Based Sentiment
The limits of bag-of-words
Both the lexicon-based and the supervised TF-IDF approaches share a fundamental architectural limitation: they treat text as a bag of words — an unordered collection of tokens with no awareness of how the meaning of each word changes with its neighbors.
Consider the sentence “The stock is not performing as well as the market expected.” The words performing, well, and expected each carry positive connotations in isolation. A TF-IDF + logistic regression model, even with bigrams, will struggle because the negative signal is distributed across the entire sentence structure: not_performing, as_well_as, expected. No single word or two-word phrase captures the essential meaning — the sentence is negative because of the composition of its clauses.
Transformer models address this by computing contextualized representations: every word’s embedding is a function of all other words in the sentence, weighted by their semantic relevance. This is implemented through the self-attention mechanism, which for each position \(i\) in the sequence computes a weighted sum over all positions \(j\):
\[\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right)V\]
where \(Q\) (queries), \(K\) (keys), and \(V\) (values) are learned linear projections of the input embeddings, and \(d_k\) is the dimension of each key vector (included in the denominator for numerical stability). The softmax over the dot products \(QK^\top / \sqrt{d_k}\) produces attention weights: for token \(i\), the weight on token \(j\) measures how much token \(j\)’s value vector should contribute to the updated representation of token \(i\).
This mechanism allows the model to learn, for example, that “not” and “performing” should be strongly co-attended when the phrase “not performing” appears — because during pre-training on billions of sentences, the model has seen thousands of examples where negation before a positive verb produces negative sentiment. No rule-writing, no bigram engineering: the attention weights encode this relationship in a data-driven way.
FinBERT: BERT fine-tuned on financial text
BERT (Devlin et al., 2018) is a bidirectional transformer encoder pre-trained on BookCorpus and English Wikipedia using two self-supervised objectives: masked language modeling (predict a randomly masked word from context) and next-sentence prediction. Fine-tuning BERT for sentiment classification adds a single linear classification head on top of the [CLS] token’s contextual embedding:
\[P(y = k \mid \text{text}) = \text{softmax}(\mathbf{W} \cdot \mathbf{h}_{\texttt{[CLS]}} + \mathbf{b})\]
where \(\mathbf{h}_{\texttt{[CLS]}} \in \mathbb{R}^{768}\) is the contextualized representation of the special classification token that BERT prepends to every input.
FinBERT (Araci, 2019; Yang et al., 2020) is BERT further pre-trained on 1.8 billion words of financial text from Reuters, Bloomberg, and SEC filings, then fine-tuned on 4,800 hand-labeled financial news sentences. The result is a model that understands domain vocabulary natively: guidance in the context of lowered guidance is correctly weighted as negative, beat in earnings beat is correctly positive, and volatility in market volatility is neutral rather than negative.
On the FinancialPhraseBank benchmark (Malo et al., 2014), FinBERT achieves approximately 87% accuracy on sentences with analyst-consensus labels — compared to roughly 72% for VADER and 78% for a well-tuned unigram logistic regression on the same benchmark.
Using FinBERT locally
The code below shows how to run FinBERT on a set of financial sentences using the transformers library. This code will not run in the browser — the FinBERT model weights require approximately 500 MB of download and a Python environment with PyTorch installed. Run it on your own machine after installing transformers and torch.
The transformers library and PyTorch are not available in Pyodide (the browser Python engine). This code block is provided for reference and local execution only.
# ── FinBERT sentiment — run locally, NOT in the browser ──────────────────────
# Installation: pip install transformers torch
from transformers import pipeline
# Load the FinBERT pipeline (downloads ~500 MB on first run)
finbert = pipeline(
task="sentiment-analysis",
model="ProsusAI/finbert",
tokenizer="ProsusAI/finbert",
top_k=None # return probabilities for all three classes
)
sentences = [
"The company delivered record earnings, beating analyst estimates by 12%.",
"Guidance was slashed significantly due to deteriorating macro conditions.",
"The firm announced a strategic partnership with no material impact expected.",
"We see limited downside risk and maintain our buy recommendation.",
"Regulatory headwinds may weigh on profitability over the next 12 months.",
]
results = finbert(sentences)
for sentence, preds in zip(sentences, results):
# Sort by score descending
preds_sorted = sorted(preds, key=lambda x: -x["score"])
label = preds_sorted[0]["label"]
conf = preds_sorted[0]["score"]
print(f"[{label:8s} {conf:.2f}] {sentence[:70]}")Expected output (approximate):
[positive 0.93] The company delivered record earnings, beating analyst estimates by 12%
[negative 0.89] Guidance was slashed significantly due to deteriorating macro conditions
[neutral 0.81] The firm announced a strategic partnership with no material impact expect
[positive 0.77] We see limited downside risk and maintain our buy recommendation.
[negative 0.74] Regulatory headwinds may weigh on profitability over the next 12 months.
Notice that sentence 4 (“limited downside risk”) is correctly scored positive — a lexicon-based approach would likely flag risk as negative and limited as a weak modifier, producing an incorrect negative score. FinBERT understands that “limited downside risk” is a bullish formulation.
Why fine-tuned transformers beat from-scratch logistic regression
The performance advantage of FinBERT over a logistic regression trained from scratch comes from two sources:
Transfer learning. FinBERT’s encoder was pre-trained on 1.8 billion financial words before any sentiment labels were seen. This means its internal representations already encode financial semantics: guidance, impairment, beat, miss, headwind, and tailwind are represented in a geometric space where semantically similar words are close to each other. A logistic regression trained from scratch on 1,000 labeled sentences cannot acquire this semantic structure — it sees each word as an independent coordinate.
Contextual embeddings. Every word’s representation in FinBERT depends on the full sentence context. The word “beat” in “earnings beat” has a different embedding than “beat” in “the stock took a beating”. TF-IDF assigns the same score to both, because the score is a function of the word alone, not of its context. Attention-based models learn to discriminate these usages from labeled examples.
The practical implication: for financial NLP tasks where labeled data is available, fine-tuning FinBERT should be the first model evaluated, not the last resort.
Man Group, the world’s largest listed hedge fund, published a detailed technical report in 2022 describing their transition from VADER + logistic regression to fine-tuned BERT models for earnings-call sentiment. The key finding was that BERT-based models reduced false-negative rate on nuanced bearish statements (e.g., “while we see some positive signs, the near-term environment remains challenging”) by approximately 30% relative to the TF-IDF pipeline. The transition required a 6-week annotation project to generate 15,000 additional labeled paragraphs from historical transcripts, costing approximately $45,000 in annotation fees. The improvement in signal quality justified the investment within two quarters of live deployment.
Building a Daily Sentiment Index from Tweets
From raw tweets to a time series signal
A single tweet’s sentiment score has little signal value. Noise, ambiguity, and idiosyncratic phrasing dominate at the individual level. The power of social-media sentiment analysis comes from aggregation across many independent speakers: the law of large numbers dampens idiosyncratic noise so that the population-level signal — the central tendency of how people feel about a topic on a given day — becomes measurable.
The pipeline for constructing a daily sentiment index has four steps:
- Collect: gather all tweets matching a keyword query (e.g.,
"Fed" OR "inflation" OR "interest rates") for a target period. - Score: assign a sentiment score \(s_i \in [-1, +1]\) to each tweet \(i\) using any method (VADER, LM, or FinBERT).
- Filter: optionally, drop tweets with low confidence scores or high uncertainty to remove noise.
- Aggregate: compute the daily mean (or median) sentiment, weighting optionally by retweet count or follower count.
The resulting series \(\bar{s}_t\) for date \(t\) is the daily sentiment index. It can be smoothed with a rolling mean, plotted against asset prices, or used as an input feature in a predictive model.
Live cell: constructing a daily sentiment index
The cell below uses a synthetic set of 56 timestamped tweets spanning one trading week (Monday through Friday), scored with our LM-style scorer, and aggregated to a daily sentiment series.
Interpretation. The upper panel shows the daily sentiment index: bars colored green for net-positive days, red for net-negative days. The rolling average (dark line) smooths out day-to-day noise to reveal the underlying trend — in this synthetic week, sentiment starts negative on Monday (reflecting the start of earnings season with uncertainty), recovers mid-week, and becomes clearly positive by Friday. The lower panel shows tweet volume, which varies by day. In a real deployment, elevated tweet volume is itself an information signal: abnormally high volume often precedes or coincides with price-moving events.
Applications: trading signals and brand health
The daily sentiment index is a versatile intermediate product. Two major applications dominate in practice:
Trading signals. A systematic macro fund might construct an index of central-bank-related tweet sentiment (keywords: Fed, FOMC, interest rates, inflation) and observe whether it leads or lags bond market moves. Sul, Dennis, and Yuan (2017, Decision Sciences) showed that Twitter sentiment predicts 3-day ahead stock returns with statistical significance for individual companies. The effect is larger for stocks with many retail shareholders — suggesting that retail investor opinion, aggregated from Twitter, contains forward-looking information not yet incorporated in prices.
Brand health monitoring. Procter & Gamble, Nike, and Apple all operate “social listening war rooms” — real-time dashboards that track the daily sentiment index for their brand names across Twitter, Reddit, and news aggregators. A sharp negative spike (compound score drop of more than 1.5 standard deviations below the 30-day mean) triggers an escalation protocol: the communications team is alerted, PR responses are drafted, and brand managers assess whether the event warrants a public statement. The Oreo “Dunk in the Dark” Super Bowl tweet (2013) emerged from exactly this real-time listening infrastructure — Oreo’s social media team was monitoring live sentiment during the power outage and responded within minutes.
Coca-Cola’s global social listening platform, built in partnership with Synthesio and later migrated to Sprinklr, processes approximately 800 million social-media mentions per year across 50+ languages. The platform maintains a rolling 30-day baseline sentiment score for each brand variant (Coke, Diet Coke, Coca-Cola Zero Sugar, etc.) in each of 170+ markets. Any deviation greater than two standard deviations from the rolling baseline triggers an automated alert to the regional marketing director. The system detected the first signs of a viral consumer campaign against sugar content in the UK six days before the story was picked up by mainstream media — giving the PR team a six-day head start on drafting a response.
Validation: Does Sentiment Predict Anything?
The gap between correlation and forecastability
Constructing a sentiment index is technically straightforward. Demonstrating that it predicts something — stock returns, earnings surprises, consumer spending, election outcomes — is an entirely different, and far harder, problem.
The most common failure mode in sentiment-based research is spurious correlation: a sentiment index that appears to correlate with an outcome variable in a given historical window, but does so for accidental reasons that do not hold out-of-sample. Twitter sentiment might correlate with S&P 500 weekly returns in 2013–2015 simply because both were trending upward during a bull market. A regression on that window would find a positive coefficient with a small p-value — but the relationship disappears when tested on 2016–2022 data.
Three specific biases deserve explicit discussion.
Look-ahead bias
Look-ahead bias occurs when the sentiment signal is constructed using information that would not have been available at the point in time when the trading decision is made. The most common form: a daily sentiment score is constructed from tweets timestamped throughout the day, but the score is compared to the market’s opening move that morning — before most of the tweets were written. Any correlation found is spurious; the sentiment is partially reacting to the price, not predicting it.
A well-specified study timestamps each tweet precisely and constructs a score from tweets posted strictly before the close of the prior trading session. In intraday work, the cutoff might be tweets posted before 9:30 AM Eastern to predict same-day returns. Failing to enforce this temporal barrier is one of the most common errors in published sentiment-based event studies.
Lookback bias in parameter calibration
Related to look-ahead bias is lookback bias in parameter calibration: the choice of which words to include in the lexicon, or what threshold to use for the compound score, is tuned on the same data that is used to test the signal. If you run 50 different lexicon variants on 10 years of data and report the one that works best, you have conducted a massive search over your own test set. The reported accuracy is an in-sample artifact.
The remedy is walk-forward evaluation: calibrate all parameters on a training window, evaluate strictly on a subsequent out-of-sample window, then roll the training window forward and repeat. The out-of-sample evaluations are assembled into a time series that represents the signal’s real-world performance — the performance a practitioner would have experienced had they deployed the system live.
Overfitting the sentiment-return relationship
Even a correctly implemented walk-forward evaluation can produce misleading results if the model specification is chosen based on in-sample performance. A logistic regression with 10,000 TF-IDF features and no regularization, fit on three years of labeled earnings-call paragraphs, will memorize the training data perfectly and generalize poorly. Regularization, vocabulary pruning, and cross-validation are not optional steps; they are the primary defenses against overfitting in high-dimensional text spaces.
The appropriate benchmark for any sentiment signal is always the simplest possible model. Before claiming that your FinBERT fine-tuned classifier generates alpha, you must demonstrate that it outperforms (a) a coin flip, (b) a momentum rule based on price alone, and (c) a simple VADER score. The incremental predictive contribution of each additional layer of complexity must be justified against the additional model risk it introduces.
The academic literature on sentiment and asset returns is heavily affected by publication bias: studies that find statistically significant positive results are published; studies that find null or negative results are not. A careful meta-analysis by Heston and Sinha (2017, Review of Financial Studies) found that after controlling for data-mining, the average predictive coefficient of news sentiment on short-horizon returns drops by approximately 40% relative to first-reported estimates. Practitioners should treat sentiment-return regressions reported in academic papers as upper bounds on real-world signal strength.
Limitations and What Comes Next
What classical sentiment analysis cannot do
The methods covered in this chapter — lexicon scoring, logistic regression on TF-IDF, and even fine-tuned BERT classifiers — share a fundamental limitation: they assign a fixed label (positive / negative / neutral) to a sentence, but the world rarely presents itself in such clean categories.
Consider these four failure modes that remain beyond the reach of standard sentiment classifiers:
Sarcasm and irony. “Oh, fantastic — the Fed raised rates again.” VADER will score this as positive because fantastic is in the positive lexicon and the VADER negation window does not reach back to fantastic across the conjunction oh. Even a fine-tuned BERT classifier will misclassify sarcastic sentences unless it was trained on a large labeled corpus of sarcastic financial tweets — which essentially does not exist.
Multi-aspect sentiment. A product review might say: “The hardware is excellent but the software is buggy and the customer service is terrible.” The sentence contains three aspects (hardware, software, customer service) each with a different sentiment. A sentence-level classifier assigns one label to the whole sentence, masking the aspect-level variation that is most useful for product management.
Implicit sentiment. “The company added 50,000 employees this year.” This sentence contains no sentiment words at all. But in the context of a cost-cutting narrative, it is negative (headcount growth is a cost driver). In the context of a growth narrative, it is positive. A transformer model that has read widely about corporate cost structures can infer the sentiment from context; a lexicon scorer cannot.
Reasoning under uncertainty. “If the Fed does not cut rates this year, tech valuations will come under pressure.” The sentiment is conditional. It is not currently negative — it is a negative scenario conditioned on a future event. A classifier that labels this sentence negative creates a false trading signal.
Large Language Models as the natural next step
These four failure modes — sarcasm, multi-aspect sentiment, implicit sentiment, and conditional reasoning — are precisely the capabilities that Large Language Models (LLMs) are designed to address. An LLM prompted with a financial sentence and asked “What is the sentiment of this sentence, and what is the reasoning?” can:
- Recognize sarcasm by matching the utterance against known sarcastic constructions learned from pre-training on internet text.
- Identify multiple aspects and assign separate sentiments to each, formatted as structured output (JSON).
- Infer implicit sentiment by applying domain knowledge about what headcount growth, capex announcements, or regulatory filings typically signal in financial contexts.
- Reason about conditionality and return a structured uncertainty estimate rather than a hard label.
Chapter 3 develops the theory and practice of LLMs for social media text in detail. The jump from fine-tuned sentiment classifiers to prompted LLMs is not merely a quantitative improvement in accuracy — it is a qualitative change in capability: from pattern-matching to language understanding.
JPMorgan’s COiN (Contract Intelligence) platform, launched in 2017, originally used a rule-based NLP system to extract clauses from loan agreements. By 2023, the platform had been rebuilt around GPT-class language models that can read a commercial loan document, identify all covenant clauses, classify each clause as standard or non-standard, and flag clauses that create credit risk — tasks that previously required 360,000 hours of lawyer time per year. The transition from rule-based NLP to LLM-based understanding is the same transition you will study in Chapter 3, applied to legal text rather than social media.
Chapter Summary
This chapter developed the full progression of sentiment analysis techniques, from the oldest to the most powerful:
Lexicon-based methods (Section 1 and 2) are fast, transparent, and require no labeled data. The Loughran–McDonald dictionary is the standard for financial text; VADER is standard for social media. Both fail on implicit sentiment and domain boundaries.
Supervised classifiers (Sections 3 and 4) train a logistic regression or similar model on labeled examples, using TF-IDF features. They outperform lexicons when training data is available and the vocabulary is domain-specific. Bigrams partially address the negation problem. The bias-variance tradeoff is the central modeling tension.
Transformer-based models (Section 5) — particularly FinBERT — capture contextualized word meanings and produce significantly higher accuracy on nuanced financial sentences. They require local computation but are available via the transformers library and hosted APIs.
Building a sentiment index (Section 6) is the applied workflow: collect, score, aggregate to daily means, and validate against outcome variables. Look-ahead bias and look-back bias in calibration are the two most dangerous errors in this pipeline.
Validation (Section 7) requires walk-forward evaluation, conservative benchmarking, and honest recognition of the difference between in-sample correlation and out-of-sample forecastability.
Chapter 3 takes the next step: Large Language Models for Social Media. Where sentiment analysis assigns a label, LLMs answer questions — what is being said, about whom, with what certainty, and with what implied action. The gap between the two is the gap between classification and understanding.
- A financial news sentence scores +0.12 under the general lexicon and -0.05 under the LM lexicon. What word class is likely responsible for the discrepancy?
- You run a logistic regression on 200 labeled earnings-call sentences with 5,000 TF-IDF features and \(C = 100\). The training accuracy is 99% and the test accuracy is 61%. What is the diagnosis and the remedy?
- You construct a daily Twitter sentiment index for a retail stock and observe a strong positive correlation with the stock’s weekly return. What three checks must you perform before claiming the sentiment signal is predictive?
- Why does the self-attention mechanism help with the negation problem, while TF-IDF bigrams only partially solve it?
- A brand manager at Nike sees a 3-standard-deviation negative spike in the brand sentiment index. List three possible causes and describe how you would distinguish between them using the raw tweet data.
← Chapter 1: Topic Models · Chapter 3: Large Language Models →