• 📖 Cover
  • Contents

Chapter 1: Topic Models

Chapter Introduction

Every day, corporate earnings calls generate hundreds of pages of transcript text. Twitter produces roughly 500 million posts. Reuters and Bloomberg push thousands of news headlines before markets open. A portfolio manager who wants to understand what language the market is using — which themes are rising, which companies are being discussed in the same breath as “guidance cut” or “margin pressure” — cannot read all of it. Neither can a compliance team monitoring for reputational risk, nor a social media analyst trying to understand what a brand’s audience actually cares about.

The question that motivates this chapter is deceptively simple: given a large collection of text documents, what are they about?

Manual coding is the traditional answer. A team of research analysts reads every document, assigns categories from a codebook, and produces a labeled dataset. This works, but it is slow, expensive, and does not scale. Supervised classification is a faster alternative — train a model on labeled examples, apply it to new text — but it requires those labeled examples to exist, and it can only assign categories that were anticipated at labeling time. Neither approach discovers topics that no one expected to find.

Topic models take a different path. They make no assumption about what the topics are. Instead, they look at patterns of word co-occurrence across documents and infer latent themes algorithmically. A topic model trained on S&P 500 earnings call transcripts from 2010 to 2024 might produce one topic characterized by the words revenue, growth, guidance, quarter, beat — readily interpreted as a performance and forward guidance topic — and another characterized by costs, margins, supply, inflation, headcount — clearly a cost-structure topic. These are not labels imposed from outside; they emerge from the text itself. And crucially, each document is not assigned to a single topic: it is described as a mixture — 60% performance guidance, 30% cost structure, 10% macro environment — which is a far richer representation of a typical earnings call than any single category label.

This chapter builds the full toolkit, from the most basic representation of text as numbers up to fitting and interpreting Latent Dirichlet Allocation (LDA), the workhorse topic model in academic finance, economics, and social-media research.

Why topic models appear throughout finance and marketing research

The application of topic models to financial text has generated an active literature since Boudoukh et al. (2013) showed that the thematic content of news articles predicts stock returns beyond simple positive/negative sentiment. Hoberg and Phillips (2016) used LDA on 10-K product descriptions to construct a continuous measure of product market competition — an instrument that has since appeared in over a thousand citations. The US Federal Reserve uses text analysis of FOMC meeting minutes to track shifts in the policy debate. Goldman Sachs and BlackRock both maintain internal NLP pipelines that classify news and filings by topic and link those topic signals to price movements and factor loadings.

On the social media side, topic models have been used to track public health crises (flu trends on Twitter, Lamb et al. 2013), political opinion formation (Blei & Lafferty 2007), customer complaint clustering for brand monitoring, and trending financial hashtag detection — the last of which appears directly in the course notebook this chapter draws on.

What you will build in this chapter

By the end, you will be able to:

  1. Represent a corpus of text documents as a numerical matrix using bag-of-words and TF-IDF.
  2. Understand the generative story behind Latent Dirichlet Allocation and the role of Dirichlet priors.
  3. Fit an LDA model using scikit-learn on a live in-browser corpus and read the output.
  4. Select the number of topics using perplexity and coherence, and understand why neither metric alone is sufficient.
  5. Apply LDA to earnings-call text and interpret topics as financial features.

All code runs live in your browser. No installations, no API keys, no files to download.


Table of Contents

  1. Text as Numbers: Bag-of-Words
  2. TF-IDF: What Is Unique to This Document?
  3. From Counts to Topics: NMF as a Warm-Up
  4. Latent Dirichlet Allocation
  5. Fitting LDA with scikit-learn
  6. Choosing K: Perplexity and Coherence
  7. Financial Application: Earnings-Call Topics
  8. Limitations and What Comes Next

Text as Numbers: Bag-of-Words

Why machines cannot read

A machine learning algorithm cannot process the sentence “Apple reported strong iPhone sales.” It can only process numbers. The first challenge of any text-analytics pipeline is therefore a representational one: how do you convert a document into a vector of numbers in a way that preserves as much of the document’s meaning as possible?

The simplest approach, and the one that underlies nearly all classical topic models, is the bag-of-words (BoW) model. The name is apt: imagine pouring the words of a document into a bag and shaking it. Order is lost — “Apple is great” and “great is Apple” produce the same bag — but the counts of each word survive. A document becomes a vector of word frequencies indexed by the vocabulary of the entire corpus.

Despite its apparent crudeness, the bag-of-words representation has powered remarkably effective models for two decades. For topic discovery, word-order matters less than one might expect: if a document uses the words revenue, guidance, and quarter frequently together, it is likely discussing financial performance regardless of how those words are arranged syntactically.

Pre-processing: the steps before counting

Raw text is noisy. Before counting words, four pre-processing steps are standard in almost every NLP pipeline:

Lowercasing. “Apple” and “apple” are the same word for our purposes. Converting all text to lowercase collapses these variants.

Tokenization. Splitting a string into individual tokens (words or sub-words). A simple approach splits on whitespace and punctuation. The sentence “Q3 revenue was $4.37B” becomes the token list ["Q3", "revenue", "was", "$4.37B"]. More sophisticated tokenizers handle contractions, hyphens, and domain-specific terms.

Stopword removal. Common words like “the,” “is,” “of,” “and” appear in every document and carry no discriminative information for topic discovery. A stopword list contains these high-frequency, low-information words; removing them reduces vocabulary size by 30–50% and sharpens the signal.

Stemming or lemmatization. “running,” “runs,” and “ran” are morphological variants of the same root. Stemming applies heuristic suffix-stripping rules (Porter stemmer) to collapse them to a common form; lemmatization uses a vocabulary lookup to find the canonical dictionary form. For financial text, stemming is common; for social media text, lemmatization often performs better because informal language has high morphological diversity.

After these four steps, each document is a list of normalized tokens ready to be counted.

Building a term-document matrix

A corpus of \(D\) documents, each pre-processed into tokens, produces a vocabulary \(V\) — the set of all unique tokens across all documents. The term-document matrix \(\mathbf{X}\) has \(D\) rows and \(|V|\) columns, where \(X_{d,t}\) is the count of term \(t\) in document \(d\).

\[X_{d,t} = \text{count of term } t \text{ in document } d\]

This matrix is almost always sparse — most documents use only a small fraction of the total vocabulary. A corpus of 10,000 earnings call paragraphs against a vocabulary of 15,000 words might have 99% of its matrix cells equal to zero. Sparse matrix formats (scipy CSR) store only the non-zero entries, making large corpora tractable in memory.

The cell below builds a tiny corpus, runs the four pre-processing steps inside CountVectorizer, and prints the resulting term-document matrix. Read it carefully: each row is a document, each column is a word from the vocabulary, and each cell is a count.

Reading the output. Each row is one document; each column is a term from the vocabulary (stopwords like “the,” “in,” “and” have been removed). A cell value of 2 means that term appeared twice in that document; a 0 means it did not appear at all. Notice how “revenue” appears in Doc1 and Doc5, and “margins” appears in Doc2 and Doc6 — this co-occurrence pattern is exactly the signal that topic models exploit to group documents.

In practice

Bloomberg Terminal’s natural-language search and Goldman Sachs’s internal NLP scoring system both start from a bag-of-words representation — though augmented with financial entity normalization (ticker disambiguation, date normalization, number parsing). The core bag-of-words matrix, often called the “document-term matrix” or DTM in finance research, is the standard input to LDA in academic studies. When researchers at the Fed analyze FOMC minutes or when Hoberg and Phillips analyze 10-K filings, the first artifact produced is a term-document matrix of exactly this form.

Exercise

Modify the corpus list above by adding two sentences of your own about a company or social-media topic you know well. Re-run the cell and observe how the vocabulary and matrix change. Which new words appear? Which existing counts change?

Before running the cell below, predict: if you add the word “revenue” to Doc4, will the count in column “revenue” for row Doc4 change from 0 to 1? What happens to the total vocabulary size?

TF-IDF: What Is Unique to This Document?

The problem with raw counts

Raw word counts have a well-known defect: they reward common words. In a corpus of earnings call transcripts, the word “company” appears in virtually every document and accumulates high counts everywhere. A topic model fed raw counts will have “company” appear as a top word in almost every topic — not because it is informative, but because it is ubiquitous. The word tells us nothing about what distinguishes one document from another.

The solution is to downweight words that are common across many documents while preserving the signal from words that are rare in the corpus but prominent in a specific document. This is the intuition behind Term Frequency–Inverse Document Frequency, or TF-IDF.

The formal definition

Let \(\text{tf}_{t,d}\) be the count of term \(t\) in document \(d\) (or its normalized version), and let \(\text{df}_t\) be the number of documents in the corpus that contain term \(t\) at least once. With \(N\) total documents, the TF-IDF weight is:

\[\text{tfidf}_{t,d} = \text{tf}_{t,d} \cdot \log\frac{N}{\text{df}_t}\]

The second factor, \(\log(N / \text{df}_t)\), is the inverse document frequency (IDF). When \(\text{df}_t = N\) (the term appears in every document), the IDF equals \(\log(1) = 0\), and the weight collapses to zero regardless of how often the term appears in \(d\). When \(\text{df}_t = 1\) (the term appears in only one document), the IDF equals \(\log(N)\), its maximum — rewarding the term for being highly discriminative. The logarithm dampens the effect of very rare terms that might otherwise dominate the matrix.

In practice, sklearn uses a slightly smoothed variant:

\[\text{tfidf}_{t,d} = \text{tf}_{t,d} \cdot \left(1 + \log\frac{N+1}{\text{df}_t + 1}\right)\]

The smoothing prevents division by zero for terms that appear in all documents.

After computing TF-IDF weights, each document row is typically L2-normalized — divided by its Euclidean norm — so that documents of different lengths are comparable. Two documents with the same mix of distinctive words but different total lengths will produce similar unit vectors.

TF-IDF in practice

Reading the output. Compare this to the raw-count matrix from the previous section. Words that appear in many documents — such as “quarter” — have lower TF-IDF scores than in the raw count representation. Words that appear in just one or two documents — such as “inflation” (Docs 4 and 8), or “margins” (Docs 2 and 6) — have higher TF-IDF scores relative to their raw counts. This makes the matrix a better representation of what is distinctive about each document.

In practice

BlackRock’s Aladdin platform and JPMorgan’s NLP research group both use TF-IDF as a baseline feature for document comparison and retrieval before applying more sophisticated embeddings. In earnings-call research, academics frequently use TF-IDF-weighted vectors as inputs to their topic models because the downweighting of common boilerplate language (“thank you for joining the call today”) sharpens the topical signal. The Loughran-McDonald finance-specific stopword list (which removes words like “company,” “year,” “quarter” that are highly common in 10-Ks) can be thought of as a domain-specific extension of the IDF principle: remove the terms that are so common in financial text that they add no discriminative value.

Exercise

In the comparison cell above, add the word “margins” to the list of terms to compare. Does it appear in more or fewer documents than “revenue”? How does that affect the TF-IDF score relative to the raw count? Does the pattern match the formula \(\log(N / \text{df}_t)\)?

From Counts to Topics: NMF as a Warm-Up

The matrix factorization view of topic modeling

Before introducing LDA — which requires a probabilistic argument — it helps to see the topic-modeling problem through a linear-algebra lens. The term-document matrix \(\mathbf{X}\) (shape \(D \times V\)) encodes the full information in the corpus. Topic modeling asks: can we find a low-rank approximation

\[\mathbf{X} \approx \mathbf{W} \mathbf{H}\]

where \(\mathbf{W}\) (shape \(D \times K\)) encodes how much each document expresses each of \(K\) topics, and \(\mathbf{H}\) (shape \(K \times V\)) encodes what each topic looks like in terms of vocabulary? If \(K \ll V\), we have compressed the corpus from a high-dimensional vocabulary space into a \(K\)-dimensional topic space.

Non-negative Matrix Factorization (NMF) is the simplest algorithm that does exactly this, with the constraint that all entries of \(\mathbf{W}\) and \(\mathbf{H}\) must be non-negative. The non-negativity constraint is crucial: it forces the factorization to represent documents as additive mixtures of topics rather than subtractive combinations, which makes the resulting topics interpretable as latent themes.

NMF minimizes the reconstruction error: \[\min_{\mathbf{W}, \mathbf{H} \geq 0} \|\mathbf{X} - \mathbf{W}\mathbf{H}\|_F^2\]

where \(\|\cdot\|_F\) denotes the Frobenius norm. The algorithm alternates between updating \(\mathbf{W}\) (with \(\mathbf{H}\) fixed) and updating \(\mathbf{H}\) (with \(\mathbf{W}\) fixed) until convergence.

NMF is a useful warm-up for two reasons: the code is short, the output is immediately interpretable, and it makes clear what “topics as columns” and “document loadings” mean. LDA produces the same conceptual output — topics as distributions over words, documents as mixtures of topics — but with a much richer probabilistic story.

Reading the output. Each row of the W matrix tells you how much each document expresses each topic. Doc1 (about revenue and growth) should load heavily on the topic whose top words include “revenue” and “growth.” Doc4 and Doc8 (about inflation and interest rates) should load on the other topic. This is the core output of any topic model: a decomposition of the corpus into \(K\) themes and a description of each document as a mixture of those themes.

The limitation of NMF is that it gives no probabilistic interpretation and no principled way to choose \(K\). LDA addresses both.

Latent Dirichlet Allocation

The generative story

Latent Dirichlet Allocation (Blei, Ng & Jordan, 2003, Journal of Machine Learning Research) is a probabilistic generative model. Instead of minimizing a reconstruction error, it posits a specific process by which the corpus was generated — a generative story — and then asks: given the words we observe, what are the most probable topic assignments?

The generative story for LDA proceeds as follows. There are \(K\) topics (a fixed hyperparameter). Each topic \(k\) is a probability distribution over the vocabulary \(V\): \(\boldsymbol{\phi}_k \in \Delta^{|V|-1}\), where \(\Delta^{|V|-1}\) denotes the \((|V|-1)\)-dimensional probability simplex. Each document \(d\) is a probability distribution over the \(K\) topics: \(\boldsymbol{\theta}_d \in \Delta^{K-1}\).

To generate document \(d\) with \(N_d\) words, LDA imagines the following sequence:

  1. Draw a topic proportion vector \(\boldsymbol{\theta}_d \sim \text{Dirichlet}(\boldsymbol{\alpha})\). This says: each document has its own mixture of topics, drawn from a Dirichlet distribution with concentration parameter \(\boldsymbol{\alpha}\).
  2. For each of the \(N_d\) word positions \(n = 1, \ldots, N_d\):
    1. Draw a topic assignment \(z_{d,n} \sim \text{Categorical}(\boldsymbol{\theta}_d)\).
    2. Draw a word \(w_{d,n} \sim \text{Categorical}(\boldsymbol{\phi}_{z_{d,n}})\).

The topic-word distributions \(\boldsymbol{\phi}_k\) are themselves drawn from a Dirichlet prior: \[\boldsymbol{\phi}_k \sim \text{Dirichlet}(\boldsymbol{\beta}), \quad k = 1, \ldots, K\]

In compact notation:

\[p(\mathbf{W}, \mathbf{Z}, \boldsymbol{\Theta}, \boldsymbol{\Phi} \mid \boldsymbol{\alpha}, \boldsymbol{\beta}) = \prod_{k=1}^K p(\boldsymbol{\phi}_k \mid \boldsymbol{\beta}) \prod_{d=1}^D p(\boldsymbol{\theta}_d \mid \boldsymbol{\alpha}) \prod_{n=1}^{N_d} p(z_{d,n} \mid \boldsymbol{\theta}_d)\, p(w_{d,n} \mid \boldsymbol{\phi}_{z_{d,n}})\]

The observed variables are the words \(\mathbf{W}\). The latent variables — what we want to infer — are the topic assignments \(\mathbf{Z}\), the document-topic proportions \(\boldsymbol{\Theta}\), and the topic-word distributions \(\boldsymbol{\Phi}\).

The role of the Dirichlet prior

The Dirichlet distribution is the natural prior for probability vectors. A symmetric \(\text{Dirichlet}(\alpha)\) with \(\alpha < 1\) produces sparse distributions — most probability mass concentrated on a few components. For topic models, this translates into two desirable properties:

  • Sparse document-topic proportions (\(\alpha < 1\)): each document is primarily about a small number of topics, not spread uniformly over all \(K\). This matches intuition: a Reuters news article about the Federal Reserve is mostly about monetary policy, not equally about all 20 topics in the corpus.
  • Sparse topic-word distributions (\(\beta < 1\)): each topic is characterized by a small core vocabulary, not spread over the entire dictionary. The “earnings beat” topic is mostly about “revenue,” “beat,” “guidance,” and “expectations,” not all 10,000 words.

A symmetric \(\text{Dirichlet}(\alpha)\) with \(\alpha > 1\) produces smooth, near-uniform distributions. Setting \(\alpha\) close to 1 produces relatively uniform topic mixtures. The choice of \(\alpha\) and \(\beta\) is a modeling decision, but common defaults (\(\alpha = 50/K\), \(\beta = 0.01\)) produce sparse, interpretable topics in practice.

Inference: finding the latent variables

The posterior distribution \(p(\mathbf{Z}, \boldsymbol{\Theta}, \boldsymbol{\Phi} \mid \mathbf{W}, \boldsymbol{\alpha}, \boldsymbol{\beta})\) is intractable to compute in closed form because the normalizing constant requires summing over exponentially many topic-assignment configurations. Two approximate inference methods dominate practice:

Collapsed Gibbs sampling (Griffiths & Steyvers, 2004) analytically integrates out \(\boldsymbol{\Theta}\) and \(\boldsymbol{\Phi}\) and samples the topic assignments \(z_{d,n}\) one at a time from their conditional distribution:

\[p(z_{d,n} = k \mid \mathbf{z}_{-dn}, \mathbf{w}) \propto (n_{d,k}^{-dn} + \alpha) \cdot \frac{n_{k,w_{d,n}}^{-dn} + \beta}{n_k^{-dn} + V\beta}\]

where \(n_{d,k}^{-dn}\) is the count of tokens in document \(d\) assigned to topic \(k\), excluding the current token; \(n_{k,w}^{-dn}\) is the count of word \(w\) assigned to topic \(k\) across the corpus; and the \(-dn\) superscript denotes exclusion of the current assignment. This update has a natural interpretation: a word is more likely to be assigned to topic \(k\) if (a) the document already uses topic \(k\) heavily and (b) the word is common in topic \(k\) across the corpus.

Variational inference (Blei et al., 2003) approximates the true posterior with a simpler distribution \(q(\mathbf{Z}, \boldsymbol{\Theta}, \boldsymbol{\Phi})\) that factorizes across documents and topics, and optimizes the Evidence Lower Bound (ELBO):

\[\mathcal{L} = \mathbb{E}_q[\log p(\mathbf{W}, \mathbf{Z}, \boldsymbol{\Theta}, \boldsymbol{\Phi})] - \mathbb{E}_q[\log q(\mathbf{Z}, \boldsymbol{\Theta}, \boldsymbol{\Phi})]\]

scikit-learn’s LatentDirichletAllocation uses variational inference, which is typically faster than Gibbs sampling and scales better to large corpora. The Gensim library (used in the original course notebook) defaults to online variational inference with mini-batches. Both approaches produce the same conceptual output: topic-word distributions \(\boldsymbol{\Phi}\) and document-topic proportions \(\boldsymbol{\Theta}\).

In practice

The original LDA paper (Blei, Ng & Jordan, 2003) has been cited over 35,000 times — one of the most influential papers in machine learning. Its application to financial text has been equally influential: Hoberg and Phillips (2016, Journal of Finance) used LDA on 10-K product descriptions to build the Text-based Network Industries and Strategic Positioning (TNIC) dataset, now a standard resource in corporate finance research. The Federal Reserve Bank of New York’s FRED-MD dataset includes text-derived topic variables constructed from FOMC minutes using LDA. Goldman Sachs Asset Management uses topic modeling on analyst reports to identify thematic shifts in consensus views before price moves.

A plate diagram in words

A plate diagram is the standard way to visualize a probabilistic graphical model. Nodes represent variables (shaded for observed, unshaded for latent), arrows represent conditional dependencies, and rectangles (“plates”) indicate repetition. For LDA:

  • The outer plate repeats over \(K\) topics: each topic \(k\) draws its word distribution \(\boldsymbol{\phi}_k\) from \(\text{Dirichlet}(\beta)\).
  • The document plate repeats over \(D\) documents: each document \(d\) draws its topic proportion \(\boldsymbol{\theta}_d\) from \(\text{Dirichlet}(\alpha)\).
  • The inner word plate repeats over the \(N_d\) words in document \(d\): each word position draws a topic \(z_{d,n}\) from \(\boldsymbol{\theta}_d\), then draws the observed word \(w_{d,n}\) (shaded, because it is observed) from \(\boldsymbol{\phi}_{z_{d,n}}\).

The key insight the plate diagram makes visible: \(w_{d,n}\) is the only observed variable. Everything else — the topic assignments \(z_{d,n}\), the document-topic proportions \(\boldsymbol{\theta}_d\), and the topic-word distributions \(\boldsymbol{\phi}_k\) — must be inferred from the words.

Fitting LDA with scikit-learn

From theory to code

The sklearn.decomposition.LatentDirichletAllocation class implements online variational Bayes. Its key hyperparameters are:

  • n_components: the number of topics \(K\). This is the primary modeling decision and is discussed in the next section.
  • doc_topic_prior: the symmetric Dirichlet concentration for document-topic distributions (\(\alpha\)). Default is 1/n_components.
  • topic_word_prior: the symmetric Dirichlet concentration for topic-word distributions (\(\beta\)). Default is 1/n_components.
  • max_iter: number of variational EM iterations. For small corpora, 20–50 iterations is sufficient; larger corpora may need more.
  • random_state: seed for reproducibility. Always set this when publishing results.

The .fit_transform(X) method returns the document-topic matrix \(\mathbf{W}\) (shape \(D \times K\)), where each row sums to approximately 1 (the proportion of each topic in that document). The .components_ attribute contains the topic-word matrix \(\mathbf{H}\) (shape \(K \times V\)); these are not normalized probabilities but unnormalized counts that must be divided by their row sums to get \(\boldsymbol{\phi}_k\).

The interpretation step

A critical point that textbooks sometimes gloss over: the numbers coming out of LDA are not self-interpreting. The algorithm produces topics as probability distributions over vocabulary words; it does not label those topics. The analyst must read the top words for each topic and assign a human-readable label. This is fundamentally a qualitative judgment. Two analysts looking at the same top-10 word list might disagree about what the topic is called, and both might be right — the label is a shorthand, not a ground truth.

In practice, topic interpretation involves: 1. Printing the top 10–15 words per topic by probability. 2. Reading 5–10 representative documents with high loading on that topic. 3. Assigning a tentative label and checking it against examples. 4. Iterating — adjusting \(K\) or pre-processing choices if topics are incoherent.

Reading the output. With \(K=4\) on this 16-document corpus, you should see topics roughly corresponding to: corporate earnings (Apple, Microsoft, guidance), macro/monetary policy (Fed, rates, inflation), energy/commodities (oil, crude, OPEC), and credit/recession risk (bank, credit, GDP). Notice that topics emerge from co-occurrence patterns — the algorithm never saw the category labels “earnings,” “macro,” or “energy.” It inferred them from which words appear together across documents.

Also read the document-topic proportions. Most short documents load heavily on one or two topics, which is consistent with a sparse Dirichlet prior.

Reading the heatmap. Dark cells indicate high topic weight for that document-topic pair. A document that is almost entirely about one topic will have one very dark cell in its row and three near-white cells. A document that mixes two themes will show two moderately dark cells. This matrix is directly usable as a feature matrix in downstream models — each row is a 4-dimensional feature vector for the corresponding document.

In practice

In academic finance research using earnings-call text (e.g., Hassan et al. 2019 on firm-level political risk; Buehlmaier & Whited 2018 on financial constraints), the standard workflow is: (1) run LDA on the full corpus to get document-topic proportions; (2) assign human-readable labels to topics; (3) use the topic proportion vectors as explanatory variables in a panel regression with stock returns, investment, or CEO compensation as the dependent variable. The topic proportion vector effectively converts an unstructured document into a structured feature that can enter a standard econometric model.

Choosing K: Perplexity and Coherence

The model selection problem

The number of topics \(K\) is the most consequential modeling choice in LDA. Too small, and the topics are coarse amalgamations that lump unrelated themes together. Too large, and topics fragment into near-duplicates and noise. Neither outcome is useful for analysis.

Two quantitative metrics are widely used to guide this choice.

Perplexity

Perplexity is a standard measure from information theory applied to language models. For a held-out test set, it measures how well the trained model predicts those unseen words. Formally, for a test corpus \(\mathcal{D}_{\text{test}}\) with total word count \(M\):

\[\text{Perplexity}(\mathcal{D}_{\text{test}}) = \exp\left(-\frac{\sum_{d=1}^{D_\text{test}} \log p(\mathbf{w}_d)}{\sum_{d=1}^{D_\text{test}} N_d}\right)\]

Lower perplexity means the model assigns higher probability to the held-out words — a better fit. As \(K\) increases, perplexity on the test set typically decreases, sometimes steeply at first and then flattening. A “kink” or “elbow” in the perplexity-vs-K curve is often taken as a signal of the optimal \(K\).

The well-known limitation: perplexity can be gamed. Blei and Lafferty (2009) showed that LDA models with the lowest perplexity do not always produce the most interpretable topics. Adding more topics almost always reduces perplexity (the model fits better) but may produce topics that no human can label sensibly. Perplexity is a necessary but not sufficient criterion.

scikit-learn’s LatentDirichletAllocation.perplexity() computes this metric on a passed document-term matrix. Note that sklearn reports the log-likelihood (higher = better), which is the negative of log-perplexity; the cell below converts to standard perplexity.

Reading the curve. On a small corpus like this, perplexity may decrease monotonically or show a shallow elbow. The key observation is that the steepest drop — if there is one — occurs at a value of \(K\) where adding another topic still captures a genuinely new theme. Once additional topics start producing near-duplicates of existing ones, further perplexity improvement is marginal and topics become less interpretable.

Coherence

Topic coherence measures how semantically consistent the top words of a topic are. The most widely used coherence score, \(C_V\) (Röder et al. 2015), computes pairwise co-occurrence statistics for the top words of each topic using an external reference corpus (typically Wikipedia). Intuitively, a coherent topic has top words that tend to appear together in real text — revenue, beat, guidance, quarter is a coherent set; revenue, ocean, bicycle, algorithm is not.

Coherence scores are not available in scikit-learn because they require an external reference corpus; they are implemented in gensim. However, since gensim is not available in the Pyodide environment, we implement a simplified in-corpus coherence proxy: for each topic, count how often pairs of top words co-occur in the same document, normalized by document frequency. This is the \(C_{UMass}\) coherence (Mimno et al. 2011), which uses the training corpus itself as the reference.

Neither perplexity nor coherence is the final word

Chang et al. (2009, “Reading Tea Leaves: How Humans Interpret Topic Models”) ran a landmark user study showing that LDA models with better held-out likelihood (lower perplexity) actually produce less interpretable topics according to human judges. This finding has been replicated multiple times. The standard recommendation in applied NLP is to use perplexity and coherence as a shortlist criterion — ruling out obviously too-small or too-large values of \(K\) — and then make the final choice by reading the top words and representative documents for each candidate \(K\). On an earnings-call corpus, you might find that \(K=3\) produces vague topics, \(K=5\) produces four clearly labeled topics plus one noise topic, \(K=8\) produces three of the same topic fragmented by company name, and \(K=5\) is the winner. That judgment cannot be made by a number alone.

In practice

At Bloomberg’s NLP research group and at academic venues like the Journal of Finance, the choice of \(K\) in published studies is often stated as: “we swept \(K\) from 5 to 50, computed perplexity and UMass coherence, and selected \(K=15\) because it produced the lowest coherence improvement beyond that point and generated clearly interpretable topics upon manual inspection.” The manual-inspection step is non-negotiable. For a corpus of S&P 500 earnings calls, a \(K\) between 10 and 30 is typical in the published literature (Buehlmaier & Whited, 2018, use \(K=15\); Hassan et al., 2019, use \(K=10\) for political-risk topic training).

Financial Application: Earnings-Call Topics

Why earnings calls matter

Every quarter, publicly traded companies hold earnings calls: the CEO and CFO present results, analysts ask questions, and management responds. These transcripts are a rich, structured source of forward-looking language from corporate insiders. They are approximately 5,000–8,000 words each, they cover the same recurring themes (revenue, costs, competition, macro outlook), and they are available for virtually every S&P 500 company going back to the mid-2000s through providers such as Refinitiv StreetEvents and S&P Global’s Capital IQ.

For financial analysts, earnings calls serve three functions: (1) confirming or contradicting guidance given in the previous quarter; (2) providing qualitative color on numbers that headline figures obscure; (3) signaling management’s interpretation of the competitive and macro environment. Topic models extract the thematic structure of these calls in a way that is scalable — processing 10,000 transcripts takes the same code as processing 10 — and produces features that can be used in quantitative models.

Building a small earnings-call corpus

The cell below uses a set of earnings-call-style snippets modeled on the sample_ccall.csv dataset (which contains real transcripts from Apple Inc. quarterly earnings calls, 2006–2008, sourced from S&P Global). The snippets are short enough to run comfortably in the browser but representative of the language patterns in real transcripts.

Reading the time-series chart. Each line represents how strongly one topic featured in the earnings-call paragraphs for that quarter. If the “macro/rates” topic (whichever topic number the algorithm assigns to those concepts) rises from Q1-22 onward — which matches the Federal Reserve’s rate-hiking cycle that began in March 2022 — that is exactly the kind of thematic shift that LDA is designed to detect. In a real production system with thousands of transcripts, you would:

  1. Fit LDA once on the full historical corpus.
  2. For each new quarter’s transcripts, use lda.transform() (not fit_transform()) to project new documents into the existing topic space.
  3. Track the resulting topic-weight time series and use it as an input to return-prediction models or investor-sentiment indices.

From topics to alpha signals

The practical value of earnings-call topic models in finance comes from linking topic weights to subsequent stock returns. The general approach:

  1. For each firm-quarter observation, extract the \(K\)-dimensional topic-proportion vector \(\boldsymbol{\theta}_{firm,t}\).
  2. Run a panel regression: \[r_{firm,t+1} = \alpha + \beta_1 \theta_{firm,t,1} + \cdots + \beta_K \theta_{firm,t,K} + \gamma' \mathbf{X}_{firm,t} + \varepsilon_{firm,t}\] where \(r_{firm,t+1}\) is the abnormal return in the quarter following the call and \(\mathbf{X}_{firm,t}\) are standard controls (size, book-to-market, momentum).
  3. A significantly positive \(\beta_k\) indicates that high loading on topic \(k\) predicts positive subsequent returns — suggesting the market underreacts to the thematic information in that topic.

Hassan et al. (2019, Quarterly Journal of Economics) construct a firm-level political-risk measure from earnings calls using exactly this approach, and show it predicts investment and employment at the firm level. Buehlmaier and Whited (2018, Review of Financial Studies) use LDA on 10-Ks to measure financial constraints. Cohen, Malloy & Nguyen (2020, Journal of Finance) show that unexpected changes in the language of 10-Ks — detected via document similarity rather than LDA per se, but using the same bag-of-words foundation — strongly predict future returns and earnings surprises.

In practice

Refinitiv StreetEvents (the data provider behind the sample_ccall.csv dataset used in the companion notebook) covers over 80,000 earnings-call transcripts for global equities. Subscribed hedge funds and asset managers use the Refinitiv API to pull transcripts within minutes of the call ending and run their NLP pipelines — including topic models — before the market opens the next day. The edge is speed and scale: human analysts can read perhaps 10 transcripts per morning; an NLP pipeline can process 500. The course dataset (sample_ccall.csv) contains 16,956 transcript segments from Apple Inc. calls dating back to Q3 2006, providing a multi-year view of how Apple’s management language evolved alongside the iPhone launch, the App Store, services transition, and the COVID era.

Limitations and What Comes Next

The fundamental constraint of the bag-of-words assumption

Every technique in this chapter — bag-of-words, TF-IDF, NMF, and LDA — rests on the same foundational assumption: word order does not matter. A document is a multiset of words, not a sequence of words. This is both the model’s greatest strength and its most significant limitation.

The strength: ignoring order makes the mathematical problem tractable. A count vector is a simple, dense summary of a document. The resulting models are interpretable, computationally efficient, and scale to millions of documents.

The limitation: natural language is fundamentally sequential. Meaning depends on order.

Consider the two sentences: - “Revenue growth is not expected to accelerate.” - “Revenue growth is expected to accelerate.”

To a bag-of-words model, these sentences are nearly identical — they share the words revenue, growth, expected, accelerate. Only the word not differs, and stopword removal will delete it entirely. A topic model will assign both sentences almost identical topic proportions. A human reader would assign them opposite meanings.

This example illustrates the negation problem — one of the most well-documented failure modes of bag-of-words-based sentiment and topic analysis. It is not a fringe case; negation is common in financial and legal text, where precise qualification of statements matters (“we are not raising guidance,” “management does not expect margin improvement”).

Other limitations

Polysemy and context. The word “bank” can mean a financial institution, a river bank, or the act of turning an aircraft. A bag-of-words model cannot distinguish these. In a financial corpus, this matters less — “bank” almost always means the financial institution — but in social media text, where puns, slang, and emoji-laden language are common, polysemy is a severe problem.

Topic drift over time. A static LDA model trained on a corpus from 2010–2015 may not capture topics that emerged post-2020 (remote work, AI adoption, supply chain disruption). Dynamic topic models (Blei & Lafferty, 2006) extend LDA to allow topics to evolve over time and are used in the companion notebook Practice_TrendingInFinancialMarket.ipynb with S&P 500 price-movement data.

The topic-labeling bottleneck. As noted above, topics must be labeled by a human analyst. With \(K=20\) topics and a need to re-run the model as the corpus evolves, this creates a recurring maintenance burden. In production systems at firms like Two Sigma and AQR, this is managed by a combination of automated coherence monitoring and periodic manual review sessions.

Short documents. LDA works best when documents are long enough to contain reliable co-occurrence statistics. Tweets are typically 15–30 words — far below the length at which LDA provides stable estimates. For short social-media text, alternative approaches (biterm topic model, BTM; neural topic models) handle the sparsity problem better.

What comes next: sentiment and language models

This chapter has treated text as an unordered collection of words and extracted what topics a document discusses. The next two chapters ask deeper questions:

Chapter 2 (Sentiment Analysis) asks: what is the attitude expressed in the text — positive, negative, or neutral? It introduces lexicon-based methods (Loughran-McDonald for financial text, VADER for social media), supervised classifiers, and transformer-based models (FinBERT, RoBERTa) that read text sequentially and therefore handle negation, sarcasm, and context-dependence that bag-of-words cannot.

Chapter 3 (Large Language Models) asks: can we represent text in a way that captures full semantic and syntactic meaning? It introduces word and document embeddings, the transformer architecture, and how large pretrained models (BERT, GPT) can be used for zero-shot classification, semantic search, and generation — the tools behind Bloomberg’s BloombergGPT and the new generation of financial AI applications.

The progression is: bag-of-words (what words appear?) → topic model (what themes are present?) → sentiment (what attitude is expressed?) → LLM (what does it fully mean?). Each step adds representational richness at the cost of computational complexity and interpretability. The right level of the hierarchy depends on your task, corpus size, and tolerance for model opacity.

In practice

The research firm Man Institute (Man Group’s research arm) published a 2022 study comparing LDA-based topic signals, FinBERT-based sentiment signals, and GPT-3 zero-shot classification on the same earnings-call corpus. For long-horizon return prediction (3–12 months), LDA topic weights performed comparably to the more expensive transformer-based methods. For short-horizon prediction (1 day around the call), transformer sentiment dominated. The conclusion: use topic models for structural, slow-moving signals; use transformer sentiment for event-driven, fast-moving signals. For most academic finance research where the dependent variable is quarterly returns, LDA is still the right tool — interpretable, reproducible, and not dependent on API calls to commercial LLM providers.


Chapter Summary

This chapter covered the full pipeline from raw text to interpretable topic models:

Step Tool Key idea
Pre-processing CountVectorizer Lowercase, tokenize, remove stopwords
Representation Bag-of-words matrix Word counts per document
Discrimination TF-IDF weighting Downweight words common to all documents
Warm-up decomposition NMF Non-negative factorization \(\mathbf{X} \approx \mathbf{WH}\)
Topic model LDA Generative model with Dirichlet priors
Fitting sklearn variational EM fit_transform() returns doc-topic matrix
Model selection Perplexity + coherence Plus human inspection — neither alone is sufficient
Application Earnings-call topics Topic-weight time series as financial features

The central takeaway. Topic models are not a black box that produces labels — they are a tool for discovering latent structure that the analyst then interprets. The algorithm finds the patterns; the domain expert names them and decides which ones are actionable. This human-in-the-loop design is not a weakness; it is what makes topic models trustworthy in settings where the cost of a misclassified document — a false signal in a portfolio, a missed compliance flag — is high.

Proceed to Chapter 2: Sentiment Analysis to learn how to go beyond topic discovery and measure the direction of opinion and emotion in the same corpora.

← Back to Contents  ·  Chapter 2: Sentiment Analysis →

 

Prof. Xuhu Wan · HKUST ISOM 5640 · Introduction to Text Analytics for News and Social Media