• 📖 Cover
  • Contents

Chapter 3: Large Language Models for Social Media

About This Chapter

Text is the native language of social media. Every tweet, review, caption, and comment is a sequence of words, and for decades the central challenge of social media analytics was simply: how do we turn words into numbers? Bag-of-words gave us sparse counts. Topic models gave us latent themes. Sentiment lexicons gave us polarity scores. Each approach moved the needle, yet each also left a residue of problems unsolved — words whose meaning depends entirely on context, irony that flips sentiment, topics that shift across time, questions posed in natural language that no SQL query can answer.

Large language models — LLMs — are the answer the field has converged on. A language model is, at its core, a probability distribution over sequences of tokens: it assigns a probability to every possible continuation of a text prefix. What makes large language models different is scale: billions of parameters, trained on hundreds of billions of tokens of human text. At that scale, something qualitative shifts. The model does not merely memorise collocations; it learns, in some implicit sense, facts about the world, grammatical structure across languages, and how concepts relate to one another. The result is a universal text tool — a single pretrained system that, with little or no task-specific training, can classify sentiment, extract named entities, summarise a thread, answer questions about a document, and translate between languages.

The timeline is worth internalising. In 2018, Google published BERT (Bidirectional Encoder Representations from Transformers), the first widely accessible model that used the transformer architecture to produce contextualised word embeddings — embeddings where the representation of “bank” in “river bank” differs from its representation in “bank account.” In the same year, OpenAI published GPT (Generative Pre-trained Transformer), a unidirectional language model for text generation. In 2020, GPT-3 showed that scaling alone — 175 billion parameters — produced a model capable of solving new tasks from a handful of examples, a behaviour called few-shot learning that nobody had predicted. In 2022, the release of ChatGPT demonstrated that instruction fine-tuning — training the model to follow human directions — dramatically improved usability for non-experts. By 2024, models such as GPT-4, Claude 3, and Gemini Ultra were multimodal, handling images alongside text, and capable of extended reasoning across contexts of hundreds of thousands of tokens.

Why does this matter for marketers, analysts, and researchers? Three reasons. First, accessibility: you no longer need labelled training data to build a useful classifier. A well-constructed prompt to an instruction-tuned LLM can replace a custom model trained on ten thousand annotated examples. Second, generality: the same API call that classifies tweet sentiment can also extract product names, generate ad copy variants, or synthesise competitive intelligence from ten earnings calls. Third, scale: the cost of running an LLM inference on a million social-media posts has dropped dramatically — what once required a research lab now fits inside a cloud function. The practical implication is that text analysis, once a specialist skill requiring NLP expertise, has become as accessible as a spreadsheet formula.

This chapter builds understanding from the ground up: tokenization, embeddings, attention, and the transformer architecture — the four mechanical pillars that make LLMs work. We then turn to practice: how to call a pretrained model, how to design prompts for marketing tasks, and how to evaluate outputs. Throughout, the focus is on application. The goal is not to build a language model; the goal is to use one confidently and critically.

Reference

The primary companion for this chapter is Hands-On Large Language Models by Jay Alammar and Maarten Grootendorst (O’Reilly, 2024). Chapter numbers in that book roughly correspond to the sections here. Code examples in python blocks throughout this chapter require transformers and torch; these libraries cannot run in the browser. All such blocks are marked clearly — copy them to Google Colab or a local Python environment to execute them.


Table of Contents

  1. Tokenization
  2. Embeddings and Word Arithmetic
  3. Attention from Scratch
  4. Multi-Head Attention and the Transformer (Conceptual)
  5. Pretraining, Fine-Tuning, and Prompting
  6. Using a Pretrained LLM in Practice
  7. Zero-Shot Classification of Tweets
  8. Embeddings and Semantic Search
  9. Prompting Patterns
  10. Evaluating LLM Outputs
  11. Cost, Latency, and Ethics
  12. Where the Field is Going

Tokenization

Why characters are not enough

Before any language model can process a sentence, the sentence must be converted into a sequence of integers — a list of token IDs that the model’s embedding layer can look up. The most naive choice is to tokenize by character: each letter becomes its own integer. Character tokenization keeps the vocabulary tiny (26 letters plus punctuation and digits), but it creates extremely long sequences — a single tweet of 140 characters becomes a sequence of 140 tokens — and it forces the model to learn spelling from scratch before it can learn meaning.

The other extreme is whitespace tokenization: split on spaces, so “HKUST students love this course” becomes five tokens. This is intuitive, but it creates an enormous vocabulary. Every morphological variant of a word — “run”, “runs”, “ran”, “running”, “runner” — is a separate entry, even though they share most of their meaning. Rare words and names, which appear only a handful of times in training data, receive poor embeddings because there is not enough data to learn from.

Modern LLMs use subword tokenization, which strikes a balance between the two extremes. The dominant algorithm is Byte-Pair Encoding (BPE), introduced by Sennrich et al. (2016) for machine translation and now the standard in GPT, LLaMA, and most other large models.

Byte-Pair Encoding: the idea

BPE starts with a character-level vocabulary and iteratively merges the most frequent adjacent pair of symbols into a new symbol. The procedure is:

  1. Represent every word in the training corpus as a sequence of characters, with a special end-of-word marker.
  2. Count every adjacent pair of symbols.
  3. Merge the most frequent pair into a single new symbol and update all occurrences.
  4. Repeat for a fixed number of iterations (the target vocabulary size minus the initial character vocabulary).

The result is a vocabulary of subword units that captures common morphemes, prefixes, and suffixes while guaranteeing that any new word can always be decomposed — at worst, into individual characters.

Toy BPE in pure numpy

The live cell below implements the BPE merge loop on a miniature corpus of four words. Read through the output carefully: you will see the merge table being built step by step, and you will see how “unhappiness” decomposes.

Interpretation. Notice that “unhappiness” does not tokenize as a single unit — instead the merge table discovers its morphological parts. The exact segmentation depends on how many merges are applied and the corpus statistics, but subword structure emerges naturally: prefixes like “un” and suffixes like “ness” become stable tokens because they are highly frequent. A word that never appeared in training — a brand name, a neologism, a typo — can still be tokenized into known subwords, guaranteeing that no input is completely alien to the model.

In practice

The GPT-4 tokenizer uses roughly 100,000 BPE merges, producing a vocabulary of ~100k tokens. The average English word tokenizes into 1.3 tokens. A single tweet of 280 characters is typically 60-80 tokens. At $0.03 per 1,000 input tokens (GPT-4 pricing circa 2024), classifying one million tweets costs roughly $1,800 — down from $18,000 at GPT-3 pricing two years earlier. Token counting directly governs API cost. Always estimate token count before running a large batch; the tiktoken library (OpenAI, pip-installable) counts tokens locally without an API call.

Whitespace vs BPE: a direct comparison

Before running, predict: how many tokens does whitespace tokenization produce for the sentence “I’m unhappily unkind”? How many after BPE with the merge table above?


Embeddings and Word Arithmetic

Words as points in space

A vocabulary of 100,000 tokens needs to be represented numerically before any neural network can process it. The naive approach is one-hot encoding: each token gets a vector of length 100,000 with a single 1 and 99,999 zeros. This is exact but catastrophically sparse — two words that mean almost the same thing have orthogonal vectors, and the representation carries zero semantic information.

Word embeddings solve this by mapping each token to a dense vector in \(\mathbb{R}^d\) where \(d\) is small (typically 128, 256, or 768). The embedding vectors are learned parameters: they are initialised randomly and then adjusted during training so that words appearing in similar contexts end up nearby in the embedding space. The result is a geometry where semantic similarity corresponds to vector proximity, and where directions in the space encode semantic relationships.

The landmark demonstration was the famous Word2vec analogy published by Mikolov et al. (2013). Trained on Google News, the model’s embeddings satisfied — to a surprisingly tight approximation — the arithmetic:

\[\vec{v}(\text{king}) - \vec{v}(\text{man}) + \vec{v}(\text{woman}) \approx \vec{v}(\text{queen})\]

This is not a mathematical theorem; it is an empirical observation about what the model learns. It suggests that the embedding space encodes a “royalty” direction and a “gender” direction as approximately orthogonal axes, so that adding the gender offset to “king” produces a vector close to “queen”. Similar arithmetic works for capitals (“Paris” − “France” + “Germany” ≈ “Berlin”), verb tenses, and comparative adjectives.

In modern LLMs, embeddings are contextual: the representation of a token depends on its surrounding tokens, not just on the token itself. BERT and GPT do not have a single embedding per word; they produce a different embedding for every occurrence of the word depending on context. But the intuition from static Word2vec embeddings — that directions in the space encode relationships — carries forward.

Hand-crafted embeddings: the arithmetic works

The cell below constructs 4-dimensional embeddings by hand for eight words. The dimensions are deliberately interpretable: (1) royalty, (2) gender (female = +1, male = −1), (3) country, (4) activity. We then verify the king/queen arithmetic and compute a cosine-similarity matrix.

Interpretation. The heatmap makes three facts visible at a glance. First, the diagonal is all 1.00 — a vector has cosine similarity 1 with itself. Second, {king, queen} form a cluster with high mutual similarity, separate from {man, woman} and from {paris, london, france, england} — each semantic group occupies its own region of the space. Third, the analogy arithmetic works because the gender dimension adds cleanly: “king − man” removes the gender component, and “+ woman” adds it back with the opposite sign, landing the resulting vector close to “queen”.

In production embeddings (Word2vec, GloVe, fastText), these same patterns emerge from training on billions of words — no human writes the dimensions. The model discovers that gender, royalty, and geography are useful axes for organising its representation simply because these distinctions predict which words appear together.

In practice

For brand monitoring and competitive intelligence, embedding similarity is a powerful discovery tool. Embed every customer review, cluster by cosine distance, and the clusters reveal coherent themes — delivery complaints, packaging praise, taste comparisons — without any labelled training data. Spotify uses embeddings of listening sequences to power its recommendation engine; LinkedIn uses embeddings of job titles and skills to suggest connections; Amazon uses item embeddings for “customers also bought” carousels. The mathematical machinery is the same as the king/queen analogy — differences in direction encode semantically meaningful relationships.


Attention from Scratch

The context problem

Embeddings from Word2vec and GloVe are static: the embedding for “bank” is the same regardless of whether the surrounding words are “river” and “fishing” or “loan” and “interest rate”. For many applications this is fine, but for anything requiring disambiguation — and most interesting NLP tasks do — static embeddings leak meaning.

The transformer architecture solves this with self-attention: a mechanism that allows each token in a sequence to gather information from every other token and produce a contextualised representation that depends on the full local context. The key insight of Vaswani et al. (2017) — the “Attention Is All You Need” paper — was that this mechanism alone, applied in layers without any recurrence, is sufficient to build a state-of-the-art sequence model. Recurrent networks (RNNs, LSTMs) process tokens one at a time and struggle to connect long-range dependencies because information must propagate through many sequential steps. Attention connects every pair of tokens directly, in a single computation.

The formula

Scaled dot-product attention takes three matrices as input: Q (queries), K (keys), and V (values), each of shape \((n \times d)\) where \(n\) is the sequence length and \(d\) is the embedding dimension. The output is:

\[\text{Attn}(Q, K, V) = \text{softmax}\!\left(\frac{Q K^\top}{\sqrt{d}}\right) V\]

The product \(QK^\top\) is an \((n \times n)\) matrix of raw attention scores: entry \((i, j)\) measures how much token \(i\) should attend to token \(j\). Dividing by \(\sqrt{d}\) prevents the dot products from growing large in magnitude (and thus pushing softmax into saturating regions) as the embedding dimension increases. The softmax converts each row into a probability distribution over the \(n\) positions. Multiplying by \(V\) produces a weighted average of the value vectors — each output token’s representation is a blend of all input tokens’ values, weighted by how much attention the query attends to each key.

In the encoder-only architecture (BERT), queries, keys, and values are all linear projections of the same input — this is “self”-attention. In the decoder (GPT), future positions are masked out so that the model can only attend to past tokens, preserving the autoregressive property.

Ten-line numpy implementation

Interpretation. Each row of the heatmap is a probability distribution — the attention pattern for one query token. A bright cell \((i, j)\) means token \(i\) attends strongly to token \(j\): its output representation will contain a large contribution from token \(j\)’s value vector. With random projection matrices, the pattern is noisy. After training, the patterns become interpretable: a pronoun “it” will attend strongly to its antecedent; an adjective will attend to the noun it modifies; in a financial headline, “fell” will attend to the stock name and the percentage figure.

The self-attention mechanism is the key innovation that makes transformers universal: it does not care about distance in the sequence. Token 1 and token 200 are equally close for the purpose of the attention computation. This is why LLMs handle long-range dependencies — between a pronoun and its referent six sentences earlier, or between a conclusion and the premise that supports it — far better than any RNN.

Common misconception: attention is not interpretable by default

Beginners often assume that the attention heatmap reveals what the model “thinks about.” This is partially true but mostly misleading. Jain & Wallace (2019) and other papers have shown that attention weights do not reliably indicate which input tokens were causally important for the model’s prediction — gradient-based attribution methods are better for that. Attention heatmaps are a useful debugging and illustration tool, but treat them as a window into mechanism, not as ground truth about reasoning.


Multi-Head Attention and the Transformer (Conceptual)

Why one head is not enough

A single attention head produces one set of attention weights — one way of routing information between tokens. But a sentence encodes multiple types of relationships simultaneously: syntactic (subject-verb agreement), semantic (co-reference), positional (nearby tokens tend to be related), and task-specific (in a sentiment task, the negation word matters most). A single head cannot specialise in all of them at once without compromising.

Multi-head attention runs \(h\) independent attention heads in parallel, each with its own learnable projection matrices \(W^Q_i\), \(W^K_i\), \(W^V_i\). The outputs are concatenated and projected back to the model dimension:

\[\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h) \, W^O\]

where \(\text{head}_i = \text{Attn}(QW^Q_i, KW^K_i, VW^V_i)\).

In practice, each head operates on a \(d/h\)-dimensional subspace of the embedding. BERT-base uses \(h = 12\) heads over a \(d = 768\)-dimensional space, so each head operates in 64 dimensions. The different heads learn to specialise: in trained BERT, some heads track syntactic dependencies, others track coreference, and others pick up on positional proximity (see Clark et al., 2019, “What Does BERT Look At?”).

The full transformer block

A transformer layer wraps multi-head attention in a standard engineering pattern:

  1. Layer normalisation before attention: stabilises the distribution of activations, making deep networks trainable.
  2. Residual connection: the input is added back to the attention output (\(x + \text{MultiHead}(x)\)), letting gradients flow directly to early layers.
  3. Position-wise feed-forward network (FFN): a two-layer MLP applied independently to each token position, typically expanding to \(4d\) intermediate dimensions. This is where the model stores factual associations — Geva et al. (2021) showed that FFN weights act like key-value memories.
  4. Second layer norm and residual: same pattern wraps the FFN.

These four components — layer norm, multi-head attention, residual, FFN — constitute one transformer block. A BERT-base model stacks 12 such blocks; GPT-3 stacks 96. Depth is what allows the model to build increasingly abstract representations: lower layers handle syntax; higher layers handle semantics and reasoning.

Positional encoding

Self-attention is permutation-equivariant by construction — if you shuffle the input tokens, the attention outputs shuffle in the same way, and the model cannot tell that the order changed. Language, however, is not order-invariant: “dog bites man” and “man bites dog” have opposite meanings. Transformers inject positional information by adding a positional encoding to each token embedding. The original paper used fixed sinusoidal functions; modern models (RoPE, ALiBi) learn or compute positional biases that generalise better to long sequences.

In practice

For social media analytics, the most practically important architectural fact is the context window: the maximum number of tokens a model can attend to in a single forward pass. GPT-3 had a 2,048-token window; GPT-4 Turbo extended this to 128,000 tokens; Gemini 1.5 Pro supports one million tokens. A context window of 128,000 tokens fits roughly 100 typical news articles. For customer service classification, a short tweet fits easily. For analysing an entire year of earnings calls or a product’s complete review corpus, context length becomes a real engineering constraint, and Retrieval-Augmented Generation (Section 8) is the standard solution.

For a step-by-step visual walkthrough of the transformer architecture, including animated diagrams of multi-head attention, refer to Jay Alammar’s blog post “The Illustrated Transformer” (2018) and Chapters 3–5 of Hands-On Large Language Models.


Pretraining, Fine-Tuning, and Prompting

Three distinct paradigms govern how a practitioner uses a large language model. Understanding the differences matters because they have very different cost, data, and latency profiles.

Pretraining

Pretraining is what creates the “large” in large language models. A model with billions of parameters is trained from random initialisation on hundreds of billions of tokens of text — web pages (Common Crawl), books (Books3, Gutenberg), scientific papers (arXiv), code (GitHub), and curated high-quality sources (Wikipedia, news). The training objective for GPT-style models is next-token prediction (causal language modelling): given the preceding tokens, predict the next one. For BERT-style models it is masked language modelling: randomly mask 15% of tokens and predict the masked ones from the bidirectional context.

Pretraining requires enormous compute — GPT-3 cost roughly $5 million in cloud GPU time in 2020 — and is done once by the model developers. Practitioners never pretrain from scratch; they start from a pretrained checkpoint.

Fine-tuning

Fine-tuning adapts a pretrained model to a specific task by continuing training on a smaller, labelled dataset. A classic example: take BERT’s pretrained weights, add a linear classification head on top of the [CLS] token, and fine-tune on 10,000 labelled tweets to produce a tweet sentiment classifier. Fine-tuning is cheaper than pretraining by many orders of magnitude, but it still requires labelled data and GPU access, and it produces a static model that cannot easily be repurposed for a different task.

Parameter-efficient fine-tuning (PEFT) methods such as LoRA (Hu et al., 2021) freeze most of the model weights and train only a small adapter layer, reducing the cost further. PEFT is now the standard approach for adapting open-source models (LLaMA, Mistral, Falcon) to domain-specific tasks.

Prompting

The paradigm shift introduced by GPT-3 and formalised by instruction-tuned models (InstructGPT, ChatGPT) is that pretraining alone creates models capable of solving new tasks from natural language instructions, without any weight updates. Prompting exploits this by framing the task in the input text. Zero-shot prompting: “Classify the sentiment of this tweet as positive, negative, or neutral: [tweet].” Few-shot prompting: provide three labelled examples in the prompt before the query. Chain-of-thought prompting: instruct the model to reason step by step before answering.

Prompting requires no labelled data and no GPUs at inference time beyond the model server. The trade-off is that prompt-based performance is less reliable than fine-tuned performance on narrow tasks, and that careful prompt engineering is a non-trivial skill.

The right paradigm depends on your situation:

Situation Recommended approach
No labelled data, quick prototype Zero-shot prompting
10–100 examples, API access Few-shot prompting
1,000+ labelled examples, own GPU Fine-tuning or LoRA
Production classifier, latency-critical Fine-tune a smaller model
Research benchmark, highest accuracy Full fine-tuning + ensemble

Using a Pretrained LLM in Practice

The transformers library from Hugging Face provides a unified Python API for hundreds of pretrained models. The simplest entry point is the pipeline function, which wraps model loading, tokenization, and inference behind a single callable.

This block does not run in your browser

The code below requires transformers and torch, which cannot be installed in the browser’s Python environment. Copy it to a Google Colab notebook or a local Python environment to run it. A free Colab T4 GPU is sufficient for all examples in this chapter.

from transformers import pipeline

# Sentiment analysis using DistilBERT fine-tuned on SST-2
classifier = pipeline("sentiment-analysis")

results = classifier([
    "HKUST students love this course!",
    "The stock market dropped sharply after the Fed announcement.",
    "I'm not sure how I feel about this product.",
])

for r in results:
    print(r)
# Expected output (DistilBERT/SST-2):
# {'label': 'POSITIVE', 'score': 0.9998}
# {'label': 'NEGATIVE', 'score': 0.9876}
# {'label': 'NEGATIVE', 'score': 0.5621}  ← neutral text maps to negative (SST-2 has no neutral)

The pipeline object handles everything: it downloads the model weights on first call (cached locally after that), tokenizes the input, runs inference, and decodes the output label. The default "sentiment-analysis" pipeline uses distilbert-base-uncased-finetuned-sst-2-english — a compressed version of BERT fine-tuned on the Stanford Sentiment Treebank. For financial and social-media text, ProsusAI/finbert or cardiffnlp/twitter-roberta-base-sentiment-latest are better choices (see Chapter 2 of this book for FinBERT comparisons).

In practice

A customer support team at a telecom company uses a fine-tuned BERT pipeline to classify incoming tickets into 14 urgency/topic categories in real time. The pipeline runs on a single AWS g4dn.xlarge instance (1 × T4 GPU) and classifies 2,000 tickets per minute at $0.52/hour — far cheaper than human triage and with consistent labelling. The same architecture, reprompted, handles content moderation for a gaming platform: flagging hate speech, spam, and account-sharing violations before they reach the moderation queue. The key engineering decision is always the same: fine-tune a domain-specific model for production; use a general pipeline for prototyping.


Zero-Shot Classification of Tweets

The idea

Zero-shot classification asks an NLP model to assign one of a set of candidate labels to a document, without any training examples of those specific labels. The underlying mechanism in modern zero-shot classifiers is an NLI (Natural Language Inference) model: the model is trained to judge whether a hypothesis follows from a premise. At inference time, the premise is the text to classify, and the hypothesis is a template like “This text is about {label}.” The model’s entailment score becomes the classification probability.

This block does not run in your browser

Install transformers and torch, then run this in Colab or locally.

from transformers import pipeline

zsc = pipeline("zero-shot-classification",
               model="facebook/bart-large-mnli")

tweet = "Fed signals two more rate hikes this year; markets sell off."
labels = ["bullish", "bearish", "neutral", "monetary policy", "equities"]

result = zsc(tweet, candidate_labels=labels, multi_label=False)
print(result)
# {'sequence': 'Fed signals two more rate hikes...',
#  'labels':   ['bearish', 'monetary policy', 'equities', 'neutral', 'bullish'],
#  'scores':   [0.62, 0.21, 0.09, 0.05, 0.03]}

The model correctly identifies the tweet as bearish without ever seeing a labelled example of a bearish financial tweet. This is possible because the BART model, trained on hundreds of millions of NLI examples, has learned what “bearish” means and how it relates to descriptions of market declines.

How impressive — and how fragile — this can be

Zero-shot classification is genuinely impressive for clean, domain-aligned text. Its fragility comes from three sources. First, label surface form matters: the model responds to the literal words in the label, so “bearish” works better than “negative market sentiment” which works better than “downward” for financial text — even if they mean the same thing in context. Second, implicit knowledge gaps: a tweet in Cantonese or dense financial jargon (“widening IG spreads signal risk-off”) may fall outside the model’s effective operating range. Third, calibration: the scores are not well-calibrated probabilities — a score of 0.62 does not reliably mean the model is 62% confident. Treat zero-shot scores as ordinal rankings, not as probabilities, unless you have calibrated them on held-out data.

In production, zero-shot is best used as a first-pass filter or labelling assistant. Classify one thousand tweets zero-shot, manually review a random sample of 100, measure precision and recall against your true labels, and decide whether to fine-tune a dedicated model. The economics are compelling: zero-shot classification via the API costs a fraction of the human labelling cost needed to train a supervised model, and the quality is often sufficient for exploratory analysis.

In practice

A brand monitoring team at a consumer goods company uses zero-shot classification to triage 50,000 brand mentions per day across Twitter/X, Instagram, and Reddit. Candidate labels are: {product complaint, delivery issue, brand praise, competitor mention, general conversation}. Zero-shot precision on the first pass is around 72%; a human moderator spot-checks 200 samples per week. After three months of spot-check data accumulation, the team fine-tunes a DistilBERT model that achieves 91% precision — the zero-shot phase generated the training data at near-zero cost.


Embeddings and Semantic Search

From keyword search to meaning search

Traditional text retrieval is keyword-based: a search for “smartphone battery life” returns documents containing those exact words, misses documents that say “phone charge duration”, and has no sense of concept proximity. Embedding-based retrieval replaces keyword matching with vector similarity: each document and each query are mapped to a dense vector, and retrieval returns the documents whose vectors are closest to the query vector in cosine distance.

The key enabling technology is the sentence embedding — a single vector that represents the meaning of a full sentence or paragraph, not just a single word. Libraries like sentence-transformers (Reimers & Gurevych, 2019) provide pretrained models specifically optimised for producing high-quality sentence embeddings. A sentence embedding is computed by passing the sentence through a transformer and pooling the token representations (typically the mean of all token vectors, or the [CLS] token for BERT-style models).

Conceptual code: sentence embeddings with sentence-transformers

This block does not run in your browser

Run in Colab or locally after pip install sentence-transformers.

from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer("all-MiniLM-L6-v2")   # 90MB, fast, good quality

corpus = [
    "Apple reports record iPhone sales in Q4.",
    "Galaxy S24 launch drives Samsung revenue higher.",
    "Federal Reserve holds interest rates steady.",
    "Inflation data surprises to the upside.",
    "Social media advertising spend rises 18% YoY.",
    "TikTok's ad revenue surpasses Twitter for the first time.",
    "Customer churn in telecoms sector accelerates.",
    "Subscriber growth at Netflix exceeds analyst expectations.",
    "OpenAI valuation reaches $80 billion in new funding round.",
    "Meta announces layoffs in hardware division.",
]

embeddings = model.encode(corpus, normalize_embeddings=True)  # shape (10, 384)

query = "mobile phone revenue"
q_emb = model.encode([query], normalize_embeddings=True)

similarities = embeddings @ q_emb.T  # cosine sim (normalised vectors)
ranked = np.argsort(similarities.ravel())[::-1]
print(f"Top matches for query: '{query}'")
for rank, idx in enumerate(ranked[:3]):
    print(f"  {rank+1}. [{similarities[idx, 0]:.3f}]  {corpus[idx]}")

The model returns “Apple reports record iPhone sales” and “Galaxy S24 launch drives Samsung revenue” as the top matches for “mobile phone revenue” — documents that share no keywords with the query but are semantically aligned.

Live demo: retrieval with random embeddings

The sentence-transformers library cannot run in the browser, but the retrieval mechanism — nearest neighbour in cosine distance — is pure numpy. The cell below uses random embeddings to demonstrate the retrieval pipeline. The matching is random because the embeddings carry no semantic content, but the structure of the operation is identical to the production version above.

Interpretation. With random embeddings, the ranking is meaningless — the most “similar” document is whichever one happened to be random-closest in 384 dimensions. The point of this cell is to show that the retrieval mechanism is nothing more than a dot product and an argsort. Swap in real sentence embeddings from sentence-transformers, and the same code produces semantically meaningful rankings. The complexity is entirely in how the embeddings are computed, not in how they are retrieved.

Retrieval-Augmented Generation (RAG)

Semantic search is the retrieval component of a larger pattern called Retrieval-Augmented Generation (RAG), introduced by Lewis et al. (2020) at Meta. The architecture is:

  1. Index a document corpus by computing embeddings and storing them in a vector database (Pinecone, Chroma, FAISS, pgvector).
  2. Retrieve: given a user query, embed the query and retrieve the \(k\) most similar documents.
  3. Generate: pass the query together with the retrieved documents as context to a generative LLM, which synthesises an answer grounded in the retrieved text.

RAG addresses two of the biggest limitations of pure LLM generation: hallucination (the model generates plausible-sounding but false facts) and stale knowledge (the model’s knowledge is frozen at its training cutoff). By grounding the generation in retrieved documents, RAG can answer questions about events after the training cutoff and cite specific passages as evidence.

For social media analytics, RAG enables applications such as: a brand manager asking natural-language questions about customer feedback (“What complaints appeared most frequently in December?”) answered by retrieving and synthesising from a corpus of vectorised reviews; or a research analyst querying an earnings call corpus (“When did management first mention supply chain disruptions?”) answered with verbatim quotes rather than model-generated paraphrases.


Prompting Patterns

Instruction-tuned LLMs respond to natural language descriptions of tasks. The quality of the response depends heavily on how the task is framed. Four patterns cover the vast majority of practical use cases.

Zero-shot prompting gives the model a task description with no examples:

Classify the sentiment of the following tweet as positive, negative, or neutral. Reply with one word only. Tweet: “Just got my order and the packaging was completely crushed.”

Few-shot prompting includes worked examples before the query. This dramatically improves performance on tasks that require a specific output format or domain knowledge:

Classify tweets about a consumer electronics brand as: complaint, praise, question, or neutral. Tweet: “My phone screen cracked on day 3.” → complaint Tweet: “Battery lasts two full days, really impressed.” → praise Tweet: “Does this model support wireless charging?” → question Tweet: “Just saw the new ad.” → neutral Tweet: “The customer service was absolutely useless.” →

Chain-of-thought prompting (Wei et al., 2022) adds the instruction “Let’s think step by step” or provides examples that show intermediate reasoning. This dramatically improves accuracy on multi-step tasks — financial analysis, causal inference, arithmetic. For social media analytics, chain-of-thought is useful when the correct sentiment label depends on implicit knowledge (“A tweet saying ‘Thanks for nothing, @airline’ is sarcastic and therefore negative — the word ‘thanks’ is a false positive for a naive lexicon”).

Role / system prompting provides a persistent identity in the system message:

You are a senior equity analyst specialising in consumer technology. Your task is to read news headlines and assess their likely impact on the share price of Apple Inc. on a scale from −3 (very negative) to +3 (very positive), followed by a one-sentence justification.

Role prompting is especially effective for calibrating tone, output format, and domain focus. A system prompt that says “You are a brand safety officer reviewing social media content” produces different label distributions than one that says “You are a marketing manager looking for brand mentions.”

In practice

Coca-Cola’s social listening team (which runs one of the largest brand monitoring operations in the world) uses a combination of few-shot and role prompting to classify a continuous stream of social mentions into 12 brand sentiment categories — a more granular taxonomy than the standard positive/negative/neutral trichotomy. The system prompt defines the brand persona and the classification rubric; the few-shot examples cover edge cases (irony, mixed sentiment, competitor comparisons). The team iterates on the prompt weekly based on classification errors identified in spot-checks. This is “prompt engineering as ongoing operational practice” — not a one-time setup.


Evaluating LLM Outputs

Deploying an LLM for a production task — sentiment classification, content moderation, financial distillation — without systematic evaluation is a serious operational risk. LLMs can be wrong, inconsistent, or confidently incorrect (“hallucination”), and these failures are not uniformly distributed: they cluster around specific input types, ambiguous phrasings, and domain boundaries.

Classification metrics

For structured classification tasks (sentiment, zero-shot labelling), standard metrics apply: accuracy, precision, recall, and F1 by class. When classes are imbalanced — as they typically are in brand monitoring, where “neutral” mentions dominate — macro-averaged F1 (average F1 across all classes, unweighted by class frequency) is more informative than accuracy. Always report a confusion matrix on a held-out test set; the off-diagonal cells reveal which label pairs the model conflates.

Calibration

Calibration measures whether the model’s confidence scores match its actual accuracy. A perfectly calibrated model that says it is 80% confident should be correct 80% of the time. LLMs are famously poorly calibrated — their logit-derived probabilities are not reliable confidence estimates, especially after RLHF fine-tuning. If you plan to threshold on confidence scores (e.g., “only route tickets to automation if confidence > 0.9”), use Platt scaling or isotonic regression to recalibrate scores on a validation set first.

Generation metrics

For summarisation and generation tasks, automatic metrics include BLEU (precision of n-gram overlap between output and reference) and ROUGE (recall-oriented variant). Both are crude proxies — a fluent, accurate summary with different word choices than the reference will score poorly, while a summary that copies chunks of the source without understanding will score well. Treat BLEU and ROUGE as sanity checks, not as ground truth.

Human evaluation remains the gold standard for generation quality. At minimum, evaluate on a random sample of 100–200 outputs for fluency, factual accuracy, and task-relevance. For safety-critical applications (financial advice, legal summarisation, medical triage), human review is not optional.

LLM-as-judge

A recent pattern uses a stronger LLM (e.g., GPT-4) to evaluate the outputs of a weaker or domain-specific LLM. “LLM-as-judge” is convenient and scalable, but has known biases: the judge model favours outputs that match its own stylistic preferences, penalises concision, and shows positional bias (preferring the first option when presented with a pair). Use it as a component of a broader evaluation framework, not as a standalone metric.

Hallucination is a first-class production risk

A financial news distillation system tested by a Tier-1 bank in 2023 produced factually incorrect earnings figures in approximately 3% of summaries — a rate that would have been operationally unacceptable for analyst briefings. The solution was a two-stage pipeline: LLM summarisation followed by a retrieval-based fact-checking step that verified every numerical claim against the source document. Hallucination rates dropped below 0.2% after the retrieval check. This pattern — generate then verify — is now standard practice in high-stakes LLM deployments.


Cost, Latency, and Ethics

Cost: tokens multiply fast

The commercial LLM API economy is priced per token. As of early 2025, GPT-4 Turbo costs approximately $0.01 per 1,000 input tokens and $0.03 per 1,000 output tokens; GPT-4o Mini (a compressed model) costs roughly 15× less. Classifying one tweet requires roughly 80 input tokens (prompt + tweet) and 5 output tokens (the label). At GPT-4 Turbo pricing, one million tweet classifications cost approximately $850. At GPT-4o Mini pricing, the same batch costs $60. For a brand monitoring operation tracking five million mentions per month, the cost differential between model choices compounds to $40,000 per year. Cost discipline begins at the architecture stage: use the smallest model that achieves acceptable accuracy, cache repeated prompts, and batch inference wherever possible.

Latency: real-time is harder than it looks

Text generation through an API has latency driven primarily by the number of output tokens — each token is generated sequentially, and each generation step costs roughly 10–50 ms depending on model size and hardware. A classification task that outputs one token has latency around 200–400 ms (dominated by network round-trip and tokenization overhead). A summarisation task that outputs 200 tokens takes 5–15 seconds. For real-time applications — live chat moderation, in-stream ad targeting, customer service triage — latency requirements may rule out the largest models entirely. The standard engineering response is a tiered architecture: a fast, small model handles the real-time path; a slower, more accurate model handles asynchronous review of edge cases.

Privacy and PII

Social media text frequently contains personally identifiable information (PII): names, locations, contact details, and health or financial disclosures. Sending user-generated content to a third-party API (OpenAI, Anthropic, Cohere) transfers that data to a third-party server. Depending on jurisdiction, this may conflict with GDPR (EU), PDPO (Hong Kong), PIPL (China), or CCPA (California). Before deploying an LLM pipeline on user data, conduct a data protection impact assessment, pseudonymise or redact PII at the pre-processing stage, and review the API provider’s data retention and training policies. The alternative — running an open-source model (LLaMA, Mistral) on-premise — avoids the data transfer issue at the cost of infrastructure management.

Bias and fairness

LLMs inherit the biases present in their training data. For social media analytics, the operationally relevant biases include: demographic bias (sentiment models may score the same statement differently depending on the author’s apparent identity, inferred from name or dialect); recency bias (knowledge is stale past the training cutoff); and majority-language bias (performance degrades for low-resource languages and dialects). For brand monitoring in multilingual markets — Cantonese, Tagalog, Bahasa Indonesia — validate model performance separately on each language, and consider using models specifically trained for that language family rather than defaulting to English-dominant models.


Closing: Where the Field is Going (as of 2026)

The LLM landscape in 2026 looks qualitatively different from 2022 in four respects. First, agents: LLMs can now use tools — web search, Python execution, database queries, API calls — and plan multi-step workflows autonomously. OpenAI’s Operator, Anthropic’s Claude Agents, and Google’s Project Mariner all demonstrate the same pattern: the LLM as orchestrator, delegating subtasks to specialised tools. For marketing analysts, this means a natural-language interface to data pipelines is no longer science fiction. Second, long context: the effective context window has grown from 2,048 tokens in 2020 to one million tokens in 2025 (Gemini 1.5 Pro). Analysing an entire year of a brand’s social corpus in a single prompt — without any chunking or retrieval — is becoming feasible. Third, multimodality: GPT-4V, Gemini Ultra, and Claude 3 handle images, audio, and video alongside text. Social media is inherently multimodal — memes, short-form video, and image captions carry sentiment that text alone cannot capture. Multimodal models are the next frontier for brand monitoring and content moderation. Fourth, cost-performance curves continue to shift downward: the same classification quality that cost $1,000 in 2022 costs $10 in 2026, driven by architectural improvements (mixture of experts, speculative decoding) and fierce competition among providers.

For depth on the mechanics covered in this chapter — transformer internals, pretraining procedures, RLHF, efficient fine-tuning — the reference is Hands-On Large Language Models by Alammar and Grootendorst. Every chapter in that book comes with runnable Colab notebooks; it is designed for the exact audience of this course.

This is the close of the Social Media Analysis book. The three chapters have traced a progression: from topics (what are people talking about, in aggregate?) through sentiment (how do they feel about it, and can we measure that from text alone?) to language models (how do we deploy the full power of modern NLP on social text at scale?). Text is now fully structured and numerically tractable. The next step, for the analyst who wants the complete picture, is to combine these textual signals with the structural analysis of who is talking to whom — the province of Network Analysis, where the social graph shapes the reach, speed, and ultimate influence of every message the text models process.


Prof. Xuhu Wan  ·  HKUST ISOM 5640  ·  Social Media Analysis  ·  2026 Edition

 

Prof. Xuhu Wan · HKUST ISOM 5640 · Introduction to Text Analytics for News and Social Media