Lexical Repetition Detection in Generated Prose¶

Craft guidance for identifying genuinely problematic repetition in LLM-generated fiction—distinguishing formulaic patterns from natural language frequency, across English and other languages.

The Problem: Signal vs. Noise in Bigram Counting¶

When LLMs generate prose across dozens of passages, certain word pairs appear repeatedly. Most of these are natural English: "in the", "of the", "she said". A few signal genuine quality problems: "dark corridor", "heavy sigh", "narrowed eyes" appearing in passage after passage.

Raw bigram counting produces overwhelming noise. In a 60-passage test with an LLM, a naive cross-passage bigram check flagged ~3,700 bigrams—almost all natural English patterns. The LLM review phase independently found only 18 genuine issues. The challenge is filtering thousands of natural-frequency bigrams to surface the handful that indicate formulaic writing.

This article covers the established NLP approaches to this filtering problem, what they mean for fiction quality pipelines, and how they apply across languages.

Content Words vs. Function Words¶

The Linguistic Foundation¶

The most fundamental distinction in filtering is between open-class (content) and closed-class (function) words.

Content words carry lexical meaning: nouns, main verbs, adjectives, most adverbs. New content words enter the language regularly ("doomscroll", "selfie"). They form the majority of vocabulary but a minority of running text.

Function words serve grammatical purposes: articles (the, a), prepositions (in, on, at), pronouns (he, she, it), conjunctions (and, but, or), auxiliary verbs (is, was, have). The set is essentially fixed—new function words are almost never coined. They are few in type (~300-400 in English) but constitute 30-40% of running text.

For repetition detection, the distinction matters because function-word bigrams ("of the", "in a", "is not") repeat by grammatical necessity, not by authorial choice. Content-word bigrams ("amber glow", "dark corridor", "whispered urgently") repeat because the author—or the LLM—is reusing the same imagery.

Filtering Heuristics: From Crude to Precise¶

Four approaches exist, each trading complexity for accuracy:

1. Character-length thresholds. Discard bigrams where one or both words fall below a character count (typically 3-4). Zero dependencies, fast—but unreliable. English has many meaningful short content words (war, god, ice, art, sea, sky, sun, old, red) and many long function words (through, about, between, without). A length-4 threshold drops "war cry" while keeping "through the". For Dutch and German, the situation is worse: function words like "geweest" (been, 7 letters) pass any reasonable length filter, while "kat" (cat, 3 letters) would be dropped.

2. Stopword lists. Maintain a curated set of high-frequency function words. Well-established NLP practice with decades of refinement. NLTK provides ~179 English stopwords, spaCy provides ~326. The key design decision is whether to drop bigrams where either word is a stopword (aggressive—drops "the darkness") or where both are stopwords (conservative—only drops "of the"). For repetition detection in fiction, the aggressive approach is usually correct: "the darkness" repeating across passages is not a quality signal, because "the" contributes nothing to the pattern. What matters is whether "darkness" itself is overused, which is a unigram question.

3. Part-of-speech (POS) filtering. Tag text with a POS tagger, then keep only bigrams matching content-word patterns. The Justeson & Katz (1995) patterns are the standard: Adjective+Noun ("dark corridor"), Noun+Noun ("whiskey glass"), Verb+Noun ("clenched fists"), Adverb+Verb ("slowly turned"). Manning & Schutze's Foundations of Statistical NLP reports that POS filtering on a New York Times corpus reduced the top bigrams from all function-word pairs to almost exclusively meaningful collocations. Most accurate—but requires a POS tagger dependency (spaCy, ~200MB+ with model).

4. Statistical measures (PMI, TF-IDF, log-likelihood). Score bigrams by how statistically surprising their co-occurrence is. Pointwise Mutual Information (Church & Hanks, 1990) measures whether two words appear together more than chance predicts. Log-likelihood ratio (Dunning, 1993) is more robust for rare events. These answer "is this a meaningful collocation?" rather than "is this a function-word bigram?"—a different question. With only ~60 passages, the statistics are noisy and the approach is overkill for the filtering problem.

Recommendation for Fiction Quality Pipelines¶

A curated stopword set of 50-80 words per language provides the best tradeoff: high accuracy, zero heavyweight dependencies, easy to maintain. POS tagging is the gold standard for precision but introduces a heavy dependency that rarely justifies itself for this specific use case. Statistical measures solve a different problem.

Collocation Theory: What Makes a Bigram Meaningful?¶

Not all content-word bigrams are equally interesting for repetition detection. Collocation theory distinguishes three categories:

Compositional collocations: "heavy rain"—meaning is the sum of parts, but the combination is preferred over alternatives like "strong rain." These are the primary target for repetition detection in fiction: repeated collocations signal formulaic imagery.

Non-compositional idioms: "kick the bucket"—meaning differs from parts. Repetition of idioms is less concerning because they are fixed expressions.

Arbitrary co-occurrences: "door opened"—words that happen to appear together without special association. High frequency but low significance for quality assessment.

Statistical Measures¶

The academic literature offers several measures for scoring collocations:

Method	What it measures	Strength	Weakness
Raw frequency	How often bigram appears	Simple baseline	Dominated by function words
t-test	Co-occurrence exceeds chance	Medium-frequency pairs	Assumes normal distribution
Chi-squared	Independence of co-occurrence	Large corpora	Unreliable for rare events
PMI	Strength of association	Rare but strong pairs	Biased toward very rare pairs
Log-likelihood ratio	Evidence supports association	All frequency ranges	Complex to interpret

Manning & Schutze recommend the log-likelihood ratio (Dunning, 1993) as the most reliable single measure because it handles both rare and common events, unlike the t-test (wrong distributional assumption) or chi-squared (unreliable for sparse data).

What This Means for Fiction¶

For detecting formulaic LLM writing, the most useful approach is simpler than full collocation extraction: cross-document frequency of content-word bigrams. Count how many distinct passages contain each content-word bigram at least once. A bigram appearing in 15% or more of passages, after stopword filtering, is a strong signal of formulaic generation—the LLM is reaching for the same imagery repeatedly.

The statistical machinery of PMI and log-likelihood is designed for large reference corpora. With ~60 passages of ~200 words each, the corpus is too small for robust statistical inference. Simple frequency counting with good filtering outperforms complex statistics at this scale.

LLM-Specific Repetition Patterns¶

The Antislop Research¶

Paech (2025) provides the most comprehensive analysis of LLM-specific repetitive patterns. Key findings:

Some phrases appear over 1,000 times more frequently in LLM output than in human writing
The human baseline combines wordfreq (for individual words) and a curated corpus of Reddit creative writing plus Project Gutenberg texts (for n-grams)
Over 8,000 identified "slop" patterns including overused words ("tapestry", "kaleidoscope", "symphony"), character names ("Elara", "Kael"), and phrases ("a testament to", "couldn't help but", "sent shivers down")
Open-source phrase lists available under MIT license

The Antislop findings suggest that for LLM-generated fiction, a curated blocklist of known overused phrases may catch more genuine issues than statistical bigram analysis. The repetition patterns are not random—they reflect systematic biases in training data and RLHF alignment.

Syntactic Templates¶

Shaib et al. (EMNLP 2024) studied repetition at the syntactic level using POS n-gram sequences:

LLM-generated text shows a 97% template rate (fraction of outputs containing at least one repeated POS template) versus 46% for human text
76% of templates in LLM output trace back to pretraining data
Templates are "not overwritten during fine-tuning or RLHF"

This means the repetition problem is structural, not just lexical. An LLM doesn't just reuse the same words—it reuses the same sentence structures. Detecting only lexical bigrams misses this deeper pattern. However, syntactic template detection requires POS tagging and is more suited to model evaluation than real-time quality gating.

Lexical Diversity Measures¶

TTR and Its Limitations¶

Type-Token Ratio (TTR)—the ratio of unique words to total words—is the simplest lexical diversity measure. It is also text-length-dependent: longer passages always produce lower TTR because vocabulary exhausts faster than text grows. This makes cross-passage comparison unreliable when passages differ in length.

McCarthy & Jarvis (2010) validated this empirically: TTR was the only measure in their study found to vary systematically as a function of text length, regardless of actual lexical diversity.

Better Alternatives¶

MTLD (Measure of Textual Lexical Diversity): Computes the average length of sequential word strings that maintain a given TTR threshold. The resulting score is independent of text length—validated as the most robust measure across sample sizes. Available in the lexical-diversity Python package.

HD-D (Hypergeometric Distribution D): Uses a probability model to compute the contribution of each word type to overall diversity. Also length-independent. Available in the same package.

vocd-D: Uses random sampling to estimate diversity. Robust but computationally expensive compared to MTLD.

For a quality pipeline comparing passages of varying lengths, MTLD is the recommended replacement for TTR.

Multilingual Considerations¶

The Universal Challenge¶

Function-word filtering is inherently language-specific. No character-length threshold or statistical heuristic works universally:

Dutch geweest (been) is 7 characters and a function word
English go is 2 characters and a content word
German compounds create extremely long function-word forms
Agglutinative languages (Turkish, Finnish, Hungarian) attach grammatical markers to stems
Chinese, Japanese, and Thai have different tokenization requirements entirely

Each language needs its own filtering approach—but the conceptual framework (content vs. function, open vs. closed class) is universal.

Dutch-Specific Challenges¶

Dutch presents several challenges beyond simple stopword filtering:

Compound words (samenstellingen). Dutch aggressively compounds without spaces: "boekenkast" (bookcase), "kerkdeur" (church door), "scheepvaartmaatschappij" (shipping company). Standard bigram detection operates on space-separated tokens and misses repetition within compounds entirely. Compound splitting tools exist but add complexity.

Separable verbs (scheidbare werkwoorden). Dutch compound verbs split during conjugation, placing the prefix at the clause end: "opbellen" (to call) becomes "Ik bel mijn moeder op" (I call my mother). The prefix "op" separates from the stem "bel" and appears as a standalone token—which looks like a preposition to a naive filter. Common separable prefixes (aan, af, bij, in, mee, om, op, over, uit, voor) overlap heavily with prepositions.

Verb-final word order. Dutch subordinate clauses place the verb at the end: "...dat hij morgen naar Amsterdam gaat" (...that he tomorrow to Amsterdam goes). This creates different bigram patterns than English word order and affects position-based heuristics.

Article system. Dutch has two definite articles ("de" for common gender, "het" for neuter) where English has one ("the"). The choice carries grammatical information and is not arbitrary—but both are function words for filtering purposes.

Word length distribution. Dutch averages 5.1 letters per word versus English at 4.6, partly due to compounds. A character-length threshold that works for English (< 4 chars) would miss even more Dutch function words. The standard Dutch stopword sets (spaCy: ~300 words, NLTK: ~100 words) are the reliable approach.

German¶

Similar compound and separable-verb challenges as Dutch, plus a four-case system that inflects articles ("der/den/dem/des") and adjectives. German compounds can be even longer than Dutch. Standard stopword lists (spaCy German, ~300 words) handle the common cases.

French¶

Elision and contractions create tokenization challenges: "l'homme" (the man) splits differently than "le chat" (the cat). Contracted forms like "au" (= "à le") and "du" (= "de le") need special handling. Clitics in "je t'aime" attach pronouns to verbs. French stopword lists (spaCy: ~300+ words) cover the common forms.

Spanish¶

Subject-pronoun dropping means many sentences omit pronouns entirely, reducing function-word frequency in text. Contractions ("del" = "de el", "al" = "a el") and enclitic pronouns ("damelo" = give-it-to-me) need tokenization awareness.

Practical Multilingual Strategy¶

For a fiction quality pipeline that must support multiple languages:

Tier 1 — Supported languages. Use language-specific stopword sets (spaCy provides sets for 25+ languages). Fast, accurate, no model dependency beyond the word lists themselves.

Tier 2 — Unsupported languages. Use the stopwords-iso collection (covers 50+ languages) or the advertools package (37 languages). Less curated than spaCy but broad coverage.

Tier 3 — Unknown or mixed languages. For languages without stopword lists, a strong multilingual LLM can identify function words in context. This is expensive per call but requires zero language-specific infrastructure. From a practical perspective, non-English fiction already requires a strong model for generation—using the same model for quality analysis adds marginal cost.

The key insight is that the conceptual framework is language-universal even though the word lists are not. Every language has function words that dominate bigram frequency counts. The question is always: "Which word pairs represent authorial choice (or LLM formulaic tendency) versus grammatical necessity?"

Practical Approaches for Quality Pipelines¶

Approach 1: Stopword-Filtered Cross-Passage Frequency¶

The simplest effective approach for detecting formulaic LLM repetition across passages:

Tokenize and lowercase each passage
Remove tokens present in the language's stopword set
Extract bigrams from remaining content words
Count how many distinct passages contain each bigram (document frequency, not raw count)
Flag bigrams appearing in more than a threshold percentage of passages (e.g., >15%)

This surfaces patterns like "amber glow" appearing in 20 of 60 passages while ignoring "in the" entirely. The threshold should scale with passage count—a fixed count (e.g., "more than 3 passages") creates false positives in large projects.

Approach 2: Known-Pattern Blocklist¶

Maintain a curated list of known LLM-overused phrases drawn from research like Antislop. Check generated prose against the list and flag passages containing multiple blocklisted phrases. This catches LLM-specific patterns ("couldn't help but", "a testament to") that generic bigram counting misses because they may not repeat across passages—they are individually overused relative to human baselines.

This approach complements cross-passage frequency: the blocklist catches per-passage cliches while cross-passage counting catches project-wide formulaic repetition.

Approach 3: Proactive Prevention Over Post-Hoc Detection¶

The most effective strategy may be preventing repetition during generation rather than detecting it after. Providing the LLM with a blocklist of already-used imagery at generation time—"these phrases have already appeared in earlier passages, do not reuse them"—addresses the root cause rather than the symptom.

This requires maintaining a running inventory of used imagery as passages are generated. The inventory should track content-word bigrams and trigrams, filtered by the same stopword approach described above.

What to Measure: Document Frequency, Not Raw Frequency¶

A critical distinction: document frequency (how many passages contain a bigram) is more meaningful than raw frequency (how many times a bigram appears total) for cross-passage repetition detection.

A bigram appearing 10 times in one passage may be intentional repetition for emphasis. The same bigram appearing once each in 10 different passages signals the LLM falling into a pattern. Document frequency captures the cross-passage pattern; raw frequency conflates intentional local repetition with unintentional global repetition.

References¶

Foundational NLP¶

Church, K. & Hanks, P. (1990). "Word Association Norms, Mutual Information, and Lexicography." Computational Linguistics, 16(1).
Dunning, T. (1993). "Accurate Methods for the Statistics of Surprise and Coincidence." Computational Linguistics, 19(1).
Manning, C. & Schutze, H. (1999). Foundations of Statistical Natural Language Processing, Chapter 5: Collocations.
Justeson, J. & Katz, S. (1995). "Technical terminology: linguistic properties and algorithm for identification in text." Natural Language Engineering, 1(1).

LLM-Specific Repetition¶

Paech, S. (2025). "Antislop: A Comprehensive Framework for Identifying and Eliminating Repetitive Patterns in Language Models." arXiv:2510.15061.
Shaib, C. et al. (2024). "Detection and Measurement of Syntactic Templates in Generated Text." EMNLP 2024.

Lexical Diversity¶

McCarthy, P. & Jarvis, S. (2010). "MTLD, vocd-D, and HD-D: A validation study of sophisticated approaches to lexical diversity assessment." Behavior Research Methods, 42(2).

Stopword Resources¶

spaCy language-specific stopword sets: github.com/explosion/spaCy (25+ languages)
stopwords-iso collection: github.com/stopwords-iso (50+ languages)
NLTK stopwords corpus: 16 languages
Snowball stemmer stopword lists: snowballstem.org