The language modeling approach to information retrieval (2008)

2025-10-08

Before there were LLMs, there were (small) language models. In the 1990s and 2000s, information retrieval researchers used small language models in all sorts of settings. I’ve been revisiting some of this literature recently because I’m trying to understand the similarities and differences between how (small) language models were used for search tasks in document databases vs. protein databases.

Here’s a bit from Manning, Raghavan, and Schütze (2008) that contains a phrase, the language modeling approach to IR, that I rather like:

The language modeling approach to IR directly models that idea: a document is a good match to a query if the document model is likely to generate the query…

I’m primarily interested in the task of, given a document, finding similar documents—so you might want to substitute “document” where you see “query” above.

Note that this book was published in 2008—well before the LLM era.

Another paper that’s highly cited is: Zhai, C., & Lafferty, J. (2004). A study of smoothing methods for language models applied to information retrieval. ACM Transactions on Information Systems (TOIS), 22(2), 179-214.

Perhaps it’s the “topic detection & tracking” literature I should be looking at. I’m not particularly interested in modeling the query.

I’ll keep looking.