Information retrieval models : foundations and relationships

cover image

Where to find it

Information & Library Science Library

Call Number
ZA3075 .R645 2013
Status
Available

Authors, etc.

Names:

Summary

Information Retrieval (IR) models are a core component of IR research and IR systems. The past decade brought a consolidation of the family of IR models, which by 2000 consisted of relatively isolated views on TF-IDF (Term-Frequency times Inverse-Document-Frequency) as the weighting scheme in the vector-space model (VSM), the probabilistic relevance framework (PRF), the binary independence retrieval (BIR) model, BM25 (Best-Match Version 25, the main instantiation of the PRF/BIR), and language modelling (LM). Also, the early 2000s saw the arrival of divergence from randomness (DFR).

Regarding intuition and simplicity, though LM is clear from a probabilistic point of view, several people stated: ""It is easy to understand TF-IDF and BM25. For LM, however, we understand the math, but we do not fully understand why it works.""

This book takes a horizontal approach gathering the foundations of TF-IDF, PRF, BIR, Poisson, BM25, LM, probabilistic inference networks (PIN's), and divergence-based models. The aim is to create a consolidated and balanced view on the main models.

A particular focus of this book is on the ""relationships between models."" This includes an overview over the main frameworks (PRF, logical IR, VSM, generalized VSM) and a pairing of TF-IDF with other models. It becomes evident that TF-IDF and LM measure the same, namely the dependence (overlap) between document and query. The Poisson probability helps to establish probabilistic, non-heuristic roots for TF-IDF, and the Poisson parameter, average term frequency, is a binding link between several retrieval models and model parameters.

Contents

  • List of Figures p. xvii
  • Preface p. xix
  • Acknowledgments p. xxi
  • Introduction p. 1
  • 1.1 Structure and Contribution of this Book p. 1
  • 1.2 Background: A Timeline of IR Models p. 1
  • 1.3 Notation p. 3
  • 1.3.1 The Notation Issue "Term Frequency" p. 6
  • 1.3.2 Notation: Zhai's Book and this Book p. 7
  • 2 Foundations of IR Models p. 9
  • 2.1 TF-IDF p. 9
  • 2.1.1 TF Variants p. 10
  • 2.1.2 TFiog: Logarithmic TF p. 12
  • 2.1.3 TFfrac: Fractional (Ratio-based) TF p. 13
  • 2.1.4 IDF Variants p. 14
  • 2.1.5 Term Weight and RSV p. 16
  • 2.1.6 Other TF Variants: Lifted TF and Pivoted TF p. 16
  • 2.1.7 Semi-subsumed Event Occurrences: A Semantics of the BM25-TF p. 17
  • 2.1.8 Probabilistic IDF: The Probability of Being Informative p. 19
  • 2.1.9 Summary p. 23
  • 2.2 PRF: The Probability of Relevance Framework p. 23
  • 2.2.1 Feature Independence Assumption p. 25
  • 2.2.2 Non-Query Term Assumption p. 26
  • 2.2.3 Term Frequency Split p. 26
  • 2.2.4 Probability Ranking Principle (PRP) p. 26
  • 2.2.5 Summary p. 29
  • 2.3 BIR: Binary Independence Retrieval p. 29
  • 2.3.1 Term Weight and RSV p. 30
  • 2.3.2 Missing Relevance Information p. 31
  • 2.3.3 Variants of the BIR Term Weight p. 32
  • 2.3.4 Smooth Variants of the BIR Term Weight p. 33
  • 2.3.5 RSJ Term Weight p. 33
  • 2.3.6 On Theoretical Arguments for 0.5 in the RSJ Term Weight p. 33
  • 2.3.7 Summary p. 35
  • 2.4 Poisson and 2-Poisson p. 35
  • 2.4.1 Poisson Probability p. 36
  • 2.4.2 Poisson Analogy: Sunny Days and Term Occurrences p. 36
  • 2.4.3 Poisson Example: Toy Data p. 37
  • 2.4.4 Poisson Example: TREC-2 p. 38
  • 2.4.5 Binomial Probability p. 39
  • 2.4.6 Relationship between Poisson and Binomial Probability p. 40
  • 2.4.7 Poisson PRF p. 40
  • 2.4.8 Term Weight and RSV p. 42
  • 2.4.9 2-Poisson p. 43
  • 2.4.10 Summary p. 44
  • 2.5 BM25 p. 45
  • 2.5.1 BM25-TF p. 45
  • 2.5.2 BM25-TF and Pivoted TF p. 45
  • 2.5.3 BM25: Literature and Wikipedia End 2012 p. 46
  • 2.5.4 Term Weight and RSV p. 47
  • 2.5.5 Summary p. 48
  • 2.6 LM: Language Modeling p. 49
  • 2.6.1 Probability Mixtures p. 49
  • 2.6.2 Term Weight and RSV: LM1 p. 51
  • 2.6.3 Term Weight and RSV: LM (normalized) p. 52
  • 2.6.4 Term Weight and RSV: JM-LM p. 54
  • 2.6.5 Term Weight and RSV: Dirich-LM p. 54
  • 2.6.6 Term Weight and RSV: LM2 p. 56
  • 2.6.7 Summary p. 57
  • 2.7 PIN's: Probabilistic Inference Networks p. 58
  • 2.7.1 The Turtle/Croft Link Matrix p. 61
  • 2.7.2 Term Weight and RSV p. 62
  • 2.7.3 Summary p. 63
  • 2.8 Divergence-based Models and DFR p. 63
  • 2.8.1 DFR: Divergence from Randomness 63 xiii
  • 2.8.2 DFR: Sampling over Documents and Locations p. 65
  • 2.8.3 DFR: Binomial Transformation Step p. 66
  • 2.8.4 DFR and KL-Divergence p. 67
  • 2.8.5 Poisson as a Model of Randomness: P(k t > 0 p. 68
  • 2.8.6 Poisson as a Model of Randomness: P(k t = tf d p. 68
  • 2.8.7 DFR: Elite Documents p. 69
  • 2.8.8 DFR: Example p. 69
  • 2.8.9 Term Weights and RSV's p. 70
  • 2.8.10 KL-Divergence Retrieval Model p. 73
  • 2.8.11 Summary p. 73
  • 2.9 Relevance-based Models p. 73
  • 2.9.1 Rocchio's Relevance Feedback Model p. 74
  • 2.9.2 The PRF p. 74
  • 2.9.3 Lavrenko's Relevance-based Language Models p. 75
  • 2.10 Precision and Recall p. 76
  • 2.10.1 Precision and Recall: Conditional Probabilities p. 76
  • 2.10.2 Averages: Total Probabilities p. 76
  • 2.11 Summary p. 77
  • 3 Relationships Between IR Models p. 79
  • 3.1 PRF: The Probability of Relevance Framework p. 80
  • 3.1.1 Estimation of Term Probabilities p. 81
  • 3.2 P(d→q): The Probability that d Implies q p. 82
  • 3.3 The Vector-Space "Model" (VSM) p. 83
  • 3.3.1 VSM and Probabilities p. 85
  • 3.4 The Generalised Vector-Space Model (GVSM) .85
  • 3.4.1 GVSM and Probabilities p. 86
  • 3.5 A General Matrix Framework p. 88
  • 3.5.1 Term-Document Matrix p. 88
  • 3.5.2 On the Notation Issue "Term Frequency" p. 90
  • 3.5.3 Document-Document Matrix p. 91
  • 3.5.4 Co-Occurrence Matrices p. 91
  • 3.6 A Parallel Derivation of Probabilistic Retrieval Models p. 92
  • 3.7 The Poisson Bridge: P D (t|U) avgtf(t, u) = P L (t p. 93
  • 3.8 Query Term Probability Assumptions p. 94
  • 3.8.1 Query term mixture assumption p. 94
  • 3.8.2 Query term hastiness assumption p. 95
  • 3.8.3 Query term BIR assumption p. 96
  • 3.9 TF-IDF p. 96
  • 3.9.1 TF-IDF and BIR p. 96
  • 3.9.2 TF-IDF and Poisson p. 98
  • 3.9.3 TF-IDF and BM25 p. 100
  • 3.9.4 TF-IDF and LM p. 101
  • 3.9.5 TF-IDF and LM: Side-by-Side p. 102
  • 3.9.6 TF-IDF and PIN's p. 104
  • 3.9.7 TF-IDF and Divergence p. 105
  • 3.9.8 TF-IDF and DFR: Risk times Gain p. 105
  • 3.9.9 TF-IDF and DFR: Gaps between Term Occurrences p. 106
  • 3.10 More Relationships: BM25 and LM, LM and PIN's p. 108
  • 3.11 Information Theory p. 108
  • 3.11.1 Entropy p. 109
  • 3.11.2 Joint Entropy p. 110
  • 3.11.3 Conditional Entropy p. 110
  • 3.11.4 Mutual Information (MI) p. 110
  • 3.11.5 Cross Entropy p. 110
  • 3.11.6 KL-Divergence p. 111
  • 3.11.7 Query Clarity: Divergence(query || collection) p. 111
  • 3.11.8 LM = Clarity(query) -Divergence(query || doc) p. 112
  • 3.11.9 TF-IDF = Clarity(doc) - Divergence(doc || query) p. 112
  • 3.12 Summary p. 113
  • 4 Summary & Research Outlook p. 117
  • 4.1 Summary p. 117
  • 4.2 Research Outlook p. 119
  • 4.2.1 Retrieval Models p. 119
  • 4.2.2 Evaluation Models p. 120
  • 4.2.3 A Unified Framework for Retrieval AND Evaluation p. 121
  • 4.2.4 Model Combinations and "New" Models p. 122
  • 4.2.5 Dependence-aware Models p. 123
  • 4.2.6 "Query-Log" and other More-Evidence Models p. 124
  • 4.2.7 Phase-2 Models: Retrieval Result Condensation Models p. 124
  • 4.2.8 A Theoretical Framework to Predict Ranking Quality p. 124
  • 4.2.9 MIR: Math for IR p. 125
  • 4.2.10 AIR: Abstraction for IR p. 125
  • Bibliography p. 127
  • Author's Biography p. 135
  • Index p. 137

Other details