Information retrieval models : foundations and relationships

Information retrieval models : foundations and relationships

Thomas Roelleke.

Roelleke, Thomas, author

[San Rafael, Calif.] : Morgan & Claypool Publishers, [2013]

Book

Print

Where to find it

Information & Library Science Library

Call Number: ZA3075 .R645 2013

Status: Available

Request

Authors, etc.

Names:

Roelleke, Thomas, author
Morgan & Claypool Publishers

Summary

Information Retrieval (IR) models are a core component of IR research and IR systems. The past decade brought a consolidation of the family of IR models, which by 2000 consisted of relatively isolated views on TF-IDF (Term-Frequency times Inverse-Document-Frequency) as the weighting scheme in the vector-space model (VSM), the probabilistic relevance framework (PRF), the binary independence retrieval (BIR) model, BM25 (Best-Match Version 25, the main instantiation of the PRF/BIR), and language modelling (LM). Also, the early 2000s saw the arrival of divergence from randomness (DFR).

Regarding intuition and simplicity, though LM is clear from a probabilistic point of view, several people stated: ""It is easy to understand TF-IDF and BM25. For LM, however, we understand the math, but we do not fully understand why it works.""

This book takes a horizontal approach gathering the foundations of TF-IDF, PRF, BIR, Poisson, BM25, LM, probabilistic inference networks (PIN's), and divergence-based models. The aim is to create a consolidated and balanced view on the main models.

A particular focus of this book is on the ""relationships between models."" This includes an overview over the main frameworks (PRF, logical IR, VSM, generalized VSM) and a pairing of TF-IDF with other models. It becomes evident that TF-IDF and LM measure the same, namely the dependence (overlap) between document and query. The Poisson probability helps to establish probabilistic, non-heuristic roots for TF-IDF, and the Poisson parameter, average term frequency, is a binding link between several retrieval models and model parameters.

Content provided by Syndetic Solutions, Inc. Terms of Use

show more show less

List of Figures p. xvii
Preface p. xix
Acknowledgments p. xxi
Introduction p. 1
1.1 Structure and Contribution of this Book p. 1
1.2 Background: A Timeline of IR Models p. 1
1.3 Notation p. 3
1.3.1 The Notation Issue "Term Frequency" p. 6
1.3.2 Notation: Zhai's Book and this Book p. 7
2 Foundations of IR Models p. 9
2.1 TF-IDF p. 9
2.1.1 TF Variants p. 10
2.1.2 TFiog: Logarithmic TF p. 12
2.1.3 TFfrac: Fractional (Ratio-based) TF p. 13
2.1.4 IDF Variants p. 14
2.1.5 Term Weight and RSV p. 16
2.1.6 Other TF Variants: Lifted TF and Pivoted TF p. 16
2.1.7 Semi-subsumed Event Occurrences: A Semantics of the BM25-TF p. 17
2.1.8 Probabilistic IDF: The Probability of Being Informative p. 19
2.1.9 Summary p. 23
2.2 PRF: The Probability of Relevance Framework p. 23
2.2.1 Feature Independence Assumption p. 25
2.2.2 Non-Query Term Assumption p. 26
2.2.3 Term Frequency Split p. 26
2.2.4 Probability Ranking Principle (PRP) p. 26
2.2.5 Summary p. 29
2.3 BIR: Binary Independence Retrieval p. 29
2.3.1 Term Weight and RSV p. 30
2.3.2 Missing Relevance Information p. 31
2.3.3 Variants of the BIR Term Weight p. 32
2.3.4 Smooth Variants of the BIR Term Weight p. 33
2.3.5 RSJ Term Weight p. 33
2.3.6 On Theoretical Arguments for 0.5 in the RSJ Term Weight p. 33
2.3.7 Summary p. 35
2.4 Poisson and 2-Poisson p. 35
2.4.1 Poisson Probability p. 36
2.4.2 Poisson Analogy: Sunny Days and Term Occurrences p. 36
2.4.3 Poisson Example: Toy Data p. 37
2.4.4 Poisson Example: TREC-2 p. 38
2.4.5 Binomial Probability p. 39
2.4.6 Relationship between Poisson and Binomial Probability p. 40
2.4.7 Poisson PRF p. 40
2.4.8 Term Weight and RSV p. 42
2.4.9 2-Poisson p. 43
2.4.10 Summary p. 44
2.5 BM25 p. 45
2.5.1 BM25-TF p. 45
2.5.2 BM25-TF and Pivoted TF p. 45
2.5.3 BM25: Literature and Wikipedia End 2012 p. 46
2.5.4 Term Weight and RSV p. 47
2.5.5 Summary p. 48
2.6 LM: Language Modeling p. 49
2.6.1 Probability Mixtures p. 49
2.6.2 Term Weight and RSV: LM1 p. 51
2.6.3 Term Weight and RSV: LM (normalized) p. 52
2.6.4 Term Weight and RSV: JM-LM p. 54
2.6.5 Term Weight and RSV: Dirich-LM p. 54
2.6.6 Term Weight and RSV: LM2 p. 56
2.6.7 Summary p. 57
2.7 PIN's: Probabilistic Inference Networks p. 58
2.7.1 The Turtle/Croft Link Matrix p. 61
2.7.2 Term Weight and RSV p. 62
2.7.3 Summary p. 63
2.8 Divergence-based Models and DFR p. 63
2.8.1 DFR: Divergence from Randomness 63 xiii
2.8.2 DFR: Sampling over Documents and Locations p. 65
2.8.3 DFR: Binomial Transformation Step p. 66
2.8.4 DFR and KL-Divergence p. 67
2.8.5 Poisson as a Model of Randomness: P(k t > 0 p. 68
2.8.6 Poisson as a Model of Randomness: P(k t = tf d p. 68
2.8.7 DFR: Elite Documents p. 69
2.8.8 DFR: Example p. 69
2.8.9 Term Weights and RSV's p. 70
2.8.10 KL-Divergence Retrieval Model p. 73
2.8.11 Summary p. 73
2.9 Relevance-based Models p. 73
2.9.1 Rocchio's Relevance Feedback Model p. 74
2.9.2 The PRF p. 74
2.9.3 Lavrenko's Relevance-based Language Models p. 75
2.10 Precision and Recall p. 76
2.10.1 Precision and Recall: Conditional Probabilities p. 76
2.10.2 Averages: Total Probabilities p. 76
2.11 Summary p. 77
3 Relationships Between IR Models p. 79
3.1 PRF: The Probability of Relevance Framework p. 80
3.1.1 Estimation of Term Probabilities p. 81
3.2 P(d→q): The Probability that d Implies q p. 82
3.3 The Vector-Space "Model" (VSM) p. 83
3.3.1 VSM and Probabilities p. 85
3.4 The Generalised Vector-Space Model (GVSM) .85
3.4.1 GVSM and Probabilities p. 86
3.5 A General Matrix Framework p. 88
3.5.1 Term-Document Matrix p. 88
3.5.2 On the Notation Issue "Term Frequency" p. 90
3.5.3 Document-Document Matrix p. 91
3.5.4 Co-Occurrence Matrices p. 91
3.6 A Parallel Derivation of Probabilistic Retrieval Models p. 92
3.7 The Poisson Bridge: P D (t|U) avgtf(t, u) = P L (t p. 93
3.8 Query Term Probability Assumptions p. 94
3.8.1 Query term mixture assumption p. 94
3.8.2 Query term hastiness assumption p. 95
3.8.3 Query term BIR assumption p. 96
3.9 TF-IDF p. 96
3.9.1 TF-IDF and BIR p. 96
3.9.2 TF-IDF and Poisson p. 98
3.9.3 TF-IDF and BM25 p. 100
3.9.4 TF-IDF and LM p. 101
3.9.5 TF-IDF and LM: Side-by-Side p. 102
3.9.6 TF-IDF and PIN's p. 104
3.9.7 TF-IDF and Divergence p. 105
3.9.8 TF-IDF and DFR: Risk times Gain p. 105
3.9.9 TF-IDF and DFR: Gaps between Term Occurrences p. 106
3.10 More Relationships: BM25 and LM, LM and PIN's p. 108
3.11 Information Theory p. 108
3.11.1 Entropy p. 109
3.11.2 Joint Entropy p. 110
3.11.3 Conditional Entropy p. 110
3.11.4 Mutual Information (MI) p. 110
3.11.5 Cross Entropy p. 110
3.11.6 KL-Divergence p. 111
3.11.7 Query Clarity: Divergence(query || collection) p. 111
3.11.8 LM = Clarity(query) -Divergence(query || doc) p. 112
3.11.9 TF-IDF = Clarity(doc) - Divergence(doc || query) p. 112
3.12 Summary p. 113
4 Summary & Research Outlook p. 117
4.1 Summary p. 117
4.2 Research Outlook p. 119
4.2.1 Retrieval Models p. 119
4.2.2 Evaluation Models p. 120
4.2.3 A Unified Framework for Retrieval AND Evaluation p. 121
4.2.4 Model Combinations and "New" Models p. 122
4.2.5 Dependence-aware Models p. 123
4.2.6 "Query-Log" and other More-Evidence Models p. 124
4.2.7 Phase-2 Models: Retrieval Result Condensation Models p. 124
4.2.8 A Theoretical Framework to Predict Ranking Quality p. 124
4.2.9 MIR: Math for IR p. 125
4.2.10 AIR: Abstraction for IR p. 125
Bibliography p. 127
Author's Biography p. 135
Index p. 137

Content provided by Syndetic Solutions, Inc. Terms of Use

show more show less

Subjects

Subject Headings A:

Information retrieval Information retrieval
Mathematical models

Other details

Description:
Series:
Published:
Language:
Notes:
ISBN:
OCLC Number:

MARC View

Tools

Information retrieval models : foundations and relationships

Where to find it

Information & Library Science Library

Authors, etc.

Summary

Contents

Subjects

Other details

Related items