Introduction to information retrieval

cover image

Where to find it

Davis Library (8th floor)

Call Number
QA76.9.T48 M26 2008
Status
Checked Out (Due 5/17/2024)

Information & Library Science Library

Call Number
QA76.9.T48 M26 2008
Status
Available
Call Number
QA76.9.T48 M26 2008 c. 3
Status
Available

Summary

Class-tested and coherent, this textbook teaches classical and web information retrieval, including web search and the related areas of text classification and text clustering from basic concepts. It gives an up-to-date treatment of all aspects of the design and implementation of systems for gathering, indexing, and searching documents; methods for evaluating systems; and an introduction to the use of machine learning methods on text collections. All the important ideas are explained using examples and figures, making it perfect for introductory courses in information retrieval for advanced undergraduates and graduate students in computer science. Based on feedback from extensive classroom experience, the book has been carefully structured in order to make teaching more natural and effective. Slides and additional exercises (with solutions for lecturers) are also available through the book's supporting website to help course instructors prepare their lectures.

Contents

  • Table of Notation p. xi
  • Preface p. xv
  • 1 Boolean retrieval p. 1
  • 1.1 An example information retrieval problem p. 3
  • 1.2 A first take at building an inverted index p. 6
  • 1.3 Processing Boolean queries p. 9
  • 1.4 The extended Boolean model versus ranked retrieval p. 13
  • 1.5 References and further reading p. 16
  • 2 The term vocabulary and postings lists p. 18
  • 2.1 Document delineation and character sequence decoding p. 18
  • 2.2 Determining the vocabulary of terms p. 21
  • 2.3 Faster postings list intersection via skip pointers p. 33
  • 2.4 Positional postings and phrase queries p. 36
  • 2.5 References and further reading p. 43
  • 3 Dictionaries and tolerant retrieval p. 45
  • 3.1 Search structures for dictionaries p. 45
  • 3.2 Wildcard queries p. 48
  • 3.3 Spelling correction p. 52
  • 3.4 Phonetic correction p. 58
  • 3.5 References and further reading p. 59
  • 4 Index construction p. 61
  • 4.1 Hardware basics p. 62
  • 4.2 Blocked sort-based indexing p. 63
  • 4.3 Single-pass in-memory indexing p. 66
  • 4.4 Distributed indexing p. 68
  • 4.5 Dynamic indexing p. 71
  • 4.6 Other types of indexes p. 73
  • 4.7 References and further reading p. 76
  • 5 Index compression p. 78
  • 5.1 Statistical properties of terms in information retrieval p. 79
  • 5.2 Dictionary compression p. 82
  • 5.3 Postings file compression p. 87
  • 5.4 References and further reading p. 97
  • 6 Scoring, term weighting, and the vector space model p. 100
  • 6.1 Parametric and zone indexes p. 101
  • 6.2 Term frequency and weighting p. 107
  • 6.3 The vector space model for scoring p. 110
  • 6.4 Variant tf-idf functions p. 116
  • 6.5 References and further reading p. 122
  • 7 Computing scores in a complete search system p. 124
  • 7.1 Efficient scoring and ranking p. 124
  • 7.2 Components of an information retrieval system p. 132
  • 7.3 Vector space scoring and query operator interaction p. 136
  • 7.4 References and further reading p. 137
  • 8 Evaluation in information retrieval p. 139
  • 8.1 Information retrieval system evaluation p. 140
  • 8.2 Standard test collections p. 141
  • 8.3 Evaluation of unranked retrieval sets p. 142
  • 8.4 Evaluation of ranked retrieval results p. 145
  • 8.5 Assessing relevance p. 151
  • 8.6 A broader perspective: System quality and user utility p. 154
  • 8.7 Results snippets p. 157
  • 8.8 References and further reading p. 159
  • 9 Relevance feedback and query expansion p. 162
  • 9.1 Relevance feedback and pseudo relevance feedback p. 163
  • 9.2 Global methods for query reformulation p. 173
  • 9.3 References and further reading p. 177
  • 10 XML retrieval p. 178
  • 10.1 Basic XML concepts p. 180
  • 10.2 Challenges in XML retrieval p. 183
  • 10.3 A vector space model for XML retrieval p. 188
  • 10.4 Evaluation of XML retrieval p. 192
  • 10.5 Text-centric versus data-centric XML retrieval p. 196
  • 10.6 References and further reading p. 198
  • 11 Probabilistic information retrieval p. 201
  • 11.1 Review of basic probability theory p. 202
  • 11.2 The probability ranking principle p. 203
  • 11.3 The binary independence model p. 204
  • 11.4 An appraisal and some extensions p. 212
  • 11.5 References and further reading p. 216
  • 12 Language models for information retrieval p. 218
  • 12.1 Language models p. 218
  • 12.2 The query likelihood model p. 223
  • 12.3 Language modeling versus other approaches in information retrieval p. 229
  • 12.4 Extended language modeling approaches p. 230
  • 12.5 References and further reading p. 232
  • 13 Text classification and Naive Bayes p. 234
  • 13.1 The text classification problem p. 237
  • 13.2 Naive Bayes text classification p. 238
  • 13.3 The Bernoulli model p. 243
  • 13.4 Properties of Naive Bayes p. 245
  • 13.5 Feature selection p. 251
  • 13.6 Evaluation of text classification p. 258
  • 13.7 References and further reading p. 264
  • 14 Vector space classification p. 266
  • 14.1 Document representations and measures of relatedness in vector spaces p. 267
  • 14.2 Rocchio classification p. 269
  • 14.3 k nearest neighbor p. 273
  • 14.4 Linear versus nonlinear classifiers p. 277
  • 14.5 Classification with more than two classes p. 281
  • 14.6 The bias-variance tradeoff p. 284
  • 14.7 References and further reading p. 291
  • 15 Support vector machines and machine learning on documents p. 293
  • 15.1 Support vector machines: The linearly separable case p. 294
  • 15.2 Extensions to the support vector machine model p. 300
  • 15.3 Issues in the classification of text documents p. 307
  • 15.4 Machine-learning methods in ad hoc information retrieval p. 314
  • 15.5 References and further reading p. 318
  • 16 Flat clustering p. 321
  • 16.1 Clustering in information retrieval p. 322
  • 16.2 Problem statement p. 326
  • 16.3 Evaluation of clustering p. 327
  • 16.4 K-means p. 331
  • 16.5 Model-based clustering p. 338
  • 16.6 References and further reading p. 343
  • 17 Hierarchical clustering p. 346
  • 17.1 Hierarchical agglomerative clustering p. 347
  • 17.2 Single-link and complete-link clustering p. 350
  • 17.3 Group-average agglomerative clustering p. 356
  • 17.4 Centroid clustering p. 358
  • 17.5 Optimality of hierarchical agglomerative clustering p. 360
  • 17.6 Divisive clustering p. 362
  • 17.7 Cluster labeling p. 363
  • 17.8 Implementation notes p. 365
  • 17.9 References and further reading p. 367
  • 18 Matrix decompositions and latent semantic indexing p. 369
  • 18.1 Linear algebra review p. 369
  • 18.2 Term-document matrices and singular value decompositions p. 373
  • 18.3 Low-rank approximations p. 376
  • 18.4 Latent semantic indexing p. 378
  • 18.5 References and further reading p. 383
  • 19 Web search basics p. 385
  • 19.1 Background and history p. 385
  • 19.2 Web characteristics p. 387
  • 19.3 Advertising as the economic model p. 392
  • 19.4 The search user experience p. 395
  • 19.5 Index size and estimation p. 396
  • 19.6 Near-duplicates and shingling p. 400
  • 19.7 References and further reading p. 404
  • 20 Web crawling and indexes p. 405
  • 20.1 Overview p. 405
  • 20.2 Crawling p. 406
  • 20.3 Distributing indexes p. 415
  • 20.4 Connectivity servers p. 416
  • 20.5 References and further reading p. 419
  • 21 Link analysis p. 421
  • 21.1 The Web as a graph p. 422
  • 21.2 PageRank p. 424
  • 21.3 Hubs and authorities p. 433
  • 21.4 References and further reading p. 439
  • Bibliography p. 441
  • Index p. 469

Other details