Dr. Shariq Bashir of National University of Computer and Emerging Sciences, Islamabad, has published Estimating retrievability ranks of documents using document features, Neurocomputing 123(10), 216-232 (2014).
Here is the abstract:
Retrievability is a measure of access that quantiﬁes how easily documents can be found using a retrieval system. Such a measure is of particular interest within the recall oriented retrieval domains such as patent or legal retrieval. This is because if a retrieval system for these retrieval domains makes some documents hard to ﬁnd then professional searchers would have a difﬁcult time when retrieving these documents. One main limitation of retrievability analysis is that it depends upon the processing of exhaustive number of queries. This requires large processing time and resources. In order to handle this problem, in this paper we use document features based approach in order to estimate the retrievability ranks of documents. In experiments, the strong correlation between features and retrievability scores on different collections conﬁrms that it is possible to estimate the retrievability ranks of documents without processing queries. One major advantage of this approach is that it requires fewer resources, and can be computed more quickly as compared to query based approach. While, on the other hand, one major disadvantage of this approach is that it can only estimate the retrievability ranks of documents, but cannot calculate how much there is retrievability inequality (retrieval bias) between the documents of collection.
The author’s models are tested in four datasets, including two U.S. patent datasets:
USPTO Patent Collections: These collections are downloaded from the freely available US patent and trademark office website. We collect all patents that are listed under the United States Patent Classification (USPC) classes 433 (Dentistry), and 422 (Chemical apparatus and process disinfecting, deodorizing, preserving, or sterilizing). These collections consist of 64,986 documents, with 36,998 documents in USPC Class 422 and 27,988 documents in USPC Class 433. […]