Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2016:107:839-856.
doi: 10.1007/s11192-016-1863-z. Epub 2016 Feb 9.

Estimating search engine index size variability: a 9-year longitudinal study

Affiliations

Estimating search engine index size variability: a 9-year longitudinal study

Antal van den Bosch et al. Scientometrics. 2016.

Abstract

One of the determining factors of the quality of Web search engines is the size of their index. In addition to its influence on search result quality, the size of the indexed Web can also tell us something about which parts of the WWW are directly accessible to the everyday user. We propose a novel method of estimating the size of a Web search engine's index by extrapolating from document frequencies of words observed in a large static corpus of Web pages. In addition, we provide a unique longitudinal perspective on the size of Google and Bing's indices over a nine-year period, from March 2006 until January 2015. We find that index size estimates of these two search engines tend to vary dramatically over time, with Google generally possessing a larger index than Bing. This result raises doubts about the reliability of previous one-off estimates of the size of the indexed Web. We find that much, if not all of this variability can be explained by changes in the indexing and ranking infrastructure of Google and Bing. This casts further doubt on whether Web search engines can be used reliably for cross-sectional webometric studies.

Keywords: Longitudinal study; Search engine index; Webometrics.

PubMed Disclaimer

Figures

Fig. 1
Fig. 1
Labeled scatter plot of per-word DMOZ frequencies of occurrence and estimates of the Wikipedia test corpus. The x axis is logarithmic. The solid horizontal line represents the actual number of documents in the Wikipedia test corpus (2,112,923); the dashed horizontal line is the averaged estimate of 2,189,790. The dotted slanted line represents the log-linear regression function x=(204,224×ln(x))-141,623
Fig. 2
Fig. 2
Estimated size of the Google and Bing indices from March 2006 to January 2015. The lines connect the unweighted running daily averages of 31 days. The colored, numbered markers at the top represent reported changes in Google and Bing’s infrastructure. The colors of the markers correspond to the color of the search engine curve they related to; for example, red markers signal changes in Google’s infrastructure (the red curve). Events that line up with a spike are marked with an opened circle, other events are marked with an times
Fig. 3
Fig. 3
Estimated size of the Google index from March 2006 to January 2015 for three pivot words, the, basketball, and illini, and the average estimate over all 28 words (black line). The lines connect the unweighted running daily averages of 31 days

References

    1. Anagnostopoulos A, Broder A, Carmel D. Sampling search-engine results. World Wide Web. 2006;9(4):397–429. doi: 10.1007/s11280-006-0222-z. - DOI
    1. Bar-Ilan J. Search engine results over time: A case study on search engine stability. Cybermetrics. 1999;2(3):1.
    1. Bar-Ilan J. The use of web search engines in information science research. Annual Review of Information Science and Technology. 2004;38(1):231–288. doi: 10.1002/aris.1440380106. - DOI
    1. Bar-Ilan J, Mat-Hassan M, Levene M. Methods for comparing rankings of search engine results. Computer Networks. 2006;50(10):1448–1463. doi: 10.1016/j.comnet.2005.10.020. - DOI
    1. Bar-Yossef, Z., & Gurevich, M. (2006). Random sampling from a search engine’s index. In: WWW ’06: Proceedings of the 15th international conference on world wide web (pp 367–376). ACM Press, New York, NY doi:10.1145/1135777.1135833.

LinkOut - more resources