Deep web
From Free net encyclopedia
The deep web (or invisible web or hidden web) is the name given to pages on the World Wide Web that are not part of the surface web that is indexed by common search engines. It consists of pages which are not linked to by other pages, such as Dynamic Web pages. Dynamic Web pages are basically searchable databases that deliver Web pages generated just in response to a query and contain information stored in tables created by programs such as Access, Oracle or SQL databases. The Deep Web also includes sites that require registration or otherwise limit access to their pages, prohibiting search engines from browsing them and creating cached copies.
Non-textual files such as multimedia (image) files, Usenet archives and documents in non-HTML file formats such as PDF and DOC documents used to form a part of deep web, but now are more easily accessible to search engines, especially Google.
The deep web should not be confused with the term dark web or dark internet which refers to machines or network segments not connected to the Internet. While deep web content is accessible to people online but not visible to conventional search engines, dark internet content is not accessible online by either people or search engines.
Contents |
Surface web
To better understand the invisible web consider how conventional search engines construct their databases, thus defining the surface web: Programs called spiders or web crawlers start by reading pages on an initial list of websites. Each page they read is indexed and added to the search engine's database. Any hyperlinks to new pages are added to the list of pages to be indexed. Eventually, all reachable pages have been indexed or the search engine runs out of time or disk space. These reachable pages are the surface web. Pages which do not have a chain of links from a page in the spider's initial list are invisible to that spider and not part of the surface web it defines.
In opposition to the 'surface web' is the 'deep web'. The great majority of the deep web is composed by searchable databases. To understand why these databases are invisible to spiders (and their search engines) consider the following:
- Imagine someone has collected a great amount of information – books, texts, articles, images, etc. – and put them together online in a website, creating a database reachable only via a search field. This database, as most databases, would work like this:
- in a search field the user types the keywords he or she wants
- this searching facility looks inside the database and retrieves the relevant content
- a page of results is presented bringing the links to every important topic related to the user’s query
Once a conventional search engine’s web crawler reaches this site, it will capture the text contained in the main page and in the pages which hyperlinks can be found to (usually “about us”, “contact us”, “privacy policy”, etc.). But the great majority of the information – books, texts, articles or images – that are only reachable by querying the search field, cannot be reached by the web crawler. The robot cannot predict which words it should type inside the search field. Thus the data is invisible to the search engine.
Accessing the deep web
As said before, search engines use web crawlers that follow hyperlinks. Such crawlers typically do not submit queries to databases due to the potential infinitude of queries that can be made to a single database. It has been noted that this can be (partially) overcome by having links to query results, thus increasing Google-style PageRank results for a member of the deep web.
In 2005, Yahoo! made a small part of the deep web searchable by releasing Yahoo! Subscriptions. This search engine searches through a few subscription-only web sites.
Some search tools are being designed to retrieve information from the deep web. Their crawlers are set to identify and somehow interact with searchable databases, aiming to provide access to deep web content.Template:Fact
References
- Gary Price & Chris Sherman. The Invisible Web : Uncovering Information Sources Search Engines Can't See. CyberAge Books, July 2001. ISBN 091096551X
- Joe Barker. Invisible Web: What it is, Why it exists, How to find it, and Its inherent ambiguity. UC Berkeley - Teaching Library Internet Workshops, January 2004. Last seen online July 2005 at http://www.lib.berkeley.edu
- Michael K. Bergman. The Deep Web: Surfacing Hidden Value. The Journal of Electronic Publishing. August, 2001. Volume 7, Issue 1. http://www.press.umich.edu/jep/07-01/bergman.html
- Alex Wright, In Search of the Deep Web, Salon.com, March 2004, http://www.salon.com/tech/feature/2004/03/09/deep_web/index.html