Web crawler in the context of Web archiving


Web crawler in the context of Web archiving

Web crawler Study page number 1 of 1

Play TriviaQuestions Online!

or

Skip to study material about Web crawler in the context of "Web archiving"


⭐ Core Definition: Web crawler

A web crawler, sometimes called a spider or spiderbot and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web and that is typically operated by search engines for the purpose of Web indexing (web spidering).

Web search engines and some other websites use Web crawling or spidering software to update their web content or indices of other sites' web content. Web crawlers copy pages for processing by a search engine, which indexes the downloaded pages so that users can search more efficiently.

↓ Menu
HINT:

👉 Web crawler in the context of Web archiving

Web archiving is the process of collecting, preserving, and providing access to material from the World Wide Web. The aim is to ensure that information is preserved in an archival format for research and the public. The process of platformizing archives, digitizing historical records via interfaces patterned on social media platforms, can reshape collective memory by privileging content that aligns with social-media logic such as popularity, connectivity, and programmability.

Web archivists typically employ automated web crawlers to capturing the massive amount of information on the Web. A widely known web archive service is the Wayback Machine, run by the Internet Archive.

↓ Explore More Topics
In this Dossier

Web crawler in the context of Internet Archive

The Internet Archive is an American non-profit library founded in 1996 by Brewster Kahle that runs a digital library website, archive.org. It provides free access to collections of digitized media including websites, software applications, music, audiovisual, and print materials. The Archive also advocates a free and open Internet. Its mission is committing to provide "universal access to all knowledge".

The Internet Archive allows the public to upload and download digital material to its data cluster, but the bulk of its data is collected automatically by its web crawlers, which work to preserve as much of the public web as possible. Its web archive, the Wayback Machine contains more than 1 trillion web captures. The Archive also oversees numerous book digitization projects, collectively one of the world's largest book digitization efforts.

View the full Wikipedia page for Internet Archive
↑ Return to Menu

Web crawler in the context of Web server

A web server is computer software and underlying hardware that accepts requests via HTTP (the network protocol created to distribute web content) or its secure variant HTTPS. A user agent, commonly a web browser or web crawler, initiates communication by making a request for a web page or other resource using HTTP, and the server responds with the content of that resource or an error message. A web server can also accept and store resources sent from the user agent if configured to do so.

The hardware used to run a web server can vary according to the volume of requests that it needs to handle. At the low end of the range are embedded systems, such as a router that runs a small web server as its configuration interface. A high-traffic Internet website might handle requests with hundreds of servers that run on racks of high-speed computers.

View the full Wikipedia page for Web server
↑ Return to Menu

Web crawler in the context of Web search engine

A search engine is a software system that provides hyperlinks to web pages, and other relevant information on the Web in response to a user's query. The user enters a query in a web browser or a mobile app, and the search results are typically presented as a list of hyperlinks accompanied by textual summaries and images. Users also have the option of limiting a search to specific types of results, such as images, videos, or news.

For a search provider, its engine is part of a distributed computing system that can encompass many data centers throughout the world. The speed and accuracy of an engine's response to a query are based on a complex system of indexing that is continuously updated by automated web crawlers. This can include data mining the files and databases stored on web servers, although some content is not accessible to crawlers.

View the full Wikipedia page for Web search engine
↑ Return to Menu

Web crawler in the context of Video search

A video search engine is a web-based search engine which crawls the web for video content. Some video search engines parse externally hosted content while others allow content to be uploaded and hosted on their own servers. Some engines also allow users to search by video format type and by length of the clip. The video search results are usually accompanied by a thumbnail view of the video.

Video search engines are computer programs designed to find videos stored on digital devices, either through Internet servers or in storage units from the same computer. These searches can be made through audiovisual indexing, which can extract information from audiovisual material and record it as metadata, which will be tracked by search engines.

View the full Wikipedia page for Video search
↑ Return to Menu

Web crawler in the context of Vertical search

A vertical search engine is distinct from a general web search engine, in that it focuses on a specific segment of online content. They are also called specialty or topical search engines. The vertical content area may be based on topicality, media type, or genre of content. Common verticals include shopping, the automotive industry, legal information, medical information, scholarly literature, job search and travel. Examples of vertical search engines include the Library of Congress, Mocavo, Nuroa, Trulia, and Yelp.

In contrast to general web search engines, which attempt to index large portions of the World Wide Web using a web crawler, vertical search engines typically use a focused crawler which attempts to index only relevant web pages to a pre-defined topic or set of topics. Some vertical search sites focus on individual verticals, while other sites include multiple vertical searches within one search engine.

View the full Wikipedia page for Vertical search
↑ Return to Menu

Web crawler in the context of Unsupervised learning

Unsupervised learning is a framework in machine learning where, in contrast to supervised learning, algorithms learn patterns exclusively from unlabeled data. Other frameworks in the spectrum of supervisions include weak- or semi-supervision, where a small portion of the data is tagged, and self-supervision. Some researchers consider self-supervised learning a form of unsupervised learning.

Conceptually, unsupervised learning divides into the aspects of data, training, algorithm, and downstream applications. Typically, the dataset is harvested cheaply "in the wild", such as massive text corpus obtained by web crawling, with only minor filtering (such as Common Crawl). This compares favorably to supervised learning, where the dataset (such as the ImageNet1000) is typically constructed manually, which is much more expensive.

View the full Wikipedia page for Unsupervised learning
↑ Return to Menu

Web crawler in the context of Yahoo Search

Yahoo Search is a search engine owned and operated by Yahoo, using Microsoft Bing to generate results.

When first released in 1994, Yahoo Search sent queries to a searchable index of pages supplemented with its directory of websites. Web crawling and data housing was executed by Google from 2000 until 2004, when Yahoo created its own crawler.

View the full Wikipedia page for Yahoo Search
↑ Return to Menu

Web crawler in the context of Web scraping

Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. Web scraping software may directly access the World Wide Web using the Hypertext Transfer Protocol or a web browser. While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a bot or web crawler. It is a form of copying in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis.

Scraping a web page involves fetching it and then extracting data from it. Fetching is the downloading of a page (which a browser does when a user views a page). Therefore, web crawling is a main component of web scraping, to fetch pages for later processing. Having fetched, extraction can take place. The content of a page may be parsed, searched and reformatted, and its data copied into a spreadsheet or loaded into a database. Web scrapers typically take something out of a page, to make use of it for another purpose somewhere else. An example would be finding and copying names and telephone numbers, companies and their URLs, or e-mail addresses to a list (contact scraping).

View the full Wikipedia page for Web scraping
↑ Return to Menu

Web crawler in the context of Focused crawler

A focused crawler is a web crawler that collects Web pages that satisfy some specific property, by carefully prioritizing the crawl frontier and managing the hyperlink exploration process. Some predicates may be based on simple, deterministic and surface properties. For example, a crawler's mission may be to crawl pages from only the .jp domain. Other predicates may be softer or comparative, e.g., "crawl pages about baseball", or "crawl pages with large PageRank". An important page property pertains to topics, leading to 'topical crawlers'. For example, a topical crawler may be deployed to collect pages about solar power, swine flu, or even more abstract concepts like controversy while minimizing resources spent fetching pages on other topics. Crawl frontier management may not be the only device used by focused crawlers; they may use a Web directory, a Web text index, backlinks, or any other Web artifact.

A focused crawler must predict the probability that an unvisited page will be relevant before actually downloading the page. A possible predictor is the anchor text of links; this was the approach taken by Pinkerton in a crawler developed in the early days of the Web. Topical crawling was first introduced by Filippo Menczer. Chakrabarti et al. coined the term 'focused crawler' and used a text classifier to prioritize the crawl frontier. Andrew McCallum and co-authors also used reinforcement learning to focus crawlers. Diligenti et al. traced the context graph leading up to relevant pages, and their text content, to train classifiers. A form of online reinforcement learning has been used, along with features extracted from the DOM tree and text of linking pages, to continually train classifiers that guide the crawl. In a review of topical crawling algorithms, Menczer et al. show that such simple strategies are very effective for short crawls, while more sophisticated techniques such as reinforcement learning and evolutionary adaptation can give the best performance over longer crawls. It has been shown that spatialinformation is important to classify Web documents.

View the full Wikipedia page for Focused crawler
↑ Return to Menu