Document search

Lecture



Search for documents of a given subject in the Internet

Here we consider the task of collecting information about Internet resources related to a given topic. This problem is relevant when solving a variety of applied problems, such as, for example, building thematic directories like Yahoo! or dmoz.

A close task is the task of automatically collecting information about existing Internet resources when creating indexes of multi-purpose search engines, such as, for example, Altavista, Google or Yandex.

To solve problems related to the collection of information on existing Internet resources, so-called network robots are used - programs that, starting from a certain web page, recursively bypass Internet resources by extracting links to new resources from the received documents.

The key issue in creating intelligent robots is the bypass strategy used, that is, the criterion for choosing the next resource to visit from among the many already known to the robot, but not yet visited resources. Since visiting all Internet pages is not possible due to the huge amount and rapid variability of information available on the Internet, the circumvention strategy also determines which resources will be able to visit (in a finite time). Naturally, it seems reasonable to visit the most “useful” resources first. The “usefulness” of a resource is determined by the task for which the robot is created.

For example, a robot that collects resource information for a search engine is interested in finding the maximum amount of various resources. Such robots often use the depth of the URL as an estimate of the “usefulness” of a resource, i.e. the number of intermediate directories mentioned in the URL between the name of the Internet host and the name of the resource itself. The greater the depth, the lower the importance of the corresponding resource. This approach allows you to quickly visit the start and close to them pages on a large number of Internet sites.

Intuitively, it seems obvious that the page referenced by many different Internet pages is more important than the one to which there are few links. And also that link from a reputable source like Yahoo! It should be rated higher than the link from someone's personal page. These considerations are used, for example, in the Google network robot algorithm, which allows you to maximize the number of the most cited resources found.

In the context of the task of searching for documents of a given topic, the main goal of the corresponding network robot is to detect the maximum number of thematically relevant resources. Thus, as an estimate of the expected `` utility '' of a resource, it is an estimate of its expected proximity to the desired topic. The robot uses information about the thematic relevance of already detected pages to calculate this rating.

We dealt with this task in the annex to the problem of building thematic collections for the OASIS project. Therefore, we assumed that the final decision on the thematic relevance of the discovered resource is made by the client of our robot, i.e. the collection. However, to reduce the load on the client, the robot may not recommend all the resources discovered, but produce preliminary, “rough” screenings of obviously irrelevant documents.

Network Robot Architecture

Since the main subject of our research is the use of information on the subject to select a specialized circumvention strategy and `garbage 'screening methods, we will limit ourselves to a brief description of the basic architecture of the network robot, highlighting only the subsystems affected by this work.

The document from the Internet first gets into the document collection subsystem (Harvester), which transfers it to the document analysis subsystem (Document Analyzer), where the description (profile) of the document is built. Next, the Document Evaluator subsystem calculates a “rough” estimate of the document’s proximity to the client’s subject matter. If this assessment exceeds a certain recommendation threshold, then the document is recommended to the client of the robot.

Note that the final decision on the relevance of the document is made by the client, and the network robot produces only rough screenings of obviously inappropriate documents. The client has the ability to asynchronously inform the robot about the estimated `ʻaccurate '' estimates of the document’s proximity to the subject of interest. The network robot uses this information to automatically refine the calculated estimates.

The crawl of WWW documents is determined by the order of the links in the URL URL to visit (URL Database), as well as the need to follow the “ethics of network robots”. New URLs are in the queue as a result of the analysis of already visited documents.

Thematic proximity assessment

For each of the documents visited, the network robot calculates a `` rough '' estimate of the proximity of the document to the subject given by the client. This assessment is further used to perform two tasks:

  • Clarify the crawling strategy used.
  • Filtering out the garbage, that is, reducing the number of irrelevant documents recommended by the client.

The used method of calculating “rough” estimates is based on the method of calculating distances within the vector model of documents, widely used in various information retrieval tasks.

Refinement thematic filter

In the process of operation, the robot can automatically refine the filter used in order to improve the quality of “rough” thematic evaluations, taking into account additional information transmitted by the client. This information represents data on more “accurate” (according to the client) estimates of thematic proximity of the recommended documents.

Note that the automatic change of the filter may entail not only a change in the significance of the terms already used in the filter, but also the addition of new terms.

Internet crawl strategy

Most network robots cannot visit all the resources available on the Internet due to the limited hardware and network resources available to the robot, and which resources will be visited is determined by the workaround strategy used. Naturally, the robot should try to use such a strategy that maximizes the overall `` utility '' of all the resources visited.

Since, in our case, “utility”, i.e. the thematic relevance of the resource is finally determined by the client of the robot, the main task of the workaround strategy used is the selection of such a workaround for the resources known to the robot, in which the maximum number of documents relevant to the client's subject will be detected in the shortest time.
created: 2014-09-22
updated: 2021-01-10
132456



Rating 9 of 10. count vote: 2
Are you satisfied?:



Comments


To leave a comment
If you have any suggestion, idea, thanks or comment, feel free to write. We really value feedback and are glad to hear your opinion.
To reply

Presentation and use of knowledge

Terms: Presentation and use of knowledge