English
 
Help Privacy Policy Disclaimer
  Advanced SearchBrowse

Item

ITEM ACTIONSEXPORT
 
 
DownloadE-Mail
  Automatic Generation of Thematically Focused Information Portals from Web Data

Sizov, S. (2005). Automatic Generation of Thematically Focused Information Portals from Web Data. PhD Thesis, Universität des Saarlandes, Saarbrücken. doi:10.22028/D291-23767.

Item is

Files

show Files
hide Files
:
dissertation.pdf (Any fulltext), 58MB
 
File Permalink:
-
Name:
dissertation.pdf
Description:
-
OA-Status:
Visibility:
Restricted (Max Planck Institute for Informatics, MSIN; )
MIME-Type / Checksum:
application/pdf
Technical Metadata:
Copyright Date:
-
Copyright Info:
-
License:
-

Locators

show
hide
Description:
-
OA-Status:
Green
Locator:
http://scidok.sulb.uni-saarland.de/doku/lic_ohne_pod.php?la=de (Copyright transfer agreement)
Description:
-
OA-Status:
Not specified

Creators

show
hide
 Creators:
Sizov, Sergej1, 2, Author           
Weikum, Gerhard1, Advisor           
Henrich, Andreas3, Referee
Affiliations:
1Databases and Information Systems, MPI for Informatics, Max Planck Society, ou_24018              
2International Max Planck Research School, MPI for Informatics, Max Planck Society, Campus E1 4, 66123 Saarbrücken, DE, ou_1116551              
3External Organizations, ou_persistent22              

Content

show
hide
Free keywords: -
 Abstract: Finding the desired information on the Web is often a hard and time-consuming task. This thesis presents the
methodology of automatic generation of thematically focused portals from Web data. The key component of the proposed
Web retrieval framework is the thematically focused Web crawler that is interested only in a specific, typically small, set of topics. The focused crawler uses classification methods for filtering of fetched documents and identifying most likely relevant Web sources for further downloads.

We show that the human efforts for preparation of the focused crawl can be minimized by automatic extending of the training dataset using additional training samples
coined archetypes. This thesis introduces the combining of classification results and link-based authority ranking methods for selecting archetypes, combined with periodical re-training of the classifier. We also explain the architecture of the focused Web retrieval framework and discuss results of comprehensive use-case studies
and evaluations with a prototype system BINGO!.

Furthermore, the thesis addresses aspects of crawl postprocessing, such as refinements of the topic structure and restrictive document filtering. We introduce postprocessing methods and meta methods that are applied in an restrictive manner, i.e. by leaving out some uncertain documents rather than assigning them to inappropriate topics or clusters with low confidence. We also introduce the methodology of collaborative crawl postprocessing for
multiple cooperating users in a distributed environment, such as a peer-to-peer overlay network.

An important aspect of the thematically focused Web portal is the ranking of search results. This thesis addresses the aspect of search personalization by aggregating explicit or implicit feedback from multiple users and capturing topic-specific search patterns by profiles. Furthermore, we consider advanced link-based authority ranking algorithms that exploit the crawl-specific information, such as classification confidence grades for particular documents.
This goal is achieved by weighting of edges in the link graph of the crawl and by adding virtual links between highly relevant documents of the topic.

The results of our systematic evaluation on multiple reference collections and real Web data show the viability of the proposed methodology.

Details

show
hide
Language(s): eng - English
 Dates: 2006-04-142005-12-1920052005
 Publication Status: Issued
 Pages: -
 Publishing info: Saarbrücken : Universität des Saarlandes
 Table of Contents: -
 Rev. Type: -
 Identifiers: eDoc: 278867
Other: Local-ID: C1256DBF005F876D-E3553B8891977BBDC125710F004DC715-Sizov2005Diss
DOI: 10.22028/D291-23767
URN: urn:nbn:de:bsz:291-scidok-4899
Other: hdl:20.500.11880/23823
 Degree: PhD

Event

show

Legal Case

show

Project information

show

Source

show