List of Papers on Link Analysis

  1. Finding Authorities and Hubs From Link Structures on the World Wide Web PDF
    Allan Borodin , Gareth O. Roberts , Jeffrey S. Rosenthal , and Panayiotis Tsaparas

    Abstract: Recently, there have been a number of algorithms proposed for analyzing hypertext link structure so as to determine the best "authorities" for a given topic or query. While such analysis is usually combined with content analysis, there is a sense in which some algorithms are deemed to be "more balanced" and others "more focused". We undertake a comparative study of hypertext link analysis algorithms. Guided by some experimental queries, we propose some formal criteria for evaluating and comparing link analysis algorithms.

  2. The Intelligent Surfer: Probabilistic Combination of Link and Content Information in PageRank PS
    Matthew Richardson, Pedro Domingos

    Abstract: The PageRank algorithm, used in the Google search engine, greatly improves the results of Web search by taking into account the link structure of the Web. PageRank assigns to a page a score proportional to the number of times a random surfer would visit that page, if it surfed indefinitely from page to page, following all outlinks from a page with equal probability. We propose to improve PageRank by using a more intelligent surfer, one that is guided by a probabilistic model of the relevance of a page to a query. Efficient execution of our algorithm at query time is made possible by precomputing at crawl time (and thus once for all queries) the necessary terms. Experiments on two large subsets of the Web indicate that our algorithm significantly outperforms PageRank in the (human -rated) quality of the pages returned, while remaining efficient enough to be used in today's large search engines.

  3. Stable Algorithms for Link Analysis PS
    Andrew Y. Ng, Alice X. Zheng, Michael I. Jordan

    The Kleinberg HITS and the Google PageRank algorithms are eigenvector methods for identifying "authoritative" or "influential" articles, given hyperlink or citation information. That such algorithms should give reliable or consistent answers is surely a desideratum, and in [10], we analyzed when they can be expected to give stable rankings under small perturbations to the linkage patterns. In this paper, we extend the analysis and show how it gives insight into ways of designing stable link analysis methods. This in turn motivates two new algorithms, whose performance we study empirically using citation data and web hyperlink data.

  4. Link Analysis, Eigenvectors and Stability PS
    Andrew Y. Ng, Alice X. Zheng, Michael I. Jordan

    Abstract: The HITS and the PageRank algorithms are eigenvector methods for identifying "authoritative" or "influential" articles, given hyperlink or citation information. That such algorithms should give consistent answers is surely a desideratum, and in this paper, we address the question of when they can be expected to give stable rankings under small perturbations to the hyperlink patterns. Using tools from matrix perturbation theory and Markov chain theory, we provide conditions under which these methods are stable, and give specific examples of instability when these conditions are violated. We also briefly describe a modification to HITS that improves its stability.

  5. The Shape of the Web and Its Implications for Searching the Web PS
    Kemal Efe, Vijay Raghavan, C. Henry Chu, Adrienne L. Broadwater, Levent Bolelli, Seyda Ertekin

    Abstract: | With the rapid growth of the number of web pages, designing a search engine that can retrieve high quality information in response to a user query is a challenging task. Automated search engines that rely on keyword matching usually return too many low quality matches and they take a long time to run. It is argued in the literature that link-following search methods can substantially increase the search quality, provided that these methods use an accurate assumption about useful patterns in the hyperlink topology of the web. Recent work in the eld has focused on detecting identi able patterns in the web graph and exploiting this information to improve the performance of search algorithms. We survey relevant work in this area and comment on the implications of these patterns for other areas such as advertisement and marketing.

  6. Recognizing Nepotistic Links on the Web PS
    Brian D. Davison

    Abstract: The use of link analysis and page popularity in search engines has grown recently to improve query result rankings. Since the number of such links contributes to the value of the document in such calculations, we wish to recognize and eliminate nepotistic links --- links between pages that are present for reasons other than merit. This paper explores some of the issues surrounding the question of what links to keep, and we report high accuracy in initial experiments to show the potential for using a machine learning tool to automatically recognize such links.

  7. Average-Clicks: A New Measure of Distance on the World Wide Web Link
    Yutaka Matsuo, Yukio Ohsawa, and Mitsuru Ishizuka

    The pages and hyperlinks of the World Wide Web may be viewed as nodes and edges in a directed graph. In this paper, we propose a new definition of the distance between two pages, called average-clicks. It is based on the probability to click a link through random surfing. We compare the average-clicks measure to the classical measure of clicks between two pages, and show average-clicks fits better to the users' intuitions of distance.

  8. Discovering Seeds of New Interest Spread from Premature Pages Cited by Multiple Communities Link
    Naohiro Matsumura1, Yukio Ohsawa and Mitsuru Ishizuka

    The World Wide Web is a great source of new topics significant for trend birth and creation. In this paper, we propose a method for discovering topics, which stimulate communities of people into earnest communications on the topics' meaning, and grow into a trend of popular interest. Here, the obtained are web pages which absorb attentions of people from multiple interest-communities. It is shown by a experiments to a small group of people, that topics in such pages can trigger the growth of peoples' interests, beyond the bounds of existing communities.

  9. An Evolutionary Approach to Automatic Web Page Categorization and Updating Link
    Lincenzo Loia and Paolo Luongo

    Abstract. Catalogues play an important role in most of the current Web search engines. The catalogues, which organize documents into hierarchical collections, are maintained manually increasing difficulty and costs due to the incessant growing of the WWW. This problem has stimulated many researches to work on automatic categorization of Web documents. In reality, most of these approaches work well either on special types of documents or on restricted set of documents. This paper presents an evolutionary approach useful to construct automatically the catalogue as well as to perform the classification of a Web document. This functionality relies on a genetic-based fuzzy clustering methodology that applies the clustering on the context of the document, as opposite to content-based clustering that works on the complete document information.

  10. Automatic Web-Page Classification by Using Machine Learning Methods Link
    Makoto Tsukada*, Takashi Washio, and Hiroshi Motoda

    Abstract. This paper describes automatic Web-page classification by using machine learning methods. Recently, the importance of portal site services is increasing including the search engine function on World Wide Web. Especially, the portal site such as Yahoo! service, which hierarchically classifies Web-pages into many categories, is becoming popular. However, the classification of Web-page into each category relies on man power, which costs much time and care. To alleviate this problem, we propose techniques to generate attributes by using co-occurrence analysis and to classify Web-page automatically based on machine learning. We apply these techniques to Web-pages on Yahoo! JAPAN and construct decision trees, which determine appropriate category for each Web-page. The performance of this proposed method is evaluated in terms of error rate, recall, and precision. The experimental evaluation demonstrates that this method provides acceptable accuracy with the classification of Web-page into top level categories on Yahoo! JAPAN.


  11. A Theory and Approach to Improving Relevance Ranking in Web Retrieval Link
    Z.W. Wang and R.B. Maguire

    Many search engines employ relevance ranking to web pages for which they use a simple form of similarity-based matching which has been studied in the vector space model. The paper discusses the problems in true similarity based matching and present a new model and approach to deal with this problem. This new method has a sound theoretical basis and offers guidelines for designing new types of Web search systems.

  12. Conceptual Information Extraction with Link-Based Search Link
    Kyung-Joong Kim and Sung-Bae Cho

    Abstract. Link-based search provides a new vehicle to find relevant web documents on the WWW. Recently, there is quite a bit of optimism that the use of link information can improve search quality as in Google. Usually, text-based search engine returns web sites which have simply the best frequency of user query, so that the result might be different from user's expectation. However, hypertextual search engine finds the most authoritative site. Proposed search engine consists of crawling, storing of link structure, ranking, and personalization processes. User profile encodes different relevances among concepts for each user. For conceptual information extraction from link-based search engine, fuzzy concept network is adopted. Fuzzy concept network can be personalized using the profile information and used to conduct fuzzy information retrieval for each user. By combining personal fuzzy information retrieval and link-based search, proposed search agent provides high-quality information on the WWW about user query. To show the effectiveness of the proposed search engine, a subjective test for five persons is conducted and the result is summarized. The result for five persons shows the usefulness of the proposed system and possibility for personalized conceptual information extraction.

  13. Mining Crawled Data and Visualizing Discovered Knowledge Link
    V.Dubois, M. Quafafou, and B. Habegger

    Abstract. This paper presents a challenging project which aims to extend the current features of search and browsing engines. Different methods are integrated to meet the following requirements : (1) Integration of incremental and focused dynamic crawling with meta-search; (2) Free the user from sifting through the long list of documents returned by the search engines; (3) Extract comprehensive patterns and useful knowledge from the documents; (4) Visual-based support to browse dynamic document collections. Finally, a new paradigm is proposed combining the mining and the visualization methods used for search and exploration.

  14. Authoritative Sources in a Hyperlinked Environment Link
    Jon M. Kleinberg

    Abstract The network structure of a hyperlinked environment can be a rich source of information about the content of the environment,provided we have effective means for understanding it. We develop a set of algorithmic tools for extracting information from the link structures of such environments, and report on experiments that demonstrate their effectiveness in a variety of contexts on the World Wide Web. The central issue we address within our framework is the distillation of broad search topics, through the discovery of "authoritative" information sources on such topics. We propose and test an algorithmic formulation of the notion of authority, based on the relationship between a set of relevant authoritative pages and the set of "hub pages" that join them together in the link structure. Our formulation has connections to the eigenvectors of certain matrices associated with the link graph; these connections in turn motivate additional heuristics for link-based analysis.

  15. Mining the Web's Link Structure PDF
    Soumen Chakrabarti, Byron E. Dom, S. Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, Andrew Tomkins,David Gibson, and Jon Kleinberg

    The Web is a hypertext body of approximately 300 million pages that continues to grow at roughly a million pages per day. Page variation is more prodigious than the data's raw scale: Taken as a whole, the set of Web pages lacks a unifying structure and shows far more authoring style and content variation than that seen in traditional text-document collections. This level of complexity makes an "off-the-shelf" database-management and information-retrieval solution impossible. To date, index-based search engines for the Web have been the primary tool by which users search for information. Such engines can build giant indices that let you quickly retrieve the set of all Web pages containing a given word or string. Experienced users can make effective use of such engines for tasks that can be solved by searching for tightly constrained keywords and phrases.These search engines are, however, unsuited for a wide range of equally important tasks. In particular, a topic of any breadth will typically contain several thousand or million relevant Web pages. How then, from this sea of pages, should a search engine select the correct ones-those of most value to the user?

  16. The Web as a graph Link
    Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, D. Sivakumar, Andrew S. Tomkins, Eli Upfal

    Abstract: The pages and hyperlinks of the World-Wide Web may be viewed as nodes and edges in a directed graph. This graph has about a billion nodes today, several billion links, and appears to grow exponentially with time. There are many reasons---mathematical, sociological, and commercial---for studying the evolution of this graph. We first review a set of algorithms that operate on the Web graph, addressing problems from Web search, automatic community discovery, and classification. We then recall a number of measurements and properties of the Web graph. Noting that traditional random graph models do not explain these observations, we propose a new family of random graph models.

  17. The Web as a graph: measurements, models, and methods Link
    Jon M. Kleinberg, Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, Andrew S. Tomkins

    The pages and hyperlinks of the World-Wide Web may be viewed as nodes and edges in a directed graph. This graph is a fascinating object of study: it has several hundred million nodes today, over a billion links, and appears to grow exponentially with time. There are many reasons --- mathematical, sociological, and commercial --- for studying the evolution of this graph. In this paper we begin by describing two algorithms that operate on the Web graph, addressing problems from Web search and automatic community discovery. We then report a number of measurements and properties of this graph that manifested themselves as we ran these algorithms on the Web. Finally, we observe that traditional random graph models do not explain these observations, and we propose a new family of random graph models. These models point to a rich new sub-field of the study of random graphs, and raise questions about the analysis of graph algorithms on the Web.

  18. Link Analysis in Web Information Retrieval. PS
    Monika R. Henzinger

    Abstract The analysis of the hyperlink structure of the web has led to significant improvements in web information retrieval. This survey describes two successful link analysis algorithms and the state-of-the art of the field.

  19. The Connectivity Server: fast access to linkage information on the Web Link
    Krishna Bharat, Andrei Broder, Monika R. Henzinger, Puneet Kumar, and Suresh Venkatasubramanian

    We have built a server that provides linkage information for all pages indexed by the AltaVista search engine. In its basic operation, the server accepts a query consisting of a set L of one or more URLs and returns a list of all pages that point to pages in L (predecessors) and a list of all pages that are pointed to from pages in L (successors). More generally the server can produce the entire neighbourhood (in the graph theory sense) of L up to a given distance and can include information about all links that exist among pages in the neighbourhood. Although some of this information can be retrieved directly from Alta Vista or other search engines, these engines are not optimized for this purpose and the process of constructing the neighbourhood of a given set of pages is slow and laborious. In contrast our prototype server needs less than 0.1 ms per result URL. So far we have built two applications that use the Connectivity Server: a direct interface that permits fast navigation of the Web via the predecessor/successor relation, and a visualization tool for the neighbourhood of a given set of pages. We envisage numerous other applications such as ranking, visualization, and classification.

  20. Hypersearching the Web. Link
    Soumen Chakrabarti, Byron Dom, S. Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, Andrew Tomkins, Jon M. Kleinberg, and David Gibson

    With the volume of on-line information in cyberspace growing at a breakneck pace, more effective search tools are desperately needed. A new technique analyzes how Web pages are linked together by by Members of the Clever Project

  21. Integrating the Document Object Model with Hyperlinks for Enhanced Topic Distillation and Information Extraction PS
    Soumen Chakrabarti

    Topic distillation is the process of finding authoritative Web pages and comprehensive ``hubs'' which reciprocally endorse each other and are relevant to a given query. Hyperlink-based topic distillation has been traditionally applied to a macroscopic Web model where documents are nodes in a directed graph and hyperlinks are edges. Macroscopic models miss valuable clues such as banners, navigation panels, and template-based inclusions, which are embedded in HTML pages using markup tags. Consequently, results of macroscopic distillation algorithms have been deteriorating in quality as Web pages are becoming more complex. We propose a uniform fine-grained model for the Web in which pages are represented by their tag trees (also called their Document Object Models or DOMs) and these DOM trees are interconnected by ordinary hyperlinks. Surprisingly, macroscopic distillation algorithms do not work in the fine-grained scenario. We present a new algorithm suitable for the fine-grained model. It can dis-aggregate hubs into coherent regions by segmenting their DOM trees. Mutual endorsement between hubs and authorities involve these regions, rather than single nodes representing complete hubs. Anecdotes and measurements using a 28-query, 366000-document benchmark suite, used in earlier topic distillation research, reveal two benefits from the new algorithm: distillation quality improves, and a by-product of distillation is the ability to extract relevant snippets from hubs which are only partially relevant to the query. Keywords: Topic distillation, Document Object Model, segmentation, Minimum Description Length principle.

  22. The Anatomy of a Large-Scale Hypertextual Web Search Engine Link
    Sergey Brin, Lawrence Page

    In this paper, we present Google, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext. Google is designed to crawl and index the Web efficiently and produce much more satisfying search results than existing systems. The prototype with a full text and hyperlink database of at least 24 million pages is available at http://google.stanford.edu. To engineer a search engine is a challenging task. Search engines index tens to hundreds of millions of web pages involving a comparable number of distinct terms. They answer tens of millions of queries every day. Despite the importance of large-scale search engines on the web, very little academic research has been done on them. Furthermore, due to rapid advance in technology and web proliferation, creating a web search engine today is very different from three years ago. This paper provides an in-depth description of our large-scale web search engine -- the first such detailed public description we know of to date. Apart from the problems of scaling traditional search techniques to data of this magnitude, there are new technical challenges involved with using the additional information present in hypertext to produce better search results. This paper addresses this question of how to build a practical large-scale system which can exploit the additional information present in hypertext. Also we look at the problem of how to effectively deal with uncontrolled hypertext collections where anyone can publish anything they want.

  23. The PageRank Citation Ranking: Bringing Order to the Web Link
    Larry Page, Sergey Brin, R. Motwani, T. Winograd

    Abstract: The importance of a Web page is an inherently subjective matter, which depends on the readers interests, knowledge and attitudes. But there is still much that can be said objectively about the relative importance of Web pages. This paper describes PageRank, a method for rating Web pages objectively and mechanically, effectively measuring the human interest and attention devoted to them. We compare PageRank to an idealized random Web surfer. We show how to efficiently compute PageRank for large numbers of pages. And, we show how to apply PageRank to search and to user navigation.

  24. Efficient Computation of PageRank Link
    Haveliwala, T.

    This paper discusses efficient techniques for computing PageRank, a ranking metric for hypertext documents. We show that PageRank can be computed for very large subgraphs of the web (up to hundreds of millions of nodes) on machines with limited main memory. Running-time measurements on various memory configurations are presented for PageRank computation over the 24-million-page Stanford WebBase archive. We discuss several methods for analyzing the convergence of PageRank based on the induced ordering of the pages. We present convergence results helpful for determining the number of iterations necessary to achieve a useful PageRank assignment, both in the absence and presence of search queries.

  25. Trawling the web for emerging cyber-communities Link
    Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, Andrew Tomkins

    : The web harbors a large number of communities -- groups of content-creators sharing a common interest -- each of which manifests itself as a set of interlinked web pages. Newgroups and commercial web directories together contain of the order of 20000 such communities; our particular interest here is on emerging communities -- those that have little or no representation in such fora. The subject of this paper is the systematic enumeration of over 100,000 such emerging communities from a web crawl: we call our process trawling. We motivate a graph-theoretic approach to locating such communities, and describe the algorithms, and the algorithmic engineering necessary to find structures that subscribe to this notion, the challenges in handling such a huge data set, and the results of our experiment. Keywords: web mining, communities, trawling, link analysis

  26. Silk from a Sow's Ear: Extracting Usable Structures from the Web Link
    Peter Pirolli, James Pitkow, Ramana Rao

    In its current implementation, the World-Wide Web lacks much of the explicit structure and strong typing found in many closed hypertext systems. While this property has directly fueled the explosive acceptance of the Web, it further complicates the already difficult problem of identifying usable structures and aggregates in large hypertext collections. These reduced structures, or localities, form the basis to simplifying visualizations of and navigation through complex hypertext systems. Much of the previous research into identifying aggregates utilize graph theoretic algorithms based upon structural topology, i.e., the linkages between items. Other research has focused on content analysis to form document collections. This paper presents our exploration into techniques that harness both the topology and textual similarity between items as well as integrate new analyses based upon actual usage of the Xerox's WWW space. Linear equations and spreading activation models are employed to arrange Web pages based upon functional categories, node types, and relevancy. Keywords Information Visualization, World Wide Web, Hypertext.

  27. Inferring Web Communities from Link Topology Link
    David Gibson, Jon Kleinberg, et al.

    The World Wide Web grows through a decentralized, almost anarchic process, and this has resulted in a large hyperlinked corpus without the kind of logical organization that can be built into more traditionally-created hypermedia. To extract meaningful structure under such circumstances, we develop a notion of hyperlinked communities on the www through an analysis of the link topology. By invoking a simple, mathematically clean method for defining and exposing the structure of these communities, we are able to derive a number of themes: The communities can be viewed as containing a core of central, "authoritative" pages linked together by "hub pages"; and they exhibit a natural type of hierarchical topic generalization that can be inferred directly from the pattern of linkage. Our investigation shows that although the process by which users of the Web create pages and links is very difficult to understand at a "local " level, it results in a much greater degree of orderly high-level structure than has typically been assumed.

  28. Efficient Identification of Web Communities Link
    Gary William Flake, Steve Lawrence, C. Lee Giles

    We define a community on the web as a set of sites that have more links (in either direction) to members of the community than to non-members. Members of such a community can be efficiently identified in a maximum flow / minimum cut framework, where the source is composed of known members, and the sink consists of well-known non-members. A focused crawler that crawls to a fixed depth can approximate community membership by augmenting the graph induced by the crawl with links to a virtual sink node. The effectiveness of the approximation algorithm is demonstrated with several crawl results that identify hubs, authorities, web rings, and other link topologies that are useful but not easily categorized. Applications of our approach include focused crawlers and search engines, automatic population of portal categories, and improved filtering.

  29. Searching the Web Link
    Arasu, Arvind; Cho, Junghoo; Garcia-Molina, Hector; Paepcke, Andreas; Raghavan, Sriram

    We offer an overview of current Web search engine design. After introducing a generic search engine architecture, we examine each engine component in turn. We cover crawling, local Web page storage, indexing, and the use of link analysis for boosting search performance. The most common design and implementation techniques for each of these components are presented. We draw for this presentation from the literature, and from our own experimental search engine testbed. Emphasis is on introducing the fundamental concepts, and the results of several performance analyses we conducted to compare different designs.

  30. Finding Related Pages in the World Wide Web Link
    Jeffrey Dean, Monika R. Henzinger

    When using traditional search engines, users have to formulate queries to describe their information need. This paper discusses a di erent approach to web searching where the input to the search process is not a set of query terms, but instead is the URL of a page, and the output is a set of related web pages. A related web page is one that addresses the same topic as the original page. For example, www.washingtonpost.com is a page related to www.nytimes.com, since both are online newspapers. We describe two algorithms to identify related web pages. These algorithms use only the connectivity information in the web (i.e., the links between pages) and not the content of pages or usage information. We have implemented both algorithms and measured their runtime performance. To evaluate the effectiveness of our algorithms, we performed a user study comparing our algorithms with Netscape's \What's Related" service [12]. Our study showed that the precision at 10 for our two algorithms are 73% better and 51% better than that of Netscape, despite the fact that Netscape uses both content and usage pattern information in addition to connectivity information.

  31. HyPursuit: A Hierarchical Network Search Engine that Exploits Content-Link Hypertext Clustering Link
    Ron Weiss, Bienvenido Velez, Mark A. Sheldon, Chanathip Nemprempre, Peter Szilagyi, Andrzej Duda, and David K. Gifford

    HyPursuit is a new hierarchical network search engine that clusters hypertext documents to structure a given information space for browsing and search activities. Our content-link clustering algorithm is based on the semantic information embedded in hyperlink structures and document contents. HyPursuit admits multiple, coexisting cluster hierarchies based on different principles for grouping documents, such as the Library of Congress catalog scheme and automatically created hypertext clusters. HyPursuit's abstraction functions summarize cluster contents to support scalable query processing. The abstraction functions satisfy system resource limitations with controlled information loss. The result of query processing operations on a cluster summary approximates the result of performing the operations on the entire information space. We constructed a prototype system comprising 100 leaf World Wide Web sites and a hierarchy of 42 servers that route queries to the leaf sites. Experience with our system suggests that abstraction functions based on hypertext clustering can be used to construct meaningful and scalable cluster hierarchies. We are also encouraged by preliminary results on clustering based on both document contents and hyperlink structures.

  32. Enhanced hypertext categorization using hyperlinks(1998) Link
    Soumen Chakrabarti, Byron Dom, and Piotr Indyk

    Abstract A major challenge in indexing unstructured hypertext databases is to automatically extract meta-data that enables structured search using topic taxonomies, circumvents keyword ambiguity, and improves the quality of search and profile-based routing and filtering. Therefore, an accurate classifier is an essential component of a hypertext database. Hyperlinks pose new problems not addressed in the extensive text classification literature. Links clearly contain highquality semantic clues that are lost upon a purely termbased classifier, but exploiting link information is non-trivial because it is noisy. Naive use of terms in the link neighborhood of a document can even degrade accuracy. Our contribution is to propose robust statistical models and a relaxation labeling technique for better classification by exploiting link information in a small neighborhood around documents. Our technique also adapts gracefully to the fraction of neighboring documents having known topics. We experimented with pre-classified samples from Yahoo!1 and the US Patent Database 2 . In previous work, we developed a text classifier that misclassified only 13% of the documents in the well-known Reuters benchmark; this was comparable to the best results ever obtained. This classifier misclassified 36% of the patents, indicating that classifying hypertext can be more difficult than classifying text. Naively using terms in neighboring documents increased error to 38%; our hypertext classifier reduced it to 21%. Results with the Yahoo! sample were more dramatic: the text classifier showed 68% error, whereas our hypertext classifier reduced this to only 21%.

  33. Hypertext Categorization using Hyperlink Patterns and Meta Data Link
    Rayid Ghani, Sean Slattery, Yiming Yang

    Hypertext poses new text classification research challenges as hyperlinks, content of linked documents, and meta data about related web sites all provide richer sources of information for hypertext classification that are not available in traditional text classification. We investigate the use of such information for representing web sites, and the effectiveness of different classifiers (Naive Bayes, Nearest Neighbor, and Foil) in exploiting those representations. We find that using words in web pages alone often yields suboptimal performance of classifiers, compared to exploiting additional sources of information beyond document content. On the other hand, we also observe that linked pages can be more harmful than helpful when the linked neighborhoods are highly "noisy" and that links have to be used in a careful manner. More importantly, our investigation suggests that meta data which is often available, or can be acquired using information extraction techniques, can be extremely useful for improving classification accuracy. Finally, the relative performance of the different classifiers being tested gives us insights into the strengths and limitations of our algorithms for hypertext classification.

  34. Probabilistic Models of Text and Link Structure for Hypertext Classification Link
    Lise Getoor, Eran Segal, Ben Taskar, Daphne Koller

    Most text classification methods treat each document as an independent instance. However, in many text domains, documents are linked and the topics of linked documents are correlated. For example, web pages of related topics are often connected by hyperlinks and scientific papers from related fields are commonly linked by citations. We propose a unified probabilistic model for both the textual content and the link structure of a document collection. Our model is based on the recently introduced framework of Probabilistic Relational Models (PRMs), which allows us to capture correlations between linked documents. We show how to learn these models from data and use them efficiently.Since exact methods for classification in these large models are intractable, we utilize belief propagation, an approximate inference algorithm. Belief propagation automatically induces a very natural behavior, where our knowledge about one document helps us classify related ones, which in turn help us classify others. We present preliminary empirical results on a dataset of university web pages.

  35. Does "Authority" Mean Quality? Predicting Expert Quality Ratings of Web Documents Link
    Brian Amento1, Loren Terveen, and Will Hill

    For many topics, the World Wide Web contains hundreds or thousands of relevant documents of widely varying quality. Users face a daunting challenge in identifying a small subset of documents worthy of their attention. Link analysis algorithms have received much interest recently, in large part for their potential to identify high quality items. We report here on an experimental evaluation of this potential. We evaluated a number of link and content -based algorithms using a dataset of web documents rated for quality by human topic experts. Link -based metrics did a good job of picking out high -quality items. Precision at 5 is about 0.75, and precision at 10 is about 0.55; this is in a dataset where 0.32 of all documents wer e of high quality. Surprisingly, a simple content-based metric performed nearly as well; ranking documents by the total number of pages on their containing site

  36. Learning to Probabilistically Identify Authoritative Documents Link
    David Cohn, Huan Chang

    We describe a model of document citation that learns to identify hubs and authorities in a set of linked documents, such as pages retrieved from the world wide web, or papers retrieved from a research paper archive. Unlike the popular HITS algorithm, which relies on dubious statistical assumptions, our model provides probabilistic estimates that have clear semantics. We also find that in general, the identified authoritative documents correspond better to human intuition.

  37. Automatic Resource list Compilation by Analyzing Hyperlink Structure and Associated Text (1998) Link
    S. Chakrabarti, B. Dom, D. Gibson, J. Keinberg, P. Raghavan, and s. Rajagopalan

    We describe the design, prototyping and evaluation of ARC, a system for automatically compiling a list of authoritative web resources on any (sufficiently broad) topic. The goal of ARC is to compile resource lists similar to those provided by Yahoo! or Infoseek. The fundamental difference is that these services construct lists either manually or through a combination of human and automated effort, while ARC operates fully automatically. We describe the evaluation of ARC, Yahoo!, and Infoseek resource lists by a panel of human users. This evaluation suggests that the resources found by ARC frequently fare almost as well as, and sometimes better than, lists of resources that are manually compiled or classified into a topic. We also provide examples of ARC resource lists for the reader to examine.

  38. Web Search Via Hub Synthesis Link
    Dimitris Achlioptas, Amos Fiat, Anna R. Karlin, Frank McSherry

    We present a model for web search that captures in a unified manner three critical components of the problem: how the link structure of the web is generated, how the content of a web document is generated, and how a human searcher generates a query. The key to this unification lies in capturing the correlations between these components in terms of proximity in a shared latent semantic space. Given such a combined model, the correct answer to a search query is well defined, and thus it becomes possible to evaluate web search algorithms rigorously. We present a new web search algorithm, based on spectral techniques, and prove that it is guaranteed to produce an approximately correct answer in our model. The algorithm assumes no knowledge of the model, and is well-defined regardless of the model's accuracy.

  39. Friends and Neighbors on the Web Link
    Lada Adamic, Eytan Adar

    Abstract: The Internet has become a rich and large repository of information about individuals. The links and text on a user's homepage to the mailing lists the user subscribes to are reflections of social interactions a user has in the real world. We devise techniques to mine this information in order to predict relationships between individuals. Further we show that some pieces of information are better indicators of social connections than others. The high quality information we discover provides a glimpse into the social life of two communities and has potential applications in automatically inferring real-world connections and discovering and labeling communities.

  40. What is this Page Known for? Computing Web Page Reputations Link
    Davood Rafiei, Alberto O Mendelzon

    Abstract: The textual content of the Web enriched with the hyperlink structure surrounding it can be a useful source of information for querying and searching. This paper presents a search process where the input is the URL of a page, and the output is a ranked set of topics on which the page has a reputation. For example, if the input is www.gamelan.com, then a possible output is "Java." We propose several algorithmic formulations of the notion of reputation using simple random walk models of Web browsing behaviour. We give preliminary test results on the effectiveness of these algorithms.

  41. Mining Web Logs to Improve Website Organization Link
    Ramakrishnan Srikant, Yinghui Yang

    Many websites have a hierarchical organization of content. This organization may be quite different from the organization expected by visitors to the website. In particular, it is often unclear where a specific document is located. In this paper, we propose an algorithm to automatically find pages in a website whose location is different from where visitors expect to find them. The key insight is that visitors will backtrack if they do not find the information where they expect it: the point from where they backtrack is the expected location for the page. We present an algorithm for discovering such expected locations that can handle page caching by the browser. Expected locations with a significant number of hits are then presented to the website administrator. We also present algorithms for selecting expected locations (for adding navigation links) to optimize the benefit to the website or the visitor. We ran our algorithm on the Wharton business school website and found that even on this small website, there were many pages with expected locations different from their actual location.

  42. PicASHOW: Pictorial Authority Search by Hyperlinks on the Web Link
    Ronny Lempel, Aya Soffer

    We describe PicASHOW, a fully automated WWW image retrieval system that is based on several link-structure analyzing algorithms. Our basic premise is that a page # displays (or links to) an image when the author of # considers the image to be of value to the viewers of the page. Wethus extend some well known link-based WWW #### ######### schemes to the context of image retrieval.
    PicASHOW's analysis of the link structure enables it to retrieve relevant images even when those are stored in les with meaningless names. The same analysis also allows it to identify ##### ########## and ##### ####. We de ne these as Web pages that are rich in relevant images, or from which many images are readily accessible. PicASHOW requires no image analysis whatsoever and no creation of taxonomies for pre-classi cation of the Web's images. It can be implemented by standard WWW search engines with reasonable overhead, in terms of both computations and storage, and with no change to user query formats. It can thus be used to easily add image retrieving capabilities to standard search engines. Our results demonstrate that PicASHOW, while relying almost exclusively on link analysis, compares well with dedicated WWW image retrieval systems. We conclude that link analysis, a bona- de e ective technique for Web page search, can improve the performance of Web image retrieval, as well as extend its de nition to include the retrieval of image hubs and containers.

  43. Geospatial Mapping and Navigation of the Web (2001 Link
    Kevin S. McCurley

    Abstract: Web pages may be organized, indexed, searched, and navigated along several di erent feature dimensions. Weinvestigate di erent approaches to discovering geographic context for web pages, and describe a navigational tool for browsing web resources by geographic proximity.

  44. Retrieving and Organizing Web Pages by "Information Unit" Link
    Wen-Syan Li, K. Selçuk Candan, Quoc Vu, Divyakant Agrawal

    Since WWW encourages hypertext and hypermedia document authoring (e.g., HTML or XML), Web authors tend to create documents that are composed of multiple pages connected with hyperlinks or frames. A Web document may be authored in multiple ways, such as (1) all information in one physical page, or (2) a main page and the related information in separate linked pages. Existing Web search engines, however, return only physical pages. In this paper, we introduce and describe the use of the concept of information unit, which can be viewed as a logical Web document consisting of multiple physical pages as one atomic retrieval unit. We present an algorithm to efficiently retrieve information units. Our algorithm can perform progressive query processing over a Web index by considering both document semantic similarity and link structures. Experimental results on synthetic graphs and real Web data show the effectiveness and usefulness of the proposed information unit retrieval technique.

  45. Conceptual Linking: Ontology-based Open Hypermedia Link
    Leslie Carr1, Sean Bechhofer2, Carole Goble2, Wendy Hall1

    This paper describes the attempts of the COHSE project to define and deploy a Conceptual Open Hypermedia Service. Consisiting of

    and integrated to form a conceptual hypermedia system to enable documents to be linked via metadata describing their contents and hence to improve the consistency and breadth of linking of WWW documents at retrieval time (as readers browse the documents) and authoring time (as authors create the documents).

  46. The Missing Link - A Probabilistic Model of Document Content and Hypertext Connectivity Link
    David Cohn and Thomas Hofmann

    We describe a joint probabilistic model for modeling the contents and inter-connectivity of document collections such as sets of web pages or research paper archives. The model is based on a probabilistic factor decomposition and allows identifying principal topics of the collection as well as authoritative documents within those topics. Furthermore, the relationships between topics is mapped out in order to build a predictive model of link content. Among the many applications of this approach are information retrieval and search, topic identification, query disambiguation, focused web crawling, web authoring, and bibliometric analysis.

  47. Creating Customized Authority Lists Link
    Huan Chang, David Cohn and Andrew McCallum

    The proliferation of hypertext and the popularity of Kleinberg's HITS algorithm have brought about an increased interest in link analysis. While HITS and its older relatives from the Bibliometrics literature provide a method for finding authoritative sources on a particular topic, they do not allow individual users to inject their own opinions about what sources are authoritative. This paper presents a learning technique which incorporates user feedback to adjust its model of document authority so that it better corresponds to the user's internal model. We demonstrate how this technique can be used to generate user-specific authority lists. We present experimental results based on a database of about one million references collected as part of the Cora on-line index of the computer science literature.

  48. Learning to Understand the Web Link
    William Cohen

    Abstract In a traditional information retrieval system, it is assumed that queries can be posed about any topic. In reality, a large fraction of web queries are posed about a relatively small number of topics, like products, entertainment, current events, and so on. One way of exploiting this sort of regularity in web search is to build, from the information found on the web, comprehensive databases about specific topics. An appropriate interface to such a database can support complex structured queries which are impossible to answer with traditional topic-independent query methods. Here we discuss three case studies for this "data-centric" approach to web search. A common theme in this discussion is the need for very robust methods for finding relevant information, extracting data from pages, and integrating information taken from multiple sources, and the importance of statistical learning methods as a tool for creating such robust methods.

  49. Context in Web Search Link
    Steve Lawrence

    Abstract Web search engines generally treat search requests in isolation. The results for a given query are identical, independent of the user, or the context in which the user made the request. Next-generation search engines will make increasing use of context information, either by using explicit or implicit context information from users, or by implementing additional functionality within restricted contexts. Greater use of context in web search may help increase competition and diversity on the web.

  50. Searching for Needles in a World of Haystacks Link
    Jamie Callan

    Abstract Current Web search engines are based on a single database model of information retrieval. This paper argues that next generation Web search will be based on a multi-database model of information retrieval. The strengths, weaknesses, open research problems, and prospects of several multi-database retrieval models are discussed briefly.

  51. Next Generation Web Search: Setting Our Sites Link
    Marti Hearst

    Abstract The current state of web search is most successful at directing users to appropriate web sites. Once at the site, the user has a choice of following hyperlinks or using site search, but the latter is notoriously problematic. One solution is to develop specialized search interfaces that explicitly support the types of tasks users perform using the information specific to the site. A new way to support task-based site search is to dynamically present appropriate metadata that organizes the search results and suggests what to look at next, as a personalized intermixing of search and hypertext.

  52. Web Dynamics Link

    The global usage and continuing exponential growth of the World-Wide-Web poses a host of challenges to the research community. In particular, thereis an urgent need to understand and manage the dynamics of the Web, in order to develop new techniques which will make the Web tractable. We provide an overview of recent statistics relating to the size of the Web graph and its growth. We then briefly review some of the key areas relating to Webdynamics with reference to the recent literature. Finally, we summarise the talks given in a recent workshop devoted to Webdynamics which was held in the beginning of January 2001 at the University of London. Keywords. Web dynamics, Web graph, information retrieval, collaborative filtering, Web navigation,Website design, data-intensive Web applications, workflow management, e-commerce,mobile computation.

  53. Improved Algorithms for Topic Distillation in a Hyperlinked Environment PDF
    Krishna Bharat and MR Henzinger

    This paper addresses the problem of topic distillation on the World Wide Web, namely, given a typical user query to find quality documents related to the query topic. Connectivity analysis has been shown to be useful in identifying high quality pages within a topic specific graph of hyperlinked documents. The essence of our approach is to augment a previous connectivity analysis based algorithm with content analysis. We identify three problems with the existing approach and devise algorithms to tackle them. The results of a user evaluation are reported that show an improvement of precision at 10 documents by at least 45% over pure connectivity analysis.