Web Information Paper Review (2)

1. Jon Kleinberg:Authoritative sources in a hyperlinked environment, JACM 1999

Links: 链接分析算法之:HITS算法

In this paper, the author states a new mechanism of searching algorithm which later was named as HIST Algorithm. HIST can mainly solve the problem that when user query some broad topics, the return result is too much and most of them is not relevant enough. In comparison with the textual based searching mechanism, HIST is considered as a textual and link analysis based searching algorithm.

The paper first introduced that there are three kind of query form: Specific queries, Broad-topic queries and Similar-page queries. The issue in Specific queries is that the searching is too specific which will lead to a small number of searching result. This can add difficulty to the search engine. Broad-topic queries has the issue that there are too many result which lead to a low efficiency and quality of the search. And also the relevance sometimes is not match with the popularity. So if the search engine is only based on the textual search, might give user a lot of useless results. The paper propose two new definition of the web pages: Hub and Authority. Authority web pages means pages contain the keywords users are searching and also has high relevance and high reputation. For example, when searching “Honda”, Honda’s official web page should be the top authority. A Hub web page are the pages contain links point to multiple relevant authorities. Such as Google can be consider as an Hub page. A good hub is a page that point to many good authority and a good authority is page that is pointed by many good hub pages. The HIST algorithm is based on such theory.

The first step of the HIST is to get a potential based web page set to apply the HIST and this set is much smaller than the original base web set. HIST extract such subgraph by using such method: (1) First use the traditional textual search to get a certain number of search result as the root collection. The root collection has less web pages and most of them are strong authorities. (2) Then base on this root set, HIST add both the web pages point to the element on the set and the web pages pointed by the elements in the set. (3) Finally filter out the intrinsic web pages and leave the transverse pages. Use this method, we can create our smaller searching base collection which is better for HIST to do large computation of the link based analysis.

Second step is to recalculate the Authority weight and the hub weight. The Authority weight is the sum of all the hub weight of the page point to the Authority page. The Hub weight is the sum of all the authority of the page the hub page point to. So if a hub point to many authority hubs, it should has a higher value of the weight. Also same for the authority weight. Then normalize the hub and authority weight. Do this to every pages in the subgraph and finally, the overall weight should be converge to a certain value. Then the pages in subgraph has been assigned with the evaluation value. Eventually, return user with the pages in the rank of the authority weight. This method actually can drag the whole subgraph to a two point directed graph. The left is the hub and the right is the authority. HIST can classify the hub and authority. This algorithm also has been proved in the paper. HIST also can apply to the third kind of search: similar web page search. It can return better result in comparison with textual search.

Also, HIST is a classifier algorithm. It can also be used in any other realm which can be simplified as directed graph. The author also introduced several area such as social network and publication citations.

This paper let me learn the link based search algorithm. This is more subjective in comparison with the objective textual search. This paper also let me have a better understanding of the web page: every thing in the web can be con information. The search can based on actual text and even the relationship between each other.


2. Scott Deerwester et al.:Indexing by Latent Semantic Analysis, Journal of the American Society for Information Science, Vol. 41, No. 6, 1990

Links:  Latent semantic analysis note奇异值分解

This paper presents a new mechanism called Latent Semantic Analysis to study the relationship between documents and apply it to the query of the web information. Unlike the traditional analysis of the text information, LSA study the potential relation between indexing words across all the documents. This method can return more reliable search result when user query some text with less overlap between the original indexing words and also reduce the noise.

The paper first states the drawbacks of the traditional retrieval method. The major issue here is that the synonymy and polysemy are exist. Sometime the superficial meaning understand by the computer is different from what actually human understand. So the latent semantic relationship will be more reliable and important when help computer understand the human’s text. The current search engine haven’t indexed all the words in the reality world and it is bad at handling synonymy and polysemy in a big collection of documents. The paper show an example to prove that the current search mechanism has a lot of mis-matching. The aim of the LSA is to understand what latent keywords of what user is searching about and the exclude the word match but meaning not match cases.

The LSA use the two mode factor analysis as the representation model for both document and the query. Two mode factor analysis meets the requirement of: (1) The representational power is controllable (2) Have an explicit representation of terms and document (3) computable for large database. The later latent analysis is based on the Singular-value decomposition (SVD). SVD is a method to decompose an non-square matrix into three sub-matrixes. By using this, we can get the eigenvalues of the MxN matrix in the diagonal sub-matrix in the middle. Eigenvalues represent the feature of the original matrix and LSA is based on the change and analysis the eigenvalue of the original representational matrix.

The procedure of the LSA is: (1) construct the term-document matrix based on the document. The term should be the frequently used terms and the collect the frequency of each term appear in each documents. In this way, we get our document matrix. We can notice that there are some synonymy in our term collection. Such as “Computer” is some kind of related to the term “system” and “human” is related to “user”. We can also notice that these pair of synonymies follow the assumption of LSA that: If a pair of terms both show up or not show up simultaneously in a collection of documents, these two term are considered latent related. “Computer” and “System” follow this pattern in 6 of 9 document in this example. Also we can determine a quantization of the similarity between two terms by dot product their row vectors. (2) Then we should apply SVD to the big matrix we get and remove the k smallest eigenvalues in the middle sub-matrix and reduce the size of the other two. By reducing the number of eigenvalues, we reduce the noise which is the interference in the base matrix. Then we reconstruct the new term-document matrix by multiply the new sub-matrix back. In the new based matrix, the latent relationship like “Human” and “user” has been enhanced. In the column vector of C2, we can see that although C2 doesn’t contain “Human” but it still has weight value of 0.40 of containing the “Human” since this document contain “user”. By now we get the latent semantic matrix and we can apply user’s query on this matrix. (3) Then we get the query from the user and convert the query string into the representation form with the indexing terms as the term-document matrix. We convert it by multiply the statistic of the frequency of terms in the query with the first two sub-matrixes we get from the SVD of the original term-document matrix. (4) Finally, we compute the converted query vector’s angle with the each column vector (each document) in the new term-document matrix and return the N smallest angle document as searching results.

From the paper I learn the latent semantic analysis which can tell the potential relationship between documents. This can help the search engine has a better and user-friendly searching results. This help me has a deeper understanding of the text analysis.

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 214,658评论 6 496
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 91,482评论 3 389
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 160,213评论 0 350
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 57,395评论 1 288
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 66,487评论 6 386
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 50,523评论 1 293
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 39,525评论 3 414
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 38,300评论 0 270
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 44,753评论 1 307
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 37,048评论 2 330
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 39,223评论 1 343
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 34,905评论 5 338
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 40,541评论 3 322
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 31,168评论 0 21
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,417评论 1 268
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 47,094评论 2 365
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 44,088评论 2 352

推荐阅读更多精彩内容