COMP9313_WEEK5

WEEK5 课程摘要(MapReduce & Pregel):1)Graph Data Processing in MapReduce;2)Introduction to Pregel(略)

1)关键词:Graph; shortest path; Adjacency Matrix; Adjacency List; Dijkstra’s algorithm; BFS(Breadth First Search)

1)Graph Data Processing in MapReduce
图经常涉及到的问题:
(1) Finding shortest paths
(1.1)Routing Internet traffic and UPS trucks
(2) Finding minimum spanning trees
(2.1) Telco laying down fiber
(3) Finding Max Flow
(3.1) Airline scheduling
(4) Identify “special” nodes and communities
(4.1) Breaking up terrorist cells, spread of avian flu
(5) Bipartite matching
5.1Monster.com, Match.com
(6)PageRank

对图的分析包括:
(1) General Graph
(1.1) Count the number of nodes whose degree is equal to 5
(1.2) Find the diameter of the graphs
(2) Web Graph
(2.1) Rank each webpage in the web graph or each user in the twitter graph using PageRank, or other centrality measure
(3)Transportation Network
(3.1) Return the shortest or cheapest flight/road from one city to another
(4) Social Network
(4.1) Detect a group of users who have similar interests
(5) Financial Network
(5.1)Find the path connecting two suspicious transactions

通常情况下,图在计算机中的表示方法有两种:
1)Adjacency Matrix


Adjacency Matrix

2)Adjacency List


Adjacency List

关于最短路径问题,这里介绍两种方式:1)Dijkstra’s Algorithm(针对单机而言); 2)BFS(可通过MapReduce来实现,从而处理含有大量节点的图问题)
这里主要介绍第二种方法:BFS
首先,我们假设一个节点到另一个节点的距离都是单位距离1(简单化处理)。因此,节点n1(假设节点n1距起始点最短距离为d)到节点n+1的距离为d+1,假设n+1有3个临近点n1, n2, n3(对应的最短距离为d1, d2, d3),现在要求n+1距离起始点最短距离,那么求出 distance_min = min(d1, d2, d3),然后distance_min + 1就为n+1的最短距离。


image.png

由上图可得,这里由n层节点,每个节点都需要进行一次Map 和Reduce操作,而每层进行一轮MapReduce,所以对于多层的图,需要进行多轮MapReduce操作。除首尾操作,重点的MapReduce都是前节点的Reducer的输出作为该节点的Mapper的输入。

image.png

上图作为用BFS算法对最短路径探索的MapReduce的伪代码,具体代码可参照ppt 31-32页,执行情况可参照ppt 36-40页。
当Iteration达到最后一个节点时,Iteration结束。

Dijkstra’s algorithm和BFS在MapReduce执行的对比:
(1) Dijkstra’s algorithm is more efficient
(1.1) At any step it only pursues edges from the minimum-cost path inside the frontier
(2) MapReduce explores all paths in parallel
(2.1) Lots of “waste”
(2.2) Useful work is only done at the “frontier”

图算法一般包括:
(1)Performing computations at each node: based on node features, edge features, and local link structure
(2)Propagating computations: “traversing” the graph
常用的手段:
(1)Represent graphs as adjacency lists
(2)Perform local computations in mapper
(3)Pass along partial results via outlinks, keyed by destination node
(4)Perform aggregation in reducer on inlinks to a node
(5)Iterate until convergence: controlled by external “driver”
(6)Don’t forget to pass the graph structure between iterations(在这里,传递图结构步骤可以优化,即每一轮MapReduce中通过Reducer在HDFS中获取即可)

我们可否优化在MapReduce中的最短路径搜索呢?
Schimmy Design Pattern:即将(key, distance)和(key,(distance, list(node relation)))之间分开,只传递(key, distance),然后Reducer通过在HDFS中获取图结构,就可以达到提升效率的目的。

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

推荐阅读更多精彩内容

  • 没有爱的人生是没有牵挂的人生,是平淡乏味的人生,也是可有可无的人生,正如经营着一家岛上书店的老板A.J.费克里。 ...
    风0420阅读 192评论 0 1
  • 关于写字,我曾经读过这样一段话:“文字是我们的宗教,让我们继续倒行逆施。不要求两三年升半职,要求两三年出一本冷僻的...
    carbon小白阅读 896评论 4 4
  • 冰冷的火焰 把世界烧成冬天 天地空旷 却无处安放思念 咆哮和嘶吼 是最苍白的语言 放弃抵抗 点燃灵魂作香烟 冬天终...
    今夜思想阅读 273评论 0 0