COMP9313_WEEK5

WEEK5 课程摘要（MapReduce & Pregel）：1）Graph Data Processing in MapReduce；2）Introduction to Pregel（略）

1)关键词：Graph; shortest path; Adjacency Matrix; Adjacency List; Dijkstra’s algorithm; BFS(Breadth First Search)

1）Graph Data Processing in MapReduce
图经常涉及到的问题：
(1) Finding shortest paths
(1.1)Routing Internet traffic and UPS trucks
(2) Finding minimum spanning trees
(2.1) Telco laying down fiber
(3) Finding Max Flow
(3.1) Airline scheduling
(4) Identify “special” nodes and communities
(4.1) Breaking up terrorist cells, spread of avian flu
(5) Bipartite matching
5.1Monster.com, Match.com
(6)PageRank

对图的分析包括：
(1) General Graph
(1.1) Count the number of nodes whose degree is equal to 5
(1.2) Find the diameter of the graphs
(2) Web Graph
(2.1) Rank each webpage in the web graph or each user in the twitter graph using PageRank, or other centrality measure
(3)Transportation Network
(3.1) Return the shortest or cheapest flight/road from one city to another
(4) Social Network
(4.1) Detect a group of users who have similar interests
(5) Financial Network
(5.1)Find the path connecting two suspicious transactions

通常情况下，图在计算机中的表示方法有两种：
1)Adjacency Matrix

Adjacency Matrix

2)Adjacency List

Adjacency List

关于最短路径问题，这里介绍两种方式：1）Dijkstra’s Algorithm（针对单机而言）； 2）BFS（可通过MapReduce来实现，从而处理含有大量节点的图问题）
这里主要介绍第二种方法：BFS
首先，我们假设一个节点到另一个节点的距离都是单位距离1（简单化处理）。因此，节点n1（假设节点n1距起始点最短距离为d）到节点n+1的距离为d+1，假设n+1有3个临近点n1, n2, n3（对应的最短距离为d1, d2, d3），现在要求n+1距离起始点最短距离，那么求出 distance_min = min(d1, d2, d3)，然后distance_min + 1就为n+1的最短距离。

image.png

由上图可得，这里由n层节点，每个节点都需要进行一次Map 和Reduce操作，而每层进行一轮MapReduce，所以对于多层的图，需要进行多轮MapReduce操作。除首尾操作，重点的MapReduce都是前节点的Reducer的输出作为该节点的Mapper的输入。

image.png

上图作为用BFS算法对最短路径探索的MapReduce的伪代码，具体代码可参照ppt 31-32页，执行情况可参照ppt 36-40页。
当Iteration达到最后一个节点时，Iteration结束。

Dijkstra’s algorithm和BFS在MapReduce执行的对比：
(1) Dijkstra’s algorithm is more efficient
(1.1) At any step it only pursues edges from the minimum-cost path inside the frontier
(2) MapReduce explores all paths in parallel
(2.1) Lots of “waste”
(2.2) Useful work is only done at the “frontier”

图算法一般包括：
(1)Performing computations at each node: based on node features, edge features, and local link structure
(2)Propagating computations: “traversing” the graph
常用的手段：
(1)Represent graphs as adjacency lists
(2)Perform local computations in mapper
(3)Pass along partial results via outlinks, keyed by destination node
(4)Perform aggregation in reducer on inlinks to a node
(5)Iterate until convergence: controlled by external “driver”
(6)Don’t forget to pass the graph structure between iterations（在这里，传递图结构步骤可以优化，即每一轮MapReduce中通过Reducer在HDFS中获取即可）

我们可否优化在MapReduce中的最短路径搜索呢？
Schimmy Design Pattern：即将(key, distance)和(key,(distance, list(node relation)))之间分开，只传递(key, distance)，然后Reducer通过在HDFS中获取图结构，就可以达到提升效率的目的。

COMP9313_WEEK5

推荐阅读更多精彩内容