Graph Connected Component Analysis with Bulk-Synchronous-Parallel Processing

In an undirected graph, a connected component is a subgraph in which any two vertices are connected to each other by paths. Usual algorithms build a graph where each vertex is labeled with its component number. They use a search from an unclassified vertex to build the connected component containing this vertex. The process repeats until all vertices are labeled with a component. Vertices that are not connected form a component with only one vertex.

In an edge-centric BSP implementation, like GraphX, finding all the connected components is done in parallel using Pregel iterations:

// working copy of the input graph, set label of vertices to their vertex Id
val cGraph : Graph[ Int, ED ] = graph.mapVertices(
        (vid, attr) => vid )

cGraph.pregel[ Long ](initialMsg = Long.MaxValue) (
    // vertex program
    (vid : VertexId, attr : Long, msg : Long) => Math.min(msg, attr),
    // sendMsg: send to vertex the min of component label along the edges
    (edgeT : EdgeTriplet[Long, ED]) =>
        if (edgeT.srcAttr < edgeT.dstAttr)
            Iterator( (edgeT.dstId, edgeT.srcAttr) )
        else if (edgeT.dstAttr < edgeT.srcAttr)
            Iterator( (edgeT.srcId, edgeT.dstAttr) )
        else
            Iterator.empty,
    // mergeMsg
    Math.min(_, _) )

This algorithm will build a graph where each vertex is labeled with the lowest vertex Id of the connected component that contains this vertex. The vertex program is a function called each time a vertex receives a message, in this case a Long number which is the component Id of one of its neighbor. The sendMsg function processes an edge and propagate the minimum of the vertex attribute of one vertex to the other vertex, and finally messages are merged with the min() binary function.

Leave a Reply

Your email address will not be published. Required fields are marked *