# Java代写 | CS 435 Introduction to Big Data Programming Assignment 2

CS 435 Introduction to Big Data
Programming Assignment 2

1. Introduction
Small-world networks are a class of networks that are “highly clustered, like regular lattices, yet have small
characteristic path lengths, like random graphs.
1”. This property of networks is closely related to the efficient
information transfer between and within regional clusters. Social networks are intuitive examples of the
Small-world networks. Cliques or clusters of friends are often interconnected while each member in a cluster is
only several friend relationships away from anyone else.
In this assignment, you should profile the Google+ ego networks using the “friend” network (gfn) that you have
created in the assignment 1. For students who could not extract the friend network correctly, we provide the
pre-processed friend network as a part of assignment 2. To analyze the Small-world networks property, you
should calculate the following measurements using Hadoop MapReduce:
(1) Requirement 1: Counting the numbers of edges and vertices of the Google+ friend networks
(2) Requirement 2: The average geodesic path length
(3) Requirement 3: The global cluster coefficient

Small-world networks tend to contain cliques, and near-cliques. This can be quantified with the clustering
coefficient. Also, most pairs of nodes will be connected by at least one geodesic path (shortest path). This
follows from the defining property that the mean-shortest path length be small. Several other properties are
often associated with small-world networks. Often there is a super hub in the network with a high number of
connections. These hubs serve as the common connections mediating the short pathlengths between other
edges.
In this assignment, we will focus on the two properties, the average characteristic path length (L-gfn) and the
cluster coefficient (C-gfn). You must measure the average geodesic path and the cluster coefficient of the
3.1. The Average Geodesic Path Length (Requirement 1 and 2)
Small-world networks have short pathlengths as is commonly observed in random networks. A network (or graph)
consists of nodes (i.e., vertices) connected by edges. A path in a network is a sequence of alternating nodes and
edges that starts with a node and ends with a node. Nodes or edges can appear multiple times in the same
path, and the number of edges in a path is the length of the path. If a graph is connected, then any node can be
reached via a finite-length path starting from any other node. The shortest path between a pair of nodes is called
a geodesic path and there can be more than one such path. In a friend graph, the distance from a person to her
friends is 1, the distance from her to a friend of her friend who is not also her friend is 2, and so on.
In Figure 1, we are given a network encompassing the set of nodes, V = {v1
, v2
, v3
, v4
, v5
, v6
, v7
}. The geodesic path
length between v7 and v4
is 1 and the path is (v4
, v7
). Similarly, the geodesic path length between v2 and v4
is 3
and the path is (v2
, v1
, v7
, v4
). In this case, there is another geodesic path: (v2
, v1
, v5
, v4
).
Small values of shortest path length ensure that information or resources easily
spreads throughout the network. This property makes distributed information
processing possible on technological networks and supports the six degrees of
separation often reported in social networks.
You must calculate all of the geodesic paths between pairs of vertices in the
friend network and return the average geodesic distance using MapReduce.
To calculate the average length, you should use the following formula:
L =
2
v(v−1) ∑
if n, i≠j
di,j
, where v is the number of nodes in the Google+ friend network and d-i,j is the pathlength between the node i
and j. If there is a pair of nodes without existing path, please ignore those path lengths. This computation may
involve multiple MapReduce jobs. Consider only unique pairs of nodes. Since this graph is non-directed graph,
the pair (i, j) and (j, i) are considered as the same pair. Also, if there are multiple shortest paths between the
nodes, consider only one shortest path for measuring the average geodesic pathlengths.
To calculate the total shortest path lengths of this graph ( ∑ ) , you can count unique path with each path
if n, i≠j
di,j
length and add them up. For example,

if n, i≠j
di,j = ∑
for di,j=1
di,j + ∑
for di,j=2
di,j + ∑
for di,j=3
di,j + … + ∑
for d =max_length i,j
di,j
Your software must include all of the paths with the pathlength of 1 to 10 (max_length). Please note that these
paths must include the “shortest” paths between given pair of nodes. If there are three paths, a-b, a-c-b, and
a-f-d-s-g-b, your calculation must consider only a-b. Also, do not count the duplicate path. The path a-b and b-a
should be considered as the same path.
Your submission must include the measurements as depicted in the following table.
Category ∑
for d =1 i,j
di,j ∑
for d =2 i,j
di,j ∑
for d =3 i,j
di,j … ∑
for di,j=10
di,j
Total shortest
pathlength
Total number
of vertices
The Average Geodesic Path Length
Measurement
Table 1. The submission requirement 1
3.2. Cluster Coefficient: Large degree of Transitivity (Requirement 3)
The overall clustering in a network can be determined by averaging the clustering across all individual nodes.
High clustering supports specialization as local collections of strongly interconnected nodes readily share
information or resources. Conceptually, clustering is quite straightforward to comprehend. In a real-world
analogy, clustering represents the probability that one’s friends are also friends of each other.
A cluster coefficient is a measure of the degree to which nodes in a graph tend to cluster together. A triplet is
three nodes that are connected by either two (open triplet) or three (closed triplet) undirected edges. Therefore,
a triangle (e.g. v1
, v5
, v6
in the Figure 1) involves three different triplets, ((v1
, v5
),(v5
,v6
)), ((v1
, v5
),(v1
,v6
)) and ((v1
,
v6
),(v5
,v6
)). In this assignment, you should use the following measure proposed by Luce and Perry
4
. A
global clustering coefficient is defined as the following:
C =
3×number of triangles
number of connected triples
, where a triangle consists of 3 nodes that are completely connected to each other (i.e., a 3-clique) and a
connected triple consists of three nodes { i, j, k} such that node i is connected to node j and node j is connected
to node k. The factor of 3 arises because each triangle gets counted 3 times in a connected triple. The clustering
coefficient C indicates how many triples are in fact triangles. A complete graph, in which every pair of nodes is
connected by an edge, with N ≥ 3 nodes yields the maximum possible value of C = 1, as all triples are also
triangles. The minimum value of the clustering coefficient is C = 0.
You must calculate the global cluster coefficient of the Google+ friend networks using MapReduce. Your
submission must contain the total number of connected triples and number of triangles. This computation may
involve multiple MapReduce jobs. E-mail: [email protected]  微信:itcsdx 