Challenge,how to implement an algorithm for six degree of separation?
UserA-UserB-UserC-UserD-UserF
Users connected by '-' know each other.
And I need an algorithm for these 2 tasks:
- Calculate the path from UserX to UserY
- For UserX,calculate all users that is no more than 3 steps away.
Is there an efficient solution?
EDIT
My purpose is not to prove it right or wrong,but to calculate the result real time when necessary.
Plus,I think the most expressive way is code,even pseudo ones.
EDIT AGAIN
I've decided that this kind of job must be done inside database,so it must be a sql solution!
Solution 1:
Represent this list of users by a graph
- Each user is a node
- There is an edge between any two users who know each other
- Calculate the path from UserX to UserY
- For UserX,calculate all users that is no more than 3 steps away.
These questions are so closely related that the same algorithm actually solves both of them. You can call it "Dijkstra's algorithm with all edges having weight 1," or "breadth-first search."
Essentially, starting at the first node, visit all its relatives; then mark them all as visited, record the shortest path to each (the shortest path to them + the edge you just traversed), and repeat for each of them. Stop after you've reached your destination for Problem #1, stop after the shortest path is > 3 for Problem #2.
This will run in O(n) time. No, there is no faster way.
The fastest O(n) algorithm for six-degrees of separation would probably be finding the sets of all users 1-step away from UserX and UserY, and finding the intersection of those two sets. If none, then add the users 2-steps from UserX and intersect; then add users 2-steps from UserY and intersect; etc. up to 3.
If each person has an average of 100 friends, this could require looking at up to about 2,020,200 users, as opposed to the 1,010 billion for Dijkstra's algorithm. In practice, these numbers would be much smaller, since often two of your friends are also friends with each other.
This is the only method of solving six-degrees of separation (of those mentioned so far) that will work in practice.
Solution 2:
Graph algorithms can help you here. Learning about them is fun too!
- Graph connectivity for the connectivity.
- Dijkstra (A*) for paths between users
- Simple DFS for finding all users N nodes away from your user
Dijkstra or similar should be used if you want to find the shortest path between two users. There may be several paths between them, and Dijkstra takes care of noting when it found a shorted path than another that was found before.
To find the shortest paths between all users you'll have to use something like Floyd-Warshall. It is a nice example of dynamic programming and is quite simple to implement. The pseudocode from Wikipedia is:
1 /* Assume a function edgeCost(i,j) which returns the cost of the edge from i to j
2 (infinity if there is none).
3 Also assume that n is the number of vertices and edgeCost(i,i) = 0
4 */
5
6 int path[][];
7 /* A 2-dimensional matrix. At each step in the algorithm, path[i][j] is the shortest path
8 from i to j using intermediate vertices (1..k−1). Each path[i][j] is initialized to
9 edgeCost(i,j) or infinity if there is no edge between i and j.
10 */
11
12 procedure FloydWarshall ()
13 for k := 1 to n
14 for i := 1 to n
15 for j := 1 to n
16 path[i][j] = min ( path[i][j], path[i][k]+path[k][j] );
Solution 3:
I have a suggestion which is quite different from those you already have. If you have to stick with an SQL database and you don't know any java this suggestion won't be of much use.
You problem is specifically a graph problem, so I would suggest that while using an SQL database to store the graph will work, a different approach would be to use a solution that is intended specifically for graph problems.
The Neo4j project provides a disk-based graph database, together will lots of graph algorithms to work with it. Quote:
Neo4j is a graph database. It is an embedded, disk-based, fully transactional Java persistence engine that stores data structured in graphs rather than in tables. A graph (mathematical lingo for a network) is a flexible data structure that allows a more agile and rapid style of development.
An appropriate example of using Neo4j on their wiki demonstrates a degrees-of-separation web application using IMDB data. The example illustrates shortest-path calculations between any actor and Kevin Bacon.
I like this example since it talks a lot about modeling the domain your graph will represent. Modeling your domain ensures you have thought about things like:
- Directed vs Undirected
- Edge types/relationships
- Attributes such as edge weights
As has been mentioned in other posts, there are a number of algorithms for calculating shortest-paths, such as Dijkstra, Floyd Warshall or BFS. All these have been implemented in Neo4j and some examples are provided here.