Hadoop Cluster, Wikipedia, and Six Degrees of Separation (2014)

Hadoop Cluster, WIkipedia, and Six Degrees of Separation

For our second Cloud Computing programming assignment in Spring 2014, ECE graduate student Mijail Gomez and I successfully created a system using Apache's Hadoop running on AWS EC2 Instances, Python, and PostgreSQL to find the shortest distance between two Wikipedia articles within a 6GB Wikipedia data dump. Thanks to Hadoop's inherent parallelism, searching breadth-first throughout a Wikipedia data dump to find the connection between two articles does not take ages to compute. Here's an image of the output: