• About Me
  • Résumé
  • Software
    • Classification of Galaxy Morphologies Using Support Vector Machines (2014)
    • Realtime Topic Analysis of Twitter Streams (2014)
    • Hadoop Cluster, Wikipedia, and Six Degrees of Separation (2014)
    • Creative Key (2012)
    • Tone Generator with Speech Recognition iOS App (2012)
  • Hardware
    • Virtual Reality With Haptics Integration (2014)
    • LC3b 5-Stage Pipeline Processor (2011)
    • Wireless Glove-Controlled Electric Mountainboard (2011)
    • Wireless PS2-Controlled Electric Mountainboard (2010)
    • Multi-Touch Screen (2009)
Andres Guzman-Ballen

Hadoop Cluster, WIkipedia, and Six Degrees of Separation

For our second Cloud Computing programming assignment in Spring 2014, ECE graduate student Mijail Gomez and I successfully created a system using Apache's Hadoop running on AWS EC2 Instances, Python, and PostgreSQL to find the shortest distance between two Wikipedia articles within a 6GB Wikipedia data dump. Thanks to Hadoop's inherent parallelism, searching breadth-first throughout a Wikipedia data dump to find the connection between two articles does not take ages to compute. Here's an image of the output:
Picture
Powered by Create your own unique website with customizable templates.