Realtime Topic Analysis of Twitter Streams
For our Cloud Computing final project in Spring 2014, ECE graduate student Mijail Gomez and undergraduate student Shivani Singh built a system that finds topics from realtime tweets from New York, Los Angeles, and Chicago.
We ran it one evening for fun and one of our members remembered how a TV show called “Love & Hip Hop” would be playing at around the same time. This explained why we ended up seeing “love”, “hip”, and “hop” show up under Chicago’s third topic around 10:00pm in the video below and seeing the phrase “Love & Hip Hop” show up under Chicago trends on Twitter’s official webpage in the screenshot below. Even though we were limited to analyzing tweets with a dictionary of only a little over 15,000 words due to the limitations of our algorithm, time and resources, we were still delighted at how great of a job our system does at applying machine learning to the tweets it receives.
We ran it one evening for fun and one of our members remembered how a TV show called “Love & Hip Hop” would be playing at around the same time. This explained why we ended up seeing “love”, “hip”, and “hop” show up under Chicago’s third topic around 10:00pm in the video below and seeing the phrase “Love & Hip Hop” show up under Chicago trends on Twitter’s official webpage in the screenshot below. Even though we were limited to analyzing tweets with a dictionary of only a little over 15,000 words due to the limitations of our algorithm, time and resources, we were still delighted at how great of a job our system does at applying machine learning to the tweets it receives.
Using the clusters we built on AWS, our system used Apache Storm to gather and filter stream of tweets and use a preprocessed dictionary to identify the index of the words within a tweet to create a string that Apache Spark will accept. Spark is responsible for the data processing, and will use topic modeling to do so, specifically Latent Dirichlet Allocation (LDA), and Variational Bayes to find the posterior probabilities. Although Variational Bayes is an approximation to the real solution, it is simpler to program and can also be parallelized.
For the website we used Flask, a lightweight web application framework written in python. On execution, the python script launches 3 threads, each to read and store the data relating to one of the three cities, as outputted by Spark’s master node. Each thread listens on a port and every time it receives new data, it adds it to a global list that stores the updated data coming in from spark. Spark outputs 5 popular topics for each city and the top 300 words describing that specific topic. We used D3.js library for data visualization on the website. The website updates in real time and each cities’ five topics are represented as 5 nodes. The words in each topics are displayed on the website next to the graph. An edge exists between two nodes from different cities when there is high correlation between the words of those nodes. As the learning algorithm runs for a long time on spark, we can see more correlations between cities’ topics.
For the website we used Flask, a lightweight web application framework written in python. On execution, the python script launches 3 threads, each to read and store the data relating to one of the three cities, as outputted by Spark’s master node. Each thread listens on a port and every time it receives new data, it adds it to a global list that stores the updated data coming in from spark. Spark outputs 5 popular topics for each city and the top 300 words describing that specific topic. We used D3.js library for data visualization on the website. The website updates in real time and each cities’ five topics are represented as 5 nodes. The words in each topics are displayed on the website next to the graph. An edge exists between two nodes from different cities when there is high correlation between the words of those nodes. As the learning algorithm runs for a long time on spark, we can see more correlations between cities’ topics.