So this week at my work place I took baby steps to being a Data Scientist! I just peaked into distributed computing, data warehousing and data crunching(add more terms like these, to sound cool). Airflow and Hadoop ecosystem made me go “oh wow!”. I built a very simple Twitter sentiment analysis big data solution using Cloudera Hadoop, airflow and Python. Let’s see how I did it!
Firstly install Cloudera manager on your cluster. You can install it automatically or manually by following this link – http://www.cloudera.com/documentation/manager/5-1-x/Cloudera-Manager-Installation-Guide/cm5ig_install_cm_cdh.html . Since my Goal was to demonstrate a proof of concept, I used automated installation process on Ubuntu 16.04 Xenial. Remember, any error is resolvable if you refer to /var/logs/ directory. 😛
Then install flume on your cluster, you can do that by using Cloudera’s UI which usually runs on 7180 port by default.
So the idea was to follow the following architecture –
*sorry for slightly tilted image… poor drawing skills you know!
so we first configure Flume to fetch twitter data stream about a topic using a snapshot so that it fetches data in desired json format. You can read more about it here – https://github.com/cloudera/cdh-twitter-example
Now we are getting streaming data in our HDFS, but being a data scientist I want to know people’s sentiment about the topic. Is there a way to query HDFS?
Yes! There is. hive!
Using Hive you can do SQL query on your HDFS. “But wait I want sentimental analysis every hour, not just once. It will help me market my campaign in better way!” is that what you are thinking? Then my friend “we are on the same page!” 😛
So I used pyhive to write a python code to fetch tweet texts from HDFS and then I used textblob library to do natural language processing and get sentimental polarity; i.e, “positive”, “negative” or “neutral” on tweets. Finally I used matplotlib library to generate pie chart showing those three parameters in percentage.
I know you are still wondering that how to get an hourly analysis, right? Here comes airflow for it.
Airflow is a workflow scheduler and the best part is that it is programmatic. You don’t have to deal with all those config files stored at various places like in Oozie. Airflow is a general workflow scheduler unlike oozie which is only for Hadoop ecosystem. You can see my bias towards airflow, well that is because It is in Python!! *Love*
Finally if you get all the parts correctly, the sum turns out to be beautiful! You will get an hourly pie chart like this.
Cool! isn’t it?
P.S- I am not providing the code here. I think coding all small parts is simple and kind of trivial if you are moderately versed with python. This post conveys the general idea of how to carry out the whole process.
I hope it was worth reading! Keep Learning… Happy Coding!