This is not just any other year for India; 2019 is the year of India’s future because a new leader has to get elected. The leader can make India the world’s fastest growing economy or can diminish all the dreams of India’s progress. India is one of the biggest democratic countries in the world. It has the potential to also become the most powerful country with a leadership that would harness its capability.
In this 2019 National Election (lok-sabha election) there are 3 major parties participating. Two of them are old player in this game. The ones in “Congress” are from a party that dates back to December 1885, though India’s independence only dates officially to August 1947. There is also theBJP, also known as Bharatiya Janta Party that was established in April-1980. There is one more party,AAP, also known as Aam Aadmi Party that recently got established in November 2012 and garnered a lot of praise in a few parts of India and becoming a great competitor for the other two long-established parties. Every party has chosen their leader, and it is shaping up to be a very interesting competition among them for the election. Let me elaborate on that using this project and analysis twitter sentiments on this election.
I chose this project because it’s a very sensitive time now in India, which makes it the optimal time-frame for this project. I started this project around 20th May, and the election result was coming out on 23rd May. So I got to validate my findings right away in a real-world scenario. In working on this project, I applied web-scrappy, data analysis, and most importantly, the basics of NLP (Natural Language Processing).
I divided my project in the steps listed below:
- Scrape all the tweets for the period of 01-Jan-2019 to 23-May-2019.
- Clean, Filter and apply NLP techniques to prepare data for Analysis
- Apply Python data analysis techniques to find meaningful insights
- Find the indications on Twitter of who would win the election.
Data Analysis Constituency Wise
A graph to show the sentiment distribution on each party tweets using ‘polarity_scores’
Which party is winning the Election battle on Twitter
A graph to show the count distribution on each party tweets
temp = pd.concat([bjp_date, inc_date, aap_date], axis=1, sort=True)
data = temp.iloc[:,[0,1,3,5]].set_index('date')
ax = sns.lineplot(data=data)
Below is the word count without tokenization applied
Below is the word count with tokenization applied
the summary details of dataframe after appling sentiment analysis
We have around 753,725 tweets and each tweet on average has 11 likes, 1 replies, and 3 retweets. Looking at the compound score we can see on average tweets are positive, with a mean sentiment of 0.05.
fig = plt.figure(figsize=(10,5))
ax = fig.add_subplot(111)
sns.distplot(sentimentDF['compound'], bins=15, ax=ax)
Finally Let’s create a graph to see how tweets sentiment changes in this 4 month period of time. This help us analyze sentiment of tweets over a time frame.
In order to plot below graph I excluded all the tweets that comes with compound sentiment score of 0. This allows us to see the variation in sentiment over time frame.
Future Work / Improvement
1. Sentiment Analysis for hindi words are not done properly as VADER package doesn’t work for hindi language. One way to do this is to translate each hindi tweet into english and then analyze.
2. I took only 10000 tweets per day, in-order to analyze it in more proper way I could get complete tweets for each day
3. There are a lot of loose ends, when it comes to analyze the tweets of Hinglish language mean appling tokenzation and lemmitization
4. There are a lot of fake/scam tweets related to #aap which needs to be cleaned.