The Great Indian Demonetization

Indian Prime Minister Narendra Modi in a surprise move on 8th November 2016 declared the two largest currency bills (Rs. 1000 and Rs. 500) illegal. This action is primarily targeted towards reducing fake currency that’s in circulation in India and the massive amount of black money that corrupt people hold. It’s too early to discuss whether the move is good or bad. However, it doesn’t stop people from giving their opinions.

There are two primary hashtags on Twitter that are in use – #demonetization and #demonetisation. Note the difference in the spelling. The first is Americal English and the second is British English. I thought it might be interesting to analyze the tweets that used one over the other hashtag. It’s possible that some people used both but the chances are that they didn’t because the hashtags are long and take up too much space.

My hypothesis was that the tweets with hashtags with British spelling will be more negative because they are likely to originate from India. Like the people in most countries, Indians are also quite vocal on Twitter. Besides it is massively gamed by all the political parties. Currently all the opposition parties in India have opposed the move so I expected that they must be funding negative publicity on Twitter. They are likely to use the hashtag with British spelling.

On the other hand, many non resident Indians are likely to be unaffected by this move. The apparent benefits to the economy are large while the cost to individuals who don’t live in India is small. So the people who are using the Americam spelling are likely to be more positive about demonetization.

However, many foreign publications such as CNN, NY Times, WaPost, etc. may also tweet about this with the American hashtag. From my past experience, these publications have a severe bias against righ wing parties and Modi in general. This will likely make the comparison more difficult.

I downloaded 5,000 tweets for each hashtag. After cleaning the tweets and running a simple text analysis for identifying the sentiment as positive, negative, or neutral, this is what I found.


Figure 1: #demonetization

So we have almost twice as much tweets with positive sentiment for demonetization. Note that these tweets used the American English spelling. Now let’s take a look at the British spelling.


Figure 2: #demonetisation

Well, I get the samle pattern with about double the tweets with positive sentiment than negative sentiment! So I can’t reject the null hypothesis of no difference. In other words, this will always remain a blog post 😉

Just for fun, I also plotted the wordcloud for both.


Figure 3: Wordcloud for #demonetization


Figure 4: Wordcloud for #demonetisation

Customer Satisfaction of American Airline Companies

Flying on US domestic airlines is a nightmare. The customer service is pathetic, the staff unfreindly, the airlines charge for every small thing…the list goes on and on.

University of Michigan carries out surveys of American customers and publishes the average scores annually as American Customer Satisfaction Index (ACSI). You can check the scores for several industries by visiting their website. For airlines, the chart looks like this:

Screen Shot 2016-04-03 at 5.20.51 PM

American Customer Satisfaction Index for American Airline Companies

The airlines appear in descending order of the 2015 ACSI scores, which range from 81 for JetBlue to 54 for Spirit.

ACSI is published for a given brand only once a year. But companies are interested in knowing about customer satisfaction round the clock. So I decided to use Twitter sentiment as a measure for customer satisfaction. This is a very rough exerise to see whether we get any results that have face validity. My students will realize that, for airlines, Twitter is one of the key social networks for addressing customer complaints. Therefore, Twitter will likely capture customer satisfaction in real time. So the validity is actually about ACSI and not about Twitter sentiment. However, there is a commonly discussed issue about Twitter — it’s not representative of the general population. Still, we must keep in mind that ACSI may not be a good representative of American flyers sentiment either.

I decided to focus on the 9 airlines for which the ACSI scores for 2015 are available – JetBlue, Southwest, Alaska, Delta, American Air, Allegiant, United, Frontier, and Spirit. The graph looks as follows:


ACSI Scores for Nine American Airlines

The average score for these 9 airlines is 68.11. As the maximum possible ACSI score is 100,  68 is not a great score. However, I am amazed at how thoroughly the epxectations of American flyers have gone down. I am sure that if the survey respondents were from Asia, you would get an average of less than 50. But that’s a story for another post where I will compare the sentiment about the best airlines including Singapore Airlines, Emirates, Qatar, etc.

Next, I went on Twitter and downloaded tweets that were directed at these airlines. My condition was simply that the Twitter handle of the airline should appear in the tweet. For example, a tweet mentioning @JetBlue would indicate that this is a tweet targeted towards JetBlue and therefore should be included for the analysis. I carried out this data collection on 2 April 2016 from Singapore. Following this, I categorized the tweets as either positive, negative, or neutral. To compare to ACSI, I created a metric similar to Net Promoter Score (NPS). The formula for that is given as follows:

\displaystyle \mbox{Net Sentiment Score} = \frac{\mbox{(Total Positive Tweets - Total Negative Tweets)}}{\mbox{Total Tweets}}


Here is the graph when I plotted net sentiment scores of all the 9 airlines:

Twitter sentiment

Net Sentiment Scores for 9 American Airlines

The score is bounded between -1 and 1. If all the tweets are negative then the score will be -1 and if all the tweets are positive then the score will be 1.

The average score is 0.19, which is around 60% of the scale range (1.19/2.0). We see that similar to ACSI graph, 4 airlines–JetBlue, Alaska, Southwest, and Delta–are above the mean while remaining 5 are below the mean. Interestingly, these are the same 4 airlines which have above average scores on ACSI. The ordering is a bit off though. In order to better compare the two graphs, I decided to plot them in the same space. However, for that I need to have the same scale. For the sake of convenience I decided to use Z-scores.¹


ACSI and Twitter Net Sentiment Score Correlation

I find that the correlation is high at 0.77. It’s also statistically significant with a p-value equal to 0.016. However, notice that we have only 9 observations, which means that the standard error is likely to be high. Actually the 95% confidence interval for the correlation coefficient is pretty wide [0.21, 0.95] but the lower level is still comfortably far from 0.

I think that ACSI is doing a fair job of capturing customer satisfaction of American air travellers. It corresponds to the Twitter sentiment quite well. It’s worth noting that I am comparing the survey results, which were collected over 1-2 months period in 2015 with tweets that were sent on or slightly before 2 April 2016. It would be worth studying how Twitter sentiment fluctuates over a period of time. This is my next assignment once I am done with the sentiment analysis of top ranked airlines.

In case you are interested in individual airlines sentiment charts, you can view them here:

¹ Z-score of any variable has 0 average and 1 standard deviation.

10 Richest Indians

I was trying out a cool new R package forbesListR, which lets you download lists from Forbes website. The package still needs a lot of work but I could download data on India’s 100 richest people from 2012 to 2015. As I was playing around with the data, I decided to plot it out using the following ggplot2 code



ggplot(a, aes(x=reorder(name,rank),y=net_worth.millions,fill=year2)) +
  geom_bar(stat="identity",position = position_dodge(width=0.9)) +
  geom_text(aes(label=scales::comma(net_worth.millions), angle=90),
            position = position_dodge(width=0.9),vjust=0.2,hjust= 1.2,color="white") +
  theme_fivethirtyeight() +
  scale_y_continuous(labels = scales::comma) +
  xlab("Name") + ylab("Net Worth in Million $") +
  guides(fill=guide_legend(title=NULL)) +
  ggtitle("Net Worth in Million $")

In my code, “a” was the edited list of 10 richest Indians.

I deleted Hinduja family and Godrej family from the data although they were in the top 10 and instead decided to focus on individuals. You can download the data from here.

India top 10

Mukesh Ambani is still the richest Indian although Dilip Shangvi is a close second. I thought that Gautam Adani’s rise in 2014 and 2015 is quite interesting. Similarly, Cyrus Poonawalla has increased its net worth substantially in these two years. On the other hand, Laxmi Mittal lost more than $ 4.5 billion due to falling steel prices.

Finally, Shiv Nadar of HCL increased his worth from $5.6 billion in 2012 to a staggering $12.9 billion in 2015. I think he has been a clear winner.

Instagram Filters and Laziness

I always believed that Instagram filters are what made Instagram such a big hit. Otherwise Instagram was just another photo sharing app. When I started using Instagram, I used to try many filters before settling on one and sharing my picture. Over the period of time, I started sharing pictures without any filters — also known as a “Normal” filter because of my laziness. Recently I wondered whether my behavior was peculiar or many other people were also using Normal filter while sharing pictures on Instagram. Accordingly, I did a quick and dirty analysis using pictures collected from Orchard Road area in Singapore. Why Orchard? Well, let’s just say because Orchard is one of the most frequented tourists spots in Singapore, which makes my analysis more representative.

I carried out analysis on the pictures collected in January-February months of 2014, 2015, and 2016. This makes comparison easier and more uniform. As the filters offered by Instagram changed over the 2 years period, I am not plotting all the filter counts for 3 years on the same graph. Instead I show you 3 separate graphs — one for each year. Also, I show bars on the graph only for filters which were used more than 500 times in the 2 months period. This is arbitrary but it helps me depict a meaningful bar graphs. All the filters with <= 500 pictures were combined and labeled “Other”, which will show up in the graphs.

Without further ado, here are the three graphs:


Well, I am not an outlier! It turns out that at least from 2014, people have been using Normal filter (basically no filter) more than any other filter. The percentages of pictures with Normal filter were 59% in 2014, 73% in 2015, and 65% in 2016.

It’s tough to say why people selected the filters they did. My hypothesis is that people are lazy and so they will go with the default. “Normal” filter is the default so it’s picked up the most often. Now Instagram has been changing the filters so I don’t know how they were arranged in 2014 or 2015. But for 2016, the ordering was as follows:

FullSizeRender 01FullSizeRender 02FullSizeRender 03FullSizeRender 04FullSizeRender 05FullSizeRender 06

Clarendon and Gingham are the next two filters shown by Instagram after Normal, which are also the next two most commonly used filters in Orchard area! After that, the filter ranking in the graph loses correlation with the Instagram’s  default ordering of filters. Perhaps, this indicates that non-lazy people actually hunt down the filter that gives them the best looking result. Still Juno and Lark, which are 5th and 6th on my 2016 graph, appear on 7th and 5th position in the Instagram ordering. Hudson, Sierra, and X-Pro II which are at the end on my bar graph, also appear in Instagram ordering towards the end. It seems that there is some support for the “lazy” hypothesis!

If you have more to add, please share your insights in the comments.

Marketing Analytics – Summary of Session 1

We started the second trimester today at ESSEC’s Singapore campus. I am teaching Marketing Analytics (Engineering) to two sections of 50 students each. In my first lecture I introduced the fundamental problem in front of marketers – how to justify their decisions to others who control the budget. Gone are the days when people could simply use experience, gut feel, intuition, etc. as valid criteria for selecting marketing strategies. Now nobody wants to bet even $1 on speculative marketing managers. Data driven marketing is the new norm. My course is an introduction to this new reality. In any marketing course, ‘brand positioning’, ‘segmentation’, ‘targeting’, ‘media planning’, etc. are common terminologies. Professors and students know what these concepts mean. Yet, given a real life business situation how many students will be able to actually come up with a strategic solution? Very few indeed.

Our Course Text – Principles of Marketing Engineering

Over the next five weeks, we will take a two-step approach. First, we will clarify a certain marketing concept, for e.g., positioning. We will then understand what type of information needs to be collected to plot a perceptual map showing the brand positioning on 2 or 3 dimensional space. Next we will use SPSS to do the data analysis using statistical techniques such as factor analysis. Finally, based on the perceptual maps, students will recommend actions. There will be hard numbers involved. For example, when the students suggest launching a new brand to exploit potential gap in the market, they will need to justify that by projecting the changes in the market shares. They will have to account for cannibalization of any existing brands from the same company that is supposed to launch a new brand. This will be a complex but fun exercise!

The other topics include decisions on segmentation using probability models, salesforce allocation, and conjoint analysis. As we started working with SPSS today, I used a dataset consisting of accounting information on several US firms over 2010 and 2011. The students’ first task is to build a sales response model and test it using the data. To what extent do the sales respond to advertising? The response model will not be very complicated yet we may end up using a logit-type curve (ADBUDG model), who knows?

I believe that modeling the data is not the most important thing. It’s just a small component of decision making. The critical parts are to read the analysis, interpret it, and then recommend a decision path. I don’t like blind data mining of millions of data points to come up with patterns that everyone believes are true. Unfortunately this is exactly what’s happening in the analytics area. Data mining coupled with intelligent experiments is the way to go. (More on this later). Bringing intuition to this party is like inviting Michael Lohan to speak at a conference on responsible parenting!