Automatic sentiment analysis of up to 16,000 social web texts per second with up to human level accuracy for English - other languages available or easily added.
SentiStrength estimates the strength of positive and negative sentiment in short texts, even for informal language. It has human-level accuracy for short social web texts in English, except political texts. SentiStrength reports two sentiment strengths:
-1 (not negative) to -5 (extremely negative)
1 (not positive) to 5 (extremely positive)
Why does it use two scores? Because research from psychology has revealed that we process positive and negative sentiment in parallel - hence mixed emotions.
SentiStrength can also report binary (positive/negative), trinary (positive/negative/neutral) and single scale (-4 to +4) results. SentiStrength was originally developed for English and optimised for general short social web texts but can be configured for other languages and contexts by changing its input files - some variants are demonstrated below.Finnish, German, Dutch Spanish. Russian, Portuguese, French, Arabic, Polish, Persian, Swedish, Greek, Welsh, Italian, Turkish.
SentiStrength is free for academic research and is certified safe by Softpedia. Please contact the author for the commercial Java version or a commercial license for the online version. The free version runs under Windows only and is provided without liability or guarantees for any uses. Downloading SentiStrength and/or the configuration files signifies acceptance of these conditions. This version does not contain the keyword or domain classification facilities.
- Register for the free SentiStrength download - make sure to save the program AND the data files (this does not do the binary/trinary/scale classifications).
- if SentiStrength gives an error message when starting, please try downloading and installing the "Microsoft .NET Framework Redistributable Package" and then try running SentiStrength again.
- Remember to use Register New Location in the File menu to point SentiStrength to the location of the data files as soon as it loads, unless they are saved in the default location C:\SentStrength_Data\.
A commercial licence for SentiStrength is available for £1000 - please contact m.thelwall -at- wlv.ac.uk. The Java version of SentiStrength is normally used commercially.
SentiStrength is used by computing, language technology and market research companies in the US, Europe and Australia. Some use the default English version and others have translated it into different languages or adopted it to integrate with their existing language technology systems. Commercial users range from small start-ups to one of the world's top 10 largest corporations.
The Java version of SentiStrength is similar to the Windows version in core functions but has additional capabilities - see the SentiStrength Java manual and Mac users' starting instructions (also helps in Linux probably). It can conduct binary (positive/negative), trinary (positive/neutral/negative), single-scale classifications (-4 very negative to very positive +4) in addition to the standard type, and can conduct keyword-oriented and domain-oriented classifications. It also has a special mode for binary and trinary classification on longer texts. It allows wildcards in the idiom list file. To use the Java version for research only (free), email from your academic institution's email address. It can process about 16,000 tweets per second.
For Python users, here is some sample Python code from Alec Larsen, University of the Witwatersrand.
SentiStrength is a sentiment analysis (opinion mining) program. It is described and evaluated in the following peer-reviewed academic articles:
- Thelwall, M., Buckley, K., Paltoglou, G. Cai, D., & Kappas, A. (2010). Sentiment strength detection in short informal text. Journal of the American Society for Information Science and Technology, 61(12), 2544–2558.
- Thelwall, M., Buckley, K., & Paltoglou, G. (2012). Sentiment strength detection for the social Web, Journal of the American Society for Information Science and Technology, 63(1), 163-173.
- Thelwall, M., & Buckley, K. (2013). Topic-based sentiment analysis for the Social Web: The role of mood and issue-related words. Journal of the American Society for Information Science and Technology, 64(8), 1608–1617.
- Thelwall, M., & Buckley, K., Paltoglou, G., Skowron, M., Garcia, D., Gobron, S., Ahn, J., Kappas, A., Küster, D. & Holyst, J.A. (2013). Damping Sentiment Analysis in Online Communication: Discussions, monologs and dialogs. In: A. Gelbukh (Ed.): CICLing 2013, Part II, LNCS 7817, pp. 1-12. Springer, Heidelberg.
- [Turkish version] G. Vural, B. B. Cambazoglu, P. Senkul, and O. Tokgoz (2013) A framework for sentiment analysis in Turkish: Application to polarity detection of movie reviews in Turkish, Computer and Information Sciences III, pp 437-445.
- Thelwall, M. (in press). Heart and soul: Sentiment strength detection in the social web with SentiStrength (summary book chapter).
It has been applied in the following research projects, amongst others.
- Thelwall, M., Buckley, K., & Paltoglou, G. (2011). Sentiment in Twitter events. Journal of the American Society for Information Science and Technology, 62(2), 406-418.
- Kucuktunc, O., Cambazoglu, B.B., Weber, I., & Ferhatosmanoglu, H. (2012).
A large-scale sentiment analysis for Yahoo! Answers, Proceedings of the 5th ACM International Conference on Web Search and Data Mining.[Used in Yahoo!]
- Weber, I, Ukkonen, A., & Gionis, A. (2012). Answers, not links: extracting tips from yahoo! answers to address how-to web queries, Proceedings of the fifth ACM international conference on Web search and data mining (WSDM '12). [Used in Yahoo!]
- Pfitzner, R., Garas, A., & Schweitzer, F. (2012). Emotional divergence influences information spreading in Twitter, ICWSM-12.
- Garas, A., Garcia, D., Skowron, M., & Schweitzer, F. (2012). Emotional persistence in online chatting communities. Scientific Reports, 2, article 402.
- Mihai Grigore and Christoph Rosenkranz (2011). Increasing the willingness to collaborate online: an analysis of sentiment-driven interactions in peer content production. ICIS 2011 Proceedings. Paper 20.
- G. Vural, B. B. Cambazoglu, and P. Senkul (2012). Sentiment-focused web crawling, Proceedings of the 21st ACM International Conference on Information and Knowledge Management, pp 2020-2024.
- Giorgos Giannopoulos, Ingmar Weber, Alejandro Jaimes, Timos Sellis (2012). Diversifying User Comments on News Articles, Web Information Systems Engineering (WISE 2012). Lecture Notes in Computer Science 2012, pp 100-113.
- From Greg Merritt: New Cities Foundation (2012), Connected Commuting: Research and Analysis on the New Cities Foundation Task Force in San Jose (SentiStrength is mentioned on page 16). See also Crowdsourcing your Commute (New York Times).
Zheludev, I., Smith, R., & Aste, T. (2014). When can social media lead financial markets?. Scientific Reports, 4. doi:10.1038/srep04213.
Press coverage and initiatives
- Reading the Riots: Investigating England's Summer of Disorder Guardian online.
- SentiStrength classified London Olympics tweets with the results put up in lights on the EDF Energy London Eye.YouTube videos: Barge 1, Barge 2, Inside Barge, LondonEye lightshow.
- Time Magazine: Want to Light Up the London Eye? Just Tweet That the Olympics Are 'Totes Amazeballs', July 27, 2012.
- UK Daily Telegraph article, p. 27, 19 July 2012, "Happy Olympic tweeters to light up London Eye" in "the world's first social media driven light show".
- BBC News Article: 20 July 2012, London Eye Olympic Twitter positivity lightshow launched.
- UK Daily Mirror article, 20 July 2012, The mood of the nation: Tweets to power spectacular London 2012 light show.
- Sydney Morning Herald: London Eye to become giant Twitter mood ring during Olympics, July 25, 2012.
- Voice of Russia: Twitter embraces Olympics, colors tweets in emotions and sends messages from space, July 24, 2012.
- SportPrimeur, 20 July 2012, London Eye twitter-uitlaatklep tijdens Spelen.
- SentiStrength classified tweets for the 2014 Super Bowl, with the results transformed into a lightshow on the Empire State Building.See video.
- NBC New York, Jan 28, 2014: Fans to Pick Colors in Empire State Building Super Bowl Light Shows.
- CBS Local, Jan 27, 2014: Start Tweeting! Super Bowl Debates To Light Up Empire State Building.
- Forbes Jan 30, 2014: Verizon's Super Bowl Scheme Is To Save $4 Million And Light Up The Sky.
- LA Times Jan 30, 2014: Empire State Building lights up for Super Bowl, Chinese New Year.
- Belfast Telegraph, Jan 29, 2014: Empire State building lights up in Super Bowl team colors.
- Fox CT, Jan 29, 2014: #WhosGonnaWin? Tweets Pick Colors Of Empire State Building.
Classifying texts with SentiStrength
To get SentiStrength to classify one or more texts, put the texts into a plain text file with one text per line. Select Analyse All Texts in File from the Sentiment Strength Analysis menu and select the text file. The output will be a copy of the file with positive and negative classifications added at the end of each line, preceded by tabs. Individual texts can also be classified by selecting Analyse One Text from the Sentiment Strength Analysis menu.
Optimising SentiStrength term weights
The term positive and negative weights can be found in the EmotionLookupTable.txt file in the SentStrength_Data folder. These can be manually adjusted by editing the file. Alternatively, they can be automatically fine-tuned with a classified text collection. To fine tune EmotionLookupTable.txt values used by SentiStrength, first create a collection of texts that have been classified by humans with positive (1-5) and negative (1-5) sentiment strengths. Put these into a plain text file in which each line has the format: negative – tab – positive – tab – text. The set should be at least 500 texts. Select Optimise the emotion dictionary weights from the Sentiment Strength Analysis menu and SentiStrength will create a new term strength list that is optimised for the sentiment in the new texts. To use the new strengths, save a copy of the original strength list and then replace it with the new list.
Assessing the accuracy of SentiStrength
To assess the accuracy of SentiStrength on a set of texts, a sample must first be classified and formatted as above. The human classifications can then be compared with the SentiStrength classifications on the same sample.
Alternatively, if one data set is available to optimise the word strength list and the same set is to be used for validation then the 10-fold cross-validation procedure can be used. This uses 90% of the data to train the term weights and the remaining 10% to assess the accuracy of the adjusted weights. This is repeated 10 times with a different 10% left out and the total results are reported. To run a 10-fold cross-validation, create the classified text as above and select Run a 10-fold cross-validation to assess the above algorithm from the Sentiment Strength Analysis menu.
- Six sets of at least 1000 human coded texts, each coded by three independent coders.
- Coding manual for sentiment in texts (pdf version) (for Tweets, but easily changed for other texts)
- SentiStrength based 6-hour sentiment analysis course.
The various files with SentiStrength contain information used in the algorithm and may be customised.
- The EmotionLookUpTable is just a list of emotion-bearing words, each one with the word then a tab, then an integer 1 to 5 or -1 to -5. This can be edited and extended. Note that strengths of +1 and -1 have no effect on the program. There are some in the list, just to indicate that they words have been considered but not used. Each word can end with a wild card * but this can only go at the end.
- The EmoticonLookUpTable is as above but for a list of emoticons.
- EnglishWordList.txt is just a list of English words - it is used for the part of the algorithm that tries to correct words with non-standard spellings.
- NegatingWordList.txt reverses the polarity of subsequent words -e.g., not happy is negative.
- BoosterWordList.txt increases sentiment intensity -e.g., very happy is more positive than happy.
- SlangLookupTable.txt – replaces common slang with equivalent words or expressions
- IdiomLookupTable.txt–overrides the sentiment strength of the individual words in the phrase
SentiStrength can be adjusted for other languages by translating the term list EmotionLookupTable.txt and adding any other sentiment-bearing words that have been omitted. Note that the sentiment scores for terms should be in the range 2 to 5 (positive) or -2 to -5 (negative). A score of +1 or -1 means neutral and neutral terms are ignored. A training corpus in the new language is recommended to help adjust the term weight strengths (see Optimising SentiStrength term weights).You will need the Java version or Windows version 2.2 of SentiStrength to cope with accented characters or characters not found in English as well as some additional linguistic features.
The following files will also need to be translated or replaced with a local equivalent (see the extra instructions):
- EmoticonLookupTable.txt - check the strengths are appropriate and add any common new national variations
- SlangLookupTable.txt – replace with a list of common slang in the new language
- EnglishWordList.txt – replace with a word list of correct spellings in the new language (many such lists are on the webm, but this step is optional)
- NegatingWordList.txt – translate/replace with a list of negating words in the new language
- IdiomLookupTable.txt–replace with a list of common idioms in the new language
- BoosterWordList.txt – translate/replace with a list of booster words in the new language – words that emphasise the strength of emotion in any subsequent words
- QuestionWords.txt– translate/replace with a list of words in the new language that reliably indicate that a question is being asked
You will also need to register a list of non-English common multiple letters (e.g., ii is common in some languages but not English). For the Java version please see the manual for this option. For the Windows version, please check the options menu for this customisation. Spell-checking can also be completely disabled in both versions, if needed.
Negating words occurring after sentiment words (e.g., "I am happy not" is OK in German but not English) can be customised in the Java version of SentiStrength but not the Windows version, sorry. The Java version may need the utf8 option to read the input files, if in UTF8 rather than ASCII format (note that utf8 does not always work on ANSI text files so it should not be used as the default).
Would you like to help? If you are a linguist with knowledge of any of these languages then you could help by:
- Checking the dictionaries for accuracy and missing sentiment words
- Reporting any badly classified texts.
Please email m dot thelwall at wlv.ac.uk if you would like to help. This makes a good student project.
Classifiers with some testing (6)
Thank you to Eismont Polina, Efanova Iuliia, Konovalova Svetlana, Losev Viktor and Velichko Alena of Saint Petersburg State University of Aerospace Instrumentation, Department of Applied Linguistics for help with the first Russian version. (+ve correl. 0.28-0.47, -ve correl. 0.31-0.46 on tweets - the second number is overfitted due to testing on the evaluation data set, so the real correlation is probably about 0.35 for both). 3000 human-classified Russian tweets.
Thank you to Юлия Павлова, Olessia Koltsova and Sergei Koltsov for the second Russian version. It was developed by the Laboratory for Internet Studies, National Research University Higher School of Economics (NRU HSE), and supported by the Russian Humanitarian Research Foundation and NRU HSE.
There is also a Turkish sentiment strength classifier that is a variant of SentiStrength created by Gural VURAL, METU Computer Eng. Dept. This is available on the same basis as the Java version.
Completely untested classifiers (10) [just for fun: please email m dot thelwall at wlv.ac.uk if you would like to help improve them - this makes a good student project for linguists or computer scientists, together with testing the results, and making a small corpus of sentiment-classified texts! Here are the language files for 9 of these languages - please improve them and send back if you like. Please also send us a few positive and negative words from any languages not listed here and we will make a new version for your language!]
Basic classifiers (10) that recognise only a few sentiment words. Please email m dot thelwall at wlv.ac.uk if you would like to help improve them or to send a list of at least 10 common sentiment words for any language. We can't get Hindi, Punjabi and Bengali to work at the moment, sorry. Also, the Chinese simplified and traditional and the Japanese are artificial versions that add spaces between words or phrases into the language.
SentiStrength can be adjusted for other domains (e.g., Twitter, product reviews) by adding new relevant words and sentiment strengths to the term list EmotionLookupTable.txt and adjusting any relevant existing term strengths. The other files can also be adjusted, as for language customisation. For example, the file EmotionLookupTableGeneral.txt in the download zipfile contains a slightly adjusted set of term weights to cope with more impersonal communication than MySpace. In this alternative file, the word "love" has a higher strength because it is less likely to be used in formulaic message endings, such as "love from" or "love u" or "love x".
The data mining menu and ARFF menu items are not part of the main SentiStrength functionality nor documented. Please ignore them unless they make sense to you.
For further issues, please see the Frequently Asked Questions.
SentiStrength was produced as part of the CyberEmotions project, supported by EU FP7.