Twitminer - Machine Learning Algorithm
TwitMiner is a contest designed to extract/mine 'useful' information from a collection of tweets hosted by Twitter.com. This is a challenging and interesting problem as the data in a single tweet is limited by number of characters. The objective of this contest is to design a good prediction algorithm from a collection of tweets. The algorithm had to be designed in teams of 2.
We were given a set of tweets (around 5000) and we had to classify them into two categories , namely Sports and Politics. The approach we followed is as follows:
One of the major problems faced in working with twitter data, is the wide variety of language , slang ,and hashtags used in them. Using a predefined dictionary of words would not have given an optimum result. Users tend to use abbreviated slang words and hash tags to reduce the tweet length below 144 characters.
Thus the algorithm that we chose , is independent of any language or type of dataset or what we are trying to classify the tweets into.
We used the frequency of various words that occur in the tweet, to classify them. This gave rise to a problem where words such as prepositions, pronouns and conjunctions that occurred in high frequencies in both sports & politics, created undesired results. To prevent this problem we implemented an algorithm where the word is considered only if it occurs atleast 9 times more in one category than the other. We also implemented a limit on the word length , setting it to 3. Thus only words with length greater than or equal to 3 are considered (No such limit was applied to #hashtags).
To reduce the length of tweet and to create hashtags, users on twitter combine separate words into a single word. To identify such combined words while searching in the dictionary, we instead search for it in the substring of the word. One of the prime examples that occurred in the training dataset was #sunrisersipl which is a combination of sunrisers and ipl. This approach also solved the problem where normal words such as cricketer and cricket could not be related.
Although the above approach solved a few problems, it also created new ones. Taking the example of #sunrisersipl , the string has the substring ‘un’. This made the algorithm classify the tweet as a political tweet. To solve this, we added a new constraint, where in the length of the word we are comparing (‘un’) and the length of the word we are comparing it with (‘sunrisersipl’) have to be in the ratio of (65/100) or more.
The value of 65/100 for length comparison, and 9:1 ratio for frequency weight-age and the minimum word length of 3 were
calculated after numerous tests and calibrations. All this gave us an accuracy of 83%.
To further improve the accuracy we modified the algorithm, such that , as it classifies the validation tweets (or the test dataset tweet), it also learns from the newly classified tweets by assuming that the classification is correct.
This increased the accuracy to 91%.
To further improve the accuracy, we added multiple passes to the validation dataset. This increased the accuracy to 93%.
The problem with the algorithm now was that it was learning from the noisy tweets as well. To prevent this we added a noisy filter. Using this filter, the learning was now only based on non noisy tweets. To improve this even further, we decreased the noisy filter value on every pass. Using which the accuracy increased to 95%.
To reach the final 96% we added minor changes. We checked with what special characters to be considered and what not to be considered. We added the concept of acronyms such as UN , UNO etc (words which are all uppercase and smaller in length).
Currently our algorithm is unable to classify tweets where the 65% world length threshold is not reached, eg #sunriseatolympic. (olympic is < 65% length of the overall word , and hence ignored). Another problem that our algorithm faces is when new words are introduced. eg. A tweet contains “sports ministry” . The frequency of sports is very less since it does not occur in the learning dataset, thus giving an improper result. To make the algorithm completely independent of any external dictionary, we decided not to hardcode any dictionary words.
During the final days , we tried creating a cluster of words that occur together, and were able to increase the accuracy to 96.32% using the unfinished cluster algorithm. But being in its nascent stage, the increase was not enough for us to implement it in the final algorithm.