Malay Keshav

Download Resume PDF

Malay Keshav

Netaji Subhas Institute of Technology

Working with Twitter API

February 4, 2014, by malay.keshav, category All, Python, Twitter

Recently in our Machine Learning Practical Examination, our group was asked to extract Tweets related to the recent election and predict which Party had the highest probability for winning. The first step for this was to extract information from Twitter.

I was familiar with various Twitter API Libraries available for python. But I wanted to make my own small python script for this.
The first thing you need to do is get the CONSUMER KEY and ACCESS TOKEN pairs.

Once I had the keys, I had to get familiar with the Twitter API call header format. The response to the requests were given using JSON.
I used urllib & urllib2 libraries for Python.
The Twitter API has limited the number of results to 100 per request. So I had to loop the request multiple times.
Many tweets were Retweeted and/or the same, which had to be filtered out.
The final script for this was :

 

import urllib
import urllib2
from hashlib import sha1, md5
import time
import json
import binascii
import hmac
import re

CONSUMER_KEY = "your key
CONSUMER_SECRET_KEY = "your secret key
ACCESS_TOKEN = "your access token
ACCESS_TOKEN_SECRET = "your secret access token"
HTTP_METHOD  = "GET"
OAUTH_VERSION = "1.0"
BASE_URL = "https://api.twitter.com/1.1/search/tweets.json"

key_dict = dict()
def init():
	global key_dict
	key_dict = dict()
	key_dict['oauth_consumer_key'] = urllib.quote(CONSUMER_KEY, '')
	key_dict['oauth_nonce'] = urllib.quote(md5(str(time.time())).hexdigest(), '')
	key_dict['oauth_signature_method'] = urllib.quote("HMAC-SHA1", '')
	key_dict['oauth_timestamp'] = urllib.quote(str(int(time.time())), '')
	key_dict['oauth_token'] = urllib.quote(ACCESS_TOKEN, '')
	key_dict['oauth_version'] = urllib.quote(OAUTH_VERSION, '')

def getSignature(values):
	for value in values:
		key_dict[value] = urllib.quote(values[value], '')
	finKey = ""
	for key in sorted(key_dict.keys()):
		finKey += key + "="+key_dict[key]+"&"
	finKey =  finKey[:-1]
	finKey = HTTP_METHOD + "&" + urllib.quote(BASE_URL, '') + "&" + urllib.quote(finKey, '')
	key = urllib.quote(CONSUMER_SECRET_KEY, '')+"&"+urllib.quote(ACCESS_TOKEN_SECRET, '')
	hashed = hmac.new(key, finKey, sha1)
	finKey = binascii.b2a_base64(hashed.digest())[:-1]
	key_dict['oauth_signature'] = urllib.quote(finKey, '')

def getHeaderString():
	ret = "OAuth "
	key_list =['oauth_consumer_key', 'oauth_nonce', 'oauth_signature', 'oauth_signature_method', 'oauth_timestamp', 'oauth_token', 'oauth_version']
	for key in key_list:
		ret = ret+key+"=\""+key_dict[key]+"\", "
	ret = ret[:-2]
	return ret

def makeRequest(values):
	url = BASE_URL
	getSignature(values)
	headers = { 'Authorization' : getHeaderString()}
	data = urllib.urlencode(values)
	req = urllib2.Request(url+"?"+data, headers= headers)
	response = urllib2.urlopen(req)
	the_page = response.read()

	# print the_page
	return json.loads(the_page)

indexed_tweets = set()
tweet_hash = set()
extracted_tweets = dict()
init()
values = {'q':'delhi elections', 'count':'100'}
for x in xrange(1,150):
	data = makeRequest(values)
	print "#",x
	for tweet in data['statuses']:
		if("retweeted_status" in tweet):
			tweet = tweet['retweeted_status']
		if tweet['text'] in tweet_hash:
			continue
		if tweet['id'] not in indexed_tweets :
			indexed_tweets.add(tweet['id'])
			extracted_tweets[tweet['id']]=[tweet['text'], tweet['retweet_count'], tweet['favorite_count'], tweet['user']['screen_name']]
			tweet_hash.add(tweet['text'])
		lastId = tweet['id']
	values = {'q':'delhi elections', 'count':'100','max_id':str(lastId)}	
	init()
fout = open("Tweets.csv",'w')
fout.write("ID, User, Retweets, Favorites, Tweet\n")
for ids in extracted_tweets:
	text = re.findall(r'[a-zA-Z0-9#@\\/]*',extracted_tweets[ids][0])
	text = ' '.join(text)
	text = ','.join([str(ids),"@"+extracted_tweets[ids][3],str(extracted_tweets[ids][1]),str(extracted_tweets[ids][2]),text.encode('utf-8')])
	fout.write(text+"\n")
fout.close()

So, what do you think ?