This blog post is not about twitter API programming, though thats what it does deal with. It focuses on the intermediate level infrastructure (ie. higher than the HTTP/REST APIs exposed by other sites but lower than the class libraries that surround those) necessary to work with HTTP/REST based APIs being offered by various web sites.
I have been working on scouring and analysing twitter data for which I have been having to work with continuously accessing twitter on a sustained basis for a number of days. I started refactoring my code recently, and this blog post details the results of that exercise. More specifically in the context of invoking HTTP APIs it deals with the following aspects (many of them which are specifically introduced due to twitter.
Ability to make HTTP Calls and deal with HTTP error conditions
Ability to deal with Connection and other failures and allow for auto retries
Ability to make HTTP Calls in bulk in batches
Ability to spawn HTTP Calls across multiple threads
Ability to respect and deal with API Rate Limitations
I have specifically focused on creating a pipelined design which might be an interesting design aspect for many in this situation, and have primarily relied on python based generators for the same though similar functionality could be built using Iterators in other languages as well.
Accessing sites using HTTP Basic Authentication
Twitter is accessed using Python urllib2. Since many API’s often require using basic authentication, we shall set up the HTTP Opener to be able to work with such sites.
To be able to initialise the opener we define a method get_opener() which accepts the username, password, realm and host to be able to setup the opener for the site.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | |
Making a Basic HTTP call to fetch JSON data
Twitter supports data both in XML and JSON formats. I tend to prefer json since it is much more compact on the network and is more efficient to parse and construct as well. As a result I have function which extracts the data from a URL as presented below. It takes the opener and the URL and returns a constructed data object. Note the class Data which has no predefined attributes but instead dynamically adds attributes to itself based on JSON attributes. Moreover it checks whether the return parameter is a single object or a sequence and acts correspondingly_The logic actually should work with nested dictionaries (dictionaries within dictionaries) as well, which is currently a pending exercise_
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 | |
Simple Client
I think we have enough infrastructure methods at this stage to initiate a simple HTTP invocation to give us an idea of how these could be used. In the code below twitter_user_id and twitter_password will need to be changed to your user id and password. This shows you how a simple twitter call could be made.
1 2 3 4 | |
Automatic Retries
We do know that network communications are not perfect and occasionally we get errors which are not due to the data but due to the network or technical infrastucture. Errors that are unlikely to be repeated if the same function is retried again. Hence we work out a common retry function which will retry certain function invocation based on various retry policy parameters. Note that retry strategies cannot be used when the functions are not idempotent, especially if it is not possible to ascertain if a invocation has been received and processed by the server. However most retrieval calls are idempotent (except for the side effect of twiter api rate limits still getting impacted), so we should be able to use the retry logic in most cases. The retry method is as follows.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 | |
There are still some missing gaps we need to fill in in this case. The check error function is domain specific (in this case twitter specific) which checks for an attribute called error in the data object as follows.
1 2 3 4 5 6 7 8 | |
We also know that http error code 404 means a not found, and retrying is unlikely to help that. However code 520 is often received when twitter is over capacity and 101 is received in case of network communication failures, good candidates for retrying the http calls. Finally for other error codes I preferred to have a default policy as retry though your choice could be different. Thus the final invocation along with the retry logic looks like follows
1
| |
Bulk Processing Twitter IDs
It is often useful to do processing in bulk, especially when one does have bulk data to process. In such a situation the starting point is the list of data elements to be processed. We shall be attempting to retrieve twitter data in bulk for a large number of user ids. So the starting point is a generator which can return a list of user ids that we need to fetch. For sake of simplicity, I have written a generator which will generate twitter ids as a sequence of numbers, though you could replace it with a more suitable generator as appropriate
1 2 3 4 5 6 7 8 | |
Batching
In case of bulk processing, it is often useful to break down the data in a set of batches. To be able to do that we need a generator which can generate a set of batches from an underlying generator of ids (defined above). Usually the batch size could’ve been fixed, but in case of twitter, API rate limits apply and hence sometimes the batch sizes need to be computed dynamically depending upon the remaining available API calls. Thus we first define a function which decides on the batch size based on the preferred batch size and remaining hits as per twitter. It also uses a keep_unused parameter to ensure that a certain number of API calls are left unused (ostensibly for usage by other programs)
1 2 3 4 5 6 7 8 9 10 11 12 13 | |
Now we also need to create a generator which will generate batches using the underlying generator. However I did run into a peculiar problem. When the underlying generator as completed (StopIteration), the batch generator also needs to complete processing. This required setting up of a booleanfunction attribute, but since function attributes are not mutable, I had to define a BooleanWrapper class which internally contains a boolean attribute and has a false() method which sets the attribute value to false. The new class along with the Batch generator are shown below.The sizer and generator arguments are functions / generators that we have already defined earlier.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 | |
Threading
Now that we have batching support, we might like to be able to multithread the calls. In my situation, I was whitelisted by Twitter so had an API rate limit of 20,000 calls per hour. However I prefer to get the API calls done as soon as possible (say in the first 20 minutes of an hour), and use the remainder of the time to do additional processing. Being able to process 1000 API calls per minute requires support of either multiprocessing or multi threading. Since I was interested in only having one executable (from a manageability perspective) I decided to use the multi threading approach. Note that python has the Global Interpreter Lock (GIL) issue, but thats unlikely to play a role since threads easily yield the GIL to each other when they reach blocking IO points and most HTTP calls are indeed blocking IO.
Having decided on threading, I decided to spawn a set of threads per batch, process the batch, shutdown the threads and continue processing. There are more efficient ways of maintaining a continuously running thread pool, but given the limitations of the rate limit APIs, even the implementation I chose is far more than adequate as far as efficiency goes.
Anyway for any threading operations we either need a Thread subclass with a run method or a function which will be executed in a thread. Given the simplicity of this situation I chose to use a function to retrieve a particular user’s information as follows. This function shall be invoked simultaneously by different threads for different user ids.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | |
Note that I added the print statements just so that you can follow the program when it runs. These are not a part of the final code. It is quite important here to be able to handle all exceptions, and not leave any exceptions unhandled. Unhandled exceptions result in the main program getting terminated, something that is probably not such a great thing to have for a continuously running program. However I haven’t really included any error handling code here apart from a simple print statement.
If you noticed this function still doesn’t contain our actual twitter userid fetch logic. Thats in another function which is passed as a parameter to the above function
1 2 3 4 5 6 | |
Having written a function that can be executed in a thread which will retrieve user data, we now need a function which will take a batch, spawn a bunch of threads, feed the data to the threads through a queue, and allow the threaded function to process each element in the queue concurrently. This function is as follows.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 | |
Putting it all together Finally here’s the main body which actually assembles the data pipeline and triggers the full processing. Since the typical API rate limit is 100, I have set the keep_unused to 90 (so 10 API calls are likely to be used). I also set the preferred batch size as 6 and thread pool size as 3. This should result in two batches being fetched, the first of size 6 and the next of size 4 (assuming your API rate limit is at 100) through a threadpool of 3.
1 2 3 4 5 6 7 | |
In summary
I would’ve actually preferred to blog at length about the various design characteristics, but since this post has become really long, I shall probably come back to it in the coming few days. However I would encourage you to note the brevity (about 175 lines of non commented source code) in which we have been able to achieve a lot, and a clear separation of the infrastructural and twitter specific code. Moreover its an example of really putting together a lot of diverse functionalities and capabliities by using building blocks which can be plugged together to work with each other (thanks in no small part to the function objects and generator support of python). The full listing is in the file twitter-http-rest-api.py.