When starting a project, one is often faced with the choice between multiple open source libraries vs. the option of rolling a custom solution.

This can be particularly challenging, especially when consulting as you should deliver the most value to his or her client. Just like other design decisions, making a solid choice will get the client more bang for the buck and allow for more adaptability in the future.

The choice comes down to one of the following:

  1. Implementing the library or service (potentially re-inventing the wheel),
  2. Relying on an open source library (that may or may not be stable and can lack documentation), or
  3. Contributing to an open source library, adding the features you’re missing (which requires getting up to speed with the library’s code base)

It is important to review the following items for each project:

  1. Does it meet your needs?
  2. Open bugs, and the maintainers’ responses.
    • Are maintainers open to contributions?
    • Are pull requests merged in a reasonable amount of time?
    • Are their any show stopper bugs that have been reported?
  3. What is the state of the unit tests for the application?
  4. How much documentation is available for the project and is it using language standard tools for the documentation?

Below I go into how I choose an library for a project I worked on recently.

The Project

I recently started on a project using the Twitter Streaming API. I wanted to do the project in Python and the Twitter portion of it made up a small portion of the overall goals of the project. Sample data was needed and Tweets provided a nice dataset to compute over.

This is important - the main focus of the project was the systems software designed to process data, not the data itself. Making this observation up front helps drive the decision making process.

Challenges

The first task at hand is identifying the challenges associated with implementing the service at hand. I began by reviewing over the specifications required for the application. In my case, these were simply the connection details for the Twitter Streaming API.

In particular, the main challenge associated with this service is implementing various back-off methods when the connection fails. The connection should be restarted at different intervals depending on the failure type. You will also want to honor any DNS changes that have been made by re-starting the connection if necessary.

Having a list of challenges gives us a gauge to use when estimating the time it would take to implement and in measuring the efficacy of existing libraries.

Identify Strengths and Client Preferences

A quick look at the available libraries for the Twitter API shows an abundance of open source implementations across nearly every language. A quick Google search reveals even more.

We have to start by narrowing down the choices to a few select languages. In my case, I’m working with a group that is fluent in both Python and Java. For this project, I would prefer Python as the project will not require high throughput and is easy to test.

The language factor can be hard to choose, but consultants should remain flexible and future maintainability must be considered. Unit tests are required, especially when using a dynamic language, and this should be factored into development time considerations.

Evaluating the strengths and weaknesses of the team who will have to maintain the code in the future may help limit the options right away.

Looking at Libraries

Since I’ve chosen to go with Python, let’s look at a few popular libraries that implement wrappers around the Twitter API.

So let’s go through the points identified above and see how the projects I’ve found, Python-Twitter, Python Twitter Tools, and TweetPy , hold up.

1. Does it Meet Your Needs?

This seems like an obvious question, but as we’ll see below isn’t always easy to answer. In my opinion, if you have a difficult time finding the answer, then the feature you’re looking for isn’t a top priority for the developer. You might consider looking elsewhere now.

Let’s see what we’ve got.

Python Twitter

From looking at the open issues for the Python Twitter project, you can clearly see that streaming API support isn’t included. There are a few outstanding pull requests, but these haven’t been touched in years and are not yet merged into the project.

Python Twitter Streaming API Support

Python Twitter Streaming API Pull Requests

Python Twitter Tools

Looking at the README.md (images below) of the Python Twitter Tools project shows that while it supports the Streaming API, the retires can either be handled by a special dictionary yield by the iterator (meaning you’ll have to implement the back-off yourself). Since this was identified as the main challenge we need a library to solve, then using this method is a step above implementing it ourselves, but we should see if there is anything else better out there first.

Python Twitter Tools Connection Breaks

An alternative is to set retry=True to have the library reconnect for you. Unfortunately, the README doesn’t explicitly say that the retry method adheres to the somewhat esoteric back-off policies specified by Twitter:

Python Twitter Tools Request Retries

Tweepy

It is clear that Tweepy support the Streaming API because it clearly mentioned in the documentation. What is unclear is how it handles the challenges we identified, namely varying back-off intervals.

Looking at the docs shows that we may have to dive into the source to find out:

Tweepy Streaming Doc Notes

After a quick look at streaming.py, it appears that it does have code for back off intervals. However, the lack of clear documentation on how it handles the protocol leaves me wary.

2. Open Bugs and Maintainers’ Responses.

So we’ve found some good libraries which will either give us a good starting point for implementing the Streaming API or, as in the case of Tweepy, likely do all we need. But before we choose, we need to look at how the projects are maintained.

At this point, you should probably be leaning toward Tweepy (though the other libraries are probably great for other use cases, e.g: the REST API). Let’s look at our other metrics for Tweepy.

Looking at bugs and maintainers’ responses can be challenging. Ideally, you’d like to be able to understand how the author approaches problems and see an eagerness to pull bug fixes. It is important to remember that authors (often) have full time jobs and may not be able to dedicate the man hours to full-time maintenance. Forking is also an option if the library meets your needs, but hasn’t gotten any attention lately.

Tweepy

Let’s check out an SSL bug filed and see how it was handled.

The author first replied showing how the bug reporter could fix his issue on Google App Engine - this kind of support is honestly above and beyond for a project that appears to be a side project. Furthermore, the response in the image below shows how the author feels about stability:

Tweepy Bug Response

The author has clearly thought about long term stability for the project and the fact that many people are using his code base.

Taking a look at the commit log shows a constant stream of fixes:

Tweepy Commits

As far as project management goes, Tweepy clearly passes.

3. Unit Tests

Tweepy is using a code coverage continuous integration service that runs on code commits (a huge +1 for any open source project), so the code coverage is displayed right on the Github page.

!Tweepy Coverage

A code coverage of 69% is pretty good - it would be nice to review the tests over the Streaming API, but the project has clearly thought about testing even going to the trouble of displaying it right on the project page.

4. Documentation

A project can be coded elegantly, provide all the features you need, and have great code coverage, but if the documentation is bad the project can be completely unusable.

For Python, the standard is to use Sphinx to write documentation and it is common practice to publish the docs to Read the Docs.

Tweepy is doing exactly that:

Tweepy Docs

The only mark against the Tweepy docs is the mention that we should refer to the source code for full documentation on the Streaming API. We identified this earlier as being an important challenge.

The Final Choice

The main drawback to Tweepy is the lack of documentation for the Streaming API. I don’t want my daemon to be unpredictably cut off from accessing the API for not adhering to the back-off standard.

If I were decided on Python, I would use Tweepy. The documentation issue is minor and could be a way to contribute back to the open source community after reading over streaming.py.

Summary

How to choose open source libraries:

  1. Filter libraries by your specific challenges
  2. Look over pull requests and issues to check for maintainability and general project direction
  3. Ensure that the project has thorough unit tests
  4. Make sure the project has usable documentation

After reviewing each of the above, you should weigh the costs of any areas that are lacking vs. rolling your own solution. Often the costs of getting up to speed with an open source project outweigh the benefits if it doesn’t quite meet your needs, but that has to be judged on a case by cases basis.

In this case, it is clear that for Python, Tweepy is a stable and mature library that seems easy to build upon in possible edge cases with the Streaming API.

How to Setup a Software Router to Improve the Performance of your Home Network

I recently helped install an IP camera deployment in a home and ran into some weird issues so I figured I'd post here in case anyone else...… Continue reading