Speeding up scikit-learn workflow using a high-performance Go proxy.

../../_images/sunshine.jpg

Up until now I’ve been using vcrpy to cache my requests during the data mining phase of my scikit-learn work, but I was recently intimated to an ultra-high-performance GoLang caching proxy, and wanted to see if I could use it for more speed-ups. I was so impressed that I wrote a python wrapper for it.

pip install hoverpy --user --upgrade

Offlining readthedocs:

from hoverpy import capture
import requests
import time

@capture("readthedocs.db", recordMode="once")
def getLinks(limit):
    start = time.time()
    sites = requests.get(
        "http://readthedocs.org/api/v1/project/?limit=%d&offset=0&format=json" % int(limit))
    objects = sites.json()['objects']

    for link in ["http://readthedocs.org" + x['resource_uri'] for x in objects]:
        response = requests.get(link)
        print("url: %s, status code: %s" % (link, response.status_code))

    print("Time taken: %f" % (time.time() - start))

getLinks(50)

Ouput:

[...]
Time taken: 9.418862

Upon second invocation:

[...]
Time taken: 0.093463

That’s much better: 100.78x faster than hitting the real endpoint.

../../_images/http_diff.png

Not surprising really. My issue with caching proxies however, is that it’s the https handshaking that takes time–not fetching the data–and one of my many annoyances with vcrpy is that it won’t let me remap https requests to http.

../../_images/infinity.jpg

Therefore I was very pleased to see remapping work perfectly in hoverpy (code provided below the next graph), with hoverpy wiping the floor with vcrpy; over 13x faster:

../../_images/https_get.png
import time
import hoverpy
import requests
import os

prot = "http" if os.path.isfile("hn.db") else "https"
hnApi = "%s://hacker-news.firebaseio.com/v0" % prot

with hoverpy.HoverPy(recordMode='once', dbpath='hn.db') as hp:
    print("started hoverpy in %s mode" % hp.mode())
    start = time.time()
    r = requests.get("%s/topstories.json" % (hnApi))
    for item in r.json():
        article = requests.get("%s/item/%i.json" % (hnApi, item)).json()
        print(article["title"])
    print("got articles in %f seconds" % (time.time() - start))

Once again, on second run, hoverfly steps in with a very significant speedup. I’m very impressed with hoverfly’s performance.

Data mining HN

../../_images/hn.png

Before we start, please note you can find the final script here. You’ll also need the data.

What I also really like about Hoverfly is how fast it loads, and how fast it loads up the boltdb database. I also like the fact it’s configuration-free. Here’s a function you can use to offline titles for various HN sections:

def getHNData(verbose=False, limit=100, sub="showstories"):
    from hackernews import HackerNews
    from hackernews import settings
    import hoverpy, time, os
    dbpath = "data/hn.%s.db" % sub
    with hoverpy.HoverPy(recordMode="once", dbpath=dbpath) as hp:
        if not hp.mode() == "capture":
            settings.supported_api_versions[
                "v0"] = "http://hacker-news.firebaseio.com/v0/"
        hn = HackerNews()
        titles = []
        print("GETTING HACKERNEWS %s DATA" % sub)
        subs = {"showstories": hn.show_stories,
                "askstories": hn.ask_stories,
                "jobstories": hn.job_stories,
                "topstories": hn.top_stories}
        start = time.time()
        for story_id in subs[sub](limit=limit):
            story = hn.get_item(story_id)
            if verbose:
                print(story.title.lower())
            titles.append(story.title.lower())
        print(
            "got %i hackernews titles in %f seconds" %
            (len(titles), time.time() - start))
        return titles



Data mining Reddit

../../_images/reddit.png

While we’re at it, let’s put a function here for offlining subreddits. This one also includes comments:

def getRedditData(verbose=False, comments=True, limit=100, sub="all"):
    import hoverpy, praw, time
    dbpath = ("data/reddit.%s.db" % sub)
    with hoverpy.HoverPy(recordMode='once', dbpath=dbpath, httpsToHttp=True) as hp:
        titles = []
        print "GETTING REDDIT r/%s DATA" % sub
        r = praw.Reddit(user_agent="Karma breakdown 1.0 by /u/_Daimon_", http_proxy=hp.httpProxy(), https_proxy=hp.httpProxy(), validate_certs="off")
        if not hp.mode() == "capture":
            r.config.api_request_delay = 0
        subreddit = r.get_subreddit(sub)
        for submission in subreddit.get_hot(limit=limit):
            text = submission.title.lower()
            if comments:
                flat_comments = praw.helpers.flatten_tree(submission.comments)
                for comment in flat_comments:
                    text += comment.body + " " if hasattr(comment, 'body') else ''
            if verbose:
                print text
            titles.append(text)
        return titles

Organising our datamines

Rather than sitting around hitting these endpoints, you may as well download these datasets, to save yourself the time.

wget https://github.com/shyal/hoverpy-scikitlearn/raw/master/data.tar
tar xvf data.tar

And the code:

subs = [('hn', 'showstories'),
        ('hn', 'askstories'),
        ('hn', 'jobstories'),
        ('reddit', 'republican'),
        ('reddit', 'democrat'),
        ('reddit', 'linux'),
        ('reddit', 'python'),
        ('reddit', 'music'),
        ('reddit', 'movies'),
        ('reddit', 'literature'),
        ('reddit', 'books')]

def doMining():
    titles = []
    target = []
    getter = {'hn': getHNData, 'reddit': getRedditData}
    for i in range(len(subs)):
        subTitles = getter[subs[i][0]](
            sub=subs[i][1])
        titles += subTitles
        target += [i] * len(subTitles)
    return (titles, target)

Calling doMining() caches everything, which takes a while. Although you’ve hopefully downloaded and extracted data.tar, in which case it shouldn’t take more than a few seconds. That’s all our data mining done. I think this is a good time to remind ourselves a big part of machine learning is, in fact, data sanitisation and mining.

GETTING HACKERNEWS showstories DATA
got 54 hackernews titles in 0.099983 seconds
GETTING HACKERNEWS askstories DATA
got 92 hackernews titles in 0.160661 seconds
GETTING HACKERNEWS jobstories DATA
got 12 hackernews titles in 0.024908 seconds
GETTING REDDIT r/republican DATA
GETTING REDDIT r/democrat DATA
GETTING REDDIT r/linux DATA
GETTING REDDIT r/python DATA
GETTING REDDIT r/music DATA
GETTING REDDIT r/movies DATA
GETTING REDDIT r/literature DATA
GETTING REDDIT r/books DATA

real    0m9.425s

Building an HN or Reddit classifier

../../_images/hat.jpg

OK time to play. Let’s build a naive bayesian text classifier. You’ll be able to type in some text, and it’ll tell you which subreddit it thinks the text could have originated from.

For this part, you’ll need scikit-learn.

pip install numpy

pip install scikit-learn

Test scentences:

sentences = ["powershell and openssl compatability testing",
    "compiling source code on ubuntu",
    "wifi drivers keep crashing",
    "cron jobs",
    "training day was a great movie with a legendary director",
    "michael bay should remake lord of the rings, set in the future",
    "hilary clinton may win voters' hearts",
    "donald trump may donimate the presidency",
    "reading dead wood gives me far more pleasure than using kindles",
    "hiring a back end engineer",
    "guitar is louder than the piano although electronic is best",
    "drum solo and singer from the rolling stones",
    "hiring a back end engineer",
    "javascript loader",
    "dostoevsky's existentialis"]

Running the classifier:

def main():
    titles, target = doMining()
    from sklearn.feature_extraction.text import CountVectorizer
    from sklearn.feature_extraction.text import TfidfTransformer
    from sklearn.naive_bayes import MultinomialNB
    # build our count vectoriser
    #
    count_vect = CountVectorizer()
    X_train_counts = count_vect.fit_transform(titles)
    # build tfidf transformer
    #
    tfidf_transformer = TfidfTransformer()
    X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
    # classifier
    #
    clf = MultinomialNB().fit(X_train_tfidf, target)
    print "*"*30+"\nTEST CLASSIFIER\n"+"*"*30
    # predict function
    #
    def predict(sentences):
        X_new_counts = count_vect.transform(sentences)
        X_new_tfidf = tfidf_transformer.transform(X_new_counts)
        predicted = clf.predict(X_new_tfidf)
        for doc, category in zip(sentences, predicted):
            print('%r => %s' % (doc, subs[category]))
    #
    predict(sentences)
    #
    while True:
        predict([raw_input("Enter title: ").strip()])

In case you are not familiar with tokenizing, tfidf, classification etc. then I’ve provided a link at the end of this tutorial that’ll demistify the block above.


Wrapping things up

You can find hoverpy’s and hoverfly’s extensive documentation here and here. This excellent and lightweight proxy was developed by the very smart guys at SpectoLabs so I strongly suggest you show them some love (I could not, however, find a donations link).

Repository for this post, with code: https://github.com/shyal/hoverpy-scikitlearn

http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html

Edit:

As jackschultz rightly pointed out, this article is missing computing the precision of the classifier. You can check his article here. I strongly recommend readers who are interested in splitting the datasets into training and testing sets, and computing the precision of the classifier to try doing so. You’ll find all the information required to do so in the scikit-learn article.

Finally also worth noting: my solution only caches the HN titles, but not the HN comments, while it does cache the Reddit comments. This will lead to huge biases towards reddit, and away from HN, despite using tfidf. So another good exercise for the reader is to attempt caching the HN comments.

Taking it one step further

The premise of hoverfly is, in fact, testing, CD and CI. The intent is to commit your requests database, and test against them. This makes your code completely hermetic to external dependencies.

https://travis-ci.org/shyal/hoverpy-scikitlearn.svg?branch=master
../../_images/unit-test.jpg

I find this extremely useful in the context of machine learning, and scikit-learn.