December 25, 2016

Analyzing programming language statistics of 100,000 Github repositories

The first step is to gather data about 100,000 repositories using the Github api. I used scrapy for this.

A high level overview of how I did this:

  1. Start from the id of my scrape_github repo

  2. Save only the id, name and languages_url for each repo. The languages_url is the api endpoint which contains the programming language statistics of the current repo.

  3. Extract the link to the next page from the Link header and follow it repeating the above steps.

Each api call returns a list of 100 repositories, so to retrieve data about 100,000 repositories, 1000 api calls are required.

All the output is saved to a file called all_repos.jsonl which came to around 13MB.

The next step is to follow the languages_url api endpoint for each repository and save the data.

Read more

December 8, 2016

Scraping my website using requests and BeautifulSoup

Ok, I didn’t use Scrapy because I am yet to go through it’s documentation. I will explore Scrapy in an upcoming blog post.

Before getting to write code to scrape my website, I will cover the basics of the following modules:

  • webbrowser

  • requests

  • BeautifulSoup

The webbrowser module is a builtin module in Python . There is not a lot to explore in this module, except the open(url) method. All it does is open the the default browser to a specified URL.


import webbrowser

urls = ["", ""]

for link in urls:

Read more

December 4, 2016

Decorators in Python 3

A Python decorator is a specific change to the Python syntax that allows us to conveniently alter functions and methods. In simpler words, a decorator takes in a function, adds some functionality and returns it.


def my_decorator(func):
    def inner():
        print("Decoration before function call")
        print("Decoration after function call")

    return inner

def simple_print():
    print("Hello from simple_print")


Decoration before function call
Hello from simple_print
Decoration after function call

Read more

December 3, 2016

Performance measurement in Python 3

Performance measurement is the process of collecting and understanding information regarding the performance of some code.

In this blog I will cover the basics of the following modules in Python:

  • timeit

  • cProfile

  • pstats

  • memory_profiler

  • line_profiler

Read more

December 2, 2016

List comprehensions, iterators, generators and generator expressions in Python 3

A list comprehension is a concise way to create lists that would normally require for loops to build.


list1 = [x**2 for x in range(10)]
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

List comprehension to create a list of tuples:

list2 = [(x, y) for x in [1,2,3] for y in [3,1,4] if x != y]
[(1, 3), (1, 4), (2, 3), (2, 1), (2, 4), (3, 1), (3, 4)]

Read more


© Plogging Dev - Powered by Hugo Theme by Kiss