concurrent.futures in Python 3

The concurrent.futures module provides a common high level interface for asynchronously executing callables using pools of threads or processes.

The concurrent.futures.Executor is a class to execute function calls asynchronously. The important methods are submit(function, args), which calls the specified function passing in the given arguments, and map(function, iterables) which calls the specified function asynchronously passing in each iterable as an argument for a separate function call. This should not be used directly, but is used through its subclasses ThreadPoolExecutor and ProcessPoolExecutor.

Let’s jump into an example. The purpose of the following program is to find the sum of all prime numbers until the given number. There are two functions to demonstrate how to use a pool of threads and how to use a pool of processes. sum_primes_thread(nums) uses threads and sum_primes_process(nums) uses processes. Notice that the only difference between the two functions is that one uses ThreadPoolExecutor while the other uses ProcessPoolExecutor.

[Read More]

Multiprocessing and multithreading in Python 3

To begin with, let us clear up some terminlogy:

  • Concurrency is when two or more tasks can start, run, and complete in overlapping time periods. It doesn’t necessarily mean they’ll ever both be running at the same instant. Eg. multitasking on a single-core machine.

  • Parallelism is when two or more tasks are executed simultaneously.

  • A thread is a sequence of instructions within a process. It can be thought of as a lightweight process. Threads share the same memory space.

  • A process is an instance of a program running in a computer which can contain one or more threads. A process has its independant memory space.

The threading module is used for working with threads in Python.

The CPython implementation has a Global Interpreter Lock (GIL) which allows only one thread to be active in the interpreter at once. This means that threads cannot be used for parallel execution of Python code. While parallel CPU computation is not possible, parallel IO operations are possible using threads. This is because performing IO operations releases the GIL. To learn more about the GIL refer here.

[Read More]

Analyzing programming language statistics of 100,000 Github repositories

The first step is to gather data about 100,000 repositories using the Github api. I used scrapy for this.

A high level overview of how I did this:

  1. Start from the id of my scrape_github repo

  2. Save only the id, name and languages_url for each repo. The languages_url is the api endpoint which contains the programming language statistics of the current repo.

  3. Extract the link to the next page from the Link header and follow it repeating the above steps.

Each api call returns a list of 100 repositories, so to retrieve data about 100,000 repositories, 1000 api calls are required.

All the output is saved to a file called all_repos.jsonl which came to around 13MB.

The next step is to follow the languages_url api endpoint for each repository and save the data.

[Read More]

Scraping my website using requests and BeautifulSoup

Ok, I didn’t use Scrapy because I am yet to go through it’s documentation. I will explore Scrapy in an upcoming blog post.

Before getting to write code to scrape my website, I will cover the basics of the following modules:

  • webbrowser

  • requests

  • BeautifulSoup

The webbrowser module is a builtin module in Python . There is not a lot to explore in this module, except the open(url) method. All it does is open the the default browser to a specified URL.


import webbrowser

urls = ["", ""]

for link in urls:

[Read More]

Decorators in Python 3

A Python decorator is a specific change to the Python syntax that allows us to conveniently alter functions and methods. In simpler words, a decorator takes in a function, adds some functionality and returns it.


def my_decorator(func):
    def inner():
        print("Decoration before function call")
        print("Decoration after function call")

    return inner

def simple_print():
    print("Hello from simple_print")


Decoration before function call
Hello from simple_print
Decoration after function call

[Read More]