The first step is to gather data about 100,000 repositories using the Github api. I used scrapy for this.

A high level overview of how I did this:

  1. Start from the id of my scrape_github repo https://api.github.com/repositories?since=76761293&access_token=MY_TOKEN

  2. Save only the id, name and languages_url for each repo. The languages_url is the api endpoint which contains the programming language statistics of the current repo.

  3. Extract the link to the next page from the Link header and follow it repeating the above steps.

Each api call returns a list of 100 repositories, so to retrieve data about 100,000 repositories, 1000 api calls are required.

All the output is saved to a file called all_repos.jsonl which came to around 13MB.

The next step is to follow the languages_url api endpoint for each repository and save the data.

A high level overview of how I did this:

  1. Read a line from all_repos.jsonl

  2. Retrieve data from the languages_url endpoint

  3. If an exception occurred, output an empty json object to lang_data.jsonl

  4. Otherwise save the response to lang_data.jsonl

  5. Check headers to see if api limit has been reached

  6. If api limit is reached, sleep until the api limit is reset

  7. Otherwise go to step 1 and repeat until all lines have been read

There were a few HTTPError exceptions (returned HTTP 403 and 404 status codes) since Github blocked a few repositories for violating their Terms of Service. Around 3 such exceptions in the first 5000 repositories. There were also a lot of empty repositories.

The api limit for Github is 5000 calls per hour. The headers include X-RateLimit-Remaining which specifies how many api calls are remaining in the current hour. The X-RateLimit-Reset header contains a number which specifies when the ratelimit will be reset. It is respresented as the seconds since the Unix epoch. These headers are used to check if the api limit has been reached and how much time to sleep for, if the limit has been reached.

The total number of api calls made in this step is 100,000 which took a little over 20 hours to complete. I ran this in a VPS.

Interesting to note that it takes around 5 minutes to make the 5000 api calls on the VPS. So the script is sleeping for the remaining 55+ minutes per hour. I took a screenshot of the bandwidth usage of the VPS the script was running on and it was nice to see a spike every hour (script is calling the Github api) and then go back to zero usage (script is sleeping) until the next spike. Here is the screenshot.

Bandwidth usage over 24 hours

Once all the relevant data was retrieved, the next step was to plot some graphs. Note that a single repository can include code using multiple programming languages.

I was interested in the following data:

Size of code vs programming language:

Size of code vs programming language

Repos appeared in vs programming language:

Repos appeared in vs programming language:

Megabytes/repo vs programming language:

Megabytes/repo vs programming language:

Take this data with a pinch of salt as it only represents the repositories created approximately over a 2 day perioid. Initially I planned to consider all repos created in 2016, but the sheer scale of Github made me rethink my plans. Extrapolating the number of repos created over 2 days to the the entire year, the number comes to around 18 million repos created in 2016. Besides, the point of the project was to learn a little about scrapy.

Here is the code.

The installation instructions are on the readme file.

The repo also includes the data I gathered from the github api: