August 29, 2017

Building a Disqus alternative Part 1 : Research

Update : I launched Hosted Comments!

I’ll start with a little back story : I started this blog around 9 months ago and managed to build up traffic to a few hundred hits every day. It might not seem like much, but it was and still is a big deal to me. Readers used to leave comments with suggestions for improvements, questions or just to say that they enjoyed reading a particular post. Comments were powered by Disqus and all was well. One day I received an email notification from Disqus informing me that someone had left a comment on my blog. A pretty routine notification, so I opened the post and scrolled down to the comments section and noticed…six shady ads with images to accompany them. Without any warning, Disqus enabled ads on my site. Until then I never really bothered with what Disqus was doing in the background, but the ads incident made me curious. I inspected the requests that Disqus was making and it turns out that 100+ http requests, sending tracking data to 10+ external advertisers and 2MB of data transfer was required to display a comments section with 5 comments! That was my breaking point and so I promptly removed Disqus from my blog and deleted my account as well.

Read more

May 22, 2017

A model for a privacy oriented ad network that profiles content, not users

A few days ago I came across a talk by Maciej Ceglowski titled Notes from an Emergency. If you have not watched the talk, stop reading this and go watch it. In the talk, he makes a suggestion that the Government should regulate ads to target content instead of users. There was a discussion about the talk on Hacker News where Maciej said that he feels it’s a regulatory argument and not a business argument, in response to a comment that implied ads targeting content will be worse than ads that target users. That got me thinking, are ads that target content worse than ads that target users, looking at it purely from a business perspective? I don’t think so, and in fact feel that ads targetting content will be better from a business perspective. In this post, I propose a model for a privacy oriented ad network that targets content instead of users. This is not a novel proposal and it’s likely that people have thought about this over the years.

Read more

January 12, 2017

concurrent.futures in Python 3

The concurrent.futures module provides a common high level interface for asynchronously executing callables using pools of threads or processes.

The concurrent.futures.Executor is a class to execute function calls asynchronously. The important methods are submit(function, args), which calls the specified function passing in the given arguments, and map(function, iterables) which calls the specified function asynchronously passing in each iterable as an argument for a separate function call. This should not be used directly, but is used through its subclasses ThreadPoolExecutor and ProcessPoolExecutor.

Let’s jump into an example. The purpose of the following program is to find the sum of all prime numbers until the given number. There are two functions to demonstrate how to use a pool of threads and how to use a pool of processes. sum_primes_thread(nums) uses threads and sum_primes_process(nums) uses processes. Notice that the only difference between the two functions is that one uses ThreadPoolExecutor while the other uses ProcessPoolExecutor.

Read more

January 9, 2017

Multiprocessing and multithreading in Python 3

To begin with, let us clear up some terminlogy:

  • Concurrency is when two or more tasks can start, run, and complete in overlapping time periods. It doesn’t necessarily mean they’ll ever both be running at the same instant. Eg. multitasking on a single-core machine.

  • Parallelism is when two or more tasks are executed simultaneously.

  • A thread is a sequence of instructions within a process. It can be thought of as a lightweight process. Threads share the same memory space.

  • A process is an instance of a program running in a computer which can contain one or more threads. A process has its independent memory space.

The threading module is used for working with threads in Python.

The CPython implementation has a Global Interpreter Lock (GIL) which allows only one thread to be active in the interpreter at once. This means that threads cannot be used for parallel execution of Python code. While parallel CPU computation is not possible, parallel IO operations are possible using threads. This is because performing IO operations releases the GIL. To learn more about the GIL refer here.

Read more

December 25, 2016

Analyzing programming language statistics of 100,000 Github repositories

The first step is to gather data about 100,000 repositories using the Github api. I used scrapy for this.

A high level overview of how I did this:

  1. Start from the id of my scrape_github repo https://api.github.com/repositories?since=76761293&access_token=MY_TOKEN

  2. Save only the id, name and languages_url for each repo. The languages_url is the api endpoint which contains the programming language statistics of the current repo.

  3. Extract the link to the next page from the Link header and follow it repeating the above steps.

Each api call returns a list of 100 repositories, so to retrieve data about 100,000 repositories, 1000 api calls are required.

All the output is saved to a file called all_repos.jsonl which came to around 13MB.

The next step is to follow the languages_url api endpoint for each repository and save the data.

Read more

     


© Plogging Dev - Powered by Hugo Theme by Kiss