May 22, 2017

A model for a privacy oriented ad network that profiles content, not users

A few days ago I came across a talk by Maciej Ceglowski titled Notes from an Emergency. If you have not watched the talk, stop reading this and go watch it. In the talk, he makes a suggestion that the Government should regulate ads to target content instead of users. There was a discussion about the talk on Hacker News where Maciej said that he feels it’s a regulatory argument and not a business argument, in response to a comment that implied ads targeting content will be worse than ads that target users. That got me thinking, are ads that target content worse than ads that target users, looking at it purely from a business perspective? I don’t think so, and in fact feel that ads targetting content will be better from a business perspective. In this post, I propose a model for a privacy oriented ad network that targets content instead of users. This is not a novel proposal and it’s likely that people have thought about this over the years.

Read more

January 12, 2017

concurrent.futures in Python 3

The concurrent.futures module provides a common high level interface for asynchronously executing callables using pools of threads or processes.

The concurrent.futures.Executor is a class to execute function calls asynchronously. The important methods are submit(function, args), which calls the specified function passing in the given arguments, and map(function, iterables) which calls the specified function asynchronously passing in each iterable as an argument for a separate function call. This should not be used directly, but is used through its subclasses ThreadPoolExecutor and ProcessPoolExecutor.

Let’s jump into an example. The purpose of the following program is to find the sum of all prime numbers until the given number. There are two functions to demonstrate how to use a pool of threads and how to use a pool of processes. sum_primes_thread(nums) uses threads and sum_primes_process(nums) uses processes. Notice that the only difference between the two functions is that one uses ThreadPoolExecutor while the other uses ProcessPoolExecutor.

Read more

January 9, 2017

Multiprocessing and multithreading in Python 3

To begin with, let us clear up some terminlogy:

  • Concurrency is when two or more tasks can start, run, and complete in overlapping time periods. It doesn’t necessarily mean they’ll ever both be running at the same instant. Eg. multitasking on a single-core machine.

  • Parallelism is when two or more tasks are executed simultaneously.

  • A thread is a sequence of instructions within a process. It can be thought of as a lightweight process. Threads share the same memory space.

  • A process is an instance of a program running in a computer which can contain one or more threads. A process has its independant memory space.

The threading module is used for working with threads in Python.

The CPython implementation has a Global Interpreter Lock (GIL) which allows only one thread to be active in the interpreter at once. This means that threads cannot be used for parallel execution of Python code. While parallel CPU computation is not possible, parallel IO operations are possible using threads. This is because performing IO operations releases the GIL. To learn more about the GIL refer here.

Read more

December 25, 2016

Analyzing programming language statistics of 100,000 Github repositories

The first step is to gather data about 100,000 repositories using the Github api. I used scrapy for this.

A high level overview of how I did this:

  1. Start from the id of my scrape_github repo https://api.github.com/repositories?since=76761293&access_token=MY_TOKEN

  2. Save only the id, name and languages_url for each repo. The languages_url is the api endpoint which contains the programming language statistics of the current repo.

  3. Extract the link to the next page from the Link header and follow it repeating the above steps.

Each api call returns a list of 100 repositories, so to retrieve data about 100,000 repositories, 1000 api calls are required.

All the output is saved to a file called all_repos.jsonl which came to around 13MB.

The next step is to follow the languages_url api endpoint for each repository and save the data.

Read more

December 8, 2016

Scraping my website using requests and BeautifulSoup

Ok, I didn’t use Scrapy because I am yet to go through it’s documentation. I will explore Scrapy in an upcoming blog post.

Before getting to write code to scrape my website, I will cover the basics of the following modules:

  • webbrowser

  • requests

  • BeautifulSoup

The webbrowser module is a builtin module in Python . There is not a lot to explore in this module, except the open(url) method. All it does is open the the default browser to a specified URL.

Example:

import webbrowser

urls = ["https://automatetheboringstuff.com/", "https://automatetheboringstuff.com/chapter11/"]

for link in urls:
   webbrowser.open(link)

Read more

     


© Plogging Dev - Powered by Hugo Theme by Kiss