Ok, I didn’t use Scrapy because I am yet to go through it’s documentation. I will explore Scrapy in an upcoming blog post.

Before getting to write code to scrape my website, I will cover the basics of the following modules:

  • webbrowser

  • requests

  • BeautifulSoup

The webbrowser module is a builtin module in Python . There is not a lot to explore in this module, except the open(url) method. All it does is open the the default browser to a specified URL.

Example:

import webbrowser

urls = ["https://automatetheboringstuff.com/", "https://automatetheboringstuff.com/chapter11/"]

for link in urls:
   webbrowser.open(link)

The above code opens the links in the default browser. It should be noted that running the above code when no browser is open will cause an error message in Firefox. The message is something like “Firefox is already running”. This happens because the webbrowser module detects that there is no browser open, and tries to open a new window for both links. Why both links? Probably because the second webbrowser.open(url) method is called before the first call causes Firefox to start. How can this be avoided? A hacky solution is to use time.sleep(n seconds) only after the first call to webbrowser.open(). Another way is to just keep a Firefox window open, so all calls to webbrowser.open() will open the link in a new tab. There are probably more elegant ways around this problem, let me know if you aware of any.

The requests library lets us make HTTP requests without worrying about network errors, connection problems, and data compression. It can make get, put, and delete requests among others.

In the following example, requests is used to get the homepage of this website and print some basic information about the response. The html response is saved to mysite.html.

import requests

res = requests.get('https://www.ploggingdev.com')
res.raise_for_status()

#print(res.text)
print("{} bytes".format(len(res.text)))
print("HTTP status code: {}".format(res.status_code))
print("response object type: {}".format(type(res)))

mysite = open("mysite.html", "wb")

print("Writing the response content to mysite.html")

for chunk in res.iter_content(10000):
    mysite.write(chunk)

mysite.close()

print("Done writing")

#output
11681 bytes
HTTP status code: 200
response object type: <class 'requests.models.Response'>
Writing the response content to mysite.html
Done writing

Some points to note:

  • res.raise_for_status() is used to raise an exception if an error occurs while downloading

  • When writing the response html to a file, the file is opened in wb mode to maintain the Unicode encoding of the text.

  • res.iter_content(bytes) returns the specified number of bytes of the response content. This is useful when working with large responses.

If you don’t already know about Unicode and character sets, read this post.

Once the html content of a webpage has been retrieved, we need a library to parse the html. This is where BeautifulSoup comes in.

A BeautifulSoup object is created by passing in html content. The html content can be in the form of res.text using the requests module or can be a text file.

Briefly, BeautifulSoup lets us:

  • find elements in html using the select() method. The selection can be made using html tags, ids. Additionally attributes can also be specified.

  • Data associated with an attribute can be retieved.

Learn more about how to use BeautifulSoup by following the links at the end of this post.

Coming to webscraping this website, what am I going to scrape? The url, title and keywords associated with every article.

How will I accomplish this?

  • Hard code the url of the first blog post

  • create lists to hold urls, keywords and titles for every article

  • Inside a while True: loop, record the current blog url

  • fetch the content hosted at the current blog url using requests.

  • Use BeautifulSoup to parse the current page and extract the title and keywords for the current page

  • Store the title and keywords for the current post

  • try to locate the link that leads to the next post and follow it

  • if the link to the next page is not found, it means that we have reached the latest blog and it’s time to stop scraping.

I added keywords to all blog posts recently and this will be an oppurtunity to check if the keywords have made it into all blogs. Why wouldn’t the keywords make it into the blogs if I added them? I use Hugo for this site. Sometimes if there is a typo while specifying the keywords (eg- an extra comma) then the <meta name="keywords" tag won’t be generated.

Code:

import requests
import bs4

print("Fetching all blog posts")

current_url = 'https://www.ploggingdev.com/2016/11/hello-world/'

urls = list()
titles = list()
keywords = list()

while True:
    urls.append(current_url)

    res = requests.get(current_url)
    res.raise_for_status()

    current_page = bs4.BeautifulSoup(res.text,"html.parser")
    
    current_title = current_page.select('title')[0].getText()
    titles.append(current_title)

    current_keywords = current_page.select('meta[name="keywords"]')[0].get('content')
    keywords.append(current_keywords)

    #url for next blog post
    try:
        current_url = current_page.select('ul[class="pager blog-pager"] > li[class="next"] > a')[0].get('href')
    except IndexError as ie:
        break

#printing all my blog posts with urls. It's number from 1 to n

zipped = zip(range(1, len(urls)+1), titles, urls, keywords)

for blog_num, blog_title, blog_url, blog_keywords in zipped:
    print(blog_num)
    print(blog_title)
    print(blog_url)
    print(blog_keywords)
    print()

Output:

Fetching all blog posts
1
Hello World
https://www.ploggingdev.com/2016/11/hello-world/
plogging dev, hello world

2
Beginning Python 3
https://www.ploggingdev.com/2016/11/beginning-python-3/
python 3, Beginning python 3

3
Data types in Python 3
https://www.ploggingdev.com/2016/11/data-types-in-python-3/
python 3, beginning python 3, data types in python 3, datatypes in python 3, boolean in python 3, ints in p
ython 3, floats in python 3

4
Strings in Python 3
https://www.ploggingdev.com/2016/11/strings-in-python-3/
python 3, data types in python 3, datatypes in python 3, strings in python 3

I won’t include the complete output here, but the program successfully scraped all the blog posts. You can find the output here.

Code for today’s plog:

References: