Welkin's Secret Garden

A Simple Crawler with Python

Recently, I have spent a few time to write a simple crawler with Python, so I decide to take some notes.

Dependence

Writing a crawler requires the dependent packages below:

  1. urllib2(request)
  2. BeautifulSoup(find tags)
  3. re(regular expression)
  4. urlparse(join URL)

Crawl

Construct a Request

Construct a request with a URL and add User-Agent to it in case of anti-scraping.

1
2
3
4
5
request = urllib2.Request(root_url)
request.add_header("User-Agent",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_1) "
"AppleWebKit/602.2.14 (KHTML, like Gecko) "
"Version/10.0.1 Safari/602.2.14")

Read the response

Just open and then read.

1
2
response = urllib2.urlopen(request)
html = response.read()

Parse

Parsing the html is the most important part in a crawler. I would like to show a method declared in BeautifulSoup first.

1
find_all( name , attrs , recursive , string , limit , **kwargs )

This helps to filter the unwanted text and gain what we need.

1. name: the name of the tags that we want.

You can pass it a string, a regular expression, a list, a method or simply a “True” which lets all the tags be satisfied with the condition.

2. **kwargs: key words

If the key words do not belong to the six items I am listing now, then it will find the key words from the tags. So this can be used to find the tags that contain “id=kin” and so on.

3. attrs: special key words

If the key words are the reserved words, you can search them with attrs, passing it like ‘attrs={“class”: “sister”}’. Of course, “class” can be passed to **kwargs as “class_”.

4. recursive: whether to search its child node (default is True)

5. string: text

It is the same as name. You can pass it a string, a regular expression, a list, a method or simply a “True”.

6. limit: the amount of results

So far, we have learned how to use find_all, so the code below means that you want to find the tags named “a” and with properties “href” and “id” that match the regular expression.

1
2
3
soup = BeautifulSoup(html, 'html.parser', from_encoding='utf-8')
links = soup.find_all("a", href=re.compile(r"http://mp.weixin.qq.com/\S+"),
id=re.compile(r"sogou_vr_11002601_title_\d*"))

That’s it for today.