Recently, I have spent a few time to write a simple crawler with Python, so I decide to take some notes.
Dependence
Writing a crawler requires the dependent packages below:
- urllib2(request)
- BeautifulSoup(find tags)
- re(regular expression)
- urlparse(join URL)
Crawl
Construct a Request
Construct a request with a URL and add User-Agent to it in case of anti-scraping.
1 | request = urllib2.Request(root_url) |
Read the response
Just open and then read.
1 | response = urllib2.urlopen(request) |
Parse
Parsing the html is the most important part in a crawler. I would like to show a method declared in BeautifulSoup first.
1 | find_all( name , attrs , recursive , string , limit , **kwargs ) |
This helps to filter the unwanted text and gain what we need.
1. name: the name of the tags that we want.
You can pass it a string, a regular expression, a list, a method or simply a “True” which lets all the tags be satisfied with the condition.
2. **kwargs: key words
If the key words do not belong to the six items I am listing now, then it will find the key words from the tags. So this can be used to find the tags that contain “id=kin” and so on.
3. attrs: special key words
If the key words are the reserved words, you can search them with attrs, passing it like ‘attrs={“class”: “sister”}’. Of course, “class” can be passed to **kwargs as “class_”.
4. recursive: whether to search its child node (default is True)
5. string: text
It is the same as name. You can pass it a string, a regular expression, a list, a method or simply a “True”.
6. limit: the amount of results
So far, we have learned how to use find_all, so the code below means that you want to find the tags named “a” and with properties “href” and “id” that match the regular expression.
1 | soup = BeautifulSoup(html, 'html.parser', from_encoding='utf-8') |
That’s it for today.