最近研究了下爬虫,发现简单的爬虫很容易实现,就是用beautifulsoup parser网页,关键需要摸清网页的结构,url有哪些规律,如何建立一个循环生成下一个url。

一个简单的应用就是indeed爬虫,indeed是北美强大的求职网站,上面的招聘信息非常全。我的这个script就是根据搜索内容分析职位信息(job title)和招聘公司。比如你想知道哪类工作最需要machine learning(机器学习),又是哪些公司有这样的职位,就可以用这个爬虫程序。

input是query条件和地区:query可以很简单(比如machine learning),也可以很复杂(包括哪些关键字不包括哪些关键字),地区可以是Toronto, ON. 然后就会parse搜索结果页面,并试图获取next page的url,直到最后一页。在每一个搜索结果页面上,会得到job title list和company list,indeed在结果中掺入了sponsor的items,我把这些排除在外了。

比如query是machine learning,地区是多伦多,搜索结果是500个job postings,输出结果就是这500个postings的job title和company,然后分析它们的frequency,比如data scientist在list中出现次数多少,bank company出现次数多少等。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
from bs4 import BeautifulSoup
import urllib
import re
import collections
from time import sleep
# get lists of job title and companies from a single webpage
def get_title_company(url):
try:
html = urllib.request.urlopen(url).read()
except:
print('invalid url!')
return
soup = BeautifulSoup(html, "lxml")
titles = soup.find_all('a', {'data-tn-element': 'jobTitle'}) # find all job title tags
companies = soup.find_all('span', {'class': 'company'}) # find all company tags
company_list, title_list = [], [] # they are used to store title and company strings
# remove sponsor jobs
for title in titles:
if title['class'] == ['turnstileLink']:
title_list.append(title.get_text())
for company in companies:
if company.parent['class'] != ['sjcl']:
company_list.append(company.get_text("|", strip=True))
return {'titles': title_list, 'companies': company_list, 'soup': soup}
# help function to update result list
def update_result(final, temp):
final['titles'].extend(temp['titles'])
final['companies'].extend(temp['companies'])
next_page = temp['soup'].find("span", {"class": "np"}, text=re.compile("Next"))
return next_page
# core function
def indeed_scraping(query, city, province):
base_url = 'https://www.indeed.ca'
curr_url = base_url + '/jobs?q=' + query + '&l=' + city + '%2C+' + province
result = {'titles': [], 'companies': []}
temp_result = get_title_company(curr_url)
next_page = update_result(result, temp_result)
while next_page is not None:
sleep(1)
curr_url = base_url + next_page.parent.parent.get('href')
temp_result = get_title_company(curr_url)
next_page = update_result(result, temp_result)
return result
a = indeed_scraping('python', 'Calgary', 'AB')
titles_stat = collections.Counter(a['titles'])
companies_stat = collections.Counter(a['companies'])
print(titles_stat)
print('\n')
print(companies_stat)
print('\n')
# debug part
# print(len(a['titles']))
# print('\n')
# for b in a['titles']:
# print(b)
# print('\n')
# print(len(a['companies']))
# print('\n')
# for b in a['companies']:
# print(b)