indeed爬虫：分析加拿大招聘动向

最近研究了下爬虫，发现简单的爬虫很容易实现，就是用beautifulsoup parser网页，关键需要摸清网页的结构，url有哪些规律，如何建立一个循环生成下一个url。

一个简单的应用就是indeed爬虫，indeed是北美强大的求职网站，上面的招聘信息非常全。我的这个script就是根据搜索内容分析职位信息（job title）和招聘公司。比如你想知道哪类工作最需要machine learning（机器学习），又是哪些公司有这样的职位，就可以用这个爬虫程序。

input是query条件和地区：query可以很简单（比如machine learning），也可以很复杂（包括哪些关键字不包括哪些关键字），地区可以是Toronto, ON. 然后就会parse搜索结果页面，并试图获取next page的url，直到最后一页。在每一个搜索结果页面上，会得到job title list和company list，indeed在结果中掺入了sponsor的items，我把这些排除在外了。

比如query是machine learning，地区是多伦多，搜索结果是500个job postings，输出结果就是这500个postings的job title和company，然后分析它们的frequency，比如data scientist在list中出现次数多少，bank company出现次数多少等。

from bs4 import BeautifulSoup
import urllib
import re
import collections
from time import sleep
# get lists of job title and companies from a single webpage
def get_title_company(url):
    try:
        html = urllib.request.urlopen(url).read()
    except:
        print('invalid url!')
        return
    soup = BeautifulSoup(html, "lxml")
    titles = soup.find_all('a', {'data-tn-element': 'jobTitle'})  # find all job title tags
    companies = soup.find_all('span', {'class': 'company'})  # find all company tags
    company_list, title_list = [], []  # they are used to store title and company strings
    #  remove sponsor jobs
    for title in titles:
        if title['class'] == ['turnstileLink']:
            title_list.append(title.get_text())
    for company in companies:
        if company.parent['class'] != ['sjcl']:
            company_list.append(company.get_text("|", strip=True))
    return {'titles': title_list, 'companies': company_list, 'soup': soup}
# help function to update result list
def update_result(final, temp):
    final['titles'].extend(temp['titles'])
    final['companies'].extend(temp['companies'])
    next_page = temp['soup'].find("span", {"class": "np"}, text=re.compile("Next"))
    return next_page
# core function
def indeed_scraping(query, city, province):
    base_url = 'https://www.indeed.ca'
    curr_url = base_url + '/jobs?q=' + query + '&l=' + city + '%2C+' + province
    result = {'titles': [], 'companies': []}
    temp_result = get_title_company(curr_url)
    next_page = update_result(result, temp_result)
    while next_page is not None:
        sleep(1)
        curr_url = base_url + next_page.parent.parent.get('href')
        temp_result = get_title_company(curr_url)
        next_page = update_result(result, temp_result)
    return result
a = indeed_scraping('python', 'Calgary', 'AB')
titles_stat = collections.Counter(a['titles'])
companies_stat = collections.Counter(a['companies'])
print(titles_stat)
print('\n')
print(companies_stat)
print('\n')
# debug part
# print(len(a['titles']))
# print('\n')
# for b in a['titles']:
#     print(b)
# print('\n')
# print(len(a['companies']))
# print('\n')
# for b in a['companies']:
#     print(b)