招聘信息爬虫项目

目标站点: 前程无忧
需求数据：指定关键词的所有职位数据
要求：自动翻页并输出
选择模块: requests

分析：
关键词：python

第一页: https://search.51job.com/list/180000,000000,0000,00,9,99,python,2,1.html?lang=c&stype=&postchannel=0000&workyear=99&cotype=99&degreefrom=99&jobterm=99&companysize=99&providesalary=99&lonlat=0%2C0&radius=-1&ord_field=0&confirmdate=9&fromType=&dibiaoid=0&address=&line=&specialarea=00&from=&welfare=

第二页: https://search.51job.com/list/180000,000000,0000,00,9,99,python,2,2.html?lang=c&postchannel=0000&workyear=99&cotype=99&degreefrom=99&jobterm=99&companysize=99&ord_field=0&dibiaoid=0&line=&welfare=

url：https://search.51job.com/list/180000,000000,0000,00,9,99,python,2,1.html

代码:

import re
import requests

url = "https://search.51job.com/list/180000,000000,0000,00,9,99,python,2,1.html"

hd={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36"}

response = requests.get(url,headers=hd)
# 如果网页出现乱码，通传编码方式
data = bytes(response.text,response.encoding).decode("gbk","ignore")
# print(len(data))
pat_pag = "共(.*?)条职位"
allline = re.compile(pat_pag,re.S).findall(data)[0]
# print(allline)
allpage = int(allline)//50 + 1

for i in range(0,allpage):
    print("------------正在爬"+str(i+1)+"页---------")
    url = "https://search.51job.com/list/180000,000000,0000,00,9,99,python,2,"+str(i+1)+".html"
    # print(url)
    response = requests.get(url, headers=hd)
    # 如果网页出现乱码，通传编码方式
    thisdata = bytes(response.text, response.encoding).decode("gbk", "ignore")
    # print(thisdata)
    job_url_pat='<em class="check" name="delivery_em" onclick="checkboxClick.this."></em>.*?href="(.*?).html'
    job_url_all = re.compile(job_url_pat,re.S).findall(thisdata)
    # print(len(job_url_all))
    for job_url in job_url_all:
        # print(job_url)
        thisurl=job_url+".html"
        response=requests.get(thisurl)
        thisdata=bytes(response.text,response.encoding).decode("gbk","ignore")
        pat_title='<h1 title="(.*?)"'
        pat_company='<p class="cname">.*?title="(.*?)"'
        pat_money='</h1><strong>(.*?)</strong>'
        pat_addr='上班地址：</span>(.*?)</p>'
        title = re.compile(pat_title,re.S).findall(thisdata)[0]
        company = re.compile(pat_company,re.S).findall(thisdata)[0]
        money = re.compile(pat_money,re.S).findall(thisdata)[0]
        try:
            addr = re.compile(pat_addr,re.S).findall(thisdata)[0]
        except IndexError:
            addr = "空"

        print("-------------------")
        print(title)
        print(company)
        print(money)
        print(addr)

I'm so cute. Please give me money.

本文作者：先逃跑再说
本文链接：https://wentianhao.github.io/2020/03/01/%E6%8B%9B%E8%81%98%E4%BF%A1%E6%81%AF%E7%88%AC%E8%99%AB%E9%A1%B9%E7%9B%AE/
版权声明：本博客所有文章除特别声明外，均默认采用许可协议。