先逃跑再说😊
先逃跑再说
wanheo
86
27
35



此文章未包含目录

招聘信息爬虫项目

2020-03-01- 2021-05-27
470- 2m-
python-pythonpython爬虫
  • 目标站点: 前程无忧

  • 需求数据:指定关键词的所有职位数据

  • 要求:自动翻页并输出

  • 选择模块: requests

  • 分析:
    关键词:python

    第一页: https://search.51job.com/list/180000,000000,0000,00,9,99,python,2,1.html?lang=c&stype=&postchannel=0000&workyear=99&cotype=99&degreefrom=99&jobterm=99&companysize=99&providesalary=99&lonlat=0%2C0&radius=-1&ord_field=0&confirmdate=9&fromType=&dibiaoid=0&address=&line=&specialarea=00&from=&welfare=

    第二页: https://search.51job.com/list/180000,000000,0000,00,9,99,python,2,2.html?lang=c&postchannel=0000&workyear=99&cotype=99&degreefrom=99&jobterm=99&companysize=99&ord_field=0&dibiaoid=0&line=&welfare=

    url:https://search.51job.com/list/180000,000000,0000,00,9,99,python,2,1.html

    代码:

    import re
    import requests
    
    url = "https://search.51job.com/list/180000,000000,0000,00,9,99,python,2,1.html"
    
    hd={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36"}
    
    response = requests.get(url,headers=hd)
    # 如果网页出现乱码,通传编码方式
    data = bytes(response.text,response.encoding).decode("gbk","ignore")
    # print(len(data))
    pat_pag = "共(.*?)条职位"
    allline = re.compile(pat_pag,re.S).findall(data)[0]
    # print(allline)
    allpage = int(allline)//50 + 1
    
    for i in range(0,allpage):
        print("------------正在爬"+str(i+1)+"页---------")
        url = "https://search.51job.com/list/180000,000000,0000,00,9,99,python,2,"+str(i+1)+".html"
        # print(url)
        response = requests.get(url, headers=hd)
        # 如果网页出现乱码,通传编码方式
        thisdata = bytes(response.text, response.encoding).decode("gbk", "ignore")
        # print(thisdata)
        job_url_pat='<em class="check" name="delivery_em" onclick="checkboxClick.this."></em>.*?href="(.*?).html'
        job_url_all = re.compile(job_url_pat,re.S).findall(thisdata)
        # print(len(job_url_all))
        for job_url in job_url_all:
            # print(job_url)
            thisurl=job_url+".html"
            response=requests.get(thisurl)
            thisdata=bytes(response.text,response.encoding).decode("gbk","ignore")
            pat_title='<h1 title="(.*?)"'
            pat_company='<p class="cname">.*?title="(.*?)"'
            pat_money='</h1><strong>(.*?)</strong>'
            pat_addr='上班地址:</span>(.*?)</p>'
            title = re.compile(pat_title,re.S).findall(thisdata)[0]
            company = re.compile(pat_company,re.S).findall(thisdata)[0]
            money = re.compile(pat_money,re.S).findall(thisdata)[0]
            try:
                addr = re.compile(pat_addr,re.S).findall(thisdata)[0]
            except IndexError:
                addr = "空"
    
            print("-------------------")
            print(title)
            print(company)
            print(money)
            print(addr)
I'm so cute. Please give me money.
  • 本文作者:先逃跑再说
  • 本文链接:https://wentianhao.github.io/2020/03/01/%E6%8B%9B%E8%81%98%E4%BF%A1%E6%81%AF%E7%88%AC%E8%99%AB%E9%A1%B9%E7%9B%AE/
  • 版权声明:本博客所有文章除特别声明外,均默认采用 许可协议。
数据同步神器
scrapy爬虫项目
若您想及时得到回复提醒,建议跳转 GitHub Issues 评论。
若没有本文 Issue,您可以使用 Comment 模版新建。
valineutterances
© 2020 – 2021 先逃跑再说
由 Hexo 驱动 v5.4.0|主题 - Yun v1.6.1
本博客已萌萌哒地运行(●'◡'●)
|