Python爬取招聘网站数据并做数据可视化处理[Python基础]

python

本文的文字及图片来源于网络,仅供学习、交流使用,不具有任何商业用途,如有问题请及时联系我们以作处理。

以下文章来源于青灯编程 ,作者:清风

 

前言

本文的文字及图片来源于网络,仅供学习、交流使用,不具有任何商业用途,如有问题请及时联系我们以作处理。

基本开发环境

相关模块使用

  • 爬虫模块

import requests

import re

import parsel

import csv

  • 词云模块

import jieba

import wordcloud

目标网页分析

 

通过开发者工具可以看到,获取返回数据之后,数据是

在 window.__SEARCH_RESULT__ 里面,可以使用正则匹配数据。

如下图所示

 

"https://jobs.51job.com/beijing-ftq/127676506.html?s=01&t=0"

每一个招聘信息的详情页都是有对应的ID,只需要正则匹配提取ID值,通过拼接URL,然后再去招聘详情页提取招聘数据即可。

response = requests.get(url=url, headers=headers)

lis = re.findall(""jobid":"(d+)"", response.text)

for li in lis:

page_url = "https://jobs.51job.com/beijing-hdq/{}.html?s=01&t=0".format(li)

 

虽然网站是静态网页,但是网页编码是乱码,在爬取的过程中需要转码。

f = open("招聘.csv", mode="a", encoding="utf-8", newline="")

csv_writer = csv.DictWriter(f, fieldnames=["标题", "地区", "工作经验", "学历", "薪资", "福利", "招聘人数", "发布日期"])

csv_writer.writeheader()

response = requests.get(url=page_url, headers=headers)

response.encoding = response.apparent_encoding

selector = parsel.Selector(response.text)

title = selector.css(".cn h1::text").get() # 标题

salary = selector.css("div.cn span::text").get() # 薪资

welfare = selector.css(".jtag div.t1 span::text").getall() # 福利

welfare_info = "|".join(welfare)

data_info = selector.css(".cn p.msg.ltype::attr(title)").get().split(" | ")

area = data_info[0] # 地区

work_experience = data_info[1] # 工作经验

educational_background = data_info[2] # 学历

number_of_people = data_info[3] # 招聘人数

release_date = data_info[-1].replace("发布", "") # 发布日期

all_info_list = selector.css("div.tCompany_main > div:nth-child(1) > div p span::text").getall()

all_info = "

".join(all_info_list)

dit = {

"标题": title,

"地区": area,

"工作经验": work_experience,

"学历": educational_background,

"薪资": salary,

"福利": welfare_info,

"招聘人数": number_of_people,

"发布日期": release_date,

}

csv_writer.writerow(dit)

with open("招聘信息.txt", mode="a", encoding="utf-8") as f:

f.write(all_info)

 

 

以上步骤即可完成关于招聘的相关数据爬取。

简单粗略的数据清洗

  • 薪资待遇

content = pd.read_csv(r"D:pythondemo数据分析招聘招聘.csv", encoding="utf-8")

salary = content["薪资"]

salary_1 = salary[salary.notnull()]

salary_count = pd.value_counts(salary_1)

 

 

  • 学历要求

content = pd.read_csv(r"D:pythondemo数据分析招聘招聘.csv", encoding="utf-8")

educational_background = content["学历"]

educational_background_1 = educational_background[educational_background.notnull()]

educational_background_count = pd.value_counts(educational_background_1).head()

print(educational_background_count)

bar = Bar()

bar.add_xaxis(educational_background_count.index.tolist())

bar.add_yaxis("学历", educational_background_count.values.tolist())

bar.render("bar.html")

 

显示招聘人数为无要求

  • 工作经验

content = pd.read_csv(r"D:pythondemo数据分析招聘招聘.csv", encoding="utf-8")

work_experience = content["工作经验"]

work_experience_count = pd.value_counts(work_experience)

print(work_experience_count)

bar = Bar()

bar.add_xaxis(work_experience_count.index.tolist())

bar.add_yaxis("经验要求", work_experience_count.values.tolist())

bar.render("bar.html")

 

词云分析,技术点要求

py = imageio.imread("python.png")

f = open("python招聘信息.txt", encoding="utf-8")

re_txt = f.read()

result = re.findall(r"[a-zA-Z]+", re_txt)

txt = "".join(result)

# jiabe 分词 分割词汇

txt_list = jieba.lcut(txt)

string = "".join(txt_list)

# 词云图设置

wc = wordcloud.WordCloud(

width=1000, # 图片的宽

height=700, # 图片的高

background_color="white", # 图片背景颜色

font_path="msyh.ttc", # 词云字体

mask=py, # 所使用的词云图片

scale=15,

stopwords={""},

# contour_width=5,

# contour_color="red" # 轮廓颜色

)

# 给词云输入文字

wc.generate(string)

# 词云图保存图片地址

wc.to_file(r"python招聘信息.png")

 

总结:

数据分析是真的粗糙,属实辣眼睛~

以上是 Python爬取招聘网站数据并做数据可视化处理[Python基础] 的全部内容, 来源链接: utcz.com/z/530236.html

回到顶部