pyhon爬虫中文乱码

Z时代
2024-01-10
分类：技术分享

pyhon爬虫中文乱码

爬到的网页，在调试模式看着是中文，用输出看也是中文，但是存入变量就编码格式不对了。

问题出现的环境背景及自己尝试过哪些方法

下面是源码

python">#coding: utf-8
import requests
import json
from bs4 import BeautifulSoup
 
url = 'https://www.3ajiepai.com/forum-190-1.html'
strhtml = requests.get(url)
# 由于目标是gbk，接受到的是乱码，所以这里需要转换编码格式
strhtml.encoding = 'gb18030'
soup = BeautifulSoup(strhtml.content, "html.parser")
print soup.original_encoding
print soup.title
data = soup.select('#waterfall li')
 
list = []
for item in data:
    imgs = item.find('img')
    name = item.select(".xw0 a")[0]
    author = item.select(".auth.cl .a_name a")[0]
    names = name.text.encode("utf8")
    result = {
        "name": names,
    }
    print (result)
    list.append(result)
print (list)
 
# 写入本地文件
test_dict = { 'start': list}
json_str = json.dumps(test_dict)
new_dict = json.loads(json_str)
with open('data.json', 'w') as f:
    json.dump(new_dict, f)    print("写入文件完毕。。。")