Python采集12星座信息，分析出12星座的各个特点

Z时代
2024-01-10
分类：综合

python

一个微博热搜引发的故事

一、故事从这里开始

二、搞事情第一步：搜集图片

三、搞事情第二步：展示图片

四、搞事情第三步：推广链接

五、搞事情第四步：统计分析

1.数据处理2.数据筛选3.统计各天的频率4.统计星座的频率5.统计月份的频率6.数据可视化（3个条形图）

写在最后

很多人学习python，不知道从何学起。
很多人学习python，掌握了基本语法过后，不知道在哪里寻找案例上手。
很多已经做案例的人，却不知道如何去学习更加高深的知识。
那么针对这三类人，我给大家提供一个好的学习平台，免费领取视频教程，电子书籍，以及课程的源代码！
QQ群：961562169

一、故事从这里开始

3月29日那晚，我正在厕所蹲坑来着，大概就是边蹲边刷手机的那种…突然发现一条微博热搜#你出生那天的宇宙#

在评论区，发现大家都有一个同样的疑惑：无法访问NASA官网（可能是因为访问量过大，导致网络极高延时）。作为一个社会主义正直青年，我怎么能放着不管呢？
于是，我决定搞事情！！

二、搞事情第一步：搜集图片

一个简单的想法油然而生：既然大家没法从官网上下载图片，那我就帮大家集齐图片，然后发给大家就好啦。（搜集数据嘛，写个爬虫不就好了？）
于是，我直接冲进NASA官网准备分析一波请求。结果…好叭，我也是大家中的一员，我也加载不出图片。
这点困难我怎么能退缩呢，再于是，我就去微博评论下面苦苦寻找，果然功夫不负有心人，发现豆瓣上有个大佬已经为找齐了所有图片：

秉承“拿来主义”的作风，我决定这里就是我的数据源（某豆瓣相册）
简单分析了一下，发现可以通过一个m_start的参数进行翻页，每页20张图片（如m_start=0为第一页，m_start=20为第二页），那么写一个循环便可：

import re
import queue
import requests
import threading
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
headers = {
"Host": "www.douban.com",
"Connection": "keep-alive",
"Cache-Control": "max-age=0",
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36",
"Sec-Fetch-Dest": "document",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"Sec-Fetch-Site": "none",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-User": "?1",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "zh-CN,zh;q=0.9",
"Cookie": "bid=rb_kUqiDS6k; douban-fav-remind=1; _pk_ses.100001.8cb4=*; ap_v=0,6.0; __utma=30149280.1787149566.1585488263.1585488263.1585488263.1; __utmc=30149280; __utmz=30149280.1585488263.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); __yadk_uid=HNoH1YVIvD2c8HrQDWHRzyLciFJl1AVD; __gads=ID=a1f73d5d4aa31261:T=1585488663:S=ALNI_MafqKPZWHx0TGWTpKEm8TTvdC-eyQ; ct=y; _pk_id.100001.8cb4=722e0554d0127ce7.1585488261.1.1585488766.1585488261.; __utmb=30149280.10.6.1585488263"
}
# driver初始化
chrome_options = Options()
chrome_options.add_argument("--headless")
driver = webdriver.Chrome(options=chrome_options)
# 下载图片
def downimg():
    while not img_queue.empty():
        img = img_queue.get()
        img_name = img[0]
        url = img[1]
        res = requests.get(url)
        data =res.content
        with open("./img/%s.webp"%img_name,"wb") as f:
            f.write(data)
        print(img_name)
# 网站参数
url_o = "https://www.douban.com/photos/album/1872547715/?m_start=%d"
# 爬取连接
img_queue = queue.Queue()
for i in range(0,21):
    url = url_o%(18*i)
    driver.get(url)
    es = driver.find_elements_by_class_name("photo_wrap")
    for e in es:
        img_e = e.find_element_by_tag_name("img")
        img_url = img_e.get_attribute("src")
        img_url = img_url.replace("photo/m/public","photo/l/public") # 替换为大图
        text_e = e.find_element_by_class_name("pl")
        img_date = text_e.text
        img_queue.put((img_date,img_url))
    print("%d页爬取完成"%(i+1))
driver.close()
# 下载图片 
thread_list = []
N_thread = 5
for i in range(N_thread):
    thread_list.append(threading.Thread(target=downimg))
for t in thread_list:
    t.start()
for t in thread_list:
    t.join()

代码简单来说就是：webdriver访问页面并获取图片地址，然后通过多线程利用requests下载并保存图片。
至此，图片搜集的工作基本完成！

三、搞事情第二步：展示图片

有了图片，接下来就是如何让大家获得图片呢？去给每个人私发？机智的我当然不会这么干，我决定写一个小网页来让大家访问。作为很不专业的我，东平西凑，效果大概就是这样（你生日那天的宇宙）：

四、搞事情第三步：推广链接

关于推广，咱也不懂，咱也不敢说。傻傻的我决定自己发一条微博（心里大概是想：这么方便的工具，肯定会受大家欢迎的，肯定是这样没有错，对，没错…）:

现实嘛，总是残酷的。吃瓜群众都猜到了：无人问津，石沉海底~
几经周折，最后呢在一位相关话题的热门博主的鼎力帮助下，最终迎来了一些流量：

五、搞事情第四步：统计分析

虽然这个流量跟我想象的还是相差甚远，毕竟这个话题也是有上亿的阅读量的，但是我还是决定对昨天访问的情况做一个简单的统计：

1.数据处理

在某度统计里拿到网页访问数据的原始csv表格后，进行了简单数据处理，调整为更方便读取的格式。

2.数据筛选

由于表格中并不仅仅包括NASA页面的数据，还有一些其他页面的数据，于是必须进行数据的筛选：

# 读取数据
data = pd.read_csv("./analyze/20200330-20200330.csv",encoding="utf-8")
# 筛选数据(和NASA相关且有有效日期的数据)
data_NASA = []
for i in range(len(data)):
    url = urllib.parse.unquote(data["URL"][i])
    pv = data["PV"][i] # 浏览量
    uv = data["UV"][i] # 访客量
    #if url[-1] == "日" and "NaN" not in url: # 为NASA访问页面
    if "date=" in url and "NaN" not in url:
        try:
            data_NASA.append((re.findall("date=(d*?月d*?日)",url)[0],pv,uv))
        except:
            pass

3.统计各天的频率

# 统计各个天数的频率
PV_map= {}
UV_map = {}
PV_total = 0
UV_total = 0
for d in data_NASA:
    if d[0] not in PV_map.keys():
        PV_map[d[0]] = 0
        UV_map[d[0]] = 0
    PV_map[d[0]] +=  d[1] # PV
    UV_map[d[0]] +=  d[2] # UV
    PV_total += d[1]
    UV_total += d[2]
for k in PV_map.keys(): # 计算频率
    PV_map[k] = PV_map[k]/PV_total*100
    UV_map[k] = UV_map[k]/UV_total*100
PVs= sorted(PV_map.items(),key=lambda x:x[1],reverse=True) # 排序
UVs= sorted(UV_map.items(),key=lambda x:x[1],reverse=True) # 排序

4.统计星座的频率

# 判断星座
def get_xingzuo(month, date):
    dates = (21, 20, 21, 21, 22, 22, 23, 24, 24, 24, 23, 22)
    constellations = ("摩羯座", "水瓶座", "双鱼座", "白羊座", "金牛座", "双子座", "巨蟹座", "狮子座", "处女座", "天秤座", "天蝎座", "射手座", "摩羯座")
    if date < dates[month-1]:
        return constellations[month-1]
    else:
        return constellations[month]
# 统计各星座的频率
xingzuo = ("摩羯座", "水瓶座", "双鱼座", "白羊座", "金牛座", "双子座", "巨蟹座", "狮子座", "处女座", "天秤座", "天蝎座", "射手座", "摩羯座")
xingzuo_map = {}
for x in xingzuo:
    xingzuo_map[x] = 0
xingzuo_total = 0
for d in data_NASA:
    month = int(re.findall("(d*?)月(d*?)日",d[0])[0][0])
    day = int(re.findall("(d*?)月(d*?)日",d[0])[0][1])
    x = get_xingzuo(month,day)
    #xingzuo_map[x] += d[1] # PV
    xingzuo_map[x] += d[2] # UV
    xingzuo_total += d[2]
for k in xingzuo_map.keys():
    xingzuo_map[k] = xingzuo_map[k]/xingzuo_total*100
xingzuos= sorted(xingzuo_map.items(),key=lambda x:x[1],reverse=True) # 排序

5.统计月份的频率

# 统计各月份的频率
month = [str(i)+"月" for i in range(1,13)]
month_map = {}
for m in month:
    month_map[m] = 0
month_total = 0
for d in data_NASA:
    m = d[0].split("月")[0]+"月"
    #month_map[m] += d[1] # PV
    month_map[m] += d[2] # UV
    month_total += d[2]
for k in month_map.keys():
    month_map[k] = month_map[k]/month_total*100
months= sorted(month_map.items(),key=lambda x:x[1],reverse=True) # 排序

6.数据可视化（3个条形图）

## 生日查询TOP10-按访客量UV
date = []
uv = []
for i in UVs:
    date.append(i[0])
    uv.append(i[1])
top10_date = date[:10]
top10_date.reverse()
top10_uv = uv[:10]
top10_uv.reverse()
fig, ax = plt.subplots() # 画图
b = plt.barh(top10_date,top10_uv,color="#6699CC") # 金色#FFFACD 银色#C0C0C0  橙色#FFA500 蓝色#6699CC
i = len(b)
for rect in b: # 画数值
    if i==3: # 第三名
        rect.set_facecolor("#FFA500") # 橙色
    if i==2: # 第二名
        rect.set_facecolor("#C0C0C0") # 银色
    if i==1: # 第一名
        rect.set_facecolor("#FFFACD") # 金色
    w = rect.get_width()
    ax.text(w, rect.get_y()+rect.get_height()/2, " %.2f%%"%w,ha="left", va="center")
    i -= 1
plt.xticks([]) # 关掉横坐标
## 星座查询排名             
name = []
v = []
for i in xingzuos:
    name.append(i[0])
    v.append(i[1])
name.reverse()
v.reverse()
fig, ax = plt.subplots() # 画图
b = plt.barh(name,v,color="#6699CC") # 金色#FFFACD 银色#C0C0C0  橙色#FFA500 蓝色#6699CC
i = len(b)
for rect in b: # 画数值
    if i==3: # 第三名
        rect.set_facecolor("#FFA500") # 橙色
    if i==2: # 第二名
        rect.set_facecolor("#C0C0C0") # 银色
    if i==1: # 第一名
        rect.set_facecolor("#FFFACD") # 金色
    w = rect.get_width()
    ax.text(w, rect.get_y()+rect.get_height()/2, " %.2f%%"%w,ha="left", va="center")
    i -= 1
plt.xticks([]) # 关掉横坐标
## 月份查询排名             
name = []
v = []
for i in months:
    name.append(i[0])
    v.append(i[1])
name.reverse()
v.reverse()
fig, ax = plt.subplots() # 画图
b = plt.barh(name,v,color="#6699CC") # 金色#FFFACD 银色#C0C0C0  橙色#FFA500 蓝色#6699CC
i = len(b)
for rect in b: # 画数值
    if i==3: # 第三名
        rect.set_facecolor("#FFA500") # 橙色
    if i==2: # 第二名
        rect.set_facecolor("#C0C0C0") # 银色
    if i==1: # 第一名
        rect.set_facecolor("#FFFACD") # 金色
    w = rect.get_width()
    ax.text(w, rect.get_y()+rect.get_height()/2, " %.2f%%"%w,ha="left", va="center")
    i -= 1
plt.xticks([]) # 关掉横坐标

最后的结果就长这个样子：

写在最后

如果可以，我亦希望在无数次键盘的敲击声中创造出所谓的“极致浪漫”~

最后，附上本次NASA活动中个人觉得比较好看的一些图片：

以上是 Python采集12星座信息，分析出12星座的各个特点的全部内容，来源链接： utcz.com/z/530379.html