数据采集实战（五）

Z时代
2024-01-10
分类：综合

database

1. 概述

现在学校越来越重视孩子课外知识的掌握，给孩子挑选课外书一般都是参考学校或者家长之间的推荐。

有时，也会想看看在儿童阶段，目前到底流行的是些什么样的书。

于是，就简单写了这个小爬虫，采集了畅销童书的前20名。

要想采集更多的畅销童书，后者采集其他类型的畅销书，调整相应的参数和URL就可以了。

2. 采集流程

因为当当网的图书排名不需要登录就可以查看，而且采集前20名也不需要翻页，所以流程很简单，打开网页直接解析保存就可以了。

核心代码如下：

import { saveContent } from "../utils.js";
const http_prefix = "http://bang.dangdang.com/books/childrensbooks";
const age_start = 1;
const age_end = 4;
const month_start = 1;
const month_end = 11; // 目前只到11月
const age_map = { 1: "0~2岁", 2: "3~6岁", 3: "7~10岁", 4: "11~14岁" };
const childrensbooks = async (page) => {
    // 0~2岁， 3~6岁，7~10岁，11~14岁
    for (let i = age_start; i <= age_end; i++) {
        for (let m = month_start; m <= month_end; m++) {
            let url = `${http_prefix}/01.41.0${i}.00.00.00-month-2021-${m}-1-1-bestsell`;
            const lines = await parseData(page, url);
            // 分年龄段分月份写入csv
            await saveContent(
                `./output/dangdang-childrenbook`,
                `2021-${m}-${age_map[i]}.csv`,
                lines.join("
")
            );
        }
    }
};
// 采集字段： 排名order，书名name，评论数comment，推荐率recommend_pct，作者author，出版日期publish_date，出版社publisher
const parseData = async (page, url) => {
    await page.goto(url);
    const listContent = await page.$$("ul.bang_list>li");
    let lines = [
        "排名order,书名name,评论数comment,推荐率recommend_pct,作者author,出版日期publish_date,出版社publisher",
    ];
    for (let i = 0; i < listContent.length; i++) {
        const order = await listContent[i].$eval(
            "div.list_num",
            (node) => node.innerText
        );
        const name = await listContent[i].$eval(
            "div.name>a",
            (node) => node.innerText
        );
        const comment = await listContent[i].$eval(
            "div.star>a",
            (node) => node.innerText
        );
        const recommend_pct = await listContent[i].$eval(
            "div.star>span.tuijian",
            (node) => node.innerText
        );
        const publisher_info = await listContent[i].$$("div.publisher_info");
        const authors = await publisher_info[0].$$eval("a", (nodes) =>
            nodes.map((node) => node.innerText)
        );
        const author = authors.join("&");
        const publish_date = await publisher_info[1].$eval(
            "span",
            (node) => node.innerText
        );
        const publisher = await publisher_info[1].$eval(
            "a",
            (node) => node.innerText
        );
        const line = `${order},${name},${comment},${recommend_pct},${author},${publish_date},${publisher}`;
        lines.push(line);
        console.log(line);
    }
    return lines;
};
export default childrensbooks;

采集之后的内容分门别类的按照月份和年龄阶段保存的。

文件的内容是csv格式的（下图是其中部分字段）。

3. 总结

以上内容是通过 puppeteer 采集的，除了童书排行榜，还有图书畅销榜，新书热卖榜，图书飙升榜，特价榜，五星图书榜等等。

各个榜单的结构都类似，只需要修改上面代码中的 http_prefix，以及童书年龄阶段的循环控制等，就能采集相应数据。

4. 注意事项

爬取数据只是为了研究学习使用，本文中的代码遵守：

如果网站有 robots.txt，遵循其中的约定

爬取速度模拟正常访问的速率，不增加服务器的负担

只获取完全公开的数据，有可能涉及隐私的数据绝对不碰

以上是数据采集实战（五）的全部内容，来源链接： utcz.com/z/536148.html

数据采集实战（五）

1. 概述

2. 采集流程

3. 总结

4. 注意事项

其他人也看了：