这样的表结构设计合理吗,如果不合理该怎么改?

现在有3个表 内容表 内容主题表 内容主题关键字表
大概的DDL如下:

CREATE TABLE `book` (

`id` bigint NOT NULL,

`content` varchar(255) DEFAULT NULL,

`title` varchar(255) DEFAULT NULL,

`sale_time` datetime DEFAULT NULL,

PRIMARY KEY (`id`)

) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;

INSERT INTO `book` VALUES (1, '内容', '标题', '2023-05-16 18:01:41');

CREATE TABLE `book_theme` (

`id` bigint NOT NULL,

`book_id` bigint DEFAULT NULL,

`theme` varchar(255) DEFAULT NULL,

`num` int DEFAULT NULL,

PRIMARY KEY (`id`)

) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;

INSERT INTO `book_theme` VALUES (1, 1, '恐怖', 2);

INSERT INTO `book_theme` VALUES (2, 1, '玄幻', 1);

CREATE TABLE `book_theme_word` (

`id` bigint NOT NULL,

`book_id` bigint DEFAULT NULL,

`book_theme_id` bigint DEFAULT NULL,

`word` varchar(255) DEFAULT NULL,

`num` int DEFAULT NULL,

`type` tinyint DEFAULT NULL,

PRIMARY KEY (`id`)

) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;

INSERT INTO `book_theme_word` VALUES (1, 1, 1, '鬼怪', 2, 0);

INSERT INTO `book_theme_word` VALUES (2, 1, 2, '天使', 1, 1);

大概说下关系 一个内容对应多个主题,每个主题有自己的数量
一个主题对应多个主题关键字 每个关键字有自己的数量,且有个type(0,1)区分是好的还是坏的

需求:现在要统计一些书中出现最多的主题数量top10
同时要取出top10中前3主题的下 出现最多得好的和坏的关键词前10

我现在的做法是

select group_concat(t.id) as idList,t.theme,sum(t.num) as num

from book_theme t left join book b on t.book_id=b.id

where b.id in (1)

group by t.theme

order by num desc limit 10

先取出主题的前10条记录 然后拿到book_theme表的id集合 再去找book_theme_word分别统计出好的和坏的前10的关键词

这样写有个不好的地方,group_concat有数量限制 虽然可以修改 但是这个数量会越来越大
再用group_concat的结果去book_theme表查 用in也会有限制的问题影响

请问该怎么处理好


回答:

你的思路是正确的,但是使用 group_concat 确实会存在数量限制的问题,不利于大规模的数据处理。
说一下我的思路:

  1. 首先从 book_theme 表中查询出内容 id 为 1 的主题及数量,并排序取 top 10,得到主题 id 列表 theme_ids。

    SELECT id, theme, num 

    FROM book_theme

    WHERE book_id = 1

    ORDER BY num DESC

    LIMIT 10

  2. 然后分别统计出这 10 个主题对应的好词(type = 1)和坏词(type = 0)的数量。
    好词:

    SELECT book_theme_id, word, SUM(num) AS num 

    FROM book_theme_word

    WHERE book_theme_id IN (theme_ids) AND type = 1

    GROUP BY book_theme_id, word

    ORDER BY num DESC

    LIMIT 10

    坏词:

    SELECT book_theme_id, word, SUM(num) AS num  

    FROM book_theme_word

    WHERE book_theme_id IN (theme_ids) AND type = 0

    GROUP BY book_theme_id, word

    ORDER BY num DESC

    LIMIT 10

  3. 最后 outer join 三个查询的结果,得到内容 id 为 1 的主题 top 10,以及其前 3 个主题的好词和坏词 top 10。

    SELECT 

    t.id AS theme_id,

    t.theme,

    t.num AS theme_num,

    g.word AS good_word,

    g.num AS good_word_num,

    b.word AS bad_word,

    b.num AS bad_word_num

    FROM

    (SELECT * FROM book_theme WHERE book_id = 1 ORDER BY num DESC LIMIT 10) AS t

    LEFT JOIN

    (SELECT * FROM good_word_result ORDER BY num DESC LIMIT 10) AS g

    ON t.id = g.book_theme_id

    LEFT JOIN

    (SELECT * FROM bad_word_result ORDER BY num DESC LIMIT 10) AS b

    ON t.id = b.book_theme_id

    LIMIT 3

    这种分步统计再 outer join 的查询方式,可以比较优雅地解决这种统计 top n 中的 top m 信息的问题,并且更利于大规模数据的处理。


回答:

1.保持book表不变。

2.在book_theme表中增加一个theme_type字段,用来区分主题关键字是好词还是坏词(0表示好词,1表示坏词)。

修改后的book_theme表DDL:

CREATE TABLE `book_theme` (

`id` bigint NOT NULL,

`book_id` bigint DEFAULT NULL,

`theme` varchar(255) DEFAULT NULL,

`num` int DEFAULT NULL,

`theme_type` tinyint DEFAULT NULL,

PRIMARY KEY (`id`)

) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;

修改后的book_theme表插入数据:

INSERT INTO `book_theme` VALUES (1, 1, '恐怖', 2, 0);

INSERT INTO `book_theme` VALUES (2, 1, '玄幻', 1, 1);

修改book_theme_word表,删除type字段。

修改后的book_theme_word表DDL:

CREATE TABLE `book_theme_word` (

`id` bigint NOT NULL,

`book_id` bigint DEFAULT NULL,

`book_theme_id` bigint DEFAULT NULL,

`word` varchar(255) DEFAULT NULL,

`num` int DEFAULT NULL,

PRIMARY KEY (`id`)

) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;

修改后的book_theme_word表插入数据:

INSERT INTO `book_theme_word` VALUES (1, 1, 1, '鬼怪', 2);

INSERT INTO `book_theme_word` VALUES (2, 1, 2, '天使', 1);

最后SQL查询看看:

WITH ranked_themes AS (

SELECT bt.*,

ROW_NUMBER() OVER (PARTITION BY bt.book_id ORDER BY bt.num DESC) AS rn

FROM book_theme bt

WHERE bt.book_id = 1

),

top_themes AS (

SELECT * FROM ranked_themes WHERE rn <= 10

),

ranked_words AS (

SELECT btm.*,

ROW_NUMBER() OVER (PARTITION BY btm.book_theme_id ORDER BY btm.num DESC) AS rn

FROM book_theme_word btm

JOIN top_themes tt ON btm.book_theme_id = tt.id

),

top_words AS (

SELECT * FROM ranked_words WHERE rn <= 10

)

SELECT tt.theme, tt.num AS theme_num, bw.word, bw.num AS word_num, tt.theme_type

FROM top_themes tt

JOIN top_words bw ON tt.id = bw.book_theme_id

ORDER BY tt.rn, bw.rn;

以上是 这样的表结构设计合理吗,如果不合理该怎么改? 的全部内容, 来源链接: utcz.com/p/945170.html

回到顶部