jieba分词结果不理想怎么办？

Z时代
2024-02-15
分类：IT

请问jieba分词结果不理想怎么办？
我想要创建关于景区评论的词云图，现在用jieba分词，然后将分词后的结果进行LDA建模提取主题，但提取出的主题中的热点词，明显能看出分词有问题。

相关代码：

# 加载中文停用词
stop_words = set(stopwords.words('chinese'))
broadcastVar = spark.sparkContext.broadcast(stop_words)
# 中文文本分词
def tokenize(text):
    return list(jieba.cut(text))
# 删除中文停用词
def delete_stopwords(tokens,stop_words):
    # 分词
    words = tokens  
    # 去除停用词
    filtered_words = [word for word in words if word not in stop_words]
    # 重建文本
    filtered_text = ' '.join(filtered_words)
    return filtered_text
# 删除标点符号和固定字
def remove_punctuation(input_string):
    import string
    # 制作一个映射表，其中所有的标点符号和需要删除的字都被映射为None
    all_punctuation = string.punctuation + "！？｡。＂＃＄％＆＇（）＊＋，－／：；＜＝＞＠［＼］＾＿｀｛｜｝～｟｠｢｣､、〃》「」『』【】〔〕〖〗〘〙〚〛〜〝〞〟〰〾〿–—‘’‛“”„‟…‧﹏.\t \n很好是去还不人太都中"
    translator = str.maketrans('', '', all_punctuation)
    # 使用映射表来移除所有的标点符号和字
    no_punct = input_string.translate(translator)
    return no_punct
def Thematic_focus(text):
    from gensim import corpora, models
    num_words = 0
    if len(text)>200:
        num_words = 10
    elif 200>=len(text)>100:
        num_words = 8
    elif 100>=len(text)>50:
        num_words = 5
    else:
        num_words = 3
    tokens = tokenize(text)
    # 删除停用词
    stop_words = broadcastVar.value
    text = delete_stopwords(tokens,stop_words)
    # 祛除标点符号
    text = remove_punctuation(text)
    # 重新分词
    tokens = tokenize(text)
    print(type(tokens),type([tokens]))
    # return str(tokens)
    # # 创建字典和文档-词频矩阵
    dictionary = corpora.Dictionary([tokens])
    corpus = [dictionary.doc2bow(tokens)]
    # 运行LDA模型
    lda_model = models.LdaModel(corpus, num_topics=1, id2word=dictionary, passes=50)
    # 提取主题
    topics = lda_model.show_topics(num_words=num_words)
    # 输出主题
    for topic in topics:        return str(topic)

我想要让分词变得更合理，或者说有更好的提取景区评论中关键词的方法。

以上是 jieba分词结果不理想怎么办？的全部内容，来源链接： utcz.com/p/939088.html

jieba分词结果不理想怎么办？

其他人也看了：