如何为sklearn CountVectorizer设置自定义停用词？

Z时代
2024-01-10
分类：问答

我正在尝试在非英语文本数据集上运行LDA（潜在Dirichlet分配）。

在sklearn的教程中，您可以在以下部分中计算输入到LDA中的单词的词频：

tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2,
                            max_features=n_features,
                            stop_words='english')

我认为它具有内置停用词功能，仅适用于英语。我该如何使用自己的停用词列表？

回答：

您可以frozenset为stop_words参数指定您自己的单词，例如：

stop_words = frozenset(["word1", "word2","word3"])

以上是如何为sklearn CountVectorizer设置自定义停用词？的全部内容，来源链接： utcz.com/qa/408138.html

如何为sklearn CountVectorizer设置自定义停用词？

回答：

其他人也看了：