记:Lucene新闻高频词汇提取

编程

目标:给出新闻一篇,统计出现频率最高的有哪些词。(网上随便抓取一篇新闻)

分析:关于文本词汇提取,这里使用Lucene索引提取词项频率的Top10。索引过程的本质是一个词条化的生成倒排索引的过程,词条化会从文本中去除标点符号、停用词等,最后生成词项。在代码中实现思路使用IndexReader的getTermVector获取文档的某一字段的Terms,从terms中获取 term frequency,拿到词项的term Frequency以后放到map中降序排序,取出Top10。

public static void main(String[] args) throws IOException {

//新闻txt位置

File f = new File("/Users/**/src/main/resources/news/news.txt");

String text = textToString(f);

Analyzer analyzer = new IKAnalyzer6x(true);

IndexWriterConfig indexWriterConfig = new IndexWriterConfig(analyzer);

indexWriterConfig.setOpenMode(IndexWriterConfig.OpenMode.CREATE);

Directory directory;

IndexWriter indexWriter;

directory = FSDirectory.open(Paths.get("indexdir"));

indexWriter = new IndexWriter(directory, indexWriterConfig);

//新建文件类型,用于指定字段索引时的信息

FieldType type = new FieldType();

//索引保存文档,词项频率,位置信息,偏移信息

type.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS);

type.setStored(true);

type.setStoreTermVectors(true);

type.setTokenized(true);

//新建文档

Document doc = new Document();

Field field = new Field("content", text, type);

doc.add(field);

indexWriter.addDocument(doc);

indexWriter.close();

directory.close();

}

public static String textToString(File f) {

StringBuilder sb = new StringBuilder();

try {

BufferedReader br = new BufferedReader(new FileReader(f));

String str;

while ((str = br.readLine()) != null) {

sb.append(System.lineSeparator() + str);

}

br.close();

} catch (Exception e) {

e.printStackTrace();

}

return sb.toString();

}

使用IndexReader查询词项频次。

public static void main(String[] args) throws IOException {

IndexReader reader = DirectoryReader.open(FSDirectory.open(Paths.get("indexdir")));

Terms terms = reader.getTermVector(0, "content");

TermsEnum termsEnum = terms.iterator();

Map<String, Integer> map = new HashMap<>();

BytesRef thisTerm;

while ((thisTerm = termsEnum.next()) != null) {

String termText = thisTerm.utf8ToString();

map.put(termText, (int) termsEnum.totalTermFreq());

}

List<Map.Entry<String, Integer>> sortedMap = new ArrayList<>(map.entrySet());

Collections.sort(sortedMap, (o1, o2) -> o2.getValue() - o1.getValue());

for (int i = 0; i < 10; i++) {

System.out.println(sortedMap.get(i).getKey() + " : " + sortedMap.get(i).getValue());

}

}

输出结果如下:

以上是 记:Lucene新闻高频词汇提取 的全部内容, 来源链接: utcz.com/z/515571.html

回到顶部