记:Lucene新闻高频词汇提取
目标:给出新闻一篇,统计出现频率最高的有哪些词。(网上随便抓取一篇新闻)
分析:关于文本词汇提取,这里使用Lucene索引提取词项频率的Top10。索引过程的本质是一个词条化的生成倒排索引的过程,词条化会从文本中去除标点符号、停用词等,最后生成词项。在代码中实现思路使用IndexReader的getTermVector获取文档的某一字段的Terms,从terms中获取 term frequency,拿到词项的term Frequency以后放到map中降序排序,取出Top10。
public static void main(String[] args) throws IOException { //新闻txt位置
File f = new File("/Users/**/src/main/resources/news/news.txt");
String text = textToString(f);
Analyzer analyzer = new IKAnalyzer6x(true);
IndexWriterConfig indexWriterConfig = new IndexWriterConfig(analyzer);
indexWriterConfig.setOpenMode(IndexWriterConfig.OpenMode.CREATE);
Directory directory;
IndexWriter indexWriter;
directory = FSDirectory.open(Paths.get("indexdir"));
indexWriter = new IndexWriter(directory, indexWriterConfig);
//新建文件类型,用于指定字段索引时的信息
FieldType type = new FieldType();
//索引保存文档,词项频率,位置信息,偏移信息
type.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS);
type.setStored(true);
type.setStoreTermVectors(true);
type.setTokenized(true);
//新建文档
Document doc = new Document();
Field field = new Field("content", text, type);
doc.add(field);
indexWriter.addDocument(doc);
indexWriter.close();
directory.close();
}
public static String textToString(File f) {
StringBuilder sb = new StringBuilder();
try {
BufferedReader br = new BufferedReader(new FileReader(f));
String str;
while ((str = br.readLine()) != null) {
sb.append(System.lineSeparator() + str);
}
br.close();
} catch (Exception e) {
e.printStackTrace();
}
return sb.toString();
}
使用IndexReader查询词项频次。
public static void main(String[] args) throws IOException { IndexReader reader = DirectoryReader.open(FSDirectory.open(Paths.get("indexdir")));
Terms terms = reader.getTermVector(0, "content");
TermsEnum termsEnum = terms.iterator();
Map<String, Integer> map = new HashMap<>();
BytesRef thisTerm;
while ((thisTerm = termsEnum.next()) != null) {
String termText = thisTerm.utf8ToString();
map.put(termText, (int) termsEnum.totalTermFreq());
}
List<Map.Entry<String, Integer>> sortedMap = new ArrayList<>(map.entrySet());
Collections.sort(sortedMap, (o1, o2) -> o2.getValue() - o1.getValue());
for (int i = 0; i < 10; i++) {
System.out.println(sortedMap.get(i).getKey() + " : " + sortedMap.get(i).getValue());
}
}
输出结果如下:
以上是 记:Lucene新闻高频词汇提取 的全部内容, 来源链接: utcz.com/z/515571.html