使用Java的Lucene搜索工具对检索结果进行分组和分页

Z时代
2024-01-10
分类：IT

使用GroupingSearch对搜索结果进行分组

Package org.apache.lucene.search.grouping Description

这个模块可以对Lucene的搜索结果进行分组，指定的单值域被聚集到一起。比如，根据”author“域进行分组，“author”域值相同的的文档分成一个组。

进行分组的时候需要输入一些必要的信息：

1、groupField：根据这个域进行分组。比如，如果你使用“author”域进行分组，那么每一个组里面的书籍都是同一个作者。没有这个域的文档将被分到一个单独的组里面。

2、groupSort：组排序。

3、topNGroups：保留多少组。比如，10表示只保留前10组。

4、groupOffset：对排在前面的哪些分组组进行检索。比如，3表示返回7个组（假设opNGroups等于10）。在分页里面很有用，比如每页只显示5个组。

5、withinGroupSort：组内文档排序。注意：这里和groupSort的区别

6、withingroupOffset：对每一个分组里面的哪些排在前面的文档进行检索。

使用GroupingSearch 对搜索结果分组比较简单

GroupingSearch API文档介绍：

Convenience class to perform grouping in a non distributed environment.

非分布式环境下分组

WARNING: This API is experimental and might change in incompatible ways in the next release.

这里使用的是4.3.1版本

一些重要的方法：

GroupingSearch：setCaching(int maxDocsToCache, boolean cacheScores) 缓存

GroupingSearch：setCachingInMB(double maxCacheRAMMB, boolean cacheScores) 缓存第一次搜索结果，用于第二次搜索

GroupingSearch：setGroupDocsLimit(int groupDocsLimit) 指定每组返回的文档数，不指定时，默认返回一个文档

GroupingSearch：setGroupSort(Sort groupSort) 指定分组排序

示例代码：

1.先看建索引的代码


public class IndexHelper {
  private Document document;
  private Directory directory;
  private IndexWriter indexWriter;
  public Directory getDirectory(){
    directory=(directory==null)? new RAMDirectory():directory;
    return directory;
  }
  private IndexWriterConfig getConfig() {
    return new IndexWriterConfig(Version.LUCENE_43, new IKAnalyzer(true));
  }
  private IndexWriter getIndexWriter() {
    try {
      return new IndexWriter(getDirectory(), getConfig());
    } catch (IOException e) {
      e.printStackTrace();
      return null;
    }
  }
  public IndexSearcher getIndexSearcher() throws IOException {
    return new IndexSearcher(DirectoryReader.open(getDirectory()));
  }
  /**
   * Create index for group test
   * @param author
   * @param content
   */
  public void createIndexForGroup(int id,String author,String content) {
    indexWriter = getIndexWriter();
    document = new Document();
    document.add(new IntField("id",id, Field.Store.YES));
    document.add(new StringField("author", author, Field.Store.YES));
    document.add(new TextField("content", content, Field.Store.YES));
    try {
      indexWriter.addDocument(document);
      indexWriter.commit();
      indexWriter.close();
    } catch (IOException e) {
      e.printStackTrace();
    }
  }
}

2.分组：


public class GroupTest
public void group(IndexSearcher indexSearcher,String groupField,String content) throws IOException, ParseException {
    GroupingSearch groupingSearch = new GroupingSearch(groupField);
    groupingSearch.setGroupSort(new Sort(SortField.FIELD_SCORE));
    groupingSearch.setFillSortFields(true);
    groupingSearch.setCachingInMB(4.0, true);
    groupingSearch.setAllGroups(true);
    //groupingSearch.setAllGroupHeads(true);
    groupingSearch.setGroupDocsLimit(10);
    QueryParser parser = new QueryParser(Version.LUCENE_43, "content", new IKAnalyzer(true));
    Query query = parser.parse(content);
    TopGroups<BytesRef> result = groupingSearch.search(indexSearcher, query, 0, 1000);
    System.out.println("搜索命中数：" + result.totalHitCount);
    System.out.println("搜索结果分组数：" + result.groups.length);
    Document document;
    for (GroupDocs<BytesRef> groupDocs : result.groups) {
      System.out.println("分组：" + groupDocs.groupValue.utf8ToString());
      System.out.println("组内记录：" + groupDocs.totalHits);
      //System.out.println("groupDocs.scoreDocs.length:" + groupDocs.scoreDocs.length);
      for (ScoreDoc scoreDoc : groupDocs.scoreDocs) {
        System.out.println(indexSearcher.doc(scoreDoc.doc));
      }
    }
  }

3.简单的测试：


public static void main(String[] args) throws IOException, ParseException {
    IndexHelper indexHelper = new IndexHelper();
    indexHelper.createIndexForGroup(1,"红薯", "开源中国");
    indexHelper.createIndexForGroup(2,"红薯", "开源社区");
    indexHelper.createIndexForGroup(3,"红薯", "代码设计");
    indexHelper.createIndexForGroup(4,"红薯", "设计");
    indexHelper.createIndexForGroup(5,"觉先", "Lucene开发");
    indexHelper.createIndexForGroup(6,"觉先", "Lucene实战");
    indexHelper.createIndexForGroup(7,"觉先", "开源Lucene");
    indexHelper.createIndexForGroup(8,"觉先", "开源solr");
    indexHelper.createIndexForGroup(9,"散仙", "散仙开源Lucene");
    indexHelper.createIndexForGroup(10,"散仙", "散仙开源solr");
    indexHelper.createIndexForGroup(11,"散仙", "开源");
    GroupTest groupTest = new GroupTest();
    groupTest.group(indexHelper.getIndexSearcher(),"author", "开源");
  }
}

4.测试结果：

两种分页方式

Lucene有两种分页方式：

1、直接对搜索结果进行分页，数据量比较少的时候可以用这种方式，分页代码核心参照：


ScoreDoc[] sd = XXX;
// 查询起始记录位置
int begin = pageSize * (currentPage - 1);
// 查询终止记录位置
int end = Math.min(begin + pageSize, sd.length);
for (int i = begin; i < end && i <totalHits; i++) {
//对搜索结果数据进行处理的代码
}

2、使用searchAfter(...)

Lucene提供了五个重载方法，可以根据需要使用

ScoreDoc after：为上次搜索结果ScoreDoc总量减1；

Query query：查询方式

int n：为每次查询返回的结果数，即每页的结果总量

一个简单的使用示例：


//可以使用Map保存必要的搜索结果
Map<String, Object> resultMap = new HashMap<String, Object>();
ScoreDoc after = null;
Query query = XX
TopDocs td = search.searchAfter(after, query, size);
//获取命中数
resultMap.put("num", td.totalHits);
ScoreDoc[] sd = td.scoreDocs;
for (ScoreDoc scoreDoc : sd) {
//经典的搜索结果处理
}
//搜索结果ScoreDoc总量减1
after = sd[td.scoreDocs.length - 1]; 
//保存after用于下次搜索，即下一页开始 
resultMap.put("after", after);
return resultMap;