Lucene的：通过查询API从字符串被解析并建立相同的查询不会产生相同的结果

Z时代
2024-01-10
分类：问答

我有以下代码：Lucene的：通过查询API从字符串被解析并建立相同的查询不会产生相同的结果

public static void main(String[] args) throws Throwable { 
    String[] texts = new String[]{ 
      "starts_with k mer", 
      "starts_with mer", 
      "starts_with bleue est mer", 
      "starts_with mer est bleue", 
      "starts_with mer bla1 bla2 bla3 bla4 bla5", 
      "starts_with bleue est la mer", 
      "starts_with la mer est bleue", 
      "starts_with la mer" 
    }; 
    //write: 
    Set<String> stopWords = new HashSet<String>(); 
    StandardAnalyzer stdAn = new StandardAnalyzer(Version.LUCENE_36, stopWords); 
    Directory fsDir = FSDirectory.open(INDEX_DIR); 
    IndexWriterConfig iwConf = new IndexWriterConfig(Version.LUCENE_36,stdAn); 
    iwConf.setOpenMode(IndexWriterConfig.OpenMode.CREATE); 
    IndexWriter indexWriter = new IndexWriter(fsDir,iwConf); 
    for(String text:texts) { 
     Document document = new Document(); 
     document.add(new Field("title",text,Store.YES,Index.ANALYZED)); 
     indexWriter.addDocument(document); 
    } 
    indexWriter.commit(); 
    //read 
    IndexReader indexReader = IndexReader.open(fsDir); 
    IndexSearcher indexSearcher = new IndexSearcher(indexReader); 
    //get query: 
    //Query query = getQueryFromString("mer"); 
    Query query = getQueryFromAPI("mer"); 
    //explain 
    System.out.println("======== Query: "+query+"\n"); 
    TopDocs hits = indexSearcher.search(query, 10); 
    for (ScoreDoc scoreDoc : hits.scoreDocs) { 
     Document doc = indexSearcher.doc(scoreDoc.doc); 
     System.out.println(">>> "+doc.get("title")); 
     System.out.println("Explain:"); 
     System.out.println(indexSearcher.explain(query, scoreDoc.doc)); 
    } 
} 
private static Query getQueryFromString(String searchString) throws Throwable { 
    Set<String> stopWords = new HashSet<String>(); 
    Query query = new QueryParser(Version.LUCENE_36, "title",new StandardAnalyzer(Version.LUCENE_36, stopWords)).parse("("+searchString+") \"STARTS_WITH "+searchString+"\""); 
    return query; 
} 
private static Query getQueryFromAPI(String searchString) throws Throwable { 
    Set<String> stopWords = new HashSet<String>(); 
    Query searchStringTermsMatchTitle = new QueryParser(Version.LUCENE_36, "title", new StandardAnalyzer(Version.LUCENE_36, stopWords)).parse(searchString); 
    PhraseQuery titleStartsWithSearchString = new PhraseQuery(); 
    titleStartsWithSearchString.add(new Term("title","STARTS_WITH".toLowerCase()+" "+searchString)); 
    BooleanQuery query = new BooleanQuery(true); 
    BooleanClause matchClause = new BooleanClause(searchStringTermsMatchTitle, Occur.SHOULD); 
    query.add(matchClause);  
    BooleanClause startsWithClause = new BooleanClause(titleStartsWithSearchString, Occur.SHOULD); 
    query.add(startsWithClause); 
    return query; 
}

基本上我索引的一些字符串，然后我有两个方法从用户输入创建一个Lucene查询，一个只是“手动”（通过字符串连接）构建相应的Lucene查询字符串，另一个使用Lucene的API来构建查询。他们似乎是建立相同的查询，如查询的调试输出显示完全相同的查询字符串，但搜索结果是不一样的：

运行通过字符串连接产量内置的查询（参数“海”）：
标题：标题滨海“STARTS_WITH海”

在这种情况下ideed当我用它搜索我得到第一个匹配title:"starts_with mer"一部分文件。这里的第一个结果的explain：

>>> starts_with mer 
Explain: 
1.2329358 = (MATCH) sum of: 
    0.24658716 = (MATCH) weight(title:mer in 1), product of: 
    0.4472136 = queryWeight(title:mer), product of: 
     0.882217 = idf(docFreq=8, maxDocs=8) 
     0.50692016 = queryNorm 
    0.55138564 = (MATCH) fieldWeight(title:mer in 1), product of: 
     1.0 = tf(termFreq(title:mer)=1) 
     0.882217 = idf(docFreq=8, maxDocs=8) 
     0.625 = fieldNorm(field=title, doc=1) 
    0.9863486 = (MATCH) weight(title:"starts_with mer" in 1), product of: 
    0.8944272 = queryWeight(title:"starts_with mer"), product of: 
     1.764434 = idf(title: starts_with=8 mer=8) 
     0.50692016 = queryNorm 
    1.1027713 = fieldWeight(title:"starts_with mer" in 1), product of: 
     1.0 = tf(phraseFreq=1.0) 
     1.764434 = idf(title: starts_with=8 mer=8) 
     0.625 = fieldNorm(field=title, doc=1)

运行通过Lucene的查询辅助工具构建的查询产生一个明显相同的查询：
标题：滨海标题： “STARTS_WITH海”

但这次结果不一样，因为其实title:"starts_with mer"部分是不匹配的。这是第一个结果的explain：

>>> starts_with mer 
Explain: 
0.15185544 = (MATCH) sum of: 
    0.15185544 = (MATCH) weight(title:mer in 1), product of: 
    0.27540696 = queryWeight(title:mer), product of: 
     0.882217 = idf(docFreq=8, maxDocs=8) 
     0.312176 = queryNorm 
    0.55138564 = (MATCH) fieldWeight(title:mer in 1), product of: 
     1.0 = tf(termFreq(title:mer)=1) 
     0.882217 = idf(docFreq=8, maxDocs=8) 
     0.625 = fieldNorm(field=title, doc=1)

我的问题是：whay不要我得到相同的结果？我真的很想能够在这里使用查询帮助器工具，特别是因为有我想使用的BooleanQuery(disableCoord)选项，我真的不知道如何直接表达Lucene查询字符串。（是的，我的例子在那里通过“真实”，我也尝试过使用“错误”，结果相同）。

=== UPDATE

femtoRgon的答案是伟大的：问题是，我是加入了完整的搜索字符串作为一个术语，而不是先分裂成词，然后将每一个查询。

如果输入字符串由一个词组成：在这种情况下，将“STARTS_WITH”文本分开添加为一个词条，然后将搜索字符串作为第二词条添加，答案femtoRgon会给出正常工作。

但是，如果用户输入的内容会被一个以上的术语所标记，那么您必须首先将其分成多个术语（最好使用您在索引时使用的相同分析器和/或标记器 - 以获得一致的结果），然后将每个术语添加到查询中。

我落得这样做是要做出拆分查询字符串到项的功能，使用我用于索引相同的分析：

private static List<String> getTerms(String text) throws Throwable { 
    Analyzer analyzer = getAnalyzer();  
    StringReader textReader = new StringReader(text); 
    TokenStream tokenStream = analyzer.tokenStream(FIELD_NAME_TITLE, textReader); 
    tokenStream.reset();   
    List<String> terms = new ArrayList<String>(); 
    CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class); 
    while (tokenStream.incrementToken()) { 
     String term = charTermAttribute.toString(); 
     terms.add(term); 
    } 
    textReader.close(); 
    tokenStream.close(); 
    analyzer.close();  
    return terms; 
}

然后我第一次添加了“STARTS_WITH”为一个任期，然后列表中的每个元素作为一个单独的术语：

PhraseQuery titleStartsWithSearchString = new PhraseQuery(); 
titleStartsWithSearchString.add(new Term("title","STARTS_WITH".toLowerCase())); 
for(String term:getTerms(searchString)) { 
    titleStartsWithSearchString.add(new Term("title",term)); 
}

回答：

我相信你正在运行到的问题是，你作为一个长期添加整个短语你PhraseQuery。在索引以及QueryParser解析的查询中，这将被拆分为"starts_with"和"mer"，必须连续找到。但是，在您构建的查询中，您的PhraseQuery中只有一个术语，而术语"starts_with mer"并不是索引中的单个术语。

你应该能够改变你在哪里建设PhraseQuery给位：

PhraseQuery titleStartsWithSearchString = new PhraseQuery(); 
titleStartsWithSearchString.add(new Term("title","STARTS_WITH".toLowerCase()) 
titleStartsWithSearchString.add(new Term("title",searchString));

以上是 Lucene的：通过查询API从字符串被解析并建立相同的查询不会产生相同的结果的全部内容，来源链接： utcz.com/qa/261577.html