Lucene的:通过查询API从字符串被解析并建立相同的查询不会产生相同的结果
我有以下代码:Lucene的:通过查询API从字符串被解析并建立相同的查询不会产生相同的结果
public static void main(String[] args) throws Throwable { String[] texts = new String[]{
"starts_with k mer",
"starts_with mer",
"starts_with bleue est mer",
"starts_with mer est bleue",
"starts_with mer bla1 bla2 bla3 bla4 bla5",
"starts_with bleue est la mer",
"starts_with la mer est bleue",
"starts_with la mer"
};
//write:
Set<String> stopWords = new HashSet<String>();
StandardAnalyzer stdAn = new StandardAnalyzer(Version.LUCENE_36, stopWords);
Directory fsDir = FSDirectory.open(INDEX_DIR);
IndexWriterConfig iwConf = new IndexWriterConfig(Version.LUCENE_36,stdAn);
iwConf.setOpenMode(IndexWriterConfig.OpenMode.CREATE);
IndexWriter indexWriter = new IndexWriter(fsDir,iwConf);
for(String text:texts) {
Document document = new Document();
document.add(new Field("title",text,Store.YES,Index.ANALYZED));
indexWriter.addDocument(document);
}
indexWriter.commit();
//read
IndexReader indexReader = IndexReader.open(fsDir);
IndexSearcher indexSearcher = new IndexSearcher(indexReader);
//get query:
//Query query = getQueryFromString("mer");
Query query = getQueryFromAPI("mer");
//explain
System.out.println("======== Query: "+query+"\n");
TopDocs hits = indexSearcher.search(query, 10);
for (ScoreDoc scoreDoc : hits.scoreDocs) {
Document doc = indexSearcher.doc(scoreDoc.doc);
System.out.println(">>> "+doc.get("title"));
System.out.println("Explain:");
System.out.println(indexSearcher.explain(query, scoreDoc.doc));
}
}
private static Query getQueryFromString(String searchString) throws Throwable {
Set<String> stopWords = new HashSet<String>();
Query query = new QueryParser(Version.LUCENE_36, "title",new StandardAnalyzer(Version.LUCENE_36, stopWords)).parse("("+searchString+") \"STARTS_WITH "+searchString+"\"");
return query;
}
private static Query getQueryFromAPI(String searchString) throws Throwable {
Set<String> stopWords = new HashSet<String>();
Query searchStringTermsMatchTitle = new QueryParser(Version.LUCENE_36, "title", new StandardAnalyzer(Version.LUCENE_36, stopWords)).parse(searchString);
PhraseQuery titleStartsWithSearchString = new PhraseQuery();
titleStartsWithSearchString.add(new Term("title","STARTS_WITH".toLowerCase()+" "+searchString));
BooleanQuery query = new BooleanQuery(true);
BooleanClause matchClause = new BooleanClause(searchStringTermsMatchTitle, Occur.SHOULD);
query.add(matchClause);
BooleanClause startsWithClause = new BooleanClause(titleStartsWithSearchString, Occur.SHOULD);
query.add(startsWithClause);
return query;
}
基本上我索引的一些字符串,然后我有两个方法从用户输入创建一个Lucene查询,一个只是“手动”(通过字符串连接)构建相应的Lucene查询字符串,另一个使用Lucene的API来构建查询。他们似乎是建立相同的查询,如查询的调试输出显示完全相同的查询字符串,但搜索结果是不一样的:
运行通过字符串连接产量内置的查询(参数“海”):
标题:标题滨海“STARTS_WITH海”
在这种情况下ideed当我用它搜索我得到第一个匹配title:"starts_with mer"
一部分文件。这里的第一个结果的explain
:
>>> starts_with mer Explain:
1.2329358 = (MATCH) sum of:
0.24658716 = (MATCH) weight(title:mer in 1), product of:
0.4472136 = queryWeight(title:mer), product of:
0.882217 = idf(docFreq=8, maxDocs=8)
0.50692016 = queryNorm
0.55138564 = (MATCH) fieldWeight(title:mer in 1), product of:
1.0 = tf(termFreq(title:mer)=1)
0.882217 = idf(docFreq=8, maxDocs=8)
0.625 = fieldNorm(field=title, doc=1)
0.9863486 = (MATCH) weight(title:"starts_with mer" in 1), product of:
0.8944272 = queryWeight(title:"starts_with mer"), product of:
1.764434 = idf(title: starts_with=8 mer=8)
0.50692016 = queryNorm
1.1027713 = fieldWeight(title:"starts_with mer" in 1), product of:
1.0 = tf(phraseFreq=1.0)
1.764434 = idf(title: starts_with=8 mer=8)
0.625 = fieldNorm(field=title, doc=1)
运行通过Lucene的查询辅助工具构建的查询产生一个明显相同的查询:
标题:滨海标题: “STARTS_WITH海”
但这次结果不一样,因为其实title:"starts_with mer"
部分是不匹配的。这是第一个结果的explain
:
>>> starts_with mer Explain:
0.15185544 = (MATCH) sum of:
0.15185544 = (MATCH) weight(title:mer in 1), product of:
0.27540696 = queryWeight(title:mer), product of:
0.882217 = idf(docFreq=8, maxDocs=8)
0.312176 = queryNorm
0.55138564 = (MATCH) fieldWeight(title:mer in 1), product of:
1.0 = tf(termFreq(title:mer)=1)
0.882217 = idf(docFreq=8, maxDocs=8)
0.625 = fieldNorm(field=title, doc=1)
我的问题是:whay不要我得到相同的结果?我真的很想能够在这里使用查询帮助器工具,特别是因为有我想使用的BooleanQuery(disableCoord)
选项,我真的不知道如何直接表达Lucene查询字符串。 (是的,我的例子在那里通过“真实”,我也尝试过使用“错误”,结果相同)。
=== UPDATE
femtoRgon的答案是伟大的:问题是,我是加入了完整的搜索字符串作为一个术语,而不是先分裂成词,然后将每一个查询。
如果输入字符串由一个词组成:在这种情况下,将“STARTS_WITH”文本分开添加为一个词条,然后将搜索字符串作为第二词条添加,答案femtoRgon会给出正常工作。
但是,如果用户输入的内容会被一个以上的术语所标记,那么您必须首先将其分成多个术语(最好使用您在索引时使用的相同分析器和/或标记器 - 以获得一致的结果),然后将每个术语添加到查询中。
我落得这样做是要做出拆分查询字符串到项的功能,使用我用于索引相同的分析:
private static List<String> getTerms(String text) throws Throwable { Analyzer analyzer = getAnalyzer();
StringReader textReader = new StringReader(text);
TokenStream tokenStream = analyzer.tokenStream(FIELD_NAME_TITLE, textReader);
tokenStream.reset();
List<String> terms = new ArrayList<String>();
CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class);
while (tokenStream.incrementToken()) {
String term = charTermAttribute.toString();
terms.add(term);
}
textReader.close();
tokenStream.close();
analyzer.close();
return terms;
}
然后我第一次添加了“STARTS_WITH”为一个任期,然后列表中的每个元素作为一个单独的术语:
PhraseQuery titleStartsWithSearchString = new PhraseQuery(); titleStartsWithSearchString.add(new Term("title","STARTS_WITH".toLowerCase()));
for(String term:getTerms(searchString)) {
titleStartsWithSearchString.add(new Term("title",term));
}
回答:
我相信你正在运行到的问题是,你作为一个长期添加整个短语你PhraseQuery。在索引以及QueryParser解析的查询中,这将被拆分为"starts_with"
和"mer"
,必须连续找到。但是,在您构建的查询中,您的PhraseQuery中只有一个术语,而术语"starts_with mer"
并不是索引中的单个术语。
你应该能够改变你在哪里建设PhraseQuery给位:
PhraseQuery titleStartsWithSearchString = new PhraseQuery(); titleStartsWithSearchString.add(new Term("title","STARTS_WITH".toLowerCase())
titleStartsWithSearchString.add(new Term("title",searchString));
以上是 Lucene的:通过查询API从字符串被解析并建立相同的查询不会产生相同的结果 的全部内容, 来源链接: utcz.com/qa/261577.html