This post is about Apache Lucene, which is a “high-performance, full-featured text search engine library written entirely in Java”. If you have no idea on what I am talking about, this tutorial is not for you :). Be advised that this is my first month using Lucene, so there is still a chance that everything I say here is just plain wrong :P. Also, I am currently using version 3.6.1.
Doing an assignment from my Information Retrieval class I was faced with the problem of creating my own Scorer class on Lucene. When you create a new IndexSearcher, by default Lucene uses DefaultSimilarity, which is actually cosine similarity (in a Vector Space Model) with different weights such as boosts given when indexing, boosts given in the query, tf*idf and document length norm. A description on how it works exactly can be found on Similarity class documentation and on Lucene Score documentation.
The guys from Lucene have put a lot of effort into finding a good similarity function and their DefaultSimilarity works quite well on most of the cases. However, for one reason or another you may still want to use your own custom function.
When searching on how to do this, all results I found were about customizing the previous similarity function by extending the class and overriding its methods to change (or “disable”) some of the weights. Such examples can be found on LuceneTutorial.com and on blog posts and they all work like this:
1 2 3 4 5 6 7 8 |
IndexSearcher searcher = new IndexSearcher(reader); searcher.setSimilarity(new DefaultSimilarity() { @Override public float computeNorm(String field, FieldInvertState state) { return 1.0f; } } ); |
However, I wanted to implement my own simple similarity function totally from scratch. It was not clear to me which single method my class had to implement/override and how I should add it to Lucene’s workflow.
After spending some hours looking on the internet, I finally found a solution in the book “Lucene in Action”, by Erik Hatcher. Do not ask me where I did find an online copy of book.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
public static class MyOwnScoreQuery extends CustomScoreQuery { private Query query; public MyOwnScoreQuery(Query query) { super(query); this.query = query; } @Override public CustomScoreProvider getCustomScoreProvider(final IndexReader reader) { return new CustomScoreProvider(reader) { @Override public float customScore(int doc, float subQueryScore, float valSrcScore) throws IOException { // Insert your math here return 1f; } }; } } |
Looking to it right now, it does not sound very complicated. But coming to this only by reading the documentation was impossible for me. As a proof-of-concept I implemented a score function that is just a sum of the frequency of the terms of the query in the document.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
@Override public CustomScoreProvider getCustomScoreProvider(final IndexReader reader) { return new CustomScoreProvider(reader) { @Override public float customScore(int doc, float subQueryScore, float valSrcScore) throws IOException { TermFreqVector freqVector = reader.getTermFreqVector(doc, "contents"); int freqs[] = freqVector.getTermFrequencies(); Set<Term> terms = new HashSet<>(); query.extractTerms(terms); int total = 0; for (Term term : terms) { int index = freqVector.indexOf(term.text()); if (index != -1) { total += freqs[index]; } } return total; } }; } |
Note that I am still verifying the index for -1 (term not found) because by default QueryParser uses an OR operand, so there is no guarantee that all terms from the query are going to be present in all of our retrieved documents.
Now that you have your own brand-new Scorer class, it is time to use it! Here is small example on how you can apply your new class on your program.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
IndexReader reader = IndexReader.open(FSDirectory.open(new File(INDEX_PATH))); IndexSearcher searcher = new IndexSearcher(reader); Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_36); QueryParser parser = new QueryParser(Version.LUCENE_36, FIELD, analyzer); Query query = parser.parse("searching something"); CustomScoreQuery customQuery = new MyOwnScoreQuery(query); ScoreDoc[] hits = searcher.search(customQuery.createWeight(searcher), null, numHits).scoreDocs; for (int i = 0; i < hits.length; i++) { // iterating over the results // hits[i].doc gives you the doc } |
As also mentioned in the book, you can use your CustomScoreQuery to apply boosting depending on some rules. This can be done by having queries in a chain:
1 2 3 |
QueryParser parser = new QueryParser(Version.LUCENE_36, FIELD, analyzer); Query q = parser.parse("original query"); Query q2 = new MyOwnScoreQuery(q, ..., ...); |
The scores obtained from the previous scorers can be obtained with the float subQueryScore that you see in the customScore method that we override, as you can see in its documentation.
That’s it! As it took me really long to figure this out I thought it would be a nice idea to share it. Please let me know if this was useful to you or if you find any mistakes on what I have said.
PS: Interesting presentation about Lucene that I found while writing this post: http://www.slideshare.net/nitin_stephens/lucene-basics
To be clear, you’re not creating a custom Scorer, you’re creating a CustomScoreProvider. I think you mean to use “Scorer” in a general sense but in Lucene it’s an actual class name that is implemented much differently than a CustomScoreProvider.
This helped. Thanks for the example.