Building your own Lucene Scorer

This post is about Apache Lucene, which is a “high-performance, full-featured text search engine library written entirely in Java”. If you have no idea on what I am talking about, this tutorial is not for you :). Be advised that this is my first month using Lucene, so there is still a chance that everything I say here is just plain wrong :P. Also, I am currently using version 3.6.1.

Doing an assignment from my Information Retrieval class I was faced with the problem of creating my own Scorer class on Lucene. When you create a new IndexSearcher, by default Lucene uses DefaultSimilarity, which is actually cosine similarity (in a Vector Space Model) with different weights such as boosts given when indexing, boosts given in the query, tf*idf and document length norm. A description on how it works exactly can be found on Similarity class documentation and on Lucene Score documentation.

The guys from Lucene have put a lot of effort into finding a good similarity function and their DefaultSimilarity works quite well on most of the cases. However, for one reason or another you may still want to use your own custom function.

When searching on how to do this, all results I found were about customizing the previous similarity function by extending the class and overriding its methods to change (or “disable”) some of the weights. Such examples can be found on LuceneTutorial.com and on blog posts and they all work like this:

However, I wanted to implement my own simple similarity function totally from scratch. It was not clear to me which single method my class had to implement/override and how I should add it to Lucene’s workflow.

After spending some hours looking on the internet, I finally found a solution in the book “Lucene in Action”, by Erik Hatcher. Do not ask me where I did find an online copy of book.

Looking to it right now, it does not sound very complicated. But coming to this only by reading the documentation was impossible for me. As a proof-of-concept I implemented a score function that is just a sum of the frequency of the terms of the query in the document.

Note that I am still verifying the index for -1 (term not found) because by default QueryParser uses an OR operand, so there is no guarantee that all terms from the query are going to be present in all of our retrieved documents.

Now that you have your own brand-new Scorer class, it is time to use it! Here is small example on how you can apply your new class on your program.

As also mentioned in the book, you can use your CustomScoreQuery to apply boosting depending on some rules. This can be done by having queries in a chain:

The scores obtained from the previous scorers can be obtained with the float subQueryScore that you see in the customScore method that we override, as you can see in its documentation.

That’s it! As it took me really long to figure this out I thought it would be a nice idea to share it. Please let me know if this was useful to you or if you find any mistakes on what I have said.

PS: Interesting presentation about Lucene that I found while writing this post: http://www.slideshare.net/nitin_stephens/lucene-basics

This entry was posted in Development and tagged . Bookmark the permalink.

2 Responses to Building your own Lucene Scorer

  1. Keith says:

    To be clear, you’re not creating a custom Scorer, you’re creating a CustomScoreProvider. I think you mean to use “Scorer” in a general sense but in Lucene it’s an actual class name that is implemented much differently than a CustomScoreProvider.

  2. Omolara says:

    This helped. Thanks for the example.

Leave a Reply to Keith Cancel reply