Understanding Apache Lucene's scoring algorithm


I'm working with Hibernate Search for months now, but still I'm not able to digest the relevance it brings. I'm overall satisfied with the results it returns, but even simplest test does not satisfy my expectation.

First test was using the term frequency(tf). Data:

Results I get:

  1. word
  2. word word word word
  3. word word word word word
  4. word word word word word word
  5. word word
  6. word word word

I'm really confused with this scoring effect. My Query is quite complex, but as this test did not have any other field involved, it can be simplified as below: booleanjunction.should(phraseQuery).should(keywordQuery).should(fuzzyQuery)

I've analyzers as below:

 SnowballPorterFilterFactory for english

My Explanation object


Scoring calculation is something really complex. Here, you have to begin with the primal equation:

score(q,d) = coord(q,d) · queryNorm(q) · ∑ ( tf(t in d) · idf(t)2 · t.getBoost() · norm(t,d) )

As you said, you have tf which means term frequency and its value is the squareroot of the frequency of the term.

But here, as you can see in your explanation, you also have norm (aka fieldNorm) which is used in fieldWeight calculation. Let's take your example:

eklavya eklavya eklavya eklavya eklavya

4.296241 = fieldWeight in 177, product of:
  2.236068 = tf(freq=5.0), with freq of:
    5.0 = termFreq=5.0
  4.391628 = idf(docFreq=6, maxDocs=208)
  0.4375 = fieldNorm(doc=177)


4.391628 = fieldWeight in 170, product of:
  1.0 = tf(freq=1.0), with freq of:
    1.0 = termFreq=1.0
  4.391628 = idf(docFreq=6, maxDocs=208)
  1.0 = fieldNorm(doc=170)

Here, eklavya has a better score than the other because fieldWeight is the product of tf, idf and fieldNorm. This last one is higher for eklavya document because he only contains one term.

As above documentation said:

lengthNorm - computed when the document is added to the index in accordance with the number of tokens of this field in the document, so that shorter fields contribute more to the score.

The more terms you have in a field, lower fieldNorm will be. Be careful with the value of this field.

So, to conclude, here you have a perfect mix to understand that the score is not calculated only with the frequency but also with the number of term that you have in your field.


Solr 5.1.0: How to set the unique key via Schema API

In Solr 5.1.0, is it possible to set the unique key via the REST schema api? I created a collection with the data driven schema. Solr would guess what the field type and create the field based on the data I upload. I can still define fields beforehand by sending...

How to index plain text files for search in Sphinx

I scanned dozens of articles and forum threads, looked through official documentation, but couldn't find an answer. This article sounds promising, since is says that The data to be indexed can generally come from very different sources: SQL databases, plain text files, HTML files, but unfortunately as all other articles...

Getting application/json back from a Solr query

I'm calling the Solr REST api using a Jersey client: final ClientResponse resp = client().path(queryPath()) .queryParam("q", query.getQuery()) .queryParam("wt", "json") .accept(MediaType.APPLICATION_JSON_TYPE) .get(ClientResponse.class); resp.getEntity(HttpResponse.class) and when I run it I get: A message body reader for Java class challenger.HttpResponse, and Java type class challenger.HttpResponse, and MIME media type text/plain; charset=UTF-8 was not...

How to skip a row with file exists condition in laravel

This is for a search query based on many input fields, i'm doing if statements inside the query based on the inputs, for example : $query = Model::all(); if($field = Input::get('field')) $query->where('column_name', $field); but what i want to do also is a condition to skip a row if there is...

KQL - Ignoring items with property not equal to value

I have to configure the site search so that it does not include items wich have a property of ModerationStatus != 1. I found out that using a query like ModerationStatus <> 1 can probably solve my problem, but I am not sure if it will work in my environment...

SQL find same value on multiple filelds with like operator

I have this records from my users table: user_id first_name last_name gender email ******* ********** ********* ****** ***** 229 Natalie Fern F [email protected] and I want to search same First Name & Last Name from first_name OR last_name. I have created sql query but not getting record. SELECT * FROM...

VB.Net - How to dynamicaly search for a string in all TreeView nodes expanding and collapsing nodes matching (or not) the search string?
