How Sphinx relevance ranking works

发布: 2010-09-21 21:19


Over time, we added quite a bunch of matching and ranking modes to Sphinx, and will be adding more. A number of different questions that regularly pop up, ranging from “how do I force this document ranked the 1st” from “how do I draw 1 to 5 stars depending on match quality”, do in fact boil down to matching and ranking internals. So let’s cover that: just how do matching and ranking modes work, what weighting factors contribute to the final weight and how, how does one tweak stuff, etc. And, of course, the stars, our destination.



What are matching modes?


First of all, let’s sort out those confusing modes. SphinxAPI exposes two different methods, SetMatchMode() and SetRankingMode() respectively. SphinxQL, for reference, does not, it only exposes OPTION ranker, which maps to ranking mode, but not the matching mode. What is all that about?


Matching modes are all about legacy and compatibility. And ranking modes are about how Sphinx computes relevance.


Previously, up to version 0.9.8, Sphinx only used to have matching modes and every matching mode was implemented with a different code path. Each code path implemented a different kind of both matching and ranking. For instance, SPH_MATCH_ALL required all keywords to be present, and computed document weight as phrase proximity alone. SPH_MATCH_ANY required any of the keywords, and computed weight differently. And so on.


In 0.9.8 we started a new, unified matching engine. To avoid breaking compatibility while we work on it, in 0.9.8 it was only exposed under a separate matching mode, called SPH_MATCH_EXTENDED2. By version 0.9.9 it became apparent that the new engine is stable and performant enough, and we removed all the legacy code paths in favor of the new engine. So starting from 0.9.9 all queries are now handled with a unified engine, which was not the case before and created maintenance difficulties. And so all the matching modes are now, in fact, just a legacy.


Sphinx stays compatible with those legacy modes, of course, and when you use one of those, it automatically switches to a simplified query parsing code (that completely ignores query syntax) and automatically picks a proper ranker. But that is it. Everything else is always handled with our unified matching engine. And thus, document weight (aka, @weight) only depends on the chosen ranking mode (aka, ranker). For instance, the following two queries will result in exactly the same weight (and exactly the same processing time, too):



// 1st route
$cl->SetMatchMode ( SPH_MATCH_ALL );
$cl->Query ( "hello world" );
 
// 2nd route
$cl->SetMatchMode ( SPH_MATCH_EXTENDED2 );
$cl->SetRankingMode ( SPH_RANK_PROXIMITY );
$cl->Query ( "hello world" );


Note that second route allows you to use, say, (@title hello world) syntax because the matching mode allows for it. First one does not, because in that matching mode all special operators are ignored and @title will be interpreted as a keyword.


So SetMatchMode() does nothing more than escape the query and pick a proper ranker. It’s rather a legacy call, as there won’t be new matching modes any more (they were a temporary solution until we had a full blown query syntax in the first place, but temporary solutions are notorious for their tendency to last) and query-syntax enabled matching allows you to do everything that older modes used to provide and much more. SetRankingMode() does even less, it just lets you explicitly pick a ranker. Which brings us to the question…


What are rankers?


Ranking modes, or rankers for short, can be formally defined as functions that compute a relevance value (weight) for a given query and document arguments.


Relevance is ultimately subjective, so there’s no single one-size-fits-all ranker, and there will never be. So there can be many different factors used to compute a final weight and a myriad of ways to combine those factors into a weight, discussing that is a subject for a separate posting.


The two most important weighting factors that Sphinx computes and uses as of 1.10 are 1) classic statistical BM25 factor, used by most if not all search engines since 80s, and 2) Sphinx specific phrase proximity factor.


BM25 factor


BM25 is a floating point value that depends on frequencies of the matched keywords only. Frequencies in question are in-document and in-collection frequencies. Basically, keywords that are more rare and/or occur many times in the document yield more weight to that document.


Standard BM25 implementation is nicely covered in Wikipedia article on BM25 but Sphinx uses a slightly modified variant. First, for performance reasons we account for all the keywords occurrences in the document, and not just the matched ones. For instance, (@title “hello world”) query that only matches a single instance of “hello world” phrase in the title will result in the same BM25 with a (hello world) query that matches all the instances of both keywords everywhere in the document. Second, we don’t enforce any document attributes and therefore don’t necessarily have a document length, so we ignore document length too (equivalent to plugging b=0 into original BM25). Both changes were intentional, as in our testing using original BM25 did not result in enough ranking improvement to justify the associated performance impact. The exact BM25 computation that Sphinx uses is, in pseudo-code, as follows:



BM25 = 0
foreach ( keyword in matching_keywords )
{
n = total_matching_documents ( keyword )
N = total_documents_in_collection
k1 = 1.2
 
TF = current_document_occurrence_count ( keyword )
IDF = log((N-n+1)/n) / log(1+N)
BM25 = BM25 + TF*IDF/(TF+k1)
}
 
// normalize to 0..1 range
BM25 = 0.5 + BM25 / ( 2*num_keywords ( query ) )


TF means Term Frequency in a document being ranked. It’s based on a number of occurrences within a document but smoothed with a logarithm function, so that 1000 occurrences don’t result in 1000x improvement over just 1. TF can generally vary from 0 to 1 but, with a chosen k=1.2, it actually varies from 0.4545… to 1.


IDF means Inverse Document Frequency in the entire document set. IDF possesses lesser values for frequent words (such as “the” or “to” etc) and greater values for rare ones, with peak values being IDF=1 when a keyword occurs in exactly one document, and IDF=-1 when it occurs in every indexed document.


So, as you can see from the code above, BM25 increases when the keywords are rare and occur many times in the document and decreases when the keywords are frequent. It should be noted that overly frequent keywords that match more than a half of indexed documents actually decrease BM25! Indeed, when a keyword occurs in 90% of the documents then the documents without it are rarer gems, probably more interesting as such, and deserve more weight.


Phrase proximity factor


Phrase proximity factor, on the contrary, does not care about the keyword frequencies at all and accounts for the mutual disposition of query keywords in the document. Instead of keyword frequencies used for BM25, Sphinx analyzes keyword positions in every field and computes phrase proximity value as the longest common sub-sequence (LCS) length between the query and the document. Basically, per-field phrase proximity is a number of keywords that occurred in the document in exactly the same order as they did in the query. Here go a few examples:



1) query = one two three, field = one and two three
field_phrase_weight = 2 (because 2-keyword long "two three" subphrase matched)
 
2) query = one two three, field = one and two and three
field_phrase_weight = 1 (because single keywords matched but no subphrase did)
 
3) query = one two three, field = nothing matches at all
field_phrase_weight = 0


Per-field phrase weights are then multiplied by field weights specified with SetFieldWeights() API call (or OPTION field_weights in SphinxQL) and added altogether to produce a per-document phrase weight. Field weights default at 1, and can not be set lower than 1. In pseudo-code, the entire phrase proximity calculation looks as follows:



doc_phrase_weight = 0
foreach ( field in matching_fields )
{
field_phrase_weight = max_common_subsequence_length ( query, field )
doc_phrase_weight += user_weight ( field ) * field_phrase_weight
}


Example:



doc_title = hello world
doc_body = the world is a wonderful place
 
query = hello world
query_title_weight = 5
query_body_weight = 3
 
title_phrase_weight = 2
body_phrase_weight = 1
doc_phrase_weight = 2*5+3*1 = 13


It’s the phrase proximity factor that guarantees that closer phrase matches will be ranked higher, and exact phrase matches will be ranked at the very top. One can use field weights to tweak and tune that behavior. For instance, in the example just above, a single-keyword match in title is made worth the same as a two-keyword phrase match in body.


Phrase proximity is by design somewhat more computationally intensive than BM25 because it needs to work through all the keyword occurrences in the matched documents and not just the documents only. Sphinx defaults to using proximity because we believe this yields better search quality. You can however choose to use a more lightweight ranker that omits the expensive proximity calculations.


Orbital view of the rankers


Phrase proximity and BM25 are the two most important factors, that is, values that contribute to the final document weight. However, the final weight value is determined by the ranker, that is, the specific function that crunches one or more factors into a single number (Also, there are other factors besides phrase weight and BM25 that Sphinx can compute and use.)


As of 1.10-beta, Sphinx has 8 different rankers, and will definitely add more in the future. Every ranker computes weight differently and thus might or might not be suitable for a particular scenario.


There are 3 simple rankers (NONE, WORDCOUNT, FIELDMASK) that do nothing, count keyword occurrences, and return matching fields bitmask, respectively. Those are useful when ranking is not needed at all, or computed somehow on application side.


There are 2 legacy rankers (PROXIMITY, MATCHANY) that rely on phrase proximity alone and are used to emulate MATCH_ALL and MATCH_ANY legacy modes respectively.


There are 3 more rankers (BM25, PROXIMITY_BM25, SPH04) that can combine phrase proximity, BM25, and other bits. Query-syntax enabled modes and SphinxQL default to PROXIMITY_BM25 now, and PROXIMITY_BM25 is strongly suggested as a drop-in replacement for PROXIMITY, too. BM25 is recommended as a reasonably good quick ranker, and also for comparison with other systems. SPH04 builds upon PROXIMITY_BM25 but additionally ranks exact field matches and field start matches higher than “just” matches.


PROXIMITY_BM25 and SPH04 are expected to yield the best quality, but your particular results may vary.


Choice of ranker can severely affect search query performance. NONE is obviously the quickest ranker, but what about the other ones? Processing the keyword positions (occurrences) is typically the most expensive part, so rankers that don’t need to do that (FIELDMASK, BM25) are always quicker than all the others. They utilize less disk IO too (no need to read in the positions). Rankers that process keyword positions (WORDCOUNT, PROXIMITY, MATCHANY, PROXIMITY_BM25, SPH04) only differ in CPU impact.


Nitty gritty ranker details


This section describes the exact algorithms Sphinx rankers use and provides pseudo-code. You can skip it freely unless you want to tweak ranking, tune field weights, etc.


While the factors might be integer, boolean, floating point or whatever else, the weight has to be a single scalar value. In Sphinx, the weight is not just scalar but an integer value. This isn’t a real constraint, floating point weight values can be mapped to integers in a variety of ways anyway.


Let’s begin with the three simplest rankers.


1) SPH_RANK_NONE ranker just assigns every document weight to 1.



weight = 1


Why use this and effectively skip ranking at all? The answer is performance. If you’re sorting search results by price, why spend CPU cycles doing expensive ranking you’re going to throw away anyway?


2) SPH_RANK_WORDCOUNT ranker counts all the keyword occurrences and multiplies them by user field weights.



weight = 0
foreach ( field in matching_fields )
weight += num_keyword_occurrences ( field )


Note that it counts all occurrences, and not the unique keywords. Therefore 3 occurrences of just 1 matching keyword will contribute exactly as much as 1 occurrence of 3 different keywords.


3) SPH_RANK_FIELDMASK ranker returns a bit mask of matched fields.



weight = 0
foreach ( field in matching_fields )
set_bit ( weight, index_of ( field ) )
// or in other words, weight |= ( 1 << index_of ( field ) )


The other five rankers are somewhat more complicated and mostly rely on phrase proximity.


4) SPH_RANK_PROXIMITY, the default ranker in SPH_MATCH_ALL legacy mode, simply passes the phrase proximity for a weight:



weight = doc_phrase_weight


By the definition of phrase weight, when documents do match the query but no sequence of two keywords matches, all such documents will receive a weight of 1. That, clearly, isn’t differentiating the results much so using PROXIMITY_BM25 ranker instead is advised. The associated searching performance impact should be negligible.


5) SPH_RANK_MATCHANY ranker, used to emulate legacy MATCH_ANY mode, combines phrase proximity and the number of matched keywords so that, with default per-field weights, a) longer sub-phrase match (aka bigger phrase proximity) in any field would rank higher, and b) in case of agreeing phrase proximity, document with more matched unique keywords would rank higher. In other words, we look at max sub-phrase match length first, and a number of unique matched keywords second. In pseudo-code,



k = 0
foreach ( field in all_fields )
k += user_weight ( field ) * num_keywords ( query )
 
weight = 0
foreach ( field in matching_fields )
{
field_phrase_weight = max_common_subsequence_length ( query, field )
field_rank = ( field_phrase_weight * k + num_matching_keywords ( field ) )
weight += user_weight ( field ) * field_rank
}


It does not use BM25 at all because legacy mode did not use it and we need to stay compatible.


6) SPH_RANK_PROXIMITY_BM25, the default SphinxQL ranker and also the default ranker when “extended” matching mode is used with SphinxAPI, computes weight as



weight = doc_phrase_weight*1000 + integer(doc_bm25*999)


So document phrase proximity is the primary factor and BM25 is an auxiliary one that additionally sorts documents sharing the same phrase proximity. BM25 belongs to 0..1 range, so last 3 decimal digits of final weight contain scaled BM25, and all the other digits are used for the phrase weight.


7) SPH_RANK_BM25 ranker sums user weights of the matched fields and BM25.



field_weights = 0
foreach ( field in matching_fields )
field_weights += user_weight ( field )
weight = field_weights*1000 + integer(doc_bm25*999)


Almost like PROXIMITY_BM25 ranker, except that user weights are not multiplied by per-field phrase proximities. Not using phrase proximity allows the engine to evaluate the query using document lists only, and skip the processing of keyword occurrences lists. Unless your documents are extremely short (think tweets, titles, etc), occurrence lists are somewhat bigger than the document lists and take somewhat more time to process. So BM25 is a faster ranker than any of the proximity-aware ones.


Also, many other search systems either default to BM25 ranking, or even provide it as the only option. So it might make sense to use BM25 ranker when doing performance testing to make the comparison fair.


8) SPH_RANK_SPH04 ranker further improves on PROXIMITY_BM25 ranker (and introduces numbers instead of meaningful names, too, because a name would be way too complicated). Phrase proximity is still the leading factor, but, within a given phrase proximity, matches in the beginning of the field are ranked higher, and exact matches of the entire field are ranked highest. In pseudo-code,



field_weights = 0
foreach ( field in matching_fields )
{
f = 4*max_common_subsequence_length ( query, field )
if ( exact_field_match ( query, field ) )
f += 3
else if ( first_keyword_matches ( query, field ) )
f += 2
field_weights += f * user_weight ( field )
}
weight = field_weights*1000 + integer(doc_bm25*999)


Thus, when querying for “Market Street”, SPH04 will basically rank a document with exact “Market Street” match in one of the fields the highest, followed by “Market Street Grocery” that begins the field with a matching keyword, then followed by “West Market Street” that has a phrase match somewhere, and then followed by all the documents that do mention both keywords but not as a phrase (such as “Flea Market on 26th Street”).


So how do I draw those stars?


Or, more formally, how do I compute the max possible weight and scale returned weights to A-F scale, or percents, or whatever else?


As you can see from the previous section there’s no simple way to tell that. Maximum weight depends both on a chosen ranker and a particular query. For example, an upper weight bound with PROXIMITY_BM25 ranker would be



max_weight = num_keywords * sum ( user_field_weights ) * 1000 + 999


But can this upper bound ever be reached? Barely in practice, because that would require a) exact phrase matches b) in all the fields c) plus BM25 peaking at 999 which roughly translates to only using one-in-a-million keywords. Moreover, what if the query uses field limit operators, eg. @title hello world? In that case our upper bound can never be reached because we would never match any field except title. For this particular query the practical upper bound, which could possibly be reached by an “ideal” document, is different.


Therefore, computing the “ideal” maximum weight (one that can actually be reached) is really, really complicated. We could possibly do that on Sphinx side but that’s a lengthy R&D project with questionable outcome. So if you can’t live without percentiles (or stars!), you can either use the “absolute” upper bound estimate like the one given above (that might never be practically reached and result in “100% match”), or just use the maximum weight from your particular query, and rescale everything to that weight. Using multi-queries, the latter can be made pretty cheap.


So how do I rank exact field matches higher?


You use a ranker that does that.


Both SphinxAPI-default PROXIMITY and SphinxQL-default PROXIMITY_BM25 rankers do not. They only rank a longer sub-phrase match higher, but do not care as to where in the field that match occurred, and whether it matched the entire field or not.


SPH04 ranker, added in 1.10-beta, does.


So how do I force document D to rank first?


Depending on why document D needs to be ranked higher, you either use a ranker that fits your requirements better, or use Sphinx run-time expressions to compute what you need and sort the result set differently.


Example given, boosting those exact field-equals-query matches could be emulated by sorting by an expression:



SELECT *, @weight+IF(fieldcrc==$querycrc,1000,0) AS myweight ...
ORDER BY myweight DESC


where fieldcrc is CRC(field) attribute computed at indexing time and stored in the index, and querycrc is CRC(query) computed at searching time.


Example given, instead of checking for strict CRC match, you could index and store field lengths, and rank shorter fields higher by using an expression like



SELECT *, @weight+ln(len+1)*1000 AS myweight ...


Example given, to force a document rank higher when a given keyword is searched for, you create a separate field with super-important keywords, put it there, and assign a high weight to that field. (Don’t set the weight over a million though or 32-bit weight would overlap!)


So how does Sphinx ranking compare to system XYZ?


Major Web search engines (think Google) are an entirely different story, subject to another posting, perhaps. Web-scale ranking (and spam fighting) forces them to account for hundreds and thousands of factors in their ranking. Many of those factors (PageRank, page and domain age, incoming link count, ratio of code to text, etc) are not text related, however, and can also be employed with Sphinx in a particular application by using expressions. Sphinx itself is generic and its rankers only handle some text related factors, everything else has to be explicitly added on top.


However most other full-text search systems still either default to plain old BM25 for text related factors, or even limit you to it. Don’t get me wrong, BM25 is great, it’s a great weighting factor. But using it as the only ranking factor is, ahem, really last century. Sphinx proximity based rankers do a step towards improving that, and we plan taking a few more steps, too. Stay tuned, interesting things are going to happen. ;)





引用: http://sphinxsearch.com/blog/2010/08/17/how-sphinx-relevance-ranking-works/


原文: http://qtchina.tk/?q=node/510

Powered by zexport