Extended query building for the Sphinx search engine
October 13, 2007
The last couple of days, I have been working on the “sphinx_search” module. This is a Phorum module which implements the Sphinx search engine as the backend for the Phorum search interface. This search engine is truly amazing for its search speed. Its query parser however, is not yet fully matured and has some shortcomings therefore. This caused us some troubles when we tried to use the extended query syntax for defining searches on both the body and subject of the forum messages. In this article I will show what work arounds were implemented in the “sphinx_search” module to get around this. Maybe these examples are useful for other who are using Spinx as well.
Search on all words in the query
We started out with the following format for defining a query “search for bodies that have both word1 and word2 in it or subjects that have both word1 and word2 in it”:
@body word1 word2 | @subject word1 word2
This appeared to be based on a wrong assumption. We found out that the OR symbol “|” takes precedence, making this query look like this for Sphinx:
@body word1 & (@body word2 | @subject word1) & @subject word2
This is not what we were looking for. So to make the query work like we wanted it to, we needed to group the query terms in the query like this:
(@body word1 word2) | (@subject word1 word2)
Unfortunately, this query syntax is not allowed by Sphinx (yet), so we had to find a different solution. After some experimenting, we ended up with the following query:
@body word1 | @subject word1 & @body word2 | @subject word2
This query looks like this for Shpinx:
(@body word1 | @subject word1) & (@body word2 | @subject word2)
Using this query we were able to get some good results. Note that this query is a little bit different from the query that we tried at first. For our application, this difference is no problem however. In fact, this latter query syntax makes the “sphinx_search” module behave the same as the standard Phorum search (which searches globally over the subject and body), so it is actually better in our case.
Search on any words in the query
When searching on any word in the query, things are easier. For this query it is possible to simply use one of these two syntaxes:
@body word1 | word2 | @subject word1 | word2
@body word1 | @subject word1 | @body word2 | @subject word2
Although the first notation is somewhat more elegant, I settled for the second one. That one is closest to the syntax that I ended up using for the “all words” query, which made it easier to program the quiry builder in the module.
Search on negated words
Using negated words, some words can be excluded from the search. So you could for example search for “word1 -word2″ in the Phorum search. Again, we missed the option to group parts of the query, disallowing us to do the following query:
(@body word1 -word2) | (@subject word1 -word2)
To make this query work, we finally came up with the following syntax:
@body -word2 & @subject -word2 & @body word1 | @subject word1
This query will be seen by Sphinx as:
@body -word2 & @subject -word2 & (@body word1 | @subject word1)
Again, not a really elegant query syntax, but at least it can be parsed correctly by Sphinx this way and it delivers the required results.