Extended query building for the Sphinx search engine

October 13, 2007

The last couple of days, I have been working on the “sphinx_search” module. This is a Phorum module which implements the Sphinx search engine as the backend for the Phorum search interface. This search engine is truly amazing for its search speed. Its query parser however, is not yet fully matured and has some shortcomings therefore. This caused us some troubles when we tried to use the extended query syntax for defining searches on both the body and subject of the forum messages. In this article I will show what work arounds were implemented in the “sphinx_search” module to get around this. Maybe these examples are useful for other who are using Spinx as well.

Search on all words in the query

We started out with the following format for defining a query “search for bodies that have both word1 and word2 in it or subjects that have both word1 and word2 in it”:

@body word1 word2 | @subject word1 word2

This appeared to be based on a wrong assumption. We found out that the OR symbol “|” takes precedence, making this query look like this for Sphinx:

@body word1 & (@body word2 | @subject word1) & @subject word2

This is not what we were looking for. So to make the query work like we wanted it to, we needed to group the query terms in the query like this:

(@body word1 word2) | (@subject word1 word2)

Unfortunately, this query syntax is not allowed by Sphinx (yet), so we had to find a different solution. After some experimenting, we ended up with the following query:

@body word1 | @subject word1 & @body word2 | @subject word2

This query looks like this for Shpinx:

(@body word1 | @subject word1) & (@body word2 | @subject word2)

Using this query we were able to get some good results. Note that this query is a little bit different from the query that we tried at first. For our application, this difference is no problem however. In fact, this latter query syntax makes the “sphinx_search” module behave the same as the standard Phorum search (which searches globally over the subject and body), so it is actually better in our case.

Search on any words in the query

When searching on any word in the query, things are easier. For this query it is possible to simply use one of these two syntaxes:

@body word1 | word2 | @subject word1 | word2
@body word1 | @subject word1 | @body word2 | @subject word2

Although the first notation is somewhat more elegant, I settled for the second one. That one is closest to the syntax that I ended up using for the “all words” query, which made it easier to program the quiry builder in the module.

Search on negated words

Using negated words, some words can be excluded from the search. So you could for example search for “word1 -word2″ in the Phorum search. Again, we missed the option to group parts of the query, disallowing us to do the following query:

(@body word1 -word2) | (@subject word1 -word2)

To make this query work, we finally came up with the following syntax:

@body -word2 & @subject -word2 & @body word1 | @subject word1

This query will be seen by Sphinx as:

@body -word2 & @subject -word2 & (@body word1 | @subject word1)

Again, not a really elegant query syntax, but at least it can be parsed correctly by Sphinx this way and it delivers the required results.

About these ads

3 Responses to “Extended query building for the Sphinx search engine”

  1. mmakaay Says:

    For the “search on all words” query, I got an update from Andrew (the author of Sphinx), suggesting the following syntax:

    @body (word1 word2) | @subject (word1 word2)

    This syntax does indeed work and it makes the search work like the first syntax that I described in the article. This syntax does unfortunately not yet work correct with negated words in the query. Those will trigger an error message. That error message is a confirmed bug, which will be fixed in a future version of Sphinx. You can follow the progress of the bug at http://www.sphinxsearch.com/bugs/view.php?id=65

  2. Chris Says:

    Evening,

    I’m working through similar, and somewhat complex challenges with the Sphinx search module, seems as if I am not alone!

    Chris

  3. mmakaay Says:

    The good news is that there have been quite some updates since I wrote this article. I didn’t have the time to check whether I can now skip the work around code, but looking at the commit logs in the Sphinx project, I think that things have improved a lot in the query language field.

    Good luck with your challenge! ;-)


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: