Phorum’s message tree sorting algorithm

March 7, 2008

While figuring out what went wrong with reverse threading support in the Phorum PHP Extension, I drew up some information on how the tree sorting code does its job, to find out where my logic went wrong in the extension code. I figured that it might be useful to others, so here’s the annotated schematics. This is not the full code solution, but it explains how Phorum’s non-recursive tree sorting algorithm works.

The message tree that has to be sorted could look like this (multiple threads in a forum):

0 (virtual root node: forum)
|
+--1
|  |
|  +--5
|  |  |
|  |  +--6
|  |
|  +--8
|  .  |
|  .  +--9
|  .
|  ..(10)
|     .        \
|     ...11     |
|        |       > some broken stale data, e.g. because node
|        +--12  |  10 was removed incorrectly.
|              /
+--2
|  |
|  +--3
|
+--4
   |
   +--7

First pass: build a parent -> childs overview

First tree ordering step: build parent -> childs relations. This must work for any input array order. During this step, we also keep track of messages for which we do not encounter a parent (stale messages).

For the above tree, we get this relation table:

+--------+-------+--------+
| parent | stale | childs |
+--------+-------+--------+
| 0      |   *   | 1/2/4  |
| 1      |       | 5/8    |
| 2      |       | 3      |
| 3      |       | -      |
| 4      |       | 7      |
| 5      |       | 6      |
| 6      |       | -      |
| 7      |       | -      |
| 8      |       | 9      |
| 9      |       | -      |
| 10     |   *   | 11     |
| 11     |       | 12     |
| 12     |       | -      |
+--------+-------+--------+

Since there is no real parent 0 (that is the virtual parent id for all
thread starter messages), that one will be marked at stale too. But we
will not do any processing on that one.

The other stale items have to be taken care of, to ensure that our message tree is consistent. The handling for stale items is that they get promoted to a higher parent node. In case the starter message for the stale message’s thread exists, the message (and its childs) gets linked to that thread starter. In case the start message does not exist, the message gets linked to parent id 0 and thus effectively represents a separate thread from here on. This is wrong but, but it is the best way to prevent problems. And remember: this shouldn’t happen anyway.

After fixing the stale data, our tree would look like this (asuming that node 10 had thread id 1):

+--------+-------+--------+
| parent | stale | childs |
+--------+-------+--------+
| 0      |       | 1/2/4  |
| 1      |       | 5/8/10 | <-- 10 added to node 1
| 2      |       | 3      |
| 3      |       | -      |
| 4      |       | 7      |
| 5      |       | 6      |
| 6      |       | -      |
| 7      |       | -      |
| 8      |       | 9      |
| 9      |       | -      |
| 10     |       | 11     |
| 11     |       | 12     |
| 12     |       | -      |
+--------+-------+--------+

Second pass: generate the sorted tree

We start at parent node 0. From there we go into the child messages one by one. For each child record, we first go into that child’s children (so it’s a depth first tree walk). It is very tempting to handle this in a classic recursive call solution, however this is not good for performance.

The Phorum team hacked up a non recursive tree sort after some beers at the MySQL conference 2007. This method uses a single loop over the messages and a stack to keep track of the tree. Here’s what this looks like for our example tree (I kept the stale data out of here, but that would work the same way once it’s tied to an existing parent). I hope you can follow this somewhat hard looking schema.

Legenda: 

  * = add a new item to the tree
  > = move down to a child node
  < = move up to a parent node 

  (N):X:Y:Z = Node N, children to process are X, Y and Z. 

action  stack0     stack1     stack2     stack3    tree order
----------------------------------------------------------------------------
  *0    (0):1/2/4                                  0
  >1    (0):2/4    (1):5/8                         0
  *1    (0):2/4    (1):5/8                         0,1
  >5    (0):2/4    (1):8      (5):6                0
  *5    (0):2/4    (1):8      (5):6                0,1,5
  >6    (0):2/4    (1):8      (5)        (6)       0,1,5
  *6    (0):2/4    (1):8      (5)        (6)       0,1,5,6
  <5    (0):2/4    (1):8      (5)                  0,1,5,6
  <1    (0):2/4    (1):8                           0,1,5,6
  >8    (0):2/4    (1)        (8):9                0,1,5,6
  *8    (0):2/4    (1)        (8):9                0,1,5,6,8
  >9    (0):2/4    (1)        (8):9                0,1,5,6,8
  >9    (0):2/4    (1)        (8)        (9)       0,1,5,6,8
  *9    (0):2/4    (1)        (8)        (9)       0,1,5,6,8,9
  <8    (0):2/4    (1)        (8)                  0,1,5,6,8,9
  <1    (0):2/4    (1)                             0,1,5,6,8,9
  <0    (0):2/4                                    0,1,5,6,8,9
  >2    (0):4      (2):3                           0,1,5,6,8,9
  *2    (0):4      (2):3                           0,1,5,6,8,9,2
  >3    (0):4      (2)        (3)                  0,1,5,6,8,9,2
  *3    (0):4      (2)        (3)                  0,1,5,6,8,9,2,3
  <2    (0):4      (2)                             0,1,5,6,8,9,2,3
  <0    (0):4                                      0,1,5,6,8,9,2,3
  >4    (0)        (4):7                           0,1,5,6,8,9,2,3
  *4    (0)        (4):7                           0,1,5,6,8,9,2,3,4
  >7    (0)        (4)        (7)                  0,1,5,6,8,9,2,3,4
  *7    (0)        (4)        (7)                  0,1,5,6,8,9,2,3,4,7
  <4    (0)        (4)                             0,1,5,6,8,9,2,3,4,7
  <0    (0)                                        0,1,5,6,8,9,2,3,4,7
----------------------------------------------------------------------------

While this was an okay view for me when tracking the algorithm, this might not be very appealing to others. So I drew up an image of the path that we take while walking the tree. Click the image below for a full view:

Tree walk path

The resulting tree order corresponds to the top to bottom order for the nodes in the example tree at the start of this document. The stack depth at the points where we add an item to the tree can be used to track at what level the node in the tree is situated. This data combined, is enough to build the sorted tree.

So, what about the reverse threading?

If reverse threading is enabled, then the nodes would be added in reverse order. So for node 0, the child nodes would be processed in the order 4, 2, 1 instead of 1, 2, 4 that we showed above. The whole tree walking schema is the same for the rest of the processing.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s

%d bloggers like this: