Friday, February 22, 2008

Notes from Dagstuhl I: Red-black trees

I've just returned (well almost: I'm in JFK waiting out the storm) from a Dagstuhl workshop on data structures. For those not in the know, this is one of the common retreat spots for computer scientists to get together and chat about problems in a very relaxed atmosphere (close to the true spirit of a workshop).

In that spirit, people present things that are more "uncooked", half-baked, or incomplete, and so I won't talk about too many of the presentations, except for the ones which are available online.

In this post, I'll discuss a new take on an old warhorse: the red-black tree.

A bit of background (for those unfamiliar with the story thus far):
One of the most fundamental data structures in computer science is the binary search tree. It's built over a set of items with an ordering defined on them, and has the property that all nodes in the left subtree of the root are smaller than all nodes in the right subtree (and so on recursively). BSTs are handy for building structures that allow for quick membership queries, or "less-than"-type queries.

One can build a BST with depth log n for a set of n ordered items. This means that operations on this structure will take log n time in general. However, if the items can change on the fly, then it's more difficult to maintain such a structure while making sure that updates themselves are cheap (O(log n)).

Among many solutions proposed to handle dynamic maintainence of BSTs is the red-black tree, proposed in its current form by Leo Guibas and Bob Sedgewick back in 1978. The red-black tree is the definitive dynamic BST data structure, at least in practice: it has worst-case log n bounds for all operations, and shows up in the implementations of basic structures in the STL and in many language libraries. By virtue of its being in CLRS, it also has stupefied the minds of undergraduates for many years now.

The new story:
Conceptually, the re-balancing operations that are used to maintain a red-black tree are not terribly hard. However, there are numerous cases to consider, especially when doing deletions, and this tends to make the actual code used to write these trees fairly complex, and thus potentially error prone. In the first talk at Dagstuhl, Bob Sedgewick talked about a simpler variant of red-black trees, called left-leaning red-black trees, whose main virtue is that they are simpler to implement (many fewer lines of code) while maintaining the nice properties of red-black trees.

Bob's slides have more details: although this new structure hasn't been published anywhere, it will appear in the next edition of his book. The key idea is to first understand how a red-black tree simulates a 2-3-4 tree (a more complex search tree in which nodes can have two, three or four children (and therefore, one, two or three keys). It's possible to construct a direct mapping between a node in a 2-3-4 tree and a subtree in a red-black tree.

Once this is understood, the LLRB tree comes about by restricting the subtrees thus formed to be "left-leaning". Specifically, right-leaning red edges are not allowed, and to prevent the tree from being too skewed, more than three consecutive left-leaning red edges are disallowed as well. Doing this allows us to simplify various cases in the insertion and deletion steps, while not increasing the tree depth by too much (this last statement was argued by analogy with 2-3-4 trees).

It's a simple idea, and simplifies code by a substantial amount without sacrificing asymptotics.

18 comments:

  1. Without sacrificing asymptotics -- but any sacrifice in the constant factors?

    ReplyDelete
  2. I'm not sure. Bob was arguing that that the code simplicity and length is such that it helps alleviate the increased number of steps needed to traverse the (generally longer) tree. But again, I didn't see detailed proofs.

    although come to think of it, I should have :), since he (and Phillipe Flajolet) push the idea of exact analysis.

    ReplyDelete
  3. There was some work on simplifying bst implementations back in the 90s. I fondly remember "deterministic skip lists" and another design from Arne Andersson (Weiss's textbook call it AA-tree, iirc). Andersson's design, in particular, uses a similiar idea of restricting the encoding of 3-nodes in a 2-3 tree to be "right-leaning". The code is very clean too. It's great to see this cute idea get more popularized!

    http://user.it.uu.se/~arnea/abs/simp.html
    A. Andersson. Balanced search trees made simple.
    In Proc. Workshop on Algorithms and Data Structures, pages 60--71. Springer Verlag, 1993.

    ReplyDelete
  4. So, this is interesting. Someone actually brought up the Andersen skip list implementation during the discussion. I'm not sure how things were resolved: Raimund (Seidel) and Bob continued to discuss things later on.

    ReplyDelete
  5. In this day and age of multi-level cache hierarchies and memory access becoming ever slower than processor speeds, a plain old red-black or 2-3(-4) tree or even a skip list can bring along with it a pretty high constant factor (in terms of latency) for each pointer dereference.

    What you need is locality of reference and even better contiguity of reference, as you obtain in B-trees and all their variants which database people had already sought out since they were working with the huge latencies of disks. But even memory-resident algorithms people should now also consider these factors ...

    ReplyDelete
  6. Treaps. Do I have to say anything more?

    Yes? Oh well - they do outperform orange-purple trees any day of the week (including sundays), and they are way way simpler. Whats much harder is the analysis, naturally.

    Skip-lists are cute, but in fact inferior to yellow-gray trees, because they perform twice the number of memory allocations (a huge overhead in practice unless you write your own memory manager, and even then the locality of ref is going to be bad).

    Naturally if the cache hierarchy is the bottleneck, b-trees, or some other io efficient, or cache obvious data structures should be used.

    I remember seeing some paper comparing all this data structures with the treaps coming on top, and the rest being inferior.

    ReplyDelete
  7. This RB implementation is quite short (but lacks deletion): http://www.eecs.usma.edu/webs/people/okasaki/sigcse05/RedBlack.java

    ReplyDelete
  8. This RB implementation is quite short (but lacks deletion):

    That is exactly the problem: deletion is complicated. AA-trees are similar in spirit to LRB trees, yet deletion is still marginally more complicated.

    ReplyDelete
  9. According to the slides the point is that both insertion and deletion are easier. I'm just saying that there exist better implementations for insertion than the ones given as examples (Java API and Algorithms in Java), when measured by code complexity.

    I don't know which would be a good existing implementation for deletions.

    ReplyDelete
  10. Ah, seeing Suresh #4, I realize my writing was ambiguous. To set the record straight, deterministic skip list is by Munro, Papadakis and Sedgewick. Andersson is responsible for the simplified binarization of 2-3 trees. Sorry for the confusion.

    ReplyDelete
  11. Very timely discussion - I am about to teach search trees this Thursday (in an undergraduate algorithms class), and I am TORN between 2-3(-4) trees and RB-trees. The respective pros:

    - RB-trees: are in the book (CLRS); examples of BINARY search trees; doable in finite time

    - 2-3(-4)-trees: simple and intuitive search and insertion, although not deletion; no (risk of breaking your neck while tracing) rotations.

    Any thoughts on this choice ? Also, any good lecture notes on 2-3(-4) trees out there ?

    Thanks,

    ReplyDelete
  12. If you look at what Bob talks about in his notes, it seems that RB trees are "merely" a simpler a way of thinking about 2-3-4 trees, which are the "real" thing.

    In other words, the concept is made clear in 2-3-4 trees, and via a one-one correspondence between structures, the RB-tree simulates the 2-3-4 tree with "easier" code.

    I don't know if that helps :)

    ReplyDelete
  13. > If you look at what Bob talks about in his notes,
    > seems that RB trees are "merely" a simpler a
    > way of thinking about 2-3-4 trees, which
    > are the "real" thing.

    I think it's more accurate to say RB trees are the practical way to *implement* 2-3-4 trees. But it's simpler to think in terms of 2-3-4 trees.

    ReplyDelete
  14. Sure, there is a mapping between 2-3-4 trees and RB trees. Problem is, I will likely be able to do only one of them. So the dilemma remains.

    Cheers,

    ReplyDelete
  15. > Sure, there is a mapping between 2-3-4
    > trees and RB trees. Problem is, I will likely
    > be able to do only one of them. So the
    > dilemma remains.

    Here's one idea. Do 2-3-4 trees in class. State the 1-1 correspondence between left-leaning red-black trees and 2-3-4 trees to make the connection with binary search trees. Assign insertion for homework.

    ReplyDelete
  16. My 2 cents about this.

    http://magic.aladdin.cs.cmu.edu/2008/02/27/my-vote-goes-to-red-black/

    ReplyDelete
  17. Can anyone comment on AVL trees (advantages/disadvantages)?

    ReplyDelete
  18. this is life we have to sacrefice something to gain something
    with regards
    edgar dantas
    www.gadgetworld.co.in

    ReplyDelete

Disqus for The Geomblog