GiST is implemented in PostgreSQL and found to have exceptional performance. This is the paper behind it.

A search tree in database (e.g., B+ tree) is a balanced tree with high fanout. The leaf nodes of the tree contain pointers to actual data, usually as a linked list to allow scanning.

We model a query predicate as . Pointers on a node are entries which is the key of the entry. If matches , we traverse the subtree below until all the matching data is found. The only requirement here is that the key must logically match all data stored below . This is a generalization. In B+ tree, is a range predicate and in R tree, is a region predicate such as overlaps . Note that in R tree, keys on a node (i.e., regions) may overlap and hence the key does not require all data in the tree that matches to be under .

The paper models a database search tree as a hierarchy of partitions of a dataset, which

  • requires a Boolean method to tell if is consistent with a given search key
  • requires a node splitting algorithm to group data into categories

GiST (generalized search tree) is proposed as balanced tree of variable fanout

  • root node has 2 to fanout, other nodes has to fanout, with
  • is the minimum fill factor of the tree
  • Each fanout is a tuple
    • predicate as a search key
    • pointer to some tuple in the database if it is a leaf node, or pointer to another tree node otherwise

Properties in a GiST (invariants)

  1. every node contains between and index entries unless it is the root
  2. in leaf node, , holds for the tuple
  3. in non-leaf node, , holds for any tuple reachable from
  4. root has at least two children unless it is a leaf
  5. all leaves appear on the same level

Here an entry in a child node and an entry in its parent does not require $$p’\to p$ to allow a different classification on each level of the tree. This is the requirement in R tree (containment hierarchy) but relaxed in GiST.

Implementation of GiST: It needs six primitive methods

  1. consistent()
    given entry and query predicate returns false only if is unsatisfiable. If consistent() returns true while is unsatisfiable, there will be penalty in performance but not correctness
  2. union()
    given a set of entries returns some predicates that holds for all tuples stored below through . Implementation includes but not limited to
  3. compress()
    given entry return entry which is a compressed representation of
  4. decompress()
    given where compress(), returns an entry such that . We do not require to allow lossy compression. Compress and decompress can be simply an identity function, or to make compress a function to simplify .
  5. penalty()
    Given and , returns a penalty for inserting into the subtree pointed by . This is to aid split and insert algorithm. Typical penalty metric is the increase of size from to Union of and . In case of R tree on , this can be the area increase from to the
  6. picksplit()
    given a set of entries , split into two sets and each of size at least , optionally optimize for some badness metric

If linear order exists for keys, we further requires defined. Then operations on the tree are described by the following functions

Search: Given a GiST rooted at , search predicate , find all tuples that satisfy

function Search(R, q)
    if not is_leaf(R)
        foreach E in entries_of(R)
            if consistent(E, q)
                Search(E.ptr, q)
            end
        end
    else
        if consistent(E, q)
            emit(E.ptr)
        end
    end
end

FindMin: Given a GiST rooted at , search predicate , find the minimum tuple in linear order that satisfies . This works only if there is a linear order for the index

function FindMin(R, q)
    if not is_leaf(R)
        foreach E in entries_of(R)
            # scan from left to right
            if consistent(E, q)
                return FindMin(E.ptr, q)
            end
        end
    else
        foreach E in entries_of(R)
            # scan from left to right
            if consistent(E, q)
                return E
            end
        end
        # if cannot find any entries satisfied q
        return NULL
    end
end

Next: Given a GiST rooted at , search predicate , and an entry , find the next entry in linear order that satisfies . This works only if there is a linear order for the index

function Next(R, q, E)
    # find next node N
    if E != rightmost(entries_of(R))
        N = right_of(E)
    else
        P = next node to the right of R on the same level of the tree
        # find next node P by tree traversal of sideway pointers
        if P == NULL
            # not exists
            return NULL
        end
        N = leftmost(entries_of(P))
    end
    # check next node N
    if consistent(N, q)
        return N
    else
        return NULL
    end
end

Insert: Given a GiST rooted at , an entry , and level , modify to have inserted at level

function Insert(R, E, l)
    L = ChooseSubtree(R, E, l)
    if there is room for E on L
        insert E to node L
    else
        Split(R, L, E)
    AdjustKeys(R, L)
end

ChooseSubtree: Helper function to Insert. Given a GiST rooted at , entry , and level , return the node at level that is best suited to hold entry with predicate

function ChooseSubtree(R, E, l)
    if level_of(R) == l
        return R
    else
        F = arg min Penalty(F, E) for F=(q, ptr) in entries_of(R)
        return ChooseSubtree(F.ptr, E, l)
    end
end

Split: Helper function to Insert. Given a GiST rooted at , a node on the GiST, and a new entry , update the tree with split into two and inserted into one of the split

function Split(R, N, E)
    # update N, create new node N'
    N, N' = PickSplit(Union(N, {E}))
    # create new entry to point to node N'
    q = Union(N')
    ptr = pointer to N'
    E_N' = (q, ptr')
    if there is room for E_N' on parent_of(N):
        insert E_N' to node parent_of(N)
    else:
        Split(R, parent_of(N), E_N')
    F = entry that points to N
    F.p = Union(N)
end

AdjustKeys: Helper function to Insert. Given a GiST rooted at and a node on GiST, update the GiST with ancestors of contains correct and specific keys

function AdjustKeys(R, N)
    if is_root(N)
        return
    end
    E = entry that points to N
    if E == Union(N)
        # E is an already-accurate representation of Union of entries on N
        return
    else:
        E.p = Union(N)
        AdjustKeys(R, parent_of(N))
    end
end

Delete: Given a GiST and a leaf entry update to be a balanced GiST have removed

function Delete(R, E)
    L = Search(R, E.p)  # find leaf node L contiains E
    if L == NULL
        # no-op: node L contains E not found
        return R
    end
    Remove entry E from node L
    CondenseTree(R, L)
    if length(entries_of(R)) == 1:
        R = child_of(R)
    end
    return R
end

CondenseTree: Given a GiST containing leaf node , modify the tree to maintain invariant properties. In particular, maintain node to have at least entries.

function CondenseTree(R, L)
    N = L
    Q = {}  # set of eliminated nodes
    while not is_root(N)
        P = parent_of(N)
        E_N = entry on P that points to N
        if count(entries_of(N)) < kM
            if tree is not ordered
                Q.add(elements of N)
                delete E_N from P
                AdjustKeys(R, P)
            else
                N' = neighbor node of N in order
                if count(entries_of(N')) + count(entries_of(N)) >= 2kM
                    evenly split entries on N and N' between two nodes
                else
                    place entries from N into N'
                    delete E_N from P
                    AdjustKeys(R, N')
                    AdjustKeys(R, P)
                end
            end
        end
        if E_N was not deleted from P
            AdjustKeys(R, N)
            break
        else:
            N = P
        end
    end if
    if not is_empty(Q)
        foreach E in entries_of(Q)
            Insert(R, E, level_of(E))
        end
    end
end

With this skeleton, we can implement various tree data structure. For example, B+ tree is a GiST over , and a key is a range in . B+ tree supports the following query predicates:

  1. Contains()
    Return true iff
  2. Equal()
    Return true iff

And it implements the following:

  1. consistent()
    with . If (contains query), this returns true if . If (equal query), this returns true if .
  2. union()
    for of entries returns which and
  3. compress()
    given , if is the leftmost key on a non-leaf node, return NULL; otherwise return
  4. decompress()
    if is the leftmost key on a non-leaf node, set otherwise . If is the rightmost key on a non-leaf node, let otherwise to be the value returned by Next(). If is a leaf node, set . Return
  5. penalty()
    If is the leftmost pointer on its node, return . If is the rightmost pointer on its node, return . Otherwise return
  6. picksplit()
    Left the first entries in order go to left group and the rest go in the right group. This guarantee a minimum fill facto of .

Another example, R tree over . A key is for a bounding rectangle. R tree supports the following query predicates:

  1. Contains()
    Return true iff
  2. Overlap()
    Return true iff
  3. Equal()
    Return true iff

and the implementation in GiST as follows (with key expressed bounding box, i.e., contains query):

  1. consistent()
    with a contains, overlap, or equal query, return true if Overlap(, ) is true
  2. union()
    for set of entries, return a new bounding box , which , , ,
  3. compress()
    given with as a polygon in the form of line segments, return , which , , ,
  4. decompress()
    identity function
  5. penalty()
    compute Union() then return
  6. picksplit()
    See Guttman (1984) paper “R-trees: A dynamic index structure for spatial searching” for R-tree splitting algorithm

Yet another example, RD-tree over . A key is a set and the following query predicates are supported:

  1. Contains()
    Return true iff
  2. Overlap()
    Return true iff
  3. Equal()
    Return true iff

The implementation in GiST is as follows: Predicate in an entry is

  1. consistent()
    and (or overlap or equal), returns true if Overlap() is true
  2. union()
    for set of return
  3. compress()
    Return a range set equivalent. That is, find the disjoint ranges from the set
  4. decompress()
    Connvert the range set back to sets by enumerating the elements in the range
  5. penalty()
    Return
  6. picksplit()
    See Guttman (1984) paper for the quadratic algorithm for R-tree split

Bibliographic data

@inproceedings{
   title = "Generalized Search Trees for Database Systems",
   author = "Joseph M. Hellerstein and Jeffrey F. Naughton and Avi Pfeffer",
   booktitle = "Proceedings of the 21st VLDB Conference",
   year = "1995",
   address = "Zurich, Switzerland",
   url = "http://db.cs.berkeley.edu/papers/UW-CS-TR-1274.pdf",
}