GiST is implemented in PostgreSQL and found to have exceptional performance. This is the paper behind it.

A search tree in database (e.g., B+ tree) is a balanced tree with high fanout. The leaf nodes of the tree contain pointers to actual data, usually as a linked list to allow scanning.

We model a query predicate as . Pointers on a node are entries which is the key of the entry. If matches , we traverse the subtree below until all the matching data is found. The only requirement here is that the key must logically match all data stored below . This is a generalization. In B+ tree, is a range predicate and in R tree, is a region predicate such as overlaps . Note that in R tree, keys on a node (i.e., regions) may overlap and hence the key does not require all data in the tree that matches to be under .

The paper models a database search tree as *a hierarchy of partitions of a dataset*, which

- requires a Boolean method to tell if is consistent with a given search key
- requires a node splitting algorithm to group data into categories

GiST (generalized search tree) is proposed as balanced tree of variable fanout

- root node has 2 to fanout, other nodes has to fanout, with
- is the minimum fill factor of the tree
- Each fanout is a tuple
- predicate as a search key
- pointer to some tuple in the database if it is a leaf node, or pointer to another tree node otherwise

Properties in a GiST (invariants)

- every node contains between and index entries unless it is the root
- in leaf node, , holds for the tuple
- in non-leaf node, , holds for any tuple reachable from
- root has at least two children unless it is a leaf
- all leaves appear on the same level

Here an entry in a child node and an entry in its parent does not require $$p’\to p$ to allow a different classification on each level of the tree. This is the requirement in R tree (containment hierarchy) but relaxed in GiST.

Implementation of GiST: It needs six primitive methods

- consistent()

given entry and query predicate returns false only if is unsatisfiable. If consistent() returns true while is unsatisfiable, there will be penalty in performance but not correctness - union()

given a set of entries returns some predicates that holds for all tuples stored below through . Implementation includes but not limited to - compress()

given entry return entry which is a compressed representation of - decompress()

given where compress(), returns an entry such that . We do not require to allow lossy compression. Compress and decompress can be simply an identity function, or to make compress a function to simplify . - penalty()

Given and , returns a penalty for inserting into the subtree pointed by . This is to aid split and insert algorithm. Typical penalty metric is the increase of size from to Union of and . In case of R tree on , this can be the area increase from to the - picksplit()

given a set of entries , split into two sets and each of size at least , optionally optimize for some badness metric

If linear order exists for keys, we further requires defined. Then operations on the tree are described by the following functions

Search: Given a GiST rooted at , search predicate , find all tuples that satisfy

```
function Search(R, q)
if not is_leaf(R)
foreach E in entries_of(R)
if consistent(E, q)
Search(E.ptr, q)
end
end
else
if consistent(E, q)
emit(E.ptr)
end
end
end
```

FindMin: Given a GiST rooted at , search predicate , find the minimum tuple in linear order that satisfies . This works only if there is a linear order for the index

```
function FindMin(R, q)
if not is_leaf(R)
foreach E in entries_of(R)
# scan from left to right
if consistent(E, q)
return FindMin(E.ptr, q)
end
end
else
foreach E in entries_of(R)
# scan from left to right
if consistent(E, q)
return E
end
end
# if cannot find any entries satisfied q
return NULL
end
end
```

Next: Given a GiST rooted at , search predicate , and an entry , find the next entry in linear order that satisfies . This works only if there is a linear order for the index

```
function Next(R, q, E)
# find next node N
if E != rightmost(entries_of(R))
N = right_of(E)
else
P = next node to the right of R on the same level of the tree
# find next node P by tree traversal of sideway pointers
if P == NULL
# not exists
return NULL
end
N = leftmost(entries_of(P))
end
# check next node N
if consistent(N, q)
return N
else
return NULL
end
end
```

Insert: Given a GiST rooted at , an entry , and level , modify to have inserted at level

```
function Insert(R, E, l)
L = ChooseSubtree(R, E, l)
if there is room for E on L
insert E to node L
else
Split(R, L, E)
AdjustKeys(R, L)
end
```

ChooseSubtree: Helper function to Insert. Given a GiST rooted at , entry , and level , return the node at level that is best suited to hold entry with predicate

```
function ChooseSubtree(R, E, l)
if level_of(R) == l
return R
else
F = arg min Penalty(F, E) for F=(q, ptr) in entries_of(R)
return ChooseSubtree(F.ptr, E, l)
end
end
```

Split: Helper function to Insert. Given a GiST rooted at , a node on the GiST, and a new entry , update the tree with split into two and inserted into one of the split

```
function Split(R, N, E)
# update N, create new node N'
N, N' = PickSplit(Union(N, {E}))
# create new entry to point to node N'
q = Union(N')
ptr = pointer to N'
E_N' = (q, ptr')
if there is room for E_N' on parent_of(N):
insert E_N' to node parent_of(N)
else:
Split(R, parent_of(N), E_N')
F = entry that points to N
F.p = Union(N)
end
```

AdjustKeys: Helper function to Insert. Given a GiST rooted at and a node on GiST, update the GiST with ancestors of contains correct and specific keys

```
function AdjustKeys(R, N)
if is_root(N)
return
end
E = entry that points to N
if E == Union(N)
# E is an already-accurate representation of Union of entries on N
return
else:
E.p = Union(N)
AdjustKeys(R, parent_of(N))
end
end
```

Delete: Given a GiST and a leaf entry update to be a balanced GiST have removed

```
function Delete(R, E)
L = Search(R, E.p) # find leaf node L contiains E
if L == NULL
# no-op: node L contains E not found
return R
end
Remove entry E from node L
CondenseTree(R, L)
if length(entries_of(R)) == 1:
R = child_of(R)
end
return R
end
```

CondenseTree: Given a GiST containing leaf node , modify the tree to maintain invariant properties. In particular, maintain node to have at least entries.

```
function CondenseTree(R, L)
N = L
Q = {} # set of eliminated nodes
while not is_root(N)
P = parent_of(N)
E_N = entry on P that points to N
if count(entries_of(N)) < kM
if tree is not ordered
Q.add(elements of N)
delete E_N from P
AdjustKeys(R, P)
else
N' = neighbor node of N in order
if count(entries_of(N')) + count(entries_of(N)) >= 2kM
evenly split entries on N and N' between two nodes
else
place entries from N into N'
delete E_N from P
AdjustKeys(R, N')
AdjustKeys(R, P)
end
end
end
if E_N was not deleted from P
AdjustKeys(R, N)
break
else:
N = P
end
end if
if not is_empty(Q)
foreach E in entries_of(Q)
Insert(R, E, level_of(E))
end
end
end
```

With this skeleton, we can implement various tree data structure. For example, B+ tree is a GiST over , and a key is a range in . B+ tree supports the following query predicates:

- Contains()

Return true iff - Equal()

Return true iff

And it implements the following:

- consistent()

with . If (contains query), this returns true if . If (equal query), this returns true if . - union()

for of entries returns which and - compress()

given , if is the leftmost key on a non-leaf node, return NULL; otherwise return - decompress()

if is the leftmost key on a non-leaf node, set otherwise . If is the rightmost key on a non-leaf node, let otherwise to be the value returned by Next(). If is a leaf node, set . Return - penalty()

If is the leftmost pointer on its node, return . If is the rightmost pointer on its node, return . Otherwise return - picksplit()

Left the first entries in order go to left group and the rest go in the right group. This guarantee a minimum fill facto of .

Another example, R tree over . A key is for a bounding rectangle. R tree supports the following query predicates:

- Contains()

Return true iff - Overlap()

Return true iff - Equal()

Return true iff

and the implementation in GiST as follows (with key expressed bounding box, i.e., contains query):

- consistent()

with a contains, overlap, or equal query, return true if Overlap(, ) is true - union()

for set of entries, return a new bounding box , which , , , - compress()

given with as a polygon in the form of line segments, return , which , , , - decompress()

identity function - penalty()

compute Union() then return - picksplit()

See Guttman (1984) paper “R-trees: A dynamic index structure for spatial searching” for R-tree splitting algorithm

Yet another example, RD-tree over . A key is a set and the following query predicates are supported:

- Contains()

Return true iff - Overlap()

Return true iff - Equal()

Return true iff

The implementation in GiST is as follows: Predicate in an entry is

- consistent()

and (or overlap or equal), returns true if Overlap() is true - union()

for set of return - compress()

Return a range set equivalent. That is, find the disjoint ranges from the set - decompress()

Connvert the range set back to sets by enumerating the elements in the range - penalty()

Return - picksplit()

See Guttman (1984) paper for the quadratic algorithm for R-tree split

## Bibliographic data

```
@inproceedings{
title = "Generalized Search Trees for Database Systems",
author = "Joseph M. Hellerstein and Jeffrey F. Naughton and Avi Pfeffer",
booktitle = "Proceedings of the 21st VLDB Conference",
year = "1995",
address = "Zurich, Switzerland",
url = "http://db.cs.berkeley.edu/papers/UW-CS-TR-1274.pdf",
}
```