GiST is implemented in PostgreSQL and found to have exceptional performance. This is the paper behind it.

A search tree in database (e.g., B+ tree) is a balanced tree with high fanout. The leaf nodes of the tree contain pointers to actual data, usually as a linked list to allow scanning.

We model a query predicate as \(q\). Pointers on a node are entries \(E=(p, ptr)\) which \(p\) is the key of the entry. If \(p\) matches \(q\), we traverse the subtree below \(ptr\) until all the matching data is found. The only requirement here is that the key \(p\) must logically match all data stored below \(ptr\). This is a generalization. In B+ tree, \(q\) is a range predicate \(c_1 \le i \le c_2\) and in R tree, \(q\) is a region predicate such as \((x_1, x_2, y_1, y_2)\) overlaps \(i\). Note that in R tree, keys on a node (i.e., regions) may overlap and hence the key \(p\) does not require all data in the tree that matches to be under \(ptr\).

The paper models a database search tree as *a hierarchy of partitions of a dataset*, which

- requires a Boolean method to tell if \(q\) is consistent with a given search key
- requires a node splitting algorithm to group data into categories

GiST (generalized search tree) is proposed as balanced tree of variable fanout

- root node has 2 to \(M\) fanout, other nodes has \(kM\) to \(M\) fanout, with \(2/M \le k \le 1/2\)
- \(k\) is the minimum fill factor of the tree
- Each fanout is a tuple \((p, ptr)\)
- \(p\) predicate as a search key
- \(ptr\) pointer to some tuple in the database if it is a leaf node, or pointer to another tree node otherwise

Properties in a GiST (invariants)

- every node contains between \(kM\) and \(M\) index entries unless it is the root
- in leaf node, \((p, ptr)\), \(p\) holds for the tuple \(ptr\)
- in non-leaf node, \((p, ptr)\), \(p\) holds for any tuple reachable from \(ptr\)
- root has at least two children unless it is a leaf
- all leaves appear on the same level

Here an entry \((p', ptr')\) in a child node and an entry in its parent \((p, ptr)\) does not require $$p’\to p$ to allow a different classification on each level of the tree. This is the requirement in R tree (containment hierarchy) but relaxed in GiST.

Implementation of GiST: It needs six primitive methods

- consistent(\(E,q\))

given entry \(E=(p, ptr)\) and query predicate \(q\) returns false only if \(p \wedge q\) is unsatisfiable. If consistent(\(E,q\)) returns true while \(p \wedge q\) is unsatisfiable, there will be penalty in performance but not correctness - union(\(P\))

given a set \(P\) of entries \((p_k, ptr_k)\) returns some predicates \(r\) that holds for all tuples stored below \(ptr_1\) through \(ptr_n\). Implementation includes but not limited to \(r = (p_1 \vee p_2 \vee \cdots \vee p_n)\) - compress(\(E\))

given entry \(E=(p, ptr)\) return entry \((\pi, ptr)\) which \(\pi\) is a compressed representation of \(p\) - decompress(\(E\))

given \(E=(\pi, ptr)\) where \(\pi=\)compress(\(p\)), returns an entry \((r,ptr)\) such that \(p \to r\). We do not require \(p \leftrightarrow r\) to allow lossy compression. Compress and decompress can be simply an identity function, or to make compress a function to simplify \(p\). - penalty(\(E_1, E_2\))

Given \(E_1=(p_1, ptr_1)\) and \(E_2=(p_2, ptr_2)\), returns a penalty for inserting \(E_2\) into the subtree pointed by \(ptr_1\). This is to aid split and insert algorithm. Typical penalty metric is the increase of size from \(E_1\) to Union of \(E_1\) and \(E_2\). In case of R tree on \(\mathbb{R}^2\), this can be the area increase from \(p_1\) to the \(p_1 \cup p_2\) - picksplit(\(P\))

given a set \(P\) of \(M+1\) entries \((p_k, ptr_k)\), split \(P\) into two sets \(P_1\) and \(P_2\) each of size at least \(kM\), optionally optimize for some badness metric

If linear order exists for keys, we further requires \(p_1 < p_2\) defined. Then operations on the tree are described by the following functions

Search: Given a GiST rooted at \(R\), search predicate \(q\), find all tuples that satisfy \(q\)

```
function Search(R, q)
if not is_leaf(R)
foreach E in entries_of(R)
if consistent(E, q)
Search(E.ptr, q)
end
end
else
if consistent(E, q)
emit(E.ptr)
end
end
end
```

FindMin: Given a GiST rooted at \(R\), search predicate \(q\), find the minimum tuple in linear order that satisfies \(q\). This works only if there is a linear order for the index

```
function FindMin(R, q)
if not is_leaf(R)
foreach E in entries_of(R)
# scan from left to right
if consistent(E, q)
return FindMin(E.ptr, q)
end
end
else
foreach E in entries_of(R)
# scan from left to right
if consistent(E, q)
return E
end
end
# if cannot find any entries satisfied q
return NULL
end
end
```

Next: Given a GiST rooted at \(R\), search predicate \(q\), and an entry \(E\), find the next entry in linear order that satisfies \(q\). This works only if there is a linear order for the index

```
function Next(R, q, E)
# find next node N
if E != rightmost(entries_of(R))
N = right_of(E)
else
P = next node to the right of R on the same level of the tree
# find next node P by tree traversal of sideway pointers
if P == NULL
# not exists
return NULL
end
N = leftmost(entries_of(P))
end
# check next node N
if consistent(N, q)
return N
else
return NULL
end
end
```

Insert: Given a GiST rooted at \(R\), an entry \(E\), and level \(l\), modify \(R\) to have \(E\) inserted at level \(l\)

```
function Insert(R, E, l)
L = ChooseSubtree(R, E, l)
if there is room for E on L
insert E to node L
else
Split(R, L, E)
AdjustKeys(R, L)
end
```

ChooseSubtree: Helper function to Insert. Given a GiST rooted at \(R\), entry \(E=(p, ptr)\), and level \(l\), return the node at level \(l\) that is best suited to hold entry with predicate \(E.p\)

```
function ChooseSubtree(R, E, l)
if level_of(R) == l
return R
else
F = arg min Penalty(F, E) for F=(q, ptr) in entries_of(R)
return ChooseSubtree(F.ptr, E, l)
end
end
```

Split: Helper function to Insert. Given a GiST rooted at \(R\), a node \(N\) on the GiST, and a new entry \(E=(p, ptr)\), update the tree with \(N\) split into two and \(E\) inserted into one of the split

```
function Split(R, N, E)
# update N, create new node N'
N, N' = PickSplit(Union(N, {E}))
# create new entry to point to node N'
q = Union(N')
ptr = pointer to N'
E_N' = (q, ptr')
if there is room for E_N' on parent_of(N):
insert E_N' to node parent_of(N)
else:
Split(R, parent_of(N), E_N')
F = entry that points to N
F.p = Union(N)
end
```

AdjustKeys: Helper function to Insert. Given a GiST rooted at \(R\) and a node \(N\) on GiST, update the GiST with ancestors of \(N\) contains correct and specific keys

```
function AdjustKeys(R, N)
if is_root(N)
return
end
E = entry that points to N
if E == Union(N)
# E is an already-accurate representation of Union of entries on N
return
else:
E.p = Union(N)
AdjustKeys(R, parent_of(N))
end
end
```

Delete: Given a GiST \(R\) and a leaf entry \(E=(p, ptr)\) update \(R\) to be a balanced GiST have \(E\) removed

```
function Delete(R, E)
L = Search(R, E.p) # find leaf node L contiains E
if L == NULL
# no-op: node L contains E not found
return R
end
Remove entry E from node L
CondenseTree(R, L)
if length(entries_of(R)) == 1:
R = child_of(R)
end
return R
end
```

CondenseTree: Given a GiST \(R\) containing leaf node \(L\), modify the tree to maintain invariant properties. In particular, maintain node \(L\) to have at least \(kM\) entries.

```
function CondenseTree(R, L)
N = L
Q = {} # set of eliminated nodes
while not is_root(N)
P = parent_of(N)
E_N = entry on P that points to N
if count(entries_of(N)) < kM
if tree is not ordered
Q.add(elements of N)
delete E_N from P
AdjustKeys(R, P)
else
N' = neighbor node of N in order
if count(entries_of(N')) + count(entries_of(N)) >= 2kM
evenly split entries on N and N' between two nodes
else
place entries from N into N'
delete E_N from P
AdjustKeys(R, N')
AdjustKeys(R, P)
end
end
end
if E_N was not deleted from P
AdjustKeys(R, N)
break
else:
N = P
end
end if
if not is_empty(Q)
foreach E in entries_of(Q)
Insert(R, E, level_of(E))
end
end
end
```

With this skeleton, we can implement various tree data structure. For example, B+ tree is a GiST over \(\mathbb{Z}\), and a key is a range \([x,y)\) in \(\mathbb{Z}\). B+ tree supports the following query predicates:

- Contains(\([x,y), v\))

Return true iff \(x\le v < y\) - Equal(\(x, v\))

Return true iff \(x=v\)

And it implements the following:

- consistent(\(E,q\))

\(E=(p, ptr)\) with \(p=[x_p, y_p)\). If \(q=[x_q, y_q)\) (contains query), this returns true if \((x_p < y_q) \wedge (x_q < y_p)\). If \(q=x_q\) (equal query), this returns true if \(x_p \le x_q < y_p\). - union(\(P\))

for \(P\) of entries \(([x_k,y_k), ptr_k)\) returns \([x,y)\) which \(x=\min(x_1,\cdots,x_n)\) and \(y=\max(y_1,\cdots,y_n)\) - compress(\(E\))

given \(E=([x,y), ptr)\), if \(E\) is the leftmost key on a non-leaf node, return NULL; otherwise return \(x\) - decompress(\(E\))

if \(E=(\pi,ptr)\) is the leftmost key on a non-leaf node, set \(x=-\infty\) otherwise \(x=\pi\). If \(E\) is the rightmost key on a non-leaf node, let \(y=\infty\) otherwise \(y\) to be the value returned by Next(\(E\)). If \(E\) is a leaf node, set \(y=x+1\). Return \([x,y)\) - penalty(\(E_1, E_2\))

If \(E_1\) is the leftmost pointer on its node, return \(\max(y_2-y_1, 0)\). If \(E_1\) is the rightmost pointer on its node, return \(\max(x_1-x_2,0)\). Otherwise return \(\max(y_2-y_1, 0) + \max(x_1-x_2, 0)\) - picksplit(\(P\))

Left the first \(\lfloor \vert P\vert / 2\rfloor\) entries in order go to left group and the rest go in the right group. This guarantee a minimum fill facto of \(M/2\).

Another example, R tree over \(\mathbb{R}^2\). A key is \((x_{ul}, y_{ul}, x_{lr}, y_{lr})\) for a bounding rectangle. R tree supports the following query predicates:

- Contains(\((x_{ul}, y_{ul}, x_{lr}, y_{lr}), (x'_{ul}, y'_{ul}, x'_{lr}, y'_{lr})\))

Return true iff \((x_{lr} \ge x'_{lr}) \wedge (x_{ul} \le x'_{ul}) \wedge (y_{lr} \le y'_{lr}) \wedge (y_{ul} \ge y'_{ul})\) - Overlap(\((x_{ul}, y_{ul}, x_{lr}, y_{lr}), (x'_{ul}, y'_{ul}, x'_{lr}, y'_{lr})\))

Return true iff \((x_{ul} \le x'_{lr}) \wedge (x'_{ul} \le x_{lr}) \wedge (y_{lr} \le y'_{ul}) \wedge (y'_{lr} \ge y_{ul})\) - Equal(\((x_{ul}, y_{ul}, x_{lr}, y_{lr}), (x'_{ul}, y'_{ul}, x'_{lr}, y'_{lr})\))

Return true iff \((x_{lr} = x'_{lr}) \wedge (x_{ul} = x'_{ul}) \wedge (y_{lr} = y'_{lr}) \wedge (y_{ul} = y'_{ul})\)

and the implementation in GiST as follows (with key expressed bounding box, i.e., contains query):

- consistent(\(E,q\))

\(E=(p, ptr)\) with \(q\) a contains, overlap, or equal query, return true if Overlap(\(p\), \(q\)) is true - union(\(P\))

for set \(P\) of \(n\) entries, return a new bounding box \((x_{ul}, y_{ul}, x_{lr}, y_{lr})\), which \(x_{ul}=\min(x_{ul}^{(1)},\cdots,x_{ul}^{(n)})\), \(y_{ul}=\max(y_{ul}^{(1)},\cdots,y_{ul}^{(n)})\), \(x_{lr}=\max(x_{lr}^{(1)},\cdots,x_{lr}^{(n)})\), \(y_{lr}=\min(y_{lr}^{(1)},\cdots,y_{lr}^{(n)})\) - compress(\(E\))

given \(E=(p, ptr)\) with \(p\) as a polygon in the form of line segments, return \(\pi = (x_{ul}, y_{ul}, x_{lr}, y_{lr})\), which \(x_{ul}=\forall_i\min x_{ul}^{(i)}\), \(y_{ul}=\forall_i\max y_{ul}^{(i)}\), \(x_{lr}=\forall_i\max x_{lr}^{(i)}\), \(y_{lr}=\forall_i\min y_{lr}^{(i)}\) - decompress(\(E\))

identity function - penalty(\(E_1, E_2\))

compute \(q=\)Union(\(\{E_1,E_2\}\)) then return \(\textrm{area}(q) - \textrm{area}(E_1.p)\) - picksplit(\(P\))

See Guttman (1984) paper “R-trees: A dynamic index structure for spatial searching” for R-tree splitting algorithm

Yet another example, RD-tree over \(\mathcal{P}(\mathbb{Z})\). A key is a set and the following query predicates are supported:

- Contains(\(S,T\))

Return true iff \(S\supseteq T\) - Overlap(\(S,T\))

Return true iff \(S\cap T\neq\varnothing\) - Equal(\(S,T\))

Return true iff \(S=T\)

The implementation in GiST is as follows: Predicate in an entry is \(p=\textrm{Contains}(S,v)\)

- consistent(\(E,q\))

\(E=(p, ptr)\) and \(q=\textrm{Contains}(T,v)\) (or overlap or equal), returns true if Overlap(\(S,T\)) is true - union(\(P\))

for set \(P\) of \(n\) return \(E_1.p\cup\cdots\cup E_n.p\) - compress(\(E\))

Return a range set equivalent. That is, find the disjoint ranges from the set - decompress(\(E\))

Connvert the range set back to sets by enumerating the elements in the range - penalty(\(E_1, E_2\))

Return \(\lvert E_1.p \cup E_2.p\rvert - \lvert E_1.p\rvert\) - picksplit(\(P\))

See Guttman (1984) paper for the quadratic algorithm for R-tree split

## Bibliographic data

```
@inproceedings{
title = "Generalized Search Trees for Database Systems",
author = "Joseph M. Hellerstein and Jeffrey F. Naughton and Avi Pfeffer",
booktitle = "Proceedings of the 21st VLDB Conference",
year = "1995",
address = "Zurich, Switzerland",
url = "http://db.cs.berkeley.edu/papers/UW-CS-TR-1274.pdf",
}
```