CHAPTER 2
ANALYSING DATA STRUCTURES
2.0 Stacks and Queues
---------------------
One of the important factors in algorithm design is choosing the right data
structures. Thus we need to know how long data structure operations take.
Stacks and queues are kinds of list -- kinds which can only be accessed at the
ends. Lists generally can be implemented either in arrays, or as linked lists
using records and pointers.
A stack has two operations, PUSH which adds an item to the top of the stack and
POP which removes an item from the top of the stack (last in first out). A
stack can be implemented in an array "a", with an index "top" indicating the
top of the stack (next free space). Push and pop are then O(1) operations.
For example:
push(x)
-------
a[top] := x
top := top + 1
Alternatively, a stack can be implemented as a linked list of records, each
containing a pointer. This is less efficient on space and time, but there is no
pre-determined limit on the size of the stack. It is easiest to put the top of
the stack at the front of the list. Then push and pop are still O(1), e.g.:
push(x)
-------
x^.next := stack
stack := x
With a queue there are also two operations: ENQUEUE which adds an item to the
back of the queue, and DEQUEUE which removes an item from the front of the
queue. A queue can be handled in an array "a[1..max]" with indexes "front"
and "back", the array being handled circularly. The operations are O(1):
enqueue(x)
----------
a[back] := x
back := back + 1
if back > max then back := 1 (or use "mod")
A queue can be held in a linked list, provided an extra pointer to the last
item in the list is kept up to date. The operations are still O(1). Note
that a data structure for a queue can in fact accommodate insertions and
deletions at both ends. This is sometimes called a "double ended queue" or
"deque".
--------------------------------------------------------------------
| Push and pop in stacks can be implemented in O(1) time. |
| Enqueue and dequeue in queues can be implemented in O(1) time. |
--------------------------------------------------------------------
2.1 Priority Queues
-------------------
Suppose we have a set of items which is changing by having items inserted or
deleted from time to time, and for which we occasionally have to find the
largest item.
One example is a "queue" of processes with varying priorities in an operating
system. The scheduler always chooses the process on the queue with the highest
priority to run next. New processes are added to the queue from time to time.
Another example is network searching (later chapter). For example, in a chess
playing program, it can be more productive to search through board positions
in order of "strength" rather than just looking a fixed number of moves ahead.
Thus positions are entered into a priority queue as they are generated, and
the strongest position is chosen for analysis at each stage. This has the
advantage that if the program is interrupted because its time has run out,
it should still choose a reasonable move.
A priority queue can be implemented as a sorted list, sorted by priority,
either in an array or a linked list. Then insertion is O(n) (to re-sort the
list) and removal is O(1). It can also be implemented as an unsorted list.
Then insertion is O(1) and removal is O(n) (to search for the largest).
The preferred way of implementing a priority queue which balances the costs of
insertion and removal is in a "heap" structure. This is a binary tree in which
the highest priority items climb to the top. The item at each node has higher
priority than its children. The structure of the tree stays fixed and the items
move in it, so we can use a full binary tree (i.e. one with all nodes present
down to the bottom level, and the bottom level filled in from the left).
Suppose the items are letters, with later letters having higher priority:
X
/ \
T O
/ \ / \
G S M N
/\ /\ /
A E R A I
To add a new item, we put it in the first free position, then allow it to
climb to its rightful place:
Insert item in heap structure
-----------------------------
put item at next free position in tree
while priority of item greater than that of parent
swap item with parent
X X X X
/ \ / \ / \ / \
... O ... O ... O ... P
/ \ / \ / \ / \
M N M N P N O N
/ / \ / \ / \
I I P I M I M
To remove the highest priority item, we take the last item, put it in place
of the deleted one, and shuffle the item DOWN the heap:
Remove item from heap structure
-------------------------------
remove top item and replace with bottom item
while priority of item not larger than priorities of both children
swap item with higher priority child
X M T T T
/ \ / \ / ... / ... / ...
T P T P M S S
/ \ / \ / \ / \ / \ / \ / \
G S O N G S O N G S G M G R
/\ /\ /\ /\ /\ / /\ /\ /\ /\ /\ /\
A E R A I M A E R A I A E R A A E R A A E M A
The items can be held in an array item[1..n], with the root in item[1]. The
children of item[i] are item[2*i] and item[2*i+1], and the parent of item[i] is
item[i div 2]
1 2 3 4 5 6 7 8 9 10 11 12
T S P G R O N A E M A I
Note that insert or remove involve a scan which may go from the bottom of the
tree to the top (or vice versa). For a tree of n items, this can be up to
log(n) items.
In fact insert is on average faster than this. The reason is (roughly), that in
a full binary tree, half the items are at the bottom level, 1/4 are at the next
level up, 1/8 are two levels up and so on. Thus when a random item is inserted,
it has probability 1/2 of staying at the bottom level, 1/4 of rising one level,
1/8 of rising two levels etc. Thus the average time for insertion is:
sigma (time to rise i levels) * (probability of rising i levels)
sigma i/2^i = O(1) (see chapter 1)
Note that we can sort n items by inserting them into an empty priority queue
and then repeatedly removing the largest. This yields an O(n*log(n)) sorting
algorithm (Heapsort) but it is not the best.
-------------------------------------------------
| Insert and remove in priority queues can be |
| implemented in O(log(n)) time - and in fact |
| insert has average time O(1) |
-------------------------------------------------
2.2 Dictionaries Using Hash Tables
----------------------------------
A dictionary is a collection of entries or records. Each entry has a key which
is used to search for the entry. In a real dictionary, an entry contains a word
which is the key, together with pronunciation, meaning and derivation. In a
compiler, a symbol table is a dictionary where entries describe variables. The
key is the name, and the other information is type, storage area etc. In a
business, employee records have employees names as the key, and information
such as salary, age etc. A dictionary may be kept entirely in memory, or in a
disc database.
We only consider four operations, INSERT a new entry, DELETE an old one, SEARCH
for an entry by key, and RANGE-SEARCH, i.e. search for entries with keys in a
certain (small) range.
It is possible to keep a dictionary as an array of entries (or array of
pointers to entries if the entries have variable length). If the array is
unsorted, insert is O(1) and search is O(n). Delete is O(1) if the entry has
already been found, as we can replace the entry by the last entry in the array.
If the array is kept sorted, insert and delete are O(n), and searching is
O(log(n)) by binary chop.
A better way of implementing dictionaries is to use hash tables. The key is
converted into a number and "hashed" to make it "random", and reduced modulo
the table size to produce an index into the hash table. This index indicates
where to store the record (or a pointer to it). As two different keys may hash
to the same index by accident, it is convenient to make each table entry a
pointer to a linked list of records. As a simple example, suppose the keys are
numbers, and the hash function just reduces the number modulo 10 for a hash
table with 10 entries. We might have:
0 .
1 .
2 -> 42
3 .
4 -> 34 -> 74 -> 14
5 .
6 .
7 -> 37
8 .
9 .
The time taken to insert is O(1) (putting the new record at the front of the
relevant list). The time for search or delete is O(1 + L) where L is the length
of the relevant list. If the number of entries in the hash table is O(n), the
average length of the lists is O(1), so the operations are O(1). However, the
table must be kept reasonably sparse to work well in practice - usually up to
about 80% full is OK. Note however that the hash table gives no help in range
searching, which must be done by examining every record.
One of the problems with hash tables is that the size must be fixed in advance.
One solution is "extendible hashing". In essence, every time the table becomes
full, a new larger table is allocated and the old items re-processed to
position them in the new table. Suppose table sizes are powers of 2 and we
double the size of the table every time it becomes full. The cost of
re-processing the n items in the table is O(n). After O(n) insertions, the
table size has been increased log(n) times at a cost of O(1 + 2 + 4 + ... + n)
which is O(n), i.e. an average cost of O(1) time per insertion. Thus even with
extendible hashing we have:
---------------------------------------------------------------------
| Insert, delete and search in dictionaries can be implemented in |
| AVERAGE time O(1) using hash tables. Range search is O(n). |
---------------------------------------------------------------------
Note that if we do not randomize or reduce modulo table size, we would need a
very large sparse array. Hash tables are just a way of keeping a large sparse
array in a small space. They can be useful for any sparse array or sparse
matrix problem.
Designing a hash function is essentially the same problem as designing a pseudo
random number generator. Most simple random number generators are based on
seed := (seed * a + b) mod c
which produces a sequence of integers in the range 0..(c-1). Note that the
rightmost digits are not very random, so it is best to use only the leftmost
ones. Also note that you may need tricks to avoid overflow. Unfortunately not
all choices of a, b, c produce good generators, so don't design your own, look
one up. Here are some:
seed := (seed * 31415821 + 1) mod 10^8 [use leftmost digits - Sedgewick]
seed := (seed * 1103515245 + 12345) mod 2^32 [use left 16 bits - UNIX S5]
seed := (seed * 5DEECE66D + B) mod 2^48 [hex - use left 32 bits - UNIX S5]
These can be used as hash functions by setting the seed to the key and using
the result modulo the table size. For hash table use, the "+b" is not important
and can be left out. Remaining problems with hash tables are that range-search
and similar operations are not efficient, and that the O(1) performance is not
guaranteed.
2.3 Dictionaries Using Trees
----------------------------
Binary search trees give us a method for doing search, insertion and deletion
each in O(log(n)) time for n items. Moreover, range-search is also O(log(n))
for small ranges. In a binary search tree, the items are kept in a tree in
which each item comes after all items in its left subtree, and before all items
in its right subtree. Thus:
DOG
/ \
ART THE
/ \ \
AND BUS WAR
\
CAT
The search algorithm is as follows.
search (tree, word)
-------------------
if the tree is empty (nil) return failure
if word = root key, return the root entry
if word < root key, search left subtree
if word > root key, search right subtree
It is easy to arrange for insertion following an unsuccessful search, or
deletion following a successful one (the entry may have to be replaced by the
leftmost entry in its right subtree). The time taken for any of these
operations is proportional to the depth of the tree, which is O(log(n)) for
well-balanced trees. Range-searching can be done using a similar algorithm,
except that when you come across an entry which is within the range, you output
it and search BOTH subtrees. The first time this happens, the path diverges and
you end up searching an area of the tree with jagged left and right hand edges,
e.g.
/
/
\
/
/\
/ /
/ \
\ \
The area between the two edges contains exactly the values within the range.
Thus the range-search takes time O(log(n)+R) where R is the number of entries
found within the range. We assume a small range, i.e. R is O(log(n)). As
random trees are well-balanced, we have:
--------------------------------------------------------------------
| Insert, delete, search and range-search in dictionaries can be |
| implemented in AVERAGE time O(log(n)) using binary search trees |
-------------------------------------------------------------------
2.4 Splay Trees
---------------
It is usually good enough to assume that the trees will be random. However, in
critical applications (e.g. real-time programs controlling equipment, or
on-line databases accessed by many people) it is necessary to have GUARANTEED
good performance and not just average good performance. Ordinary search trees
can give O(n) performance instead of O(log(n)) if the data is badly non-random.
There are many methods for guaranteeing O(log(n)) performance in trees -- 2-3-4
trees, red-black trees, AVL trees etc. Very few of these have been used in
earnest, partly because they need extra space in the nodes. The best method,
which does not need extra space, is a new one called "splay trees" which hasn't
reached the textbooks yet. It can be found in the library in the Communications
of the ACM, Volume 30, No. 3 (March 1987) in the article "Algorithm Design" by
Tarjan.
Like most tree-balancing acts, splay trees rely on ROTATION of an edge of a
binary search tree, in see-saw fashion. If x and y point to neighbouring nodes
containing A and B, with subtrees T1, T2, and T3 as in the left hand picture,
then the edge xy can be rotated right to produce the right hand picture. The
edge xy rotates like a see-saw, and the middle tree T2 slides to the other end.
(y) B A (x)
/ \ / \
/ \ / \
(x) A T3 T1 B (y)
/ \ / \
/ \ / \
T1 T2 T2 T3
Rotate edge xy to the right (assuming x is left child of y)
---------------------------
swap x^.value with y^.value
y^.left := x^.left (the order of these matters a lot)
x^.left := x^.right
x^.right := y^.right
y^.right := x
swap x and y
This does not change the ordering properties of the search tree, because in
both the original and final trees, the order is T1..A..T2..B..T3, in fact you
can think of the operation as a re-bracketing (T1 A T2) B T3 goes to
T1 A (T2 B T3). There is a similar rotate left operation on a rightward edge
(just swap the two pictures above).
In a splay tree, each time you retrieve an item, you make changes to the tree
to bring it to the root. The operation also spreads the tree out (splays it),
halving the depth of the items found along the path to the retrieved item.
Items off the path may have their depth increased, but only by 1 or 2 levels.
All nodes along the path are much more efficient to access later, while those
off the path are a little less efficient.
The actual splay operation on finding a particular item is as follows. There
are three cases (and their mirror images) as follows.
The ZIG case:
If node x is a left child of the root y (or a right child)
rotate xy (and the operation is finished)
The ZIG-ZIG case:
If x is a left child of y which is a left child of z (or right/right)
rotate yz, then rotate xy
The ZIG-ZAG case:
If x is a left child of y which is a right child of z (or right/left)
rotate xy, then rotate xz
The pictures are as follows:
(y) B A (x)
/ \ / \
/ \ / \
(x) A T3 >-- ZIG --> T1 B (y)
/ \ / \
/ \ / \
T1 T2 T2 T3
(z) C A (x)
/ \ / \
/ \ / \
(y) B T4 T1 B (y)
/ \ / \
/ \ >-- ZIG ZIG --> / \
(x) A T3 T2 C (z)
/ \ / \
/ \ / \
T1 T2 T3 T4
(z) A B (x)
/ \ / \
/ \ / \
T1 (y)C A(z) C (y)
/ \ / \ / \
/ \ >-- ZIG ZAG --> / \ / \
(x) B T4 T1 T2 T3 T4
/ \
/ \
T2 T3
For an example of the way the algorithm works, suppose we retrieve A and then C
from the unbalanced tree shown below. Retrieving A causes 4 ZIG-ZIG operations
which bring A to the top, the retrieving C causes a ZAG-ZIG, a ZIG-ZIG and a
ZAG operation to bring C to the top.
I A C
/ \ / \
H H A F
/ / \ \ / \
G F I B D H
/ --> / \ --> \ / \
F D G E G I
/ / \
E B E
/ \
D C
/
C
/
B
/
A
Because there is no extra space used, a splay tree IS a normal binary search
tree at each stage. Thus other operations (insert, delete, range-search) can be
done in the normal way if desired.
The efficiency of splay trees is very hard to analyse, but the following facts
emerge. Splay trees don't guarantee O(log(n)) time on every operation, but they
DO guarantee O(k*log(n)) time on k operations. In other words, later retrievals
may take longer than log(n), but only if earlier ones took less. For example,
the above two operations retrieving A and C took O(n) and O(n/2) time, but on a
tree which could only have arisen from n insertion operations which took O(1)
time each, giving an average of O((n+n+n/2)/n) = O(1) time each, well within
the O(log(n)) guarantee. If you want a guarantee on each operation, then you
have to return to one of the traditional methods, of which red-black trees are
probably best -- see Sedgewick.
The data structure not only takes account of the tree shape, keeping it
balanced (on average), but it also takes account of the frequency of access,
giving very fast access to very frequently used items. This makes it useful for
a variety of practical applications, including priority queues.
------------------------------------------------------------------------
| Insert, delete, search and range-search in dictionaries can be |
| implemented in GUARANTEED AVERAGE time O(log(n)) using splay trees |
------------------------------------------------------------------------
2.5 Databases and B-Trees
-------------------------
B-trees provide guaranteed O(log(n)) performance on every operation for
dictionaries, and are commonly used in disc-based databases. See chapters 15
and 18 of Sedgewick. Each node in a B-tree may be stored in one sector on
disc. It has several items and pointers instead of one (as many as will fit),
say three items and four pointers for the sake of argument. Otherwise it is
like a binary search tree. The tree has a fixed number of levels. For example,
the result of entering the letters ALGORITHM into an empty tree is:
_ _ _
|G|I|O|
/ | | \
/ | | \
_ _ _ / _ _ _| |_ _ _ \ _ _ _
|A|_|_| |H|_|_| |L|M|_| |R|T|_|
The left subtree contains items less than G, the second subtree contains items
between G and I and so on.
The idea behind the insertion algorithm is to follow the pointers to the bottom
level, as you would with a binary tree, and enter the item into the appropriate
bottom level node. The items in a node are kept sorted and left-adjusted by any
crude method (the time taken for this sorting within a node is negligible
compared to reading and writing the node from and to disc). If the relevant
node is full, it is split into two nodes, and the "middle" item is passed back
up to the parent. If this is full, it must be split in turn, and if the root of
the whole tree is split, a new root is created and the tree has one more level.
For example, after A,L,G have been entered into the above tree, we need to
split the node in two to enter the O as follows:
_ _ _ _ _ _
|G|_|_| |G|_|_|
_ _ _ _ _ _ / \_ _ _ _ _ _ / \ _ _ _
|A|G|L| |A|_|_| |L|_|_| |A|_|_| |L|O|_|
To avoid searching to the bottom level, splitting the bottom level node, and
possibly propagating the split back up the tree to the top, it is better to
split any full nodes encountered on the way DOWN the tree. This ensures that
whenever a "middle" element has to be sent up to a parent, there is room in the
parent for it. The algorithm is:
Insert item into tree
---------------------
if root is full, split into two nodes and pass middle item up to parent
if at bottom level, insert item into node
insert into relevant subtree
As another example, inserting an S into the tree above which contains
A,L,G,O,R,I,T,H,M involves splitting the root, then inserting S in the
rightmost node:
_ _ _
|I|_|_|
/ \
_ _/_ \_ _ _
|G|_|_| |O|_|_|
/ | | \
/ | | \
_ _ _ / _ _ _| |_ _ _ \ _ _ _
|A|_|_| |H|_|_| |L|O|_| |R|S|T|
To see that B-trees ensure O(log(n)) depth and thus O(log(n)) performance,
suppose a B-tree has k levels and n items. Each node, except at the bottom
level, has at least 2 subtrees and at most 4 subtrees, so
1 + 2 + 4 + 8 + ... + 2^(k-1) <= n <= 3 * (1 + 4 + 16 + ... + 4^(k-1))
2^k <= n <= 4^k ( = 2^2k) (roughly)
k <= log(n) <= 2k
so k = O(log(n))
Thus we get:
--------------------------------------------------------------------
| Insert, delete, search and range-search in dictionaries can be |
| implemented in GUARANTEED time O(log(n)) using B-trees |
-------------------------------------------------------------------
2.6 Disjoint Sets
-----------------
The best implementation of sets depends on the operations required on them.
Suppose we need UNION = "forming the union of two (disjoint) sets", and
FIND = "finding which set an element is in" to be efficient operations
on a collection of disjoint sets. Here n is the total number of elements in
all the sets. (Insertion, deletion and membership tests might also be needed,
but they are simpler than UNION and FIND).
If we represent each set as a linked list, UNION is O(1), by linking one list
onto the end of the other, but FIND is O(n), because we have to scan the
lists. If we use an array with an entry for each element saying which set it is
in, FIND is O(1), but UNION is O(n), because we have to scan the array to find
which entries need changing.
As a compromise between these extremes, the sets can be held as parent trees,
that is trees which are not binary, but for which only parent information need
be stored, e.g.:
A B C
1 2 6
/ \ | /|\
3 5 4 8 7 9
A UNION would still be efficient, making one tree a subtree of the root
of the other, so to form the union of A and B:
A C
1 6
/ | \ /|\
3 5 2 8 7 9
|
4
If the trees are kept balanced, then a FIND operation can be done by following
parent pointers to the root, and will take O(log(n)) steps. In order to keep
the trees balanced, we keep track of their SIZE (number of items) and form
unions by making the smaller tree a subtree of the larger. We assume that the
items are numbered from 1 to n, and the trees are named after their root
elements. The items only need parent information, so we can use an array of
integers to represent the trees. Parent[n] gives the parent of item n, with 0
for roots. A second array gives the size of each tree.
Thus the following sets, trees and arrays correspond:
S1 = {1, 3, 8}
S2 = {2, 4, 6, 7, 9}
S5 = {5, 10}
1 2 5
/ \ / \ |
8 3 6 4 10
/ \
9 7
item: 1 2 3 4 5 6 7 8 9 10
parent: 0 0 1 2 0 2 4 1 4 5
size: 3 5 0 0 2 0 0 0 0 0
We could even save space by using one array for both parent and size. The
algorithms are:
union (t1, t2)
--------------
if size[t1] < size[t2] then swap t1 and t2
link t2 to root of t1 { parent[t2] := t1 }
update sizes { size[t1] := size[t1] + size[t2]; size[t2] := 0 }
find (item)
-----------
x := item
while parent[x] <> 0 do x := parent[x]
result is x
The data structure is initialised by having each element in a tree of its own.
We now prove that finding an item takes O(log(n)) steps by showing that the
depth of a tree of size s is (d <= log(s)). Note that this says more than just
that the trees are well-balanced, which would be (d <= O(log(s))).
This is true to begin with as s=1 for each initial tree, and depth d=0.
All we have to show is that whenever we form a union, the new tree also has the
same property. Suppose we form a union of trees with sizes s1,s2 and depths
d1,d2 with s1>=s2.
.__________
/ \ \
d1 / \ /\ d2+1
/ s1 \ /s2\
------- ----
The depth of the "left hand part" of the union is d1, and of the "right hand
part" is d2+1. Suppose d1 >= d2+1. Then the new depth is
d1 <= log(s1) <= log(s1+s2)
If on the other hand d1 < d2+1, then the new depth is
d2+1 <= log(s2)+1 = log(2*s2) <= log(s1+s2)
Thus although these are not binary trees, they behave even better than binary
trees, having GUARANTEED depth at most log(n), rather than average depth
O(log(n)). The number of items at each level roughly doubles as you go down
the tree, despite the trees not being binary.
FIND can be speeded up further by changing all the items on the path to
the root so that their parents are the root. This is called PATH-COMPRESSION
and speeds up later FINDS of these items. The find algorithm becomes:
find (item)
-----------
find the root ( while parent[x] <> 0 do x := parent[x] )
while parent[item] <> root do
temp := parent[item]
parent[item] := root
item := temp
the result is root
For example, searching for x in the following tree produces this result:
BEFORE AFTER
------ -----
a a=======----
/ \ / \ \ \
b c b c d x
/ \ \ /
d e e f
/ \
f x
This speeding up of future finds actually increases the speed of a find from
O(log(n)) to an average of almost O(1). The same goes for insertion and
membership tests, and (with a bit of extra effort) deletion.
---------------------------------------------------------------
| Unions and finds with disjoint sets can be implemented in |
| AVERAGE time ALMOST O(1) using parent trees |
---------------------------------------------------------------
To see ROUGHLY why this is, suppose the total number of items n is at most
2^(2^16). As this is 2^65536, i.e. roughly 10^20000, this seems quite likely.
Now imagine the final tree as it would have been without path-compression. At
each stage in the algorithm, the current collection of trees is a collection of
subtrees of this final tree. We regard the path compression as defining
"shortcuts" in this tree.
Pretend that the number of items at each level doubles as you go down the tree.
Split the tree into two sections, the lower section consisting of the bottom 16
levels of the tree:
/\
/ \
/ \ ~2^16 levels
/ \
/ n/2^16 \
--------------
/ ~n \ 16 levels
-------------
Now the bottom level contains half the items, the next one up 1/4 of the items
etc. Thus the bottom 16 levels account for all but a fraction 1/2^16 of the
items. Thus we can estimate the size of the upper section as n/2^16, and the
size of the lower section as about n. The depth of the lower section is 16, and
the depth of the upper section is at most log(n)-16 < 2^16.
As a FIND operation climbs the tree, it spends about O(1) time (<16 steps) in
the lower section, so we only need consider the cost of the upper section. We
can also discount the last step up to the root of the relevant subtree, so each
item visited in the upper section has its parent (shortcut) altered as part of
the FIND. Each item in the upper section can only be visited (by FIND
operations) 2^16 times before its parent becomes the root of the final tree,
since its parent is moved up the tree (set equal to the root of a subtree) on
each visit. Thus the total number of visits is 2^16 * n/2^16 which is O(n)
visits spread over O(n) FIND operations, for an average of O(1) each.
The precise result (which comes from a precise version of the above argument)
is as follows. If we define
F(0)=1 and F(n+1) = 2^F(n)
G(n) = minimum k with F(k) >= n,
so G(n) <= 5 for all n <= 2^65536, we get that the average time for a find is
O(G(n)), which is O(1) for all practical purposes.