CHAPTER 2 ANALYSING DATA STRUCTURES 2.0 Stacks and Queues --------------------- One of the important factors in algorithm design is choosing the right data structures. Thus we need to know how long data structure operations take. Stacks and queues are kinds of list -- kinds which can only be accessed at the ends. Lists generally can be implemented either in arrays, or as linked lists using records and pointers. A stack has two operations, PUSH which adds an item to the top of the stack and POP which removes an item from the top of the stack (last in first out). A stack can be implemented in an array "a", with an index "top" indicating the top of the stack (next free space). Push and pop are then O(1) operations. For example: push(x) ------- a[top] := x top := top + 1 Alternatively, a stack can be implemented as a linked list of records, each containing a pointer. This is less efficient on space and time, but there is no pre-determined limit on the size of the stack. It is easiest to put the top of the stack at the front of the list. Then push and pop are still O(1), e.g.: push(x) ------- x^.next := stack stack := x With a queue there are also two operations: ENQUEUE which adds an item to the back of the queue, and DEQUEUE which removes an item from the front of the queue. A queue can be handled in an array "a[1..max]" with indexes "front" and "back", the array being handled circularly. The operations are O(1): enqueue(x) ---------- a[back] := x back := back + 1 if back > max then back := 1 (or use "mod") A queue can be held in a linked list, provided an extra pointer to the last item in the list is kept up to date. The operations are still O(1). Note that a data structure for a queue can in fact accommodate insertions and deletions at both ends. This is sometimes called a "double ended queue" or "deque". -------------------------------------------------------------------- | Push and pop in stacks can be implemented in O(1) time. | | Enqueue and dequeue in queues can be implemented in O(1) time. | --------------------------------------------------------------------
2.1 Priority Queues ------------------- Suppose we have a set of items which is changing by having items inserted or deleted from time to time, and for which we occasionally have to find the largest item. One example is a "queue" of processes with varying priorities in an operating system. The scheduler always chooses the process on the queue with the highest priority to run next. New processes are added to the queue from time to time. Another example is network searching (later chapter). For example, in a chess playing program, it can be more productive to search through board positions in order of "strength" rather than just looking a fixed number of moves ahead. Thus positions are entered into a priority queue as they are generated, and the strongest position is chosen for analysis at each stage. This has the advantage that if the program is interrupted because its time has run out, it should still choose a reasonable move. A priority queue can be implemented as a sorted list, sorted by priority, either in an array or a linked list. Then insertion is O(n) (to re-sort the list) and removal is O(1). It can also be implemented as an unsorted list. Then insertion is O(1) and removal is O(n) (to search for the largest). The preferred way of implementing a priority queue which balances the costs of insertion and removal is in a "heap" structure. This is a binary tree in which the highest priority items climb to the top. The item at each node has higher priority than its children. The structure of the tree stays fixed and the items move in it, so we can use a full binary tree (i.e. one with all nodes present down to the bottom level, and the bottom level filled in from the left). Suppose the items are letters, with later letters having higher priority: X / \ T O / \ / \ G S M N /\ /\ / A E R A I To add a new item, we put it in the first free position, then allow it to climb to its rightful place: Insert item in heap structure ----------------------------- put item at next free position in tree while priority of item greater than that of parent swap item with parent X X X X / \ / \ / \ / \ ... O ... O ... O ... P / \ / \ / \ / \ M N M N P N O N / / \ / \ / \ I I P I M I M To remove the highest priority item, we take the last item, put it in place of the deleted one, and shuffle the item DOWN the heap:
Remove item from heap structure ------------------------------- remove top item and replace with bottom item while priority of item not larger than priorities of both children swap item with higher priority child X M T T T / \ / \ / ... / ... / ... T P T P M S S / \ / \ / \ / \ / \ / \ / \ G S O N G S O N G S G M G R /\ /\ /\ /\ /\ / /\ /\ /\ /\ /\ /\ A E R A I M A E R A I A E R A A E R A A E M A The items can be held in an array item[1..n], with the root in item[1]. The children of item[i] are item[2*i] and item[2*i+1], and the parent of item[i] is item[i div 2] 1 2 3 4 5 6 7 8 9 10 11 12 T S P G R O N A E M A I Note that insert or remove involve a scan which may go from the bottom of the tree to the top (or vice versa). For a tree of n items, this can be up to log(n) items. In fact insert is on average faster than this. The reason is (roughly), that in a full binary tree, half the items are at the bottom level, 1/4 are at the next level up, 1/8 are two levels up and so on. Thus when a random item is inserted, it has probability 1/2 of staying at the bottom level, 1/4 of rising one level, 1/8 of rising two levels etc. Thus the average time for insertion is: sigma (time to rise i levels) * (probability of rising i levels) sigma i/2^i = O(1) (see chapter 1) Note that we can sort n items by inserting them into an empty priority queue and then repeatedly removing the largest. This yields an O(n*log(n)) sorting algorithm (Heapsort) but it is not the best. ------------------------------------------------- | Insert and remove in priority queues can be | | implemented in O(log(n)) time - and in fact | | insert has average time O(1) | ------------------------------------------------- 2.2 Dictionaries Using Hash Tables ---------------------------------- A dictionary is a collection of entries or records. Each entry has a key which is used to search for the entry. In a real dictionary, an entry contains a word which is the key, together with pronunciation, meaning and derivation. In a compiler, a symbol table is a dictionary where entries describe variables. The key is the name, and the other information is type, storage area etc. In a business, employee records have employees names as the key, and information such as salary, age etc. A dictionary may be kept entirely in memory, or in a disc database.
We only consider four operations, INSERT a new entry, DELETE an old one, SEARCH for an entry by key, and RANGE-SEARCH, i.e. search for entries with keys in a certain (small) range. It is possible to keep a dictionary as an array of entries (or array of pointers to entries if the entries have variable length). If the array is unsorted, insert is O(1) and search is O(n). Delete is O(1) if the entry has already been found, as we can replace the entry by the last entry in the array. If the array is kept sorted, insert and delete are O(n), and searching is O(log(n)) by binary chop. A better way of implementing dictionaries is to use hash tables. The key is converted into a number and "hashed" to make it "random", and reduced modulo the table size to produce an index into the hash table. This index indicates where to store the record (or a pointer to it). As two different keys may hash to the same index by accident, it is convenient to make each table entry a pointer to a linked list of records. As a simple example, suppose the keys are numbers, and the hash function just reduces the number modulo 10 for a hash table with 10 entries. We might have: 0 . 1 . 2 -> 42 3 . 4 -> 34 -> 74 -> 14 5 . 6 . 7 -> 37 8 . 9 . The time taken to insert is O(1) (putting the new record at the front of the relevant list). The time for search or delete is O(1 + L) where L is the length of the relevant list. If the number of entries in the hash table is O(n), the average length of the lists is O(1), so the operations are O(1). However, the table must be kept reasonably sparse to work well in practice - usually up to about 80% full is OK. Note however that the hash table gives no help in range searching, which must be done by examining every record. One of the problems with hash tables is that the size must be fixed in advance. One solution is "extendible hashing". In essence, every time the table becomes full, a new larger table is allocated and the old items re-processed to position them in the new table. Suppose table sizes are powers of 2 and we double the size of the table every time it becomes full. The cost of re-processing the n items in the table is O(n). After O(n) insertions, the table size has been increased log(n) times at a cost of O(1 + 2 + 4 + ... + n) which is O(n), i.e. an average cost of O(1) time per insertion. Thus even with extendible hashing we have: --------------------------------------------------------------------- | Insert, delete and search in dictionaries can be implemented in | | AVERAGE time O(1) using hash tables. Range search is O(n). | --------------------------------------------------------------------- Note that if we do not randomize or reduce modulo table size, we would need a very large sparse array. Hash tables are just a way of keeping a large sparse array in a small space. They can be useful for any sparse array or sparse matrix problem. Designing a hash function is essentially the same problem as designing a pseudo random number generator. Most simple random number generators are based on seed := (seed * a + b) mod c
which produces a sequence of integers in the range 0..(c-1). Note that the rightmost digits are not very random, so it is best to use only the leftmost ones. Also note that you may need tricks to avoid overflow. Unfortunately not all choices of a, b, c produce good generators, so don't design your own, look one up. Here are some: seed := (seed * 31415821 + 1) mod 10^8 [use leftmost digits - Sedgewick] seed := (seed * 1103515245 + 12345) mod 2^32 [use left 16 bits - UNIX S5] seed := (seed * 5DEECE66D + B) mod 2^48 [hex - use left 32 bits - UNIX S5] These can be used as hash functions by setting the seed to the key and using the result modulo the table size. For hash table use, the "+b" is not important and can be left out. Remaining problems with hash tables are that range-search and similar operations are not efficient, and that the O(1) performance is not guaranteed. 2.3 Dictionaries Using Trees ---------------------------- Binary search trees give us a method for doing search, insertion and deletion each in O(log(n)) time for n items. Moreover, range-search is also O(log(n)) for small ranges. In a binary search tree, the items are kept in a tree in which each item comes after all items in its left subtree, and before all items in its right subtree. Thus: DOG / \ ART THE / \ \ AND BUS WAR \ CAT The search algorithm is as follows. search (tree, word) ------------------- if the tree is empty (nil) return failure if word = root key, return the root entry if word < root key, search left subtree if word > root key, search right subtree It is easy to arrange for insertion following an unsuccessful search, or deletion following a successful one (the entry may have to be replaced by the leftmost entry in its right subtree). The time taken for any of these operations is proportional to the depth of the tree, which is O(log(n)) for well-balanced trees. Range-searching can be done using a similar algorithm, except that when you come across an entry which is within the range, you output it and search BOTH subtrees. The first time this happens, the path diverges and you end up searching an area of the tree with jagged left and right hand edges, e.g.
/ / \ / /\ / / / \ \ \ The area between the two edges contains exactly the values within the range. Thus the range-search takes time O(log(n)+R) where R is the number of entries found within the range. We assume a small range, i.e. R is O(log(n)). As random trees are well-balanced, we have: -------------------------------------------------------------------- | Insert, delete, search and range-search in dictionaries can be | | implemented in AVERAGE time O(log(n)) using binary search trees | ------------------------------------------------------------------- 2.4 Splay Trees --------------- It is usually good enough to assume that the trees will be random. However, in critical applications (e.g. real-time programs controlling equipment, or on-line databases accessed by many people) it is necessary to have GUARANTEED good performance and not just average good performance. Ordinary search trees can give O(n) performance instead of O(log(n)) if the data is badly non-random. There are many methods for guaranteeing O(log(n)) performance in trees -- 2-3-4 trees, red-black trees, AVL trees etc. Very few of these have been used in earnest, partly because they need extra space in the nodes. The best method, which does not need extra space, is a new one called "splay trees" which hasn't reached the textbooks yet. It can be found in the library in the Communications of the ACM, Volume 30, No. 3 (March 1987) in the article "Algorithm Design" by Tarjan. Like most tree-balancing acts, splay trees rely on ROTATION of an edge of a binary search tree, in see-saw fashion. If x and y point to neighbouring nodes containing A and B, with subtrees T1, T2, and T3 as in the left hand picture, then the edge xy can be rotated right to produce the right hand picture. The edge xy rotates like a see-saw, and the middle tree T2 slides to the other end. (y) B A (x) / \ / \ / \ / \ (x) A T3 T1 B (y) / \ / \ / \ / \ T1 T2 T2 T3 Rotate edge xy to the right (assuming x is left child of y) --------------------------- swap x^.value with y^.value y^.left := x^.left (the order of these matters a lot) x^.left := x^.right x^.right := y^.right y^.right := x swap x and y
This does not change the ordering properties of the search tree, because in both the original and final trees, the order is T1..A..T2..B..T3, in fact you can think of the operation as a re-bracketing (T1 A T2) B T3 goes to T1 A (T2 B T3). There is a similar rotate left operation on a rightward edge (just swap the two pictures above). In a splay tree, each time you retrieve an item, you make changes to the tree to bring it to the root. The operation also spreads the tree out (splays it), halving the depth of the items found along the path to the retrieved item. Items off the path may have their depth increased, but only by 1 or 2 levels. All nodes along the path are much more efficient to access later, while those off the path are a little less efficient. The actual splay operation on finding a particular item is as follows. There are three cases (and their mirror images) as follows. The ZIG case: If node x is a left child of the root y (or a right child) rotate xy (and the operation is finished) The ZIG-ZIG case: If x is a left child of y which is a left child of z (or right/right) rotate yz, then rotate xy The ZIG-ZAG case: If x is a left child of y which is a right child of z (or right/left) rotate xy, then rotate xz The pictures are as follows: (y) B A (x) / \ / \ / \ / \ (x) A T3 >-- ZIG --> T1 B (y) / \ / \ / \ / \ T1 T2 T2 T3 (z) C A (x) / \ / \ / \ / \ (y) B T4 T1 B (y) / \ / \ / \ >-- ZIG ZIG --> / \ (x) A T3 T2 C (z) / \ / \ / \ / \ T1 T2 T3 T4 (z) A B (x) / \ / \ / \ / \ T1 (y)C A(z) C (y) / \ / \ / \ / \ >-- ZIG ZAG --> / \ / \ (x) B T4 T1 T2 T3 T4 / \ / \ T2 T3
For an example of the way the algorithm works, suppose we retrieve A and then C from the unbalanced tree shown below. Retrieving A causes 4 ZIG-ZIG operations which bring A to the top, the retrieving C causes a ZAG-ZIG, a ZIG-ZIG and a ZAG operation to bring C to the top. I A C / \ / \ H H A F / / \ \ / \ G F I B D H / --> / \ --> \ / \ F D G E G I / / \ E B E / \ D C / C / B / A Because there is no extra space used, a splay tree IS a normal binary search tree at each stage. Thus other operations (insert, delete, range-search) can be done in the normal way if desired. The efficiency of splay trees is very hard to analyse, but the following facts emerge. Splay trees don't guarantee O(log(n)) time on every operation, but they DO guarantee O(k*log(n)) time on k operations. In other words, later retrievals may take longer than log(n), but only if earlier ones took less. For example, the above two operations retrieving A and C took O(n) and O(n/2) time, but on a tree which could only have arisen from n insertion operations which took O(1) time each, giving an average of O((n+n+n/2)/n) = O(1) time each, well within the O(log(n)) guarantee. If you want a guarantee on each operation, then you have to return to one of the traditional methods, of which red-black trees are probably best -- see Sedgewick. The data structure not only takes account of the tree shape, keeping it balanced (on average), but it also takes account of the frequency of access, giving very fast access to very frequently used items. This makes it useful for a variety of practical applications, including priority queues. ------------------------------------------------------------------------ | Insert, delete, search and range-search in dictionaries can be | | implemented in GUARANTEED AVERAGE time O(log(n)) using splay trees | ------------------------------------------------------------------------ 2.5 Databases and B-Trees ------------------------- B-trees provide guaranteed O(log(n)) performance on every operation for dictionaries, and are commonly used in disc-based databases. See chapters 15 and 18 of Sedgewick. Each node in a B-tree may be stored in one sector on disc. It has several items and pointers instead of one (as many as will fit), say three items and four pointers for the sake of argument. Otherwise it is like a binary search tree. The tree has a fixed number of levels. For example, the result of entering the letters ALGORITHM into an empty tree is:
_ _ _ |G|I|O| / | | \ / | | \ _ _ _ / _ _ _| |_ _ _ \ _ _ _ |A|_|_| |H|_|_| |L|M|_| |R|T|_| The left subtree contains items less than G, the second subtree contains items between G and I and so on. The idea behind the insertion algorithm is to follow the pointers to the bottom level, as you would with a binary tree, and enter the item into the appropriate bottom level node. The items in a node are kept sorted and left-adjusted by any crude method (the time taken for this sorting within a node is negligible compared to reading and writing the node from and to disc). If the relevant node is full, it is split into two nodes, and the "middle" item is passed back up to the parent. If this is full, it must be split in turn, and if the root of the whole tree is split, a new root is created and the tree has one more level. For example, after A,L,G have been entered into the above tree, we need to split the node in two to enter the O as follows: _ _ _ _ _ _ |G|_|_| |G|_|_| _ _ _ _ _ _ / \_ _ _ _ _ _ / \ _ _ _ |A|G|L| |A|_|_| |L|_|_| |A|_|_| |L|O|_| To avoid searching to the bottom level, splitting the bottom level node, and possibly propagating the split back up the tree to the top, it is better to split any full nodes encountered on the way DOWN the tree. This ensures that whenever a "middle" element has to be sent up to a parent, there is room in the parent for it. The algorithm is: Insert item into tree --------------------- if root is full, split into two nodes and pass middle item up to parent if at bottom level, insert item into node insert into relevant subtree As another example, inserting an S into the tree above which contains A,L,G,O,R,I,T,H,M involves splitting the root, then inserting S in the rightmost node: _ _ _ |I|_|_| / \ _ _/_ \_ _ _ |G|_|_| |O|_|_| / | | \ / | | \ _ _ _ / _ _ _| |_ _ _ \ _ _ _ |A|_|_| |H|_|_| |L|O|_| |R|S|T| To see that B-trees ensure O(log(n)) depth and thus O(log(n)) performance, suppose a B-tree has k levels and n items. Each node, except at the bottom level, has at least 2 subtrees and at most 4 subtrees, so
1 + 2 + 4 + 8 + ... + 2^(k-1) <= n <= 3 * (1 + 4 + 16 + ... + 4^(k-1)) 2^k <= n <= 4^k ( = 2^2k) (roughly) k <= log(n) <= 2k so k = O(log(n)) Thus we get: -------------------------------------------------------------------- | Insert, delete, search and range-search in dictionaries can be | | implemented in GUARANTEED time O(log(n)) using B-trees | ------------------------------------------------------------------- 2.6 Disjoint Sets ----------------- The best implementation of sets depends on the operations required on them. Suppose we need UNION = "forming the union of two (disjoint) sets", and FIND = "finding which set an element is in" to be efficient operations on a collection of disjoint sets. Here n is the total number of elements in all the sets. (Insertion, deletion and membership tests might also be needed, but they are simpler than UNION and FIND). If we represent each set as a linked list, UNION is O(1), by linking one list onto the end of the other, but FIND is O(n), because we have to scan the lists. If we use an array with an entry for each element saying which set it is in, FIND is O(1), but UNION is O(n), because we have to scan the array to find which entries need changing. As a compromise between these extremes, the sets can be held as parent trees, that is trees which are not binary, but for which only parent information need be stored, e.g.: A B C 1 2 6 / \ | /|\ 3 5 4 8 7 9 A UNION would still be efficient, making one tree a subtree of the root of the other, so to form the union of A and B: A C 1 6 / | \ /|\ 3 5 2 8 7 9 | 4 If the trees are kept balanced, then a FIND operation can be done by following parent pointers to the root, and will take O(log(n)) steps. In order to keep the trees balanced, we keep track of their SIZE (number of items) and form unions by making the smaller tree a subtree of the larger. We assume that the items are numbered from 1 to n, and the trees are named after their root elements. The items only need parent information, so we can use an array of integers to represent the trees. Parent[n] gives the parent of item n, with 0 for roots. A second array gives the size of each tree.
Thus the following sets, trees and arrays correspond: S1 = {1, 3, 8} S2 = {2, 4, 6, 7, 9} S5 = {5, 10} 1 2 5 / \ / \ | 8 3 6 4 10 / \ 9 7 item: 1 2 3 4 5 6 7 8 9 10 parent: 0 0 1 2 0 2 4 1 4 5 size: 3 5 0 0 2 0 0 0 0 0 We could even save space by using one array for both parent and size. The algorithms are: union (t1, t2) -------------- if size[t1] < size[t2] then swap t1 and t2 link t2 to root of t1 { parent[t2] := t1 } update sizes { size[t1] := size[t1] + size[t2]; size[t2] := 0 } find (item) ----------- x := item while parent[x] <> 0 do x := parent[x] result is x The data structure is initialised by having each element in a tree of its own. We now prove that finding an item takes O(log(n)) steps by showing that the depth of a tree of size s is (d <= log(s)). Note that this says more than just that the trees are well-balanced, which would be (d <= O(log(s))). This is true to begin with as s=1 for each initial tree, and depth d=0. All we have to show is that whenever we form a union, the new tree also has the same property. Suppose we form a union of trees with sizes s1,s2 and depths d1,d2 with s1>=s2. .__________ / \ \ d1 / \ /\ d2+1 / s1 \ /s2\ ------- ---- The depth of the "left hand part" of the union is d1, and of the "right hand part" is d2+1. Suppose d1 >= d2+1. Then the new depth is d1 <= log(s1) <= log(s1+s2) If on the other hand d1 < d2+1, then the new depth is d2+1 <= log(s2)+1 = log(2*s2) <= log(s1+s2) Thus although these are not binary trees, they behave even better than binary trees, having GUARANTEED depth at most log(n), rather than average depth O(log(n)). The number of items at each level roughly doubles as you go down the tree, despite the trees not being binary.
FIND can be speeded up further by changing all the items on the path to the root so that their parents are the root. This is called PATH-COMPRESSION and speeds up later FINDS of these items. The find algorithm becomes: find (item) ----------- find the root ( while parent[x] <> 0 do x := parent[x] ) while parent[item] <> root do temp := parent[item] parent[item] := root item := temp the result is root For example, searching for x in the following tree produces this result: BEFORE AFTER ------ ----- a a=======---- / \ / \ \ \ b c b c d x / \ \ / d e e f / \ f x This speeding up of future finds actually increases the speed of a find from O(log(n)) to an average of almost O(1). The same goes for insertion and membership tests, and (with a bit of extra effort) deletion. --------------------------------------------------------------- | Unions and finds with disjoint sets can be implemented in | | AVERAGE time ALMOST O(1) using parent trees | --------------------------------------------------------------- To see ROUGHLY why this is, suppose the total number of items n is at most 2^(2^16). As this is 2^65536, i.e. roughly 10^20000, this seems quite likely. Now imagine the final tree as it would have been without path-compression. At each stage in the algorithm, the current collection of trees is a collection of subtrees of this final tree. We regard the path compression as defining "shortcuts" in this tree. Pretend that the number of items at each level doubles as you go down the tree. Split the tree into two sections, the lower section consisting of the bottom 16 levels of the tree: /\ / \ / \ ~2^16 levels / \ / n/2^16 \ -------------- / ~n \ 16 levels ------------- Now the bottom level contains half the items, the next one up 1/4 of the items etc. Thus the bottom 16 levels account for all but a fraction 1/2^16 of the items. Thus we can estimate the size of the upper section as n/2^16, and the size of the lower section as about n. The depth of the lower section is 16, and the depth of the upper section is at most log(n)-16 < 2^16.
As a FIND operation climbs the tree, it spends about O(1) time (<16 steps) in the lower section, so we only need consider the cost of the upper section. We can also discount the last step up to the root of the relevant subtree, so each item visited in the upper section has its parent (shortcut) altered as part of the FIND. Each item in the upper section can only be visited (by FIND operations) 2^16 times before its parent becomes the root of the final tree, since its parent is moved up the tree (set equal to the root of a subtree) on each visit. Thus the total number of visits is 2^16 * n/2^16 which is O(n) visits spread over O(n) FIND operations, for an average of O(1) each. The precise result (which comes from a precise version of the above argument) is as follows. If we define F(0)=1 and F(n+1) = 2^F(n) G(n) = minimum k with F(k) >= n, so G(n) <= 5 for all n <= 2^65536, we get that the average time for a find is O(G(n)), which is O(1) for all practical purposes.