Dict Library README ~~~~~~~~~~~~~~~~~~~ $Id$ Introduction ------------ Libdict is a compact, ANSI C library which provides access to a set of generic and flexible ``dictionary'' data structures, including a hash table, a red-black tree, a splay tree, a weight-balanced tree, a path-reduction tree, and a height-balanced (or AVL) tree. These structures can be used to efficiently store and retrieve key-data pairs. Each of these structures can be accessed using its direct API, or it can be accessed using a dictionary abstraction. All data structures have the following operations defined: - Insert: insert a key-data pair into the dictionary. - Probe: search the data structure for the given key. If found, return it's associated data, otherwise, insert the given key-data pair. - Search: search the data structure for the given key, and return the associated data, if found. - Const Search: same as search but accepts a const pointer to the data structure instead and returns a const pointer to the associated data for the key. - Remove: If found, remove the key-data pair from the data structure. - Count: return the number of key-data pairs stored in the data structure. - Empty: remove all key-data pairs from the data structure. - Walk: pass each key-data pair to a callback function, stopping when the callback functions returns 0. - Destroy: destroy the data structure and release its storaage. - Make Iterator: allocate and return an iterator, which allows one to iterate over each key-data pair in the structure efficiently. Iterators are both random and sequential (bi-directional). In addition, various structures can provide statistical information. For example, each tree structure can provide the minimum and maximum path lengths from the root to a leaf, and the hash table can provide the number of buckets used, etc. These statistical operations are not provided in the generic dictionary interface because they are not applicable to each data structure. The great advantage to using the generic interface is, by changing only the line of code which creates the dictionary, the underlying data structure and therefore the performance characteristics of the tree can be changed. Using the Library ----------------- The library allows the client to store keys and ``satellite'' data associated with each key in the tree. The client must provide the library with a comparison function which will then be used to determine the relationship between two keys. This is passed to the library as a ``dict_cmp_func'', which is defined as: typedef int (*dict_cmp_func)(const void *key1, const void *key2); This function should return 0 if the two keys are determined to be equal, less then zero if key1 < key2, and greater than zero if key1 > key2. This function must be deterministic and must always return the same result for the same keys. In addition, the client may optionally provide functions which can be used for automatically freeing the key and / or the data which is stored in the tree. These are passed to the library as ``dict_del_func'', which is defined as: typedef void (*dict_del_func)(void *); For example, if a tree was being used to store blocks of memory, indexed by allocated strings, dict_cmp_func could be the ``strcmp'', and the routines passed in for freeing the key and data could both be ``free''. The deallocation function is not required to do anything, and it should not be used as a ``notification'' in the client application that the data item is no longer in the tree. This is because these routines are invoked when ``overwriting'' the data of an item already in the tree to free the old key and data. If the deallocation mechanism is abused and is used as a notification that the item is no longer in the key, the application will think the item is no longer in the tree when in fact it is still present but only had its data changed. Red-Black Trees --------------- A red-black tree is a balanced tree structure. This structure guarantees that insertion, deletion, and search for an item has a time complexity O(lg N). A red-black tree with N data items has a height of at most 2lg(N + 1). More specifically, the insertion and deletion algorithms guarantee that no leaf has a distance from the root that is more than twice the distance of another leaf. Thus the tree remains roughly balanced and manipulation of the tree has logarithmic time complexity. This is unlike ``plain'' unbalanced search trees, where performance may, and often does, degenerate into linear time. Although the maximum number of comparisons for an insert, search, or delete for a given item in a red-black tree is bounded by 2lg N + 1, empirical studies on red-black trees which are filled with random data show that the average number of comparisons for these operations is about 1.002lg N. All data structures provide identical interfaces (apart from function name prefixes) and identical semantics for all operations defined. Thus, by presenting the interface for a particular structure, the reader will also learn how to use the rest. In this section, I will explain the interface to the red-black tree. Every data structure is opaque to the user; only a pointer to it may be stored by the client program. The function which creates a new dictionary structure is the abbreviated name of the structure, followed by a ``_new''. For example, to allocate and return a new red-black tree, the function to use would be: rb_tree *rb_tree_new(dict_cmp_func key_cmp, dict_del_func key_del, dict_del_func dat_del); The ``key_cmp'' function is REQUIRED and may not be NULL. Either ``key_del'' or ``dat_del'' may be NULL, in which case they will not be called :-). Once the tree is created, it may be used until it is no longer needed, in which case it must be destroyed with void rb_tree_destroy(rb_tree *tree, int del); This function will release all storage used by the tree and, if del is set, it will invoke the user-defined deallocation functions to free the key and / or the data, if applicable. To insert key and data pairs into the tree, use: int rb_tree_insert(rb_tree *tree, void *key, void *dat, int overwrite); This will attempt to insert the given key and data pair into ``tree''. If the item is already in the tree and ``overwrite'' is non-zero, the new data value will overwrite the old one, and the function will return a positive value. If the node is not present in the tree, it is inserted and the function will return 0. If, however, the function cannot allocate memory to perform the insertion, it will return -1. This is a more general function which also doubles as a tree lookup function: int rb_tree_probe(rb_tree *tree, void *key, void **dat); This function will search the given tree for a match for ``key''; if it is found, the data associated with the given key is stored at the location ``dat'', and the function will return 0. Otherwise, the given key and data pair is inserted into the tree and the function will return 1. ``dat''. However, if it the function was unable to allocate memory it will return -1. Probe operations provides an interface similar to, for example, the programming language awk's associative arrays, in which the first reference to a given key will bring that key and data pair into the array, and subsequent accesses will return the item originally placed into the array. Searching the tree is done using: void *rb_tree_search(rb_tree *tree, const void *key); const void *rb_tree_csearch(const rb_tree *tree, const void *key); Both will return the data associated with the given key if it present in the tree, or NULL. ``rb_tree_csearch'' will return a const pointer to the data. It is meant for sections of code which are not allowed to modify the tree, and therefore are not allowed to modify any data held in the tree. Removing an item from the tree is done using: int rb_tree_remove(rb_tree *tree, const void *key, int del); This function will return 0 if the item was found and successfully removed. If the item was not found it will return -1. Any user-defined deallocators will only be invoked if ``del'' is non-zero. To remove all key-data pairs from a tree, use: void rb_tree_empty(rb_tree *tree, int del); The tree will be completely emptied. Any user-defined deallocators will only be invoked if ``del'' is non-zero. It is possible to have the tree call a given function for each item in the tree. The callback is defined as: int (*dict_vis_func)(const void *key, void *dat); This function will be called for each key-data pair in the tree. The tree will call this function for each item in order, until the function returns zero. This can be done with: void rb_tree_walk(rb_tree *tree, dict_vis_func visit); Finally, to find out how many key-data pairs are held in the tree, use: unsigned rb_tree_count(rb_tree *tree); This sums up the all the operations which can be done on the dict data structures, using the red-black tree as an example. There are, however, more operations which can be performed on trees. unsigned rb_tree_height(const rb_tree *tree); unsigned rb_tree_mheight(const rb_tree *tree); ``rb_tree_height'' will return the length of the longest path from the root to a leaf. ``rb_tree_mheight'' will return the length of the shortest path from the root to a leaf. const void *rb_tree_min(const rb_tree *tree); const void *rb_tree_max(const rb_tree *tree); These functions will return the minimum and maximum keys in the tree, respectively. Be careful, as these may will NULL if the tree is empty. Suppose you decide you would like to use the red-black tree structure provided by this library in your application. It would be perfectly OK to use the ``rb_tree_*'' functions described here. However, suppose at a later time you decide you'd rather use the splay tree. This would require you to go through your source code and change every instance of ``rb_tree'' to ``sp_tree.'' There is a Better Way (tm) to access this library which relieves the programmer from having to know about the underlying data in use. The generic interface is defined in the dict.h file. It includes all operations such as dict_insert(), dict_probe(), dict_remove(), dict_empty(), etc. Each of these routines requires a pointer to a ``dict'' structure. However, there is no function defined which creates a new dict structure. The function which will create the generic dict structure is defined in the header file for that particular structure. For example, earlier we used the rb_tree_new() function to create a new red-black tree. Instead, we now use the rb_dict_new() function, which will create a red-black tree, but instead of returning the tree itself, it will return a dict structure which encapsulates all of the operations on the structure. We can now use the dict structure returned by rb_dict_new() with all of the functions such as dict_insert(), dict_remove(), etc, from dict.h. Thus, the programmer may change the underlying data structure used, and hence the performance characteristics of the application, by simply changing a single line of code - the line which creates the dict structure. Throughout the rest of the application, the programmer no longer has to worry about which actual structure is being used. All these advantages are available for only the most infinitesmal performance hit (the overhead of one extra function call). As mentioned earlier, iterators are also available which allow both random and sequential, bi-directional access. A routine which creates a new iterator would be something like: rb_itor *rb_itor_new(rb_tree *tree); All iterators have the following operations defined: - Valid: is the iterator positioned at a meaningful location? - Invalidate: invalidate the iterator, or position it at a sentinel location which signifies that it cannot be ``dereferenced.'' - Next: move the iterator to the next location. If it was invalid, move it to the first location in its associated dictionary. - Next N: move the iterator forward N times. If it was invalid, first move it the first location in its associated dictionary, then move it forward from there N - 1 times. - Prev: move the iterator to the previous location. If it was invalid, move it to the last location in its associated dictionary. - Prev N: move the iterator backwards N times. If it was invalid, first move it the last location in its associated dictionary, then move it backwards from there N - 1 times. - First: move the iterator to the first location in its associated dictionary. - Last: move the iterator to the last location in its associated dictionary. - Key: return the key stored at the current location of the iterator. - Data: return the data stored at the current location of the iterator. - Set Data: change the data stored the current location of the iterator. The following code fragment illustrates the use of the iterator: rb_tree *tree; rb_itor *itor; tree = rb_tree_new((dict_cmp_func), free, free); /* Insert some items, keyed by character string, into the tree here... */ /* Print every character string, in order, that is in the tree. */ itor = rb_itor_new(tree); for (rb_itor_first(itor); rb_itor_valid(itor); rb_itor_next(itor)) printf("%s\n", rb_itor_key(itor)); rb_itor_destroy(itor); Farooq Mela <fmela@sm.socccd.cc.ca.us> Mar 27, 2001