libdict

Introduction

Libdict is a compact, ANSI C library which provides access to a set of generic and flexible ``dictionary'' data structures. All algorithms used in libdict have been optimized, and, with one very small exception, are not recursive but iterative. It was written entirely by me, and is released under a BSD style licence. Libdict can be downloaded here. Descriptions of several of the algorithms employed in Libdict can be downloaded from here.

Libdict implements the following data structures, which are listed with their asymptotic expected and worst-case complexity for insert, search, and delete operations:

Data Structure	Expected Time	Worst-Case Time
AVL Tree	O(lg N)	O(lg N)
Red-Black Tree	O(lg N)	O(lg N)
Splay Tree	O(lg N) [amortized]	O(N)
Treap	O(lg N)	O(N)
Weight-balanced tree	O(lg N)	O(lg N) insert and search, O(N) delete
Path-reduction tree	O(lg N)	O(lg N) insert and search, O(N) delete
Hashtable (Chained)	O(1)	O(1) insert, O(N) search and delete

These structures can be used to efficiently store and retrieve key-data pairs. Each of these structures can be accessed using its direct API, or it can be accessed using a dictionary abstraction. Despite it's name, libdict can be used to store any kind of data and any kind of key (provided it fits into a 'void' pointer on your system). All data structures have the following operations defined:

Insert insert a key-data pair into the dictionary.
Probe search the data structure for the given key. If found, return it's associated data, otherwise, insert the given key-data pair.
Search search the data structure for the given key, and return the associated data, if found.
Const Search same as search but accepts a const pointer to the data structure instead and returns a const pointer to the associated data for the key.
Remove If found, remove the key-data pair from the data structure.
Count return the number of key-data pairs stored in the data structure.
Empty remove all key-data pairs from the data structure.
Walk pass each key-data pair to a callback function, stopping when the callback function returns 0.
Destroy destroy the data structure and release its storaage.
Make Iterator allocate and return an iterator, which allows one to iterate over each key-data pair in the structure efficiently. Iterators are both random and sequential (bi-directional).

In addition, various structures can provide statistical information. For example, each tree structure can provide the minimum and maximum path lengths from the root to a leaf, and the hash table can provide the number of buckets used, etc. These statistical operations are not provided in the generic dictionary interface because they are not applicable to each data structure.

The great advantage to using the generic interface is, by changing only the line of code which creates the dictionary, the underlying data structure and therefore its performance characteristics can be changed.

Using the Library

The library allows the client to store keys and ``satellite'' data associated with each key in a dict structure. The client must provide the library with a comparison function which will then be used to determine the relationship between two keys. This is passed to the library as a ``dict_cmp_func'', which is defined as:


typedef int		(*dict_cmp_func)(const void *key1, const void *key2);

This function should return 0 if the two keys are determined to be equal, less then zero if key1 < key2, and greater than zero if key1 > key2. This function must be deterministic and must always return the same result for the same keys. In addition, the comparison relation must be reflexive (A = A), symmetric (A = B implies B = A), and transitive (A = B and B = C implies A = C, A < B and B < C implies A < C, etc).

In addition, the client may optionally provide functions which can be used for automatically freeing the key and / or the data which is stored in the tree. These are passed to the library as ``dict_del_func'', which is defined as:


typedef void	(*dict_del_func)(void *);

For example, if a tree was being used to store blocks of memory, indexed by allocated strings, dict_cmp_func could be the ``strcmp'', and the routines passed in for freeing the key and data could both be ``free''. The deallocation function is not required to do anything, and it should not be used as a ``notification'' callback in the client application that the data item is no longer in the tree. This is because these routines are invoked when ``overwriting'' the data of an item already in the tree to free the old key and data. If the deallocation mechanism is abused and is used as a notification that the item is no longer in the key, the application will think the item is no longer in the tree when in fact it is still present but only had its data changed.

However, say we are storing a structure such as the one that follows as data in a dict structure:


struct fooey {

    char    *key;

    int	     bar;

    void    *ptr;

};

If the struct members key and ptr are allocated and we wish to free them when the structure is removed from the dict structure, we could write the following free routine:


void

free_fooey(struct fooey *f)

{

    free(f->key);

    free(f->ptr);

    free(f);

}

A pointer to this function can now be passed as the 'dat_del' argument to a function which creates a dict structure (this will be explained below).

All data structures provide identical interfaces (apart from function name prefixes) and identical semantics for all operations defined. Thus, by presenting the interface for a particular structure, the reader will also learn how to use the rest. In the following section, the interface to the red-black tree will be explained. Every data structure is opaque to the user; only a pointer to it may be stored by the client program.

Red-Black Trees

A red-black tree is a balanced tree structure. This structure guarantees that insertion, deletion, and search for an item has a time complexity O(lg N). A red-black tree with N data items has a height of at most 2lg(N + 1). More specifically, the insertion and deletion algorithms guarantee that no leaf has a distance from the root that is more than twice the distance of another leaf. Thus the tree remains roughly balanced and manipulation of the tree has logarithmic time complexity. This is unlike ``plain'' unbalanced search trees, where performance may, and often does, degenerate into linear time.

Although the maximum number of comparisons for an insert, search, or delete for a given item in a red-black tree is bounded by 2lg N + 1, empirical studies on red-black trees which are filled with random data show that the average number of comparisons for these operations is about 1.002lg N.

The following function will allocate and return a new red-black tree:


rb_tree *rb_tree_new(dict_cmp_func key_cmp, dict_del_func key_del,
					 dict_del_func dat_del);

If the ``key_cmp'' argument is NULL, then the keys which are later inserted into the tree will be cast to void pointers and their value will be compared. One will rarely want this behavior, but it is there should you require it. Either ``key_del'' or ``dat_del'' may be NULL, in which case they will not be called :-).

Once the tree is created, it may be used until it is no longer needed, in which case it must be destroyed with


void rb_tree_destroy(rb_tree *tree, int del);

This function will release all storage used by the tree and, if del is set, it will invoke the user-defined deallocation functions to free the key and / or the data, if applicable.

To insert key and data pairs into the tree, use:


int rb_tree_insert(rb_tree *tree, void *key, void *dat, int overwrite);

This will attempt to insert the given key and data pair into ``tree''. If the item is already in the tree and ``overwrite'' is non-zero, the new data value will overwrite the old one, and the function will return a positive value. If the node is not present in the tree, it is inserted and the function will return 0. If, however, the function cannot allocate memory to perform the insertion, it will return -1.

This is a more general function which also doubles as a tree search function:


int rb_tree_probe(rb_tree *tree, void *key, void **dat);

This function will search the given tree for a match for ``key''; if it is found, the data associated with the given key is stored at the location ``dat'', and the function will return 0. Otherwise, the given key and data pair is inserted into the tree and the function will return 1. However, if it the function was unable to allocate memory it will return -1. Probe operations provides an interface similar to, for example, the programming language awk's associative arrays, in which the first reference to a given key will bring that key and data pair into the array, and subsequent accesses will return the item originally placed into the array.

Searching the tree is done using:


void *rb_tree_search(rb_tree *tree, const void *key);

const void *rb_tree_csearch(const rb_tree *tree, const void *key);

Both will return the data associated with the given key if it present in the tree, or NULL. ``rb_tree_csearch'' will return a const pointer to the data. It is meant for sections of code which have a const pointer to a rb_tree, and so are not allowed to modify the tree or any data held in the tree.

Removing an item from the tree is done using:

int rb_tree_remove(rb_tree *tree, const void *key, int del);

This function will return 0 if the item was found and successfully removed. If the item was not found it will return -1. Any user-defined deallocators will only be invoked if ``del'' is non-zero.

To remove all key-data pairs from a tree, use:

void rb_tree_empty(rb_tree *tree, int del);

The tree will be completely emptied. Any user-defined deallocators will only be invoked if ``del'' is non-zero.

It is possible to have the tree call a given function for each item in the tree. The callback is defined as:


int	(*dict_vis_func)(const void *key, void *dat);

This function will be called for each key-data pair in the tree. The tree will call this function for each item in order, until the function returns zero. This is accomplished with:


void rb_tree_walk(rb_tree *tree, dict_vis_func visit);

Finally, to find out how many key-data pairs are held in the tree, use:


unsigned rb_tree_count(rb_tree *tree);

This sums up the all the operations which can be done on the dict data structures, using the red-black tree as an example. There are, however, more operations which can be performed on trees.


unsigned rb_tree_height(const rb_tree *tree);

unsigned rb_tree_mheight(const rb_tree *tree);

``rb_tree_height'' will return the length of the longest path from the root to a leaf. ``rb_tree_mheight'' will return the length of the shortest path from the root to a leaf.


const void *rb_tree_min(const rb_tree *tree);

const void *rb_tree_max(const rb_tree *tree);

These functions will return the minimum and maximum keys in the tree, respectively. Be careful, as these may will NULL if the tree is empty. These functions also allow the tree to be used as a double-ended priority queue (although there are better algorithms for that).

Suppose you decide you would like to use the red-black tree structure provided by this library in your application. It would be perfectly OK to use the ``rb_tree_*'' functions described here. However, suppose at a later time you decide you'd rather use the splay tree. This would require you to go through your source code and change every instance of ``rb_tree'' to ``sp_tree.'' There is a Better Way (tm) to access this library which relieves the programmer from having to know about the underlying data in use. There is where the generic interface of libdict comes into play.

The generic interface is defined in the dict.h file. It includes all operations such as dict_insert(), dict_probe(), dict_remove(), dict_empty(), etc. Each of these routines [they are actually macros] requires a pointer to a ``dict'' structure. However, there is no function defined which creates a new dict structure. The function which will create the generic dict structure is defined in the header file for that particular structure. For example, earlier we used the rb_tree_new() function to create a new red-black tree. Instead, we now use the rb_dict_new() function, which will create a red-black tree, but instead of returning the tree itself, it will return a dict structure which encapsulates all of the operations on the structure. We can now use the dict structure returned by rb_dict_new() with all of the functions such as dict_insert(), dict_remove(), etc, from dict.h. Thus, the programmer may change the underlying data structure used, and hence the performance characteristics of the application, by simply changing a single line of code - the line which creates the dict structure. Throughout the rest of the application, the programmer no longer has to worry about which actual structure is being used.

As mentioned earlier, iterators are also available which allow both random and sequential, bi-directional access. A routine which creates a new iterator would be something like:

rb_itor *rb_itor_new(rb_tree *tree);

All iterators have the following operations defined:

Valid is the iterator positioned at a meaningful location?
Invalidate invalidate the iterator, or position it at a sentinel location which signifies that it cannot be ``dereferenced.''
Next move the iterator to the next location. If it was invalid, move it to the first location in its associated dictionary.
Next N move the iterator forward N locations. If it was invalid, first move it the first location in its associated dictionary, then move it forward from there N - 1 times.
Prev move the iterator to the previous location. If it was invalid, move it to the last location in its associated dictionary.
Prev N move the iterator backwards N locations. If it was invalid, first move it the last location in its associated dictionary, then move it backwards from there N - 1 times.
First move the iterator to the first location in its associated dictionary.
Last move the iterator to the last location in its associated dictionary.
Key return the key stored at the current location of the iterator.
Data return the data stored at the current location of the iterator.

Set Data change the data stored the current location of the iterator.

The following program illustrates the use of a red-black tree and an iterator.


#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#include <dict.h>

void *xmalloc(size_t size);
char *xstrdup(const char *s);

int
main(void)
{
	rb_tree *tree;
	rb_itor *itor;
	FILE *fp;
	char *p;
	char buf[512];
	int rv;

	if ((fp = fopen("data.txt", "r")) == NULL) {
		fprintf(stderr, "can't open data.txt\n");
		exit(EXIT_FAILURE);
	}

	/*
	 * Set the allocation function for the library. This may not be a good idea
	 * except in trivial programs like this one.
	 */
	dict_set_malloc(xmalloc);

	tree = rb_tree_new((dict_cmp_func)strcmp, free, free);

	/*
	 * We expect the text file to contain lines like "key=value".
	 */
	 while ((fgets(buf, sizeof(buf), fp)) != NULL) {
	 	p = strchr(buf, '=');
		if (!p) {
			fprintf(stderr, "bad file syntax\n");
			exit(EXIT_FAILURE);
		}
		*p++ = '\0';
		if (!*p) {
			fprintf(stderr, "bad file syntax\n");
			exit(EXIT_FAILURE);
		}
		if ((rv = rb_tree_insert(tree, xstrdup(buf), xstrdup(p), 0)) != 0) {
			fprintf(stderr, "duplicate key in file: '%s'\n", buf);
			exit(EXIT_FAILURE);
		}
	 }
	 rewind(fp);
	 while ((fgets(buf, sizeof(buf), fp)) != NULL) {
	 	strtok(buf, '=');
		if ((p = rb_tree_search(tree, buf)) == NULL) {
			fprintf(stderr, "key '%s' not found in tree!\n", buf);
			exit(EXIT_FAILURE);
		}
		printf("%s = %s\n", buf, p);
	 }
	 printf("\n");
	 fclose(fp);

	/* Print every key and data pair that is in the tree, in order. */
	itor = rb_itor_new(tree);
	for (; rb_itor_valid(itor); rb_itor_next(itor))
		printf("%s = %s\n", rb_itor_key(itor), rb_itor_data(itor));
	rb_itor_destroy(itor);

	/* Destroy the tree and free all the data in it. */
	rb_tree_destroy(tree, TRUE);

	exit(EXIT_SUCCESS);
}

void *
xmalloc(size_t size)
{
	void *p;

	if ((p = malloc(size)) == NULL) {
		fprintf(stderr, "out of memory\n");
		exit(EXIT_FAILURE);
	}
	return p;
}

char *
xstrdup(const char *s)
{
	unsigned len;

	len = strlen(s) + 1;
	return memcpy(xmalloc(len), s, len);
}

At the risk of being repetitive, here is the same program, but this time it uses the generic interface. (Some parts of the first listing are omitted for brevity).


int
main(void)
{
	dict *dct;
	dict_itor *itor;
	FILE *fp;
	char *p;
	char buf[512];
	int rv;

	if ((fp = fopen("data.txt", "r")) == NULL) {
		fprintf(stderr, "can't open data.txt\n");
		exit(EXIT_FAILURE);
	}

	/*
	 * Set the allocation function for the library so we dont have to worry
	 * about it running out of memory. This may not be a good idea except in
	 * trivial programs like this one.
	 */
	dict_set_malloc(xmalloc);

	dct = rb_dict_new((dict_cmp_func)strcmp, free, free);

	/*
	 * We expect the text file to contain lines like "key=value".
	 */
	 while ((fgets(buf, sizeof(buf), fp)) != NULL) {
	 	strtok(buf, "\n");
	 	p = strchr(buf, '=');
		if (!p) {
			fprintf(stderr, "bad file syntax\n");
			exit(EXIT_FAILURE);
		}
		*p++ = '\0';
		if (!*p) {
			fprintf(stderr, "bad file syntax\n");
			exit(EXIT_FAILURE);
		}
		if ((rv = dict_insert(dct, xstrdup(buf), xstrdup(p), 0)) != 0) {
			fprintf(stderr, "duplicate key in file: '%s'\n", buf);
			exit(EXIT_FAILURE);
		}
	 }
	 rewind(fp);
	 while ((fgets(buf, sizeof(buf), fp)) != NULL) {
	 	strtok(buf, "=");
		if ((p = dict_search(dct, buf)) == NULL) {
			fprintf(stderr, "key '%s' not found in dict!\n", buf);
			exit(EXIT_FAILURE);
		}
		printf("%s = %s\n", buf, p);
	 }
	 printf("\n");
	 fclose(fp);

	/* Print every key and data pair that is in the dict, in order. */
	itor = dict_itor_new(dct);
	for (; dict_itor_valid(itor); dict_itor_next(itor))
		printf("%s = %s\n", dict_itor_key(itor), dict_itor_data(itor));
	dict_itor_destroy(itor);

	/* Destroy the dict and free all the data in it. */
	dict_destroy(dct, TRUE);

	exit(EXIT_SUCCESS);
}

Now, if we wanted to use some other data structure than a red-black tree, such as a splay tree, we could just change the line
dct = rb_dict_new(...) to
dct = sp_dict_new(...)'.

Copyright (C) 2001 Farooq Mela. Comments, suggestions, bug fixes, etc, are welcome.