Coursework: Week 10, Hash

The coursework this week doesn't count towards the unit. The main purpose is to investigate hash tables, which is important for next year You need the hash table chapter of the lecture notes.

make choices

The idea is to implement a hash table module. It can be designed either as a library module, making it reasonably general purpose, or it can be designed for a specific program that you have in mind. Nothing is provided for you. It is up to you to create a header file hash.h, an implementation hash.c, and to adapt a previous Makefile, and to choose what to do.

Decide what to store

You need to decide what items you want to store in the hash table.

One possibility is to just store strings (so you are implementing a hash set of strings).

Another is to store structures consisting of a string and other data, with the string as the key, e.g. a string as a word plus a word count, or a string as a name plus a telephone number. This would be a hash map from strings to some other data.

Decide on a hash function

I suggest either Java's hash function (h = 31*h +c) or Jenkin's one-at-a-time hash function. If you implement Jenkin's function, look online to see if you can find actual examples of the results it produces, to use as tests.

Decide how big the hash table will be

I recommend making the hash table size a power of two. Old books and tutorials suggest a prime number, but (a) that's obsolete - all it does is to make a slight improvement to a poor hash function and (b) it is not easy to increase the size to a different prime to make the hash table flexible.

You can either choose a fixed power of two, or double the size every time the table becomes too full. Perhaps the best is to start with a fixed power of two, and switch to making it flexible later if you have time.

Decide how to handle collisions

What happens when two strings have the same hash number (modulo the hash table size) so that they end up in the same slot?

You could investigate open addressing (also confusingly called closed hashing) so that there is only ever one item in a slot, but I don't recommend it - there aren't really any outstanding advantages to compensate for the complexity.

The simplest option is for each slot to be a (singly) linked list of nodes. That means linear search has to be used to find the right item, but that's OK if the lists are always very short.

One of the best alternatives to investigate is where each table slot contains a linked list, but the lists are stored in an array. As well as the hash table, a single array of node structures is allocated, and that array is increased in size if it runs out of space. All 'pointers' are represented by integer indexes into the array. One special linked list in the array holds all the free nodes, which are not currently being used for anything else. This variation reduces the memory and time overheads associated with allocating nodes singly, and can also halve the amount of memory being used by pointers (by using 32 bit integers instead of 64 bit pointers).

write something extra

If you have time, and if you are interested or you want more than 50% for the assignment, you can do some extra open-ended work. Write any program you want in C (of the same sort of scope as the triangle program) which involves simple classification or calculation.

Submit the source file, and a file readme.txt (or a pdf file) which describes your aims for the program and how far you got with it. There are no marks for report writing, but the report may be necessary for us to make sense of the program.

A mark out of 50 for the extra work will be awarded by swiftly reading the report, checking whether your program matches what you claim, judging the sophistication of what has been done, and checking whether the program follows the conventions and advice in the lectures. In particular, writing tests as part of the program, in the same way as the skeletons, is very much recommended (though you only need enough tests to make you confident). Also recommended is working in very small steps, one test at a time, keeping your program in a working state.

The mark will aim to make your total for the assignment meet the university scale. So assuming you get full marks for the triangle program, 10/50 means "this raises your total result from average to good", 20/50 means "overall first class", 30/50 means "overall above and beyond what was expected" and 40/50 means "publishable in a research journal".

submit

Submit your program triangle.c (not Triangle.c or any other name, unless you want to lose marks, and not the compiled program). Make sure your program compiles without warnings, runs, and doesn't still contain debugging print statements. Feel free to resubmit it if you improve it before the deadline (using the same filename).

Submit your extra program, if any, and a readme.txt file with any comments you might want to make.