Next page | Contents page |

Data structures

Some background

Suppose you run a company with millions of customers. A customer contacts you and you want to find their record quickly. How would you make that possible? Well, as a company you would probably install a database system and access it using the JDBC API in Java EE, which we will not be covering in this course.

For the moment let us suppose you don't want the added hassle of maintaining a database system. You have all the records in a java.util.List<Customer>, that we met a few pages back. How will you find a customer by name in such a list? The simplest approach would be to retrieve every item sequentially, test every name until you find the one that matches. That could of course take a rather long time.

Searching could be made faster if the list was sorted into alphabetical order of customer name and then we could use a technique called binary searching: test the middle record in the list. If the customer name matches then fine, you were very lucky. If not, look at whether the name must be earlier or later in the list. Then try the middle record in that half of the list. Repeat this process, halving the search region each time. This will find the target much more quickly: 1,000,000 = 220 (roughly) so at most 20 records would need to be checked to find the one you want.

Nevertheless be careful about sorting the list. It would be pointless doing it every time just before the search because the sorting operation itself takes a long time. So the trick would be to build the list so it is always sorted. Whenever a new customer is added, insert it at the right point in the list. For this a linked list is needed, in which each record points to the next. Then insertion only involves changing a couple of pointers. A java.util.ArrayList would not be suitable because all the records later in the list would have to be moved up one to accommodate the new record.

There is an even quicker method of finding records, using a hash table. The basic idea is to make a number (called a hash code) from the customer's name that will form the address of the record in a table. Then whenever we are given the name we can immediately retrieve the record. The only difficulty with this is that we cannot guarantee that the addresses for all customers will be unique. So the software that manages the hash table has to handle the possibility that a few customers have the same address. In that case it forms a (short) list of those customers so it can find them by searching as before. The list look-up is completely hidden within the table manager, so the user of the hash table does not need to know about it.

How to form hash codes? Well, a simple approach is to take the numeric code for each character in the name and simply add the codes up, modulo the number of entries in the table. Eg, if we keep the letter cases and the space, Barack Obama = 66 + 97 + 114 + 97 + 99 + 107 + 32 + 79 + 98 + 97 + 109 + 97 and then do a % operation to get the remainder after dividing by the size of the hash table.

Next page | Contents page |