DataBase Speed Index

the m8 speed index is a fast index with high performance
M8 - enkle løsninger
Skip Navigation LinksM8 > Articles > DataBase Speed Index

A normal database can be rather slow for looking up keys and returning rows if the index is a text column. This Binary DataBase provides a very fast way for looking up rows.

Introduction

The central point of the Speed Index is a tree structure which is serialized in an index file. When looking up the entries, the database always knows exactly where to look for the word. Therefore, even though the text file becomes rather large during the indexing, the performance is very high (has immediate response).

The central idea

The central idea is to let each letter lead to the next. If for instance the input word is 'adorno' the database has to know in advance exactly where the 'a' resides. When the database finds the 'a', the 'a' can tell where the 'b' resides, the 'b' can tell where the 'c' resides and so on up to a maximum of N lookups, where N equals the length of the word. This structure is illustrated in the figure below.

Figure 1

Here five words are encoded:

As you'll notice the words are prefixed with the initial symbol '^' and post fixed with the end symbol '$'. The initial symbol ensures that there is only one tree. If there where no initial symbol, we would need a forrest consisting of two trees with the root nodes 'a' and 'b' respectively.

The end symbol has a more important function which is illustrated in the difference between word (4) and (5). It is the end symbol which ensures that the word ben exists in the database. If there where no end symbols, only bens would exist. In other words: The end symbol ensures that a word, which is equal to the first part of another word, can be stored in the database.

This is the general idea. The challenge now is to encode this tree in a binary file in such a manner that the lookup of each character can be done in O(1) time.

note it is possible to compress this tree and gain a performance benefit. For instance the orno$ part of $adorno$ carries no additional information and seen from the point of the structure, it is irrelevant since none of the nodes has more than one child node. If we allowed more than one character per node we could let the last node contain 'orno$' instead of just 'o'. I haven't done any math on this, but it would make O(N) the worst case with a mean possibly somewhere around O(lg(N)) for natural language words. Though, the N will always be rather small for natural languages, even in the worst cases, so possibly the trouble with implementing the compression is not worth the effort.

The binary file

Now the central idea is in place, so it's time to get concrete. The binary file is where the index is stored and the tree structure should be implemented. The main criterion for my model is, that it should be able to do the lookup of each character in O(1) time. Hence the size of the file has not been of any concern. In a sample lexicon with approximately 300.000 Danish word forms, the size of the file is 119 MB. This is the cost for the high speed! As mentioned above it is possible, though, to compress the file considerably.

The general structure of the file is that it is build up by a number of blocks. This is illustrated in the figure below.

Figure 2

Each block has to fields: one field for a symbol from the alphabet and a field consisting of a number of links. Each link points to the next symbol field. In the figure the word '^adam$' is used as a case. The '^' will always be at position 0 in the file - it's the first byte. Now the first Char in the String '^adam$' is found. In the link field of the '^' symbol, the DataBase looks for the 'a'. When this is found, the 'a' points to a new block, an 'a'-block. Now the second char in the input string is found. This continues until the end symbol is reached.

But I promised a lookup of each symbol in O(1) time. Therefore the LINK field needs a structure where no iteration is required. This is illustrated in the figure below.

Figure 3

In this figure 3 important things are worth noticing.

First, the LINK field contains a number of different sub fields. The number of sub fields is equal to the True Alphabet, i.e. the alphabet minus the three special characters: the null symbol '0', the initial symbol '^' and the end symbol '$'.

Second, each sub field is exactly n Bytes long. This is a constant which is by default set to 4. If the file will grow very large it is possible that this constant need to be raised. This constant is the reason that (returning to the previous figure) the LINK field is exactly constant * no. of symbols in alphabet long.

And third, the sub fields are organized in the same order as the true alphabet. This is important because it makes it possible to locate the position of the following symbol in the file in ~ O(1) time. This is done by the formula: current position + (the constant * alphabet.IndexOf(next symbol)). This way it is possible for the database first to calculate the position of the pointer to the next symbol and then to move to this position in the file.

The DataBase file

The index file is the first half of the database. The next half is the database itself. I've chosen a rather simple database, which converts each cell in a IDataReader to a String value and writes this String value to the database file. The structure of the database is similar to the structure of the index file in that it contains a number of blocks as illustrated below.

Figure 4

Each block begins with a field length and a data field. The field length-FIELD tells how long the following field is. The data field FIELD resembles a cell in the IDataReader and contains data.

The next question is how we can know where one row stops and the next row begins. This information is stored in the index file. In Figure 2 it was illustrated how the word '^adam$' could be found in the index. The last block (represented by the symbol '$') looked like this:

Figure 5

Here the LINK field is empty. This room we use for two pointers. The first pointer (BEGIN) tells us where in the data file the requested row begins. The second pointer tells us where the next row begins (END). The pseudo code below describes how we are able to retrieve each cell in the row:

    dataFileStream.Position = BEGIN;
    while(dataFileStream.Position < END)
    {
        //Read block
    }
    

An overview of the architecture

Figure 6 provides an overview of the M8.System.Data.BinaryDataBase Class.

Figure 6

The BinaryDataBase is an abstract class that provides the base functionalities for the two classes BinaryDataBaseReader and BinaryDataBaseWriter. The BinaryDataBase implements the IDisposable interface.

The BinaryDataBaseWriter is the class that creates the database. This class inherits from the BinaryDataBase and implements a IWriteToDataBase interface. This interface has the two methods:

The Write method can take either an IDataReader object or a DataTable object as parameter. This way the BinaryDataBaseWriter can build the database either one row at a time or by passing a whole datatable.

The BinaryDataBaseReader class also inherits from the BinaryDataBase class and implements a IReadFromDataBase interface. This interface has four methods:

Finally the BinaryDataBase has an Alphabet object which can be customized.

The code

To create a new DataBase you should use the BinaryDataBaseWriter:

    M8.System.Data.IWriteToDataBase dbWriter = 
        new M8.System.Data.BinaryDataBaseWriter("indexFile.txt", "dataFile.txt", 
                                                    M8.System.IO.FileMode.Create);
    using(dbWriter as IDisposable)
    using(SqlConnection conn = new SqlConnection("ConnectionString"))
    {
        conn.Open();
        dbWriter.Open();
        
        String sql = "SELECT * FROM MySqlDataBase";
        IDataReader rdr = new SqlCommand(sql, conn).ExecuteReader();
        using(rdr as IDisposable)
        {
            while(rdr.Read())
                dbWriter.Write(rdr, "MyIndexColumn");
        }
    }
    

The main parts of this code are the first and last lines. In the first line the BinaryDataBaseWriter is instantiated. We're passing three parameters: a name for the index file, a name for the data file, and a FileMode. The FileMode can have one of the values Create or AppendOrCreate.

In the last line each IDataReader instance is written to the database. This time passing two parameters: the IDataReader and the name of the column to use as an index.

Retrieving elements from the database is equally easy:

    M8.System.Data.IReadFromDataBase rdr = 
        new M8.System.Data.BinaryDataBaseReader(hashFile, dataFile);
    using (rdr as IDisposable)
    {
        rdr.Open();
        IList<String> lst = rdr.GetRow("adorno");
        foreach(String s in lst)
        {
            Console.WriteLine(s);
        }
    }
    

Points of interest

This is a first version of the database, and as the reader will notice, there are many ways to make the database cleaner and stronger. As this happens (when I have the time to do so) I will post updates on this page.

Click here to fast forward to the code.

Download

   Søg   Kontakt    Udskriv