The aim was to build a search engine for the entire english wikipedia collection . It should be able to search based on hyperlinks, info-box, or a general query and rank the results and display the title and the link. It needs to give the result in under a second and the index size must be good.
So i achieved an index size of 15% of the original data. So lets get into the details.
First step was indexing, being a Noob in this field, the obvious choice was building an inverted index.
The idea is very simple. We will have a set of documents identified by their Wiki IDs(Document IDs) and each document has a set of words. So inverted index has a mapping of words to Wiki IDs. So given a word, we have a list of wiki Ids, the word is present in. So creating an inverted index is pretty simple. Just store a dictionary or map from string(Words) to vector/list of integers(Wiki Ids). But for the wikidump that I had (45GB) , the no of unique words were itself large. The words itself will fit in memory but not the words along with the doc ids. So we need to move the index from in memory to on disk and we weren't allowed to use databases. So the idea was to split the entire dump into chunks of say 1GB and process each chunk by creating the index in memory and then aggregate the index later.
I wrote it in C++ , so I wrote a lot of parsing to ensure speed.(We were graded based on speed and index size).
To start with you are given the wikipedia XML dump (42GB in size) . Like this one - wikipedia.xml(Smaller Version)
In C++, I couldn't find any reliable Streaming XML parser(SAX parser). I found rapidXML which was a DOM parser and was pretty fast. So the idea was to read the XML file and split it into a set of ... entries and parse each using a DOM parser. Doing this is pretty simple, you just read line by line if it contains anything other than a "" concatenate it else if it contains a "" , then concatenate it and then sent it to the DOM parser. Using stl strings proved slow . So did reading line by line and checking. So the optimization here is to read character by character and keep track of till what point the match occurs in the string "". If a mismatch occurs reset the counter to 0 else keep incrementing the counter. If the counter reaches string length of "" , then you know that you have a match.
This is still serial processing. Can you paralellize it? The point to remember is that the data is stored on a HDD. So if you spawn multiple threads that read from the same file , you are actually slowing the entire processing and most of the time spent by the hard disk is in seeking rather than reading. Harddisks are slow to seek.
So the idea was to allocate say 4GB of memory. Now seek to the 4GB point in the wikipedia dump and then keep reading character by character till the is found. Now store the file pointer value. Seek to 0 and read till the pointer value in memory and send it for processing. After this chunk is done, you will start from the pointer value and seek 4 GB after that and do the same till all the dump is processed.
Now since the data is in memory , you can paralellize the processing. Now the idea is to fix the number of threads as t . Divide the 4 GB into t blocks , keeping in mind that the blocks should end on a "" similar to the chunk. Now assign the t blocks to the t threads. I used openMP for paralellizeing.
Now to process each block is easy. Go character by character, fill a buffer with it and break every time when you have read a "" . Then process the buffer using rapid XML , tokenize it , stem it , and do other processing and finally store it in a map<string,pair > . This map stores the word to int,char mapping where int represents the count , char represents the type.( Each bit in the char represents a type, for example bit 0 represents if it was present in the infobox, bit 1 represents if it was present in the title , one more bit for the category ... so on ).
One size optimization we did here was converting all numbers into alphabets , like a = 1 , b = 2 ... so on. Only alphabets were present finally , everything else was removed. This even further reduces the index size when we compress it in the next step and doesn't reduce our retrieval accuracy by much.
This is done by each thread.
There is a global memory which is the posting list for this chunk which is a map from word to List of (WikiArticleID,Count,Char(Type Bits)) tuples.
Once the above map is created , it is merged into this main map .
Then we call on dump index. This will dump the index which is the map currently. You can store it as it is. But when you want to search , you need to read this entire index in memory to search , which is bad. You should be able to search without loading the entire block's index in memory.
To do this you store one more index.
You split the word and the posting list.
So you have a manifest index for each 4GB chunk which gives you a word->(offset in file ) mapping and then you have a huge file(idx.idx) common for all chunks which stores the length of the posting list as a int and all the posting lists in order.
The idea is to have the manifest file in memory (as it fits). If it doesn't fit, you can have a manifest file for this manifest file and so on this one layer of manifest file that fits in memory.
So how to create this?
Since this part is not threaded, keep a file pointer for idx.idx which always points to the end of file(else seek to end of file).
Make a map iterator for the in-memory posting list.
The MapIterator.first would be the word while MapIterator.second would be the posting list.
So for each word, print to the manifest file the word followed by the end-file-pointer location of idx.idx and then print to idx.idx , the size of the posting list followed by the posting list.
So if you want to search , the process would be, the manifest file is in memory , so search for the word in the manifest, you get the file pointer offset in idx.idx. So open the idx.idx file pointer and seek to it. Read the next int , this would be the size of the posting list denoted by n , and then read n*9 bytes(4 bytes for wikiArticleID,4 bytes for word count,1 byte for types present). Thus you have the posting list.
SO do this for the entire dump.
You will end up with a few manifest files and one single idx.idx.
Now remember, for any word, you need to go through all the manifest files and each manifest file will have a posting list in a different location for the same word.
So you need to do a deframenting operation. Like in this case for the word firefox, the posting list is present in 3 different locations in idx.idx
So you read all the posting lists in memory by going to each one of them and then creating a merged posting list. You store this merged posting list in defragidx.idx and then create one single merged manifest for all.
Thus after this you'll end up with one manifest index which gives word to offset and then the defragidx.idx file which is the posting list.
Doing this creates an index 15GB in size. You can reduce it.
The observation is most of the word count is in 1-10 range and you are wasting space by storing it in 8 bytes. Similarly each character takes 1 bytes , when all text is restricted to the 26 characters a-z which requires only 5 bits to represent.
We achieve this by variable byte encoding.
For integers, the way this works is by using a 7 bits in a byte to represent the number and the last bit is the continue bit.
So you keep reading a byte till the continue bit is 0 and concatenate all the 7 bits to get the required number.This is acheived by compress_unsigned_int(),compress_unsigned_long() in process.cpp .
The decoding is done by decompress() in process.cpp .
Have a look at it to find out how it's done.
For words, we have a byte for telling the no of characters and then concatenate characters represented by 5 bytes.
This is done by encode_str and decode_str.
to decode, you read the 1st byte which tells the no of data bytes and then read that many bytes and pass these two to the function.
This is pretty simple, you just read the manifest file in memory in the form of a map, then any search query comes, you tokenize, stem and get the corresponding offsets using the manifest map and then read all posting lists in memory. Then you rank them according to a criterion, I used simple TF-IDF with a weighed score for each one of the types (title, category,infobox,etc).
This resulted in an index size of 5.9GB and a query time of under 1 sec(upto 10 terms).
Please find the code at miniproject.tar.gz