Creating a single node search engine for Wikipedia

The aim was to build a search engine for the entire english wikipedia collection . It should be able to search based on hyperlinks, info-box, or a general query and rank the results and display the title and the link. It needs to give the result in under a second and the index size must be good.
So i achieved an index size of 15% of the original data. So lets get into the details.

First step was indexing, being a Noob in this field, the obvious choice was building an inverted index.
The idea is very simple. We will have a set of documents identified by their Wiki IDs(Document IDs) and each document has a set of words. So inverted index has a mapping of words to Wiki IDs. So given a word, we have a list of wiki Ids, the word is present in. So creating an inverted index is pretty simple. Just store a dictionary or map from string(Words) to vector/list of integers(Wiki Ids). But for the wikidump that I had (45GB) , the no of unique words were itself large. The words itself will fit in memory but not the words along with the doc ids. So we need to move the index from in memory to on disk and we weren’t allowed to use databases. So the idea was to split the entire dump into chunks of say 1GB and process each chunk by creating the index in memory and then aggregate the index later.

I wrote it in C++ , so I wrote a lot of parsing to ensure speed.(We were graded based on speed and index size).
To start with you are given the wikipedia XML dump (42GB in size) . Like this one – wikipedia.xml(Smaller Version)

Step-1 (indexer.cpp):
In C++, I couldn’t find any reliable Streaming XML parser(SAX parser). I found rapidXML which was a DOM parser and was pretty fast. So the idea was to read the XML file and split it into a set of … entries and parse each using a DOM parser. Doing this is pretty simple, you just read line by line if it contains anything other than a “” concatenate it else if it contains a “” , then concatenate it and then sent it to the DOM parser. Using stl strings proved slow . So did reading line by line and checking. So the optimization here is to read character by character and keep track of till what point the match occurs in the string “”. If a mismatch occurs reset the counter to 0 else keep incrementing the counter. If the counter reaches string length of “” , then you know that you have a match.
This is still serial processing. Can you paralellize it? The point to remember is that the data is stored on a HDD. So if you spawn multiple threads that read from the same file , you are actually slowing the entire processing and most of the time spent by the hard disk is in seeking rather than reading. Harddisks are slow to seek.

So the idea was to allocate say 4GB of memory. Now seek to the 4GB point in the wikipedia dump and then keep reading character by character till the is found. Now store the file pointer value. Seek to 0 and read till the pointer value in memory and send it for processing. After this chunk is done, you will start from the pointer value and seek 4 GB after that and do the same till all the dump is processed.

Now since the data is in memory , you can paralellize the processing. Now the idea is to fix the number of threads as t . Divide the 4 GB into t blocks , keeping in mind that the blocks should end on a “” similar to the chunk. Now assign the t blocks to the t threads. I used openMP for paralellizeing.

Step-2 (process.cpp):
Now to process each block is easy. Go character by character, fill a buffer with it and break every time when you have read a “” . Then process the buffer using rapid XML , tokenize it , stem it , and do other processing and finally store it in a map<string,pair > . This map stores the word to int,char mapping where int represents the count , char represents the type.( Each bit in the char represents a type, for example bit 0 represents if it was present in the infobox, bit 1 represents if it was present in the title , one more bit for the category … so on ).
One size optimization we did here was converting all numbers into alphabets , like a = 1 , b = 2 … so on. Only alphabets were present finally , everything else was removed. This even further reduces the index size when we compress it in the next step and doesn’t reduce our retrieval accuracy by much.
This is done by each thread.
There is a global memory which is the posting list for this chunk which is a map from word to List of (WikiArticleID,Count,Char(Type Bits)) tuples.
Once the above map is created , it is merged into this main map .

Step-3( process.cpp):
Then we call on dump index. This will dump the index which is the map currently. You can store it as it is. But when you want to search , you need to read this entire index in memory to search , which is bad. You should be able to search without loading the entire block’s index in memory.
To do this you store one more index.
You split the word and the posting list.
So you have a manifest index for each 4GB chunk which gives you a word->(offset in file ) mapping and then you have a huge file(idx.idx) common for all chunks which stores the length of the posting list as a int and all the posting lists in order.

The idea is to have the manifest file in memory (as it fits). If it doesn’t fit, you can have a manifest file for this manifest file and so on this one layer of manifest file that fits in memory.

So how to create this?
Since this part is not threaded, keep a file pointer for idx.idx which always points to the end of file(else seek to end of file).
Make a map iterator for the in-memory posting list.
The MapIterator.first would be the word while MapIterator.second would be the posting list.

So for each word, print to the manifest file the word followed by the end-file-pointer location of idx.idx and then print to idx.idx , the size of the posting list followed by the posting list.

So if you want to search , the process would be, the manifest file is in memory , so search for the word in the manifest, you get the file pointer offset in idx.idx. So open the idx.idx file pointer and seek to it. Read the next int , this would be the size of the posting list denoted by n , and then read n*9 bytes(4 bytes for wikiArticleID,4 bytes for word count,1 byte for types present). Thus you have the posting list.

SO do this for the entire dump.
You will end up with a few manifest files and one single idx.idx.
Now remember, for any word, you need to go through all the manifest files and each manifest file will have a posting list in a different location for the same word.

Manifest 1
firefox 90223

Manifest 2
firefox 126637

Manifest 3
firefox 76859

So you need to do a deframenting operation. Like in this case for the word firefox, the posting list is present in 3 different locations in idx.idx
So you read all the posting lists in memory by going to each one of them and then creating a merged posting list. You store this merged posting list in defragidx.idx and then create one single merged manifest for all.
Thus after this you’ll end up with one manifest index which gives word to offset and then the defragidx.idx file which is the posting list.

Doing this creates an index 15GB in size. You can reduce it.
The observation is most of the word count is in 1-10 range and you are wasting space by storing it in 8 bytes. Similarly each character takes 1 bytes , when all text is restricted to the 26 characters a-z which requires only 5 bits to represent.

We achieve this by variable byte encoding.
For integers, the way this works is by using a 7 bits in a byte to represent the number and the last bit is the continue bit.
So you keep reading a byte till the continue bit is 0 and concatenate all the 7 bits to get the required number.This is acheived by compress_unsigned_int(),compress_unsigned_long() in process.cpp .
The decoding is done by decompress() in process.cpp .
Have a look at it to find out how it’s done.
For words, we have a byte for telling the no of characters and then concatenate characters represented by 5 bytes.
This is done by encode_str and decode_str.
to decode, you read the 1st byte which tells the no of data bytes and then read that many bytes and pass these two to the function.

This is pretty simple, you just read the manifest file in memory in the form of a map, then any search query comes, you tokenize, stem and get the corresponding offsets using the manifest map and then read all posting lists in memory. Then you rank them according to a criterion, I used simple TF-IDF with a weighed score for each one of the types (title, category,infobox,etc).

This resulted in an index size of 5.9GB and a query time of under 1 sec(upto 10 terms).
Please find the code at miniproject.tar.gz

Hand Tracking And Gesture Detection (OpenCV)

The aim of the project was to device a program that is able to detect out hands, track them in realtime and perform some guesture recognition. It is do be done with simple signal processing performed on images obtained from a regular laptop web-camera.It was an under 2 week project so theshold values in the code including the canny filter needs to be tweaked a bit more. It doesnt work well with changing background . If a better background detection and subtraction algorithm is used , you can get better results.

  1. Detecting Background
    Given the feed from the camera, the 1st thing to do is to remove the background. We use running average over a sequence of images to get the average image which will be the background too.
     CurBG[i][j]=alpha CurBG[i][j] + (1- alpha )CurFrame[i][j]
    This equation works because of the assumption that the background is mostly static. Hence for those stationary item , those pixels arent affected by this weighted averaging and alpha x+(1-alpha)x=x . Hence those pixels that are constantly changing isnt a part of the background, hence those pixels will get weighed down. Hence the stationary pixels or the background gets more and more prominent with every iteration while those moving gets weighed out. Thus after a few iterations , you get the above average which contains only the background. In this case , even my face is part of the background as it needs to detect only my hands.
  2. Background Subtraction
    A simple method to start with is we can subtract the pixel values.However this will result in negative values and values greater than 255, which is the maximum value used to store an integer. And what if we have a black background? Nothing gets subtracted in that case. Instead we use an inbuilt background subtractor based on a Gaussian Mixture-based Background/Foreground Segmentation Algorithm.Background subtraction involves calculating a reference image, subtracting each new frame from this image and thresholding the result which results is a binary segmentation of the image which highlights regions of non-stationary objects . We then use erosion and dilation to enhance the changes to make it more prominant.
  3. Contour Extraction
    Contour extraction is performed using OpenCV’s inbuilt edge extraction function. It uses a canny filter. You can tweak paramemters to get better edge detection.
  4. Convex Hull and Defects

    Now given the set of points for the contour, we find the smallest area convex hull that covers the contours . The observation here is that the convex hull points are most likely to be on the fingers as they are the extremeties and hence this fact can be used to detect no of fingers. But since our entire arm is there, there will be other points of convexity too. So we find the convex defects ie, between each arm of the hull, we try to find the deepest point of deviation on the contour.
  5. Tracking and Finger Detection
    Thus the defect points are most likely to be the center of the finger valleys as pointed out by the picture. Now we find the average of all these defects which is definitely bound to be in the center of the palm but its a very rough estimate. So we average out and find this rough palm center. Now we assume that the palm is angled in such a way that its roughly a circle. So to find the palm center , we take 3 points that closes to the rough palm center and find the circle center and radius of the circle passing though these 3 points. Thus we get the center of the palm. Due to noise , this center keeps jumping , so to stabilize it , we take an average over a few iterations. Thus the radius of the palm is a indication of the depth of the palm and we know the center of the palm . Using this we can track the position of the palm in realtime and even know the depth of the palm using the radius. The next challenge is detecting the no of fingers. We use a couple of observations to do this. For each maxima defect point which will be the finger tip, there will be 2 minimal defect points to indicate the valleys. Hence the maxima and the 2 minimal defects should form a triangle with the distance between the maxima and the minimas to be more or less same. Also the minima should be on or pretty close to the circumference of the palm. We use this factor too. Also the the ratio of the palm radius to the length of the finger triange should be more or less same . Hence using these properties, we get the list of maximal defect points that satisfy the above conditions and thus we find the no of fingers using this. If no of fingers is 0 , it means the user is showing a fist.

In theory it sounds all perfect and good. The no of fingers given by the program is a bit so so. So use it at your own risk. I didnt get time to play with the thresholds. So I guess you should get pretty good results if you get the right thresholds like the ratio of radius to fingers, length of finger, etc. Currently with the program given, you can move the mouse pointer and click.
The code is located at:
The report is located at:

Dynamically adding entries to ListView

To create a ListView, you creat an ArrayAdapter and supply it with Array list_of_string.
So to add entries all you need to do is, add entried to your Array list_of_string and then notify ur adapter by calling


Heres the code

public class YourClass extends whatever{
  ArrayList<String> listItems=new ArrayList<String>();
  ArrayAdapter<String> adapter;
  public void onCreate(Bundle icicle) {
    adapter=new ArrayAdapter<String>(this,android.R.layout.simple_list_item_1,listItems);

 //To add stuff to list
 public void addItems(View v) {
   listItems.add("Whatever entry");

PS: You need to call notifyDataSetChanged() only in the UI thread.
To run a code in UI thread do this

((Activity)Your_Activity_Context_Variable).runOnUiThread(new Runnable() 
        public void run()
           //Whatever Code to be run in UI thread ,Insert here

Resize EBS device on AWS

You cannot directly resize the EBS device like you do on a computer but its pretty simple.
1) Go to you EC2 Console and go to snapshots.
2) Create a snapshot of the EBS device that you want to expand. It will take some time.
3) Right click on that snapshot -> Create Volume -> Mention the Final size of your EBS volume
4) Now shutdown the EC2 Instance the old EBS volume is attached to . Detach the old EBS volume b going to EBS volume -> Select the old Volume -> detach
5) Now right click on the new volume and click attach and mention the mount point (If its a root EBS volume , then mention it as /dev/sda1)
6) If you have any elastic IPs associated with it , reattach it.
7) Start you instance.
You are now done !

Unable to extract EDID:NVIDIA Linux

Sometime Xorg fails to start with nvidia binary drivers if EDID is not extracted.

cat /var/log/Xorg.0.log| grep -i EDID 

If u get a line that says unable to extract EDID,
Then boot into Windows , download SoftMCCS by EnTech
Then go to File->Save EDID, save it as edid.bin (Dont save it as raw or anyother format).
Copy this to ur /etc/X11 of ur linux.
Then add this to your Xorg.conf under Device section

 Option "CustomEDID" "DFP-0:/etc/X11/edid.bin"

assuming DFP-0 is ur used display. If you dont know that
run “xrandr” on terminal and it will give you the port your display is connected to.

Multiple Views Android

Many times,especially in games you may need to have multiple views , a basic OpenglRenderer that occupies the entire surface and you may need to add other views like TextView, AdViews(to promote you apps) , etc on top of it occasionaly. So recently I ran into one such issue for my game wherein I needed to add a view for displaying advertisement banners . So i used Android’s RelativeLayout rather than the default LinearLayout or nested LinearLayouts. So this is the code for 2 views , the main is the Andengine and the other is an advertising banner(Tap for Tap).

So here is the code:

    protected void onSetContentView() {
            final RelativeLayout relativeLayout = new RelativeLayout(this);
            final FrameLayout.LayoutParams relativeLayoutLayoutParams = new FrameLayout.LayoutParams(FILL_PARENT, FILL_PARENT);

            //Create as many views you want
            AdView adView = (AdView) new AdView(this);                                //View1
            adView.layout(0, 0, 50, 100);

            this.mRenderSurfaceView = new RenderSurfaceView(this);                    //View2

            final LayoutParams surfaceViewLayoutParams = new RelativeLayout.LayoutParams(super.createSurfaceViewLayoutParams());
            surfaceViewLayoutParams.addRule(RelativeLayout.CENTER_IN_PARENT);         //Mention the relative layouts

            relativeLayout.addView(this.mRenderSurfaceView, surfaceViewLayoutParams); //Add the views in order of relative layout
            relativeLayout.addView(adView, this.createAdViewLayoutParams());

            this.setContentView(relativeLayout, relativeLayoutLayoutParams);

    private LayoutParams createAdViewLayoutParams() {
            final LayoutParams adViewLayoutParams = new LayoutParams(400, 50);
            return adViewLayoutParams;

Add this to you Andengine Activity and you are set to go.
To make the view visible or invisible , do .setVisibility(View.VISIBLE) or .setVisibility(View.INVISIBLE).
PS: You need to do the above in the UI thread.



Android app on Google Play

Finally I publish a game under my long-time un-used Google Market Account.It is a fun Puzzle game. You need to get the blue coins to the goal without hitting the vortexes or the red coins. You can kill the red coins by pushing them into the vortexes.Try not to use brute-force. Space out the coins using the bricks and things will get a lot simpler.

I did it using Andengine. It an undocumented engine but the forums more than make up for it. Its very active and has some helpful people too. The tutorials that you get by a google search are mostly outdated and uses older andengines which are no longer in use. The best resource to learn it is by looking at the examples and the forum. The flow is pretty easy to understand. After a lot of hours of debugging concurrency issues, forum reading , its done. More than programming the game designing the levels took a lot of time . I created a level designer based on python tkinter which is also included . Create a level and save it as the next consecutive number under assets/level/ and recompile and run it.

Here are some screenshots.






Here is the link to the code :

You need tkinter for python to run the level creator.