So what we have here is two sets of data and we needed to do a subtraction of one set from another, a the two sets in question can be arbitrary and not necessarily be same in size.
So we have two sets A, B of IDs. A is the search result set and B is the set of the ID of records already have been viewed.
So basically we want A - B.
Those of you who suggest we use an SQL with condition in WHERE clause
say like WHERE record_id NOT IN (b0, b1, b2,b3, .... bn) can skip reading this post.
Now why do I say the problem is singular,
- The data filtering needs to be done for a web search page so it must be fast.
- I already said both the sets can be arbitrary in size. A < B , A = B, A > B.
Now how to go about this problem. What I initially thought of was if we can sort the sets A, B in increasing order. After we have sorted the lists we can find out the most common overlapping part among the both sets.
Say A = { 8374, 4430, 2393, 1002 , 763472, 9878, 10232}
B = {7656, 4532, 2674}
now min(A) = 1002, max (A) = 763472
Min (B) = 2009, Max(B) = 7656
Now we know the end points of the already IDs that are already viewed. Thus in the Set B we can ignore all the values that are outside these boundaries. So all the values in set A lower than min(B) and greater than max(B) can be simply ignored.
After this we are again left with a possibly smaller set of the same initial problem.
The answer hopefully shall be when we discover bit map indexes. The basic theory can be obtained from here.
http://www.akadia.com/services/ora_bitmapped_index.html
So the idea here is using of memory based bit map indexes. Now what we intend to here is create a long stream of bitmaps. This stream of bitmap will be as long as the maximum record
possible. Here I would like to point out that this number is much larger than either Set A or B.
So technically set A, B will be a subset of the super set which contains all the records. Say its length is N.
So we have say a bitmap index N bits long. When this bitmap is loaded in memory it will take approximately 2.5 MBs of space for N = 20 million.
Now we will set all the bits in the bitmap stream to 1 or switch them on.
After this we will traverse the set B and XOR each Bith bit in the bitmap index.
Say for the above example we will XOR the 2674, 4532 and 7656th bits on the bitmap. Now when we display the search results we have this set of values which are already viewed turned as off. This is how far we have gone. I will continue to post once I find a code sample and implement something.
MIT HACKMEM Count
int bitcount(unsigned int n)
{
/* works for 32-bit numbers only */
/* fix last line for 64-bit numbers replace 63 by 1023*/
register unsigned int tmp;
tmp = n - ((n >> 1) & 033333333333)
- ((n >> 2) & 011111111111);
return ((tmp + (tmp >> 3)) & 030707070707) % 63;
}
PS: Thanks to vikasB and anshumG, yawarA for their help.

1 comment:
nice blog Amresh!!
waiting for more of such easy to understand and good to use articles..
Post a Comment