Solr Group Truncate vs Collapse

Few months ago, I was working with an eCommerce company. Its products had different sellers with different prices of each seller. We called product data with seller data and price together as an offer. In document structure, we maintained document for offer and duplicated products data into offer document. For example

There are 2 offers for product p1
o11 having selling price Rs. 10
o12 having selling price Rs. 9

There are 2 offers for product p2
o21 having selling price Rs. 8
o22 having selling price Rs. 12

Below are 4 offer documents.

{ offerId: o11, productId: p1, price: 10 }
{ offerId: o12, productId: p1, price: 9 }
{ offerId: o21, productId: p2, price: 8 }
{ offerId: o22, productId: p2, price: 12 }

Solr did not had facility to Store offers as a child documents of a product document that time. We had to duplicate the product data into offer data. To query product we grouped the offers with the productId, sorted by price asc and truncated into one document; signifying lowest selling price offer as product. We faced several performance issues with these group queries. Check apache wiki to know in detail about field collapsing using group.


For example, displaying products with lowest price, we applied the following query and obtained following offers with lowest prices of products.

group=true&group.sort=price asc&group.field=productId&group.truncate=true&sort=popularity asc

{ offerId: o12, productId: p1, price: 9 }
{ offerId: o21, productId: p2, price: 8 }

This type of queries were very slow because Solr performs operations in following steps.
1. filter
2. sort
3. group
4. group sort
5. group truncate

Not considering O(n) time operations in filtering. Sorting takes O(n log n) time,
Then group takes O(n) and group sort takes O(n log m) time, where m is average group size or average number of offers per product
Overall O(n log n) . When number of documents is high, the grouping operation takes hell lot of time.

We compromised with the low performance of grouping using caching of results. But there was always a small problem and a known bug.

If user applied a desc sort on price, we needed to show him lowest offer price with product but the lowest price sorted in desc order. Consider the same 4 offer documents, what will be the output?

group=true&group.sort=price asc&group.field=productId&group.truncate=true&sort=price desc

Expected
{ offerId: o12, productId: p1, price: 9 }
{ offerId: o21, productId: p2, price: 8 }

Actual
{ offerId: o21, productId: p2, price: 8 }
{ offerId: o12, productId: p1, price: 9 }

Now let's analyse why did it happen step by step?

0. Document
{ offerId: o11, productId: p1, price: 10 }
{ offerId: o12, productId: p1, price: 9 }
{ offerId: o21, productId: p2, price: 8 }
{ offerId: o22, productId: p2, price: 12 }

1. filter -- ignore O(n)

2. sort -- O(n log n)
{ offerId: o22, productId: p2, price: 12 }
{ offerId: o11, productId: p1, price: 10 }
{ offerId: o12, productId: p1, price: 9 }
{ offerId: o21, productId: p2, price: 8 }

3. group -- O(n)
[{ offerId: o22, productId: p2, price: 12 },{ offerId: o21, productId: p2, price: 8 }]
[{ offerId: o11, productId: p1, price: 10 }.{ offerId: o12, productId: p1, price: 9 }]

4. group sort -- O(n log m)
[{ offerId: o21, productId: p2, price: 8 },{ offerId: o22, productId: p2, price: 12 }]
[{ offerId: o12, productId: p1, price: 9 },{ offerId: o11, productId: p1, price: 10 }]

5. group truncate -- O(n/m)
{ offerId: o21, productId: p2, price: 8 }
{ offerId: o12, productId: p1, price: 9 }

So the output were not in desc order in cases where seller price differ a lot.

It took a lot of effort to optimize performance by caching results and caching various parts of results and optimizing several small codes. Because of this debugging got difficult. Inserted logs and included several logging mechanisms. But code got messy and looked dirty. Solr 4.1 got a feature block indexing in which a document can have child documents. We researched on it but it did not work for us.

At the end, the company consulted Solr consultants, even who couldn't help us themselves because there was no option in Solr for solve our problems. But they connected to Solr committers and got a patch containing a collapse parser written for this slow group response case. This parser gave us more than 3-4 times faster results. Few of our heavy queries which took around 19 seconds came down to 900 ms. You can't measure speed on numbers because this depends on the number of documents (memory and CPU) and specifically average number of documents in a group (m). We will be verifying this as well. This parser actually solved our other problem of the known bug as well. The new parser itself had several smaller bugs, which step by step got resolved. Along with the consultants, even I started discussing with the committers the collapse parser behavior in Jira and debugging Solr code as well.

Check apache wiki to know in detail about field collapsing using collapse and you can go through expanding results as well.


Let me show you how this parser works.

Group query
group=true&group.sort=price asc&group.field=productId&group.truncate=true&sort=popularity asc

Transformed Collapse query
fq={!collapse field="productId" min="price"}

collapse is equivalent to group.truncate
field is equivalent to group.field
min is equivalent to group.sort

If everything is equivalent, how can it give high performance and faster results?

Let's leave the internal implementation for now and do the logical analysis. The parser is applied as a filter query instead of group.
The time complexity for filtering is O(n)
Then for sorting, but sorting will be applied on n/m documents and not all n documents, so sorting will be of order O(n/m log n). Overall O(n/m log n) .
This is why I said the overall time complexity depends specifically on average number of document per group (m).

And now about the known bug. Now the query we pass will be

fq={!collapse field="productId" min="price"}&sort=price desc

Now let's analyse why did it happen step by step?

0. Document
{ offerId: o11, productId: p1, price: 10 }
{ offerId: o12, productId: p1, price: 9 }
{ offerId: o21, productId: p2, price: 8 }
{ offerId: o22, productId: p2, price: 12 }

1. filter -- do not ignore -- O(n)
{ offerId: o12, productId: p1, price: 9 }
{ offerId: o21, productId: p2, price: 8 }

2. sort -- O(n/m log n)
{ offerId: o12, productId: p1, price: 9 }
{ offerId: o21, productId: p2, price: 8 }

Now the question is why the parser does this thing in time complexity O(n), apart from sorting. If you notice a requirement we just needed the lowest price offer; and not all the offer sorted by price.
Suppose you have to write a good (obviously) algorithm to do the same thing, you will do the same in O(n). I haven’t seen the parser code, but I can guess it anyway.
Just maintain a HashTable and iterate through each document. Check if the table contain the productId as key. If it doesn’t, save the productId as key with the document as value; else get the table document, compare their prices, and save in table the document with smaller price. At the end, you will have in values the required documents with lowest prices.

You can find the code to compare these two parsers in the below link. These are not the solr code. Its just algorithms implementation and comparison of their performance.
Download
I hope this article did not bore you much. For any help regarding Solr you can contact me or my friend Kanishka.

deepakmishra117@gmail.com


kanishka.pahwa@gmail.com

Comments

Popular posts from this blog

Caterpillar Uneaten Leaves Problem – HackerEarth Interview Question

Swimming

The Undertaker - A True Legend