Curb the Memory Usage of Faceted Indexes February 16, 2017

View all articles from Gyroscope Development Blog

Every since the release of Gyroscope 9.4, a multi-core version of the faceted navigation template has been included in the code generator. It has been improved incrementally over the releases.

Faceted navigation is an effective mode of user interaction that summarizes a large collection of records by their common attributes. For example, 10,000 automobile models can be grouped into makes, years, models, colors, etc. These attributes are called Facets, or Dimensions. Within each dimension, the faceted search interface offers a breakdown and a count in each category, also known as a refinement count. Upon selecting a dimension option, say, "Color: Red", the search interface updates all the other dimensions to show the breakdown of only red cars. For example, all the "makes" of red cars and all the "models" that have red cars. As the user slice and dice the facets, he/she is interacting with not just the search results, but also the distribution, or Shape of the data set as a whole.

Database queries for grouping and counting dimension values are computationally expensive. On a large data collection, counting on a single dimension takes much longer than that on a smaller data set. One of the main challenges of implementing faceted search with PHP and MySQL is "high dimensionality". Heck, even a reasonable number of facets could easily stress the system.

Consider an e-commerce site that presents shoes in 4 dimensions: Color, Size, Price and Brand. The conventional approach is to spend the time to count the variety of colors, then sizes, then prices, then brands. Bare in mind that PHP does not take advantage of multiple CPU cores. However, when multiple queries are sent to MySQL, concurrent queries are spread over various cores.

The multi-core variant of the faceted search in Gyroscope (internally codenamed "gNavi") addresses performance in two ways: 1. using parallel database queries (MySQLI_ASYNC) in combination of channel splitting and tagging; 2. caching the refinement counts of each dimension per filter permutation in a single object for each cache-invalidation.

Now let's look at the above two points in detail. In theory PHP provides a method to execute asynchronous MySQL queries. The script does not wait for the database response. It enters a wait-loop until all the in-flight queries are accounted for. In practice this can get messy, as the queries may come back in a different order from how they were sent. gNavi prepares a separate database connection for each query (as CPU cores are spread by connections). It then tag each connection with an SQL alias. For example, the first 3 queries are sent out as:

select 1 as nav_linkid, count(carid) as c, make from...
 select 2 as nav_linkid, count(carid) as c, model from...
 select 3 as nav_linkid, count(carid) as c, color from...

As the queries are made, a connection registry is created to anticipate their responses:

Link ID		Arrival
1		2nd
2		1st
3		3rd

After all the results come back, gNavi harvests the responses in the original sending order.

The second optimization is caching. The combination of selected dimension filters forms a "query signature". Answers to identical, repeated questions are stored as a sub-entry of one giant cache object. This is so that the navigation cache can be cleared, or "invalidated" at once if needed to.

At first glance, the above design seems reasonable. Now consider the following queries:

select 1 as nav_linkid, count(carid) as c, make from...
 select 2 as nav_linkid, count(carid) as c, make from...
 select 3 as nav_linkid, count(carid) as c, make from...

The queries are all identical; but they are tagged with different connection IDs based on execution state. In a 6-dimension interface, each query cache is duplicated in 6 possible positions. This is compounded with the permutation of other queries and other filters. Channel splitting/tagging has made the cache size exponentially larger than necessary.

Starting Gyroscope 10.6, the code generator strips the connection IDs. In addition, a configurable "cachemax" quota is set to limit the total number of elements of the navigation object. The default value is 1000. This is to ensure that the memory usage has a ceiling.

But that's not all. Things can still go wrong if the gNavi is incorrectly used.

Avoid Time-stamping

When a blog uses faceted search, it's common to compare the publishing date with the current time. Enforcing the date is easily implemented in the sqlfilters function:

$now=time();
 $filters=" and published and publishdate<=$now"

The above code makes the queries non-cache-able, as each second the cache is invalidated. Instead we use the beginning of the day to lock the query for a day. In Gyroscope there's a convenience function for that, defined in forminput.php:

$now=date2stamp(date('Y-n-j'));

Avoid Search Terms

If keyword search is enabled, the cache storage can be easily exhausted as there can be infinite number of keyword choices. In Gyroscope 10.6, the cachemax value is dynamically set to 0, effectively disabling the cache in the presence of a keyword search. One may be concerned that the search interface may not be able to handle high traffic - realistically this issue is not as bad as it seems, especially when the search term is piped into a full text search engine such as Sphinx. The returned result can be capped to a much smaller data set - something MySQL can easily handle without caching. This leaves more space to cache the finite number of existing choices.

Geo-Partitioning

The faceted navigation can be used together with reverse Geo-IP lookup to serve region-specific content. Like the opposite of the channel splitting issue where the same query is cached multiple times, Geo-specific queries must be treated differently even if they carry the same set of dimension filters. Conveniently, gNavi has a "swap prefix" mechanism that can be repurposed to store per-region results. Visitors from Ontario, Canada have their dedicated "cachemax" quota; visitors from Georgia, USA has their own cache storage. The question then becomes: how do we invalidate all the regional caches?

Again, 10.6 has the following code baked in to create a swap registry:

$swaps=cache_get('gswaps');
 if (!$swaps){
     $swaps=array($swapprefix=>1);
     cache_set('gswaps',$swaps,$gnavcacheexp);
 } else {
     $swaps[$swapprefix]=1;
     cache_set('gswaps',$swaps,$gnavcacheexp);
 }

The entire navigation cache can then be cleared with the new memcache function:

cache_clearnav('gswaps','gnavobj');

To sum up, here's a checklist to protect your faceted index memory consumption: