Cliqz's major open source contribution: Keyvi, the key value index

Keyvi, the key value index developed by Cliqz is open source.

Hendrik

Cliqz combines the power of data, search and browser technologies to create a new web user experience based on “searching directly in the browser”. We are passionate users of open source technologies. We use a large variety of open source software such as Hadoop, Redis, Nginx, and Docker. Today, we are excited to announce the open source release of keyvi, our first big contribution back to the community.

Keyvi – the short form for “Key value index” – defines a special subtype of the popular key value store (KVS) technologies. As you can imagine from the name, keyvi is an immutable key value store, therefore an index not a store. That sounds like a step back, but it is the foundation of keyvi’s strengths: high compression ratio and extreme scalability. So if you need online read/writes keyvi is not for you, however, if your use case is mostly reads and infrequent writes you might be interested in checking keyvi out.

Keyvi takes a different approach than existing NoSQL engines. Keyvi is based on finite state transducer (FST) technology, building and persisting FST’s compact and fast. You might know FST to be used for autosuggest e.g. in ElasticSearch/Lucene. We took a step further: Keyvi is a full-blown FST based key value index.

Let’s have a look at the 4 main areas: compression, scalability, use cases, and additional FST use cases.

Use cases

Keyvi can store arbitrary key value store data. The simplest are key-only. At Cliqz we use this type for example for filtering domains or URLs. Then we have the integer value type, which is nice for any kind of weighting. Furthermore, we use this type at Cliqz for autosuggest features, see below. The most flexible value type is JSON values. At Cliqz we use that type in various ways dealing with our data, while the size of the JSON value range from a couple of bytes to a MB. Note that JSON values are internally compressed.

Example for storing indexing some data from the Yelp academic dataset:

For this example, we extracted the business_id(key) and attributes (value) and stored the data in Redis and keyvi. Additionally in ‘Redis compressed’ we use a very similar method to compress the values. Note: This very small example is just for the purpose of showing compression ratio. You can build GB indexes the same way, while compression usually gets better with more data.

High compression

Keyvi’s compression is the result of prefix and suffix compression: Like a trie structure common prefixes are shared. If the data uses URL’s, labels, or are any form of natural language as key, keyvi will be able to compress them very well. In addition FST’s also compress suffixes, including the value. So sparse values lead to a high compression as well. But even if you do not have either of the two, keyvi’s internal pointer structure is very compact, because of its efficient use of locality.

Scalability

Keyvi’s indexes are binary files which are directly read from or to disk, using shared memory (mmap). That means indexes can be accessed in parallel from different threads and/or processes. That allows both a highly scalable architecture utilizing all CPU cores on a machine as well as resilience to failures. You can use keyvi as an embedded database or a micro-service.

Lookup is based on a very simple (sparse) array. In a nutshell, a lookup is a sequence of lookups in the array:

A lookup operation has O(n) complexity while n is the number of characters in the key.

Additional FST use cases

As FSTs use character-by-character transitions, you do not need to have the full key. That makes it a natural choice for features like autosuggest, which return you the best matches given a prefix. With keyvi you can build autosuggest engines with gigabytes of data. In order to find best matches highly efficient, the completion dictionary type uses inner weights. These inner weights are used for routing the completion engine to the best paths. Just for the record, the current autosuggest for German at Cliqz uses more than 100 GB.

Try it out

We warmly invite you to take a look and try it out. We are using keyvi as high performance low-latency engine within Cliqz. It is a young project, still in its ramp up phase, stable but it is not a dump of some old software stack. We fully empathize one of the main ideas of open source:“release early, release often”.

With the 1st release we open source the main engine and a python extension. You can find our repository at http://github.com/cliqz/keyvi.

Cliqz's first major open source contribution: Keyvi, the key value index