JanusGraph Deep Dive (Part 2): Demystify indexing

Boxuan Li
6 min readMay 28, 2022

It’s been almost a year since I wrote the first blog post of the JanusGraph deep dive series. If you haven’t read it already, Data Layout in JanusGraph helps you understand the basics of the data model. In this blog post, I will demystify indexing in JanusGraph by explaining the internal mechanism behind them. Understanding the internals could help you choose the best indexing approach and potentially boost your query performance. You don’t have to but it is recommended that you read this user guide first.

Composite Index

Composite Index Basics

Many will confuse the composite index concept with multi-column indexes (a.k.a. composite index) in MySQL. In MySQL and some other databases, a composite index is an index that covers multiple columns. In JanusGraph, however, a composite index could cover one or more columns and thus is totally different from MySQL’s concept. Please think of “composite index” as a “regular index”. We call it a “regular index” because it is just like an index you would create in a traditional database (with only equality lookup support). For example, if you create an index for the “name” vertex property:

// open a schema management session
mgmt = graph.openManagement()
// fetch the property key you want to index
name = mgmt.getPropertyKey(‘name’)
// build a composite index against this property
mgmt.buildIndex(‘byNameComposite’,Vertex.class).addKey(name).buildCompositeIndex()
// commit index changes
mgmt.commit()

then you could quickly find the vertex by its name. Note that you never explicitly state that you want to use the index. JanusGraph query optimizer automatically picks up the best index for you, although you could tune the index selection algorithm using query.index-select-strategy option.

Composite Index Internals

Composite indexes are stored in the same storage backend as your graph data (nodes and edges), although graph data are stored in the edgestore “store”, while composite indexes are stored in the graphindex “store”. Depending on your storage backend, “store” can mean different things. If you are using Cassandra as the storage backend, then your graph data will be stored in the edgestore table while composite indexes will be stored in the graphindex table in the same keyspace.

Composite indexes are very fast and efficient but limited to equality lookups for a particular, previously-defined combination of property keys.

The mechanism behind composite indexes is fairly simple. A composite index record stores the indexed property value(s) as the partition key, and the vertex or edge id as the value (Fun fact: you could also index a vertex property by its meta-property, although I haven’t seen a real-world use case yet).

Depending on your storage backend, primary data update and composite index update may or may not be atomic. In general, since your primary data and composite index sit in the same storage backend, composite indices are fast and are less likely to be stale or inconsistent with the primary data.

Composite Index Caveats

Note that composite indexes are in favor of high cardinality data (I try not to use the word “selectivity” because people often have opposite interpretations of it). For example, if you have hundreds of thousands of vertices with the same value for a property, then in general you should consider using a mixed index rather than a composite index. Since JanusGraph does not have an option to index labels, you might tend to create a property (let’s call it “type”) which essentially just replicates the label. Then you might naturally index the “type” property using a composite index. This works well when you don’t have a lot of vertices/edges with the same “type”, but when you have a million of vertices with the same “type”, you will likely encounter memory and/or performance problems. But why? Recall that composite indexes store your indexed values as partition keys. If you have one million vertices with the same “type”, then there will be one million rows with the same partition key. This is a big anti-pattern in most databases. In this case, you should consider using a mixed index.

Mixed Index

Mixed Index Basics

A mixed index is essentially an index stored in an external index backend that is different from your storage backend which stores your primary graph data. JanusGraph supports three index backends: Elasticsearch, Apache Solr and Apache Lucene. Apache Lucene is a Java library which can only be used locally where JanusGraph resides, and thus is not applicable when you have multiple JanusGraph instances. Elasticsearch and Apache Solr are distributed engines built on top of Apache Lucene, and thus are suitable for a distributed setup. Mixed index creation is very similar to composite index creation, and you can learn more in the official documentation. Of course, if you don’t need mixed index functionality, then you don’t need an external index backend at all.

Recall that in the Composite Index Basics, we mentioned that you never explicitly state that you want to use the index? This typically also holds for mixed index. There are some heuristics used by the query optimizer. For example, if you have a query that can be satisfied by either a composite index or a mixed index, then the composite index is preferred due to its better performance. The source code can be found here if you are interested.

There is one exception, though. Although gremlin queries should be sufficient most of the time, JanusGraph does allow you to construct a query against your index backend explicitly and directly via Direct Index Query. This is particularly useful when you have a query pattern that is not natively supported by JanusGraph.

Mixed Index Internals

Different from composite indexes, mixed indexes are stored in your external index backend. Therefore, the search features, performance and scalability all depend on your choice of index backend. In general, mixed indexes provide more flexibility than composite indexes and support additional condition predicates beyond equality.

Note that primary data update and mixed index update are NOT atomic, no matter what storage backend and index backend you use. Upon transaction commit, JanusGraph always updates storage backend (including graph data and composite indexes) first, and then index backend. Therefore, compared to composite indexes, mixed indices are more likely to be inconsistent with primary data.

Mixed Index Caveats

When hitting a composite index, you will likely find that a query such as the one below

g.V().has(“propKey”, “propValue”).toList()

returns results in a consistent order in the sense that if you execute the query again, you get same results in the same order. In other words, the results are ordered in some deterministic way. In fact, as long as your storage backend follows BigTable Data Model, which is the case for the officially supported storage backends, the results would be in a consistent order. Mixed indexes, on the contrary, might not return results in a consistent order. This might matter if you are using the following pattern to retrieve results:

g.V().has(“propKey”, “propValue”).range(0, 10000).toList()
g.V().has(“propKey”, “propValue”).range(10000, 20000).toList()
...
g.V().has(“propKey”, “propValue”).range(90000, 100000).toList()

Although the above “pagination” is discouraged in JanusGraph, sometimes users do need it. If the above queries hit a mixed index (you can use profile step to check if a mixed index is used), then there is no guarantee that the results are contiguous.

Another problem that sometimes occurs is inconsistency between mixed index and primary data. During transaction commit phase, if your data gets written into the storage backend successfully but fails to write into the external index backend, the commit still succeeds, causing inconsistency between mixed index and primary data. You could write your own programs such as a MapReduce/Spark job to detect and reconcile such inconsistency, or you could use the JanusGraph transaction recovery utility.

This concludes this blog post. We didn’t cover Relation Indexes (a.k.a. Vertex-Centric Indexes or in abbreviation, VCIs) because it deserves an individual post. Stay tuned for the next post, in which I’ll talk about how to accelerate your edge queries with and without VCIs.

--

--

Boxuan Li

Database core developer at Microsoft & Maintainer of JanusGraph