JanusGraph Deep Dive (Part 3): Optimize Edge Queries
This is the 3rd post of the JanusGraph Deep Dive series. If you haven’t already, you might be interested in the other two posts: Data Layout in JanusGraph and Demystify Indexing. In this blog post, we will discuss how to speed up your edge queries without or with vertex-centric indexes.
How is an Edge Stored?
This is already documented in the official doc, but not clear and detailed enough. Understanding the edge layout could help you decide the best way to define your edge label, whether you should and how you could create vertex-centric indexes.
Let’s start by creating an edge label and an edge from an empty graph.
gremlin> graph = JanusGraphFactory.open("inmemory")
gremlin> mgmt = graph.openManagement()
gremlin> g = graph.traversal()
gremlin> v1 = g.addV("person").next()
gremlin> v2 = g.addV("person").next()
gremlin> e = g.V(v1).addE("follow").to(v2).next()
gremlin> mgmt = graph.openManagement()
gremlin> e = g.E().next()
The above code snippet shows the procedure to create a new in-memory graph, create an edge label named “follow”, and create an edge with a “follow” label. In the end, we retrieve the id of the label “follow”:
1045, and the id of the edge:
In JanusGraph, an edge id consists of four parts: relation id, out-vertex id, edge label id, and in-vertex id. They are encoded using alphanumerics and concatenated into a single id. Let’s decode them:
gremlin> import org.janusgraph.util.encoding.LongEncoding;
==> output omitted
From the above, we can see that “172” is 1550 (relation id) encoded, “38w” is 4208 (out-vertex id) encoded, “t1” is 1045 (edge label id) encoded, and “37s” is 4168 (in-vertex id) encoded. Relation id is a unique identifier auto-generated by JanusGraph. Sometimes it is also referred to as edge id. To avoid confusion, we will call the complete id (e.g.
172–38w-t1–37s) as “full id”, and call the relation id (e.g.
1550) as “edge id”. With that in mind, let’s look at the edge layout in JanusGraph.
If you have read the first post of the series, Data Layout in JanusGraph, then you must know that columns are sorted within a partition (a vertex). In some databases, this column is referred to as the “clustering key” or “sort key”. Anyways, we shall keep in mind that for a particular vertex, all its edges are sorted by “column”.
We can see a column consists of four parts: 1) label id + direction, 2) sort key, 3) adjacent vertex id, and 4) edge id (relation id). Note that the ordering is fixed and can be crucial to query performance. Label id is the id of the edge label. Direction denotes whether the edge goes from current vertex to neighbor vertex, or from neighbor vertex to current vertex. The sort key is an optional field that could speed up certain queries — we will talk about it later. Adjacent vertex id is the id of the neighbor vertex. Finally, edge id is the relation id generated by JanusGraph that is unique across edges. As you can see, JanusGraph does not directly store edge’s full id. This is for performance reasons which we will talk about next.
How Edge Query works
For convenience, from now on, let’s assume we are only looking up edges connected to a single vertex, v1. Recall that in the first post, we explain that property falls in the range [0x40, 0x60) while edge falls in the range [0x60, 0x80) in binary (JanusGraph stores everything in binary). That is, if you don’t give any information about the four parts we just mentioned, JanusGraph will execute one query against the storage backend with the following condition:
key = v1.id() AND 0x60 ≤ column < 0x80
Usually, the more information you provide, the more JanusGraph could narrow down its searching scope, the less in-memory filtering is needed, and the more efficient your query would be.
Pushing down Predicates to Storage
The process of “narrowing down search scope” is also known as “predicate pushdown”. Based on the information extracted from the query, JanusGraph tries pushing down predicates as many as possible to the storage backend. For example, instead of searching column in the range of [0x40, 0x60), if JanusGraph knows the result falls in the range of [0x40, 0x40F), then it can pass this much tighter bound to the storage backend. This is very efficient because it means less data needs to be looked up in the storage backend and also less communication cost.
Let’s take a look at the edge layout again. For now, let’s ignore the sort key — we will cover that later. Let’s take a look at a few examples and see how the predicates could be pushed down for better performance. Note that you can inspect how JanusGraph translates your Gremlin query into a backend range query using the profile step.
// query Q1 translates to: [0x70A0, 0x70A1)
// query Q2 translates to: [0x70A0802000, 0x70A0802001)
// query Q3 translates to: [0x70A080200080081A, 0x70A080200080081B)
The first query Q1 looks up outgoing edge(s) with label “follows” from vertex v1. JanusGraph constructs the range [0x70A0, 0x70A1) out of the label id (“follows”) and direction (outgoing).
The second query Q2 narrows the scope by adding another condition: the other endpoint vertex must be v2. Given the label id + direction, and adjacent vertex id (note that here we don’t have sort key) info, JanusGraph constructs the range [0x70A0802000, 0x70A0802001) which is much tighter than the previous range. Therefore, even if there are many outgoing “follows” edges from v1, we could load only those connecting to v2 from the storage backend.
The third query Q3 may look quite different from Q1 and Q2, but it in fact is just Q2 with an additional edge relation id constraint. Recall that a complete id of an edge consists of four parts? They are out-vertex id, in-vertex id, edge label id and relation id. That’s basically everything stored in a “column”, except the edge direction. Note that JanusGraph stores each unique edge twice — one for each end vertex of the edge. Given the complete edge id, JanusGraph will load the edge stored with the out-vertex. Thus, the edge direction is always outgoing. Since JanusGraph has knowledge of everything in the “column”, it can recover that information and load the exact edge from the storage backend.
When Predicate Pushdown fails
Predicate pushdown sounds great, right? However, it does not always work well. Recall that the four parts (label id + direction, sort key, adjacent vertex id, and edge id) are in a certain order! Therefore, a predicate is only useful when any other predicate before it is known. For example, only knowing the adjacent vertex id but not the label id is not helpful — JanusGraph would not be able to push down the predicates to the storage backend in this case. Rather, JanusGraph would simply load all edges and do in-memory filtering to find out edges connecting to the target vertex. To better illustrate that, let’s see the following example:
// recap: query Q1 translates to: [0x70A0, 0x70A1)
// recap: query Q2 translates to: [0x70A0802000, 0x70A0802001)
// query Q4 translates to: [0x60, 0x80)
Does the result of query Q4 surprise you? It seems very similar to Q1 and Q2, and maybe even better than Q1 since a neighbor vertex id sounds much more valuable than a label id. Yet Q4 translates into [0x60, 0x80), a very inefficient query that loads all edges connected to vertex v1! After loading all edges from the storage backend, JanusGraph will then do in-memory filtering to retain only edges that are connected to v2. This could damage your performance especially if v1 is a super node with hundreds of thousands of edges.
So what’s wrong with our query Q4? Recall that “column” always starts with the label id. Without knowing the edge label, knowing the neighbor vertex id, unfortunately, does not help and JanusGraph has to load all edges. In this case, the predicate pushdown fails because no predicate is pushed down to the storage backend.
Two small Reads vs One big Read (Advanced)
Now that you have an understanding of how JanusGraph constructs query constraints into a range query and pushes down the predicates, let’s look at another interesting example. This is an advanced topic and please feel free to skip this section.
// query Q2 translates to: [0x70A0802000, 0x70A0802001)
// query Q5 translates to: [0x70A1802000, 0x70A1802001)
We can see query Q2 and query Q5 are only different in the direction. The resulting backend queries, therefore, have differences in only one byte that represents the direction.
// Both queries below translate to:
// [0x70A0802000,0x70A0802001) AND [0x70A1802000,0x70A1802001)
// Query Q6
// Query Q7
Query Q6 uses a union step to combine two queries together, and the resulting backend queries are also a union of the previous two backend queries. Query Q7 is a simplified version of Q6 and results in the same union of backend queries. Two backend queries generally mean two independent queries against the storage backend. This sounds reasonable, doesn’t it? But this approach has a drawback: it needs to fire two queries to the storage backend. Sometimes it is better to fire one larger query rather than two smaller queries. In this case, you could do the following:
// Query Q8 translates to: [0x70A0,0x70A2)
In the above example, we disable AdjacentVertexFilterOptimizerStrategy so that JanusGraph won’t try “folding” the “where” condition into the query. The backend query is just the same as that of
g.V(v1).bothE("follows"). Of course, whether doing so is beneficial depends on your data distribution and your storage backend settings.
Sort Key and Vertex-Centric Indexes
Thanks for your patience and we finally come to this part. You must remember that there is an important part in the edge layout that we haven’t discussed: the sort key. So, what is the sort key? A sort key is a localized structure that you define to accelerate equivalence and range queries. Talk is cheap, let me show you the code:
// open a schema management session
mgmt = graph.openManagement()
// create a property key time
time = mgmt.makePropertyKey("time").dataType(Integer.class).make()
// create an edge label 'connects' with 'time' as sort key
// commit schema change
// add a new edge "connects" between v1 and v2
e2 = g.V(v1).addE("connects").to(v2).next()
Note how we define the sort key for the “follow” edge label. Now when JanusGraph stores a “follow” edge, it will store its “time” property in the “sort key” part of the “column”.
As a consequence, now queries targeting “follow” edges can be accelerated by the “time” property as shown below:
// query Q9 translates to: [0x70E0,0x70E1)
// query Q10 translates to: [0x70E000E299752E,0x70E000E299752F)
Usage of the sort key has an important caveat: it makes adjacent vertex query and edge id query very inefficient. See the below example:
// query Q11 translates to: [0x70E0,0x70E1)
// query Q12 translates to: [0x70E0,0x70E1)
Apparently, predicates are not pushed down in Q11 and Q12. Recall that Q2 is almost identical to Q11:
g.V(v1).outE(“follows”).where(__.inV().is(v2)).next()The adjacent vertex predicate is pushed down in Q2 but not in Q11. This means Q11 translates to an expensive backend call, and JanusGraph needs to filter results in memory. Same thing happens to query Q12. Recall that in the section When Predicate Pushdown fails, we mentioned that
The four parts (label id + direction, sort key, adjacent vertex id, and edge id) are in a certain order. Ordering is fixed and can be crucial to query performance.
At that time, we didn’t define
sort key and thus sort key field is ignored by JanusGraph. Now that we have it defined when creating our edge label, it must not be ignored. As you can see from the above examples, defining
sort key helps queries involving that
sort key as constraint, but harms queries with neighbor vertex id or edge id as a constraint. In short, if you have use cases where you want to find edge(s) between two particular vertices, or if you know the edge id, you should not define a sort key for this edge label.
A vertex-centric index is similar to a materialized view in Cassandra, or a local secondary index in Amazon DynamoDB. Essentially, when you create a vertex-centric index for an edge label, then edges with that label will be stored in an alternative format apart from their original format. See official documentation for an introduction to its usage.
You may notice that the usage of vertex-centric index sounds exactly the same as a sort key. Bingo! In fact, a vertex-centric index is implemented leveraging the sort key feature. Let’s say you have an edge label “follows” (without defining a sort key) and a corresponding vertex-centric index on property “time”. Then an edge will be stored twice in two formats:
// ordinary format
vertex id | label + direction | neighbor id | edge id | value
// vertex-centric index format
vertex id | label + direction | time | neighbor id | edge id | value
Now you can effectively push down all predicates we talked about earlier. JanusGraph will figure out whether it should load the ordinary edge or the vertex-centric index.
Despite its benefits, vertex-centric index comes with additional cost. If you have one vertex-centric index defined for each edge label, then your total edge storage overhead will double. Of course, the write performance is also impacted, which generally holds for all index types.
This is so far the longest post in the JanusGraph Deep Dive series. We covered the internals of edge storage and how JanusGraph pushes down predicates to optimize your queries. Hopefully, understanding internals can help you make your data modeling and queries more optimal. Remember, a rule of thumb to tune your performance is to check the profiling result to inspect the underlying backend query and also the latency of your query.