JanusGraph Deep Dive (Part 3): Optimize Edge Queries

How is an Edge Stored?

This is already documented in the official doc, but not clear and detailed enough. Understanding the edge layout could help you decide the best way to define your edge label, whether you should and how you could create vertex-centric indexes.

gremlin> graph = JanusGraphFactory.open("inmemory")
==>standardjanusgraph[inmemory:[127.0.0.1]]
gremlin> mgmt = graph.openManagement()
==>org.janusgraph.graphdb.database.management.ManagementSystem@fa5f81c
gremlin> mgmt.makeEdgeLabel("follow").make()
==>follow
gremlin> mgmt.commit()
==>null
gremlin> g = graph.traversal()
==>graphtraversalsource[standardjanusgraph[inmemory:[127.0.0.1]], standard]
gremlin> v1 = g.addV("person").next()
==>v[4208]
gremlin> v2 = g.addV("person").next()
==>v[4168]
gremlin> e = g.V(v1).addE("follow").to(v2).next()
==>e[172-38w-t1-37s][4208-follow->4168]
gremlin> graph.tx().commit()
==>null
gremlin> mgmt = graph.openManagement()
==>org.janusgraph.graphdb.database.management.ManagementSystem@7f42e06e
gremlin> mgmt.getEdgeLabel("follow").id()
==>1045
gremlin> e = g.E().next()
==>e[172-38w-t1-37s][4208-follow->4168]
gremlin> e.id()
==>172-38w-t1-37s
gremlin> import org.janusgraph.util.encoding.LongEncoding;
==> output omitted
gremlin> LongEncoding.decode("172")
==>1550
gremlin> LongEncoding.decode("38w")
==>4208
gremlin> LongEncoding.decode("37s")
==>4168
gremlin> LongEncoding.decode("t1")
==>1045
Edge Layout

How Edge Query works

For convenience, from now on, let’s assume we are only looking up edges connected to a single vertex, v1. Recall that in the first post, we explain that property falls in the range [0x40, 0x60) while edge falls in the range [0x60, 0x80) in binary (JanusGraph stores everything in binary). That is, if you don’t give any information about the four parts we just mentioned, JanusGraph will execute one query against the storage backend with the following condition:

key = v1.id() AND 0x60 ≤ column < 0x80

Pushing down Predicates to Storage

The process of “narrowing down search scope” is also known as “predicate pushdown”. Based on the information extracted from the query, JanusGraph tries pushing down predicates as many as possible to the storage backend. For example, instead of searching column in the range of [0x40, 0x60), if JanusGraph knows the result falls in the range of [0x40, 0x40F), then it can pass this much tighter bound to the storage backend. This is very efficient because it means less data needs to be looked up in the storage backend and also less communication cost.

Edge Layout
// query Q1 translates to: [0x70A0, 0x70A1)
g.V(v1).outE("follows").next()
// query Q2 translates to: [0x70A0802000, 0x70A0802001)
g.V(v1).outE("follows").where(__.inV().is(v2)).next()
// query Q3 translates to: [0x70A080200080081A, 0x70A080200080081B)
g.E(e.id()).next()

When Predicate Pushdown fails

Predicate pushdown sounds great, right? However, it does not always work well. Recall that the four parts (label id + direction, sort key, adjacent vertex id, and edge id) are in a certain order! Therefore, a predicate is only useful when any other predicate before it is known. For example, only knowing the adjacent vertex id but not the label id is not helpful — JanusGraph would not be able to push down the predicates to the storage backend in this case. Rather, JanusGraph would simply load all edges and do in-memory filtering to find out edges connecting to the target vertex. To better illustrate that, let’s see the following example:

// recap: query Q1 translates to: [0x70A0, 0x70A1)
g.V(v1).outE("follows").next()
// recap: query Q2 translates to: [0x70A0802000, 0x70A0802001)
g.V(v1).outE("follows").where(__.inV().is(v2)).next()
// query Q4 translates to: [0x60, 0x80)
g.V(v1).outE().where(__.inV().is(v2)).next()

Two small Reads vs One big Read (Advanced)

Now that you have an understanding of how JanusGraph constructs query constraints into a range query and pushes down the predicates, let’s look at another interesting example. This is an advanced topic and please feel free to skip this section.

// query Q2 translates to: [0x70A0802000, 0x70A0802001)
g.V(v1).outE("follows").where(__.inV().is(v2)).next()
// query Q5 translates to: [0x70A1802000, 0x70A1802001)
g.V(v1).inE("follows").where(__.outV().is(v2)).next()
// Both queries below translate to:
// [0x70A0802000,0x70A0802001) AND [0x70A1802000,0x70A1802001)
// Query Q6
g.V(v1).union(__.outE("follows").where(__.inV().is(v2)), __.inE("follows").where(__.outV().is(v2))).next()
// Query Q7
g.V(v1).bothE("follows").where(__.otherV().is(v2)).next()
// Query Q8 translates to: [0x70A0,0x70A2)
g.withoutStrategies(AdjacentVertexFilterOptimizerStrategy.class).V(v1).bothE("follows").where(otherV().is(v2)).next()

Sort Key and Vertex-Centric Indexes

Sort Key

Thanks for your patience and we finally come to this part. You must remember that there is an important part in the edge layout that we haven’t discussed: the sort key. So, what is the sort key? A sort key is a localized structure that you define to accelerate equivalence and range queries. Talk is cheap, let me show you the code:

// open a schema management session
mgmt = graph.openManagement()
// create a property key time
time = mgmt.makePropertyKey("time").dataType(Integer.class).make()
// create an edge label 'connects' with 'time' as sort key
mgmt.makeEdgeLabel('connects').sortKey(time).make()
// commit schema change
mgmt.commit()
// add a new edge "connects" between v1 and v2
e2 = g.V(v1).addE("connects").to(v2).next()
Edge Layout
// query Q9 translates to: [0x70E0,0x70E1)
g.V(v1).outE("connects").next()
// query Q10 translates to: [0x70E000E299752E,0x70E000E299752F)
g.V(v1).out("connects").has("time", 1654224174).next()
// query Q11 translates to: [0x70E0,0x70E1)
g.V(v1).outE("connects").where(__.inV().is(v2))
// query Q12 translates to: [0x70E0,0x70E1)
g.E(e2.id()).next()

Vertex-centric Index

A vertex-centric index is similar to a materialized view in Cassandra, or a local secondary index in Amazon DynamoDB. Essentially, when you create a vertex-centric index for an edge label, then edges with that label will be stored in an alternative format apart from their original format. See official documentation for an introduction to its usage.

// ordinary format
vertex id | label + direction | neighbor id | edge id | value
// vertex-centric index format
vertex id | label + direction | time | neighbor id | edge id | value

Conclusion

This is so far the longest post in the JanusGraph Deep Dive series. We covered the internals of edge storage and how JanusGraph pushes down predicates to optimize your queries. Hopefully, understanding internals can help you make your data modeling and queries more optimal. Remember, a rule of thumb to tune your performance is to check the profiling result to inspect the underlying backend query and also the latency of your query.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Boxuan Li

Boxuan Li

Maintainer of JanusGraph, a popular distributed graph database