JanusGraph Deep Dive: Data layout in JanusGraph
This article is the first one of the series JanusGraph Deep Dive. It assumes you have a basic knowledge of JanusGraph and Cassandra. It also assumes you have read the official documentation of JanusGraph Data Model, although no thorough understanding is required.
This article is organized as follows: we start from an empty JanusGraph instance and Cassandra database and then add some simple vertices, edges, and properties step by step. After each step, we check the raw data stored in Cassandra and explain them, which helps us understand how data is organized and stored.
Let’s start a new JanusGraph instance by connecting to a Cassandra database.
graph = JanusGraphFactory.open(“conf/janusgraph-cql.properties”)
cqlsh> select * from janusgraph.edgestore;
key | column1 | value
- — -+ — — — — -+ — — — -
Background knowledge: JanusGraph stores all nodes (including properties), edges (including properties), and vertex-centric indexes (a.k.a. VCI) in edgestore.
According to https://docs.janusgraph.org/master/advanced-topics/data-model/, JanusGraph stores data using Bigtable data model. Each row is uniquely identified by a key (key) here. Each row is comprised of an arbitrary number of cells. A cell is composed of a column (column1) and a value (value).
In the JanusGraph-Cassandra backend, key is the partition key, column1 is the clustering key, and they together form the primary key. See https://blog.devgenius.io/cassandra-primary-vs-partitioning-vs-clustering-keys-3b3fa0e317f4 if you are unfamiliar with these Cassandra terminologies (partition/clustering/primary key).
A primary key uniquely determines one row in Cassandra. With a partition key key, Cassandra can look up a partition using partition key quickly (consistent hashing). Once it locates the partition (a group of rows with the same partition key), it can use binary-search to locate the particular row you are looking for, because all rows are sorted by column1.
But how are these three columns (key, column1, value) used in JanusGraph?
v = graph.addVertex()
cqlsh> select * from janusgraph.edgestore;key | column1 | value— — — — — — — — — — + — — — — -+ — — — — — —0xe000000000000080 | 0x02 | 0x0001049c(1 rows)
We add a very simple vertex — default label, no property. What is stored? key stores the id of this vertex. 0x02 here stands for existence. JanusGraph uses this row to judge whether this vertex/edge exists or not.
v2 = graph.addVertex()
cqlsh> select * from janusgraph.edgestore;key | column1 | value— — — — — — — — — — + — — — — -+ — — — — — —0xe000000000000080 | 0x02 | 0x0001049c0x1000000000000080 | 0x02 | 0x00010482(2 rows)
Now we have two vertices.
cqlsh> select * from janusgraph.edgestore;key | column1 | value— — — — — — — — — — + — — — — — — — — — — + — — — — — — — — — — — — — — —0xe000000000000080 | 0x02 | 0x0001049c0xe000000000000080 | 0x70a080201080081c | 0x0x0000000000000415 | 0x02 | 0x000108800x0000000000000415 | 0x10c0 | 0xa072741e636f6e6e6563f404800x0000000000000415 | 0x10c2801800 | 0x8f00018e0080800x0000000000000415 | 0x10c2801c00 | 0x9981018e0081800x0000000000000415 | 0x10c2802000 | 0xad80018e0082800x0000000000000415 | 0x10c2802400 | 0x9981018e0083800x0000000000000415 | 0x10c2802800 | 0xae80018e0084800x0000000000000415 | 0x10c2802c00 | 0xb082018e0086800x0000000000000415 | 0x10c2803000 | 0xb382018e0087800x0000000000000415 | 0x10c4 | 0x00800c800x0000000000000415 | 0x10c8 | 0x008005c5f775054d4814800x1000000000000080 | 0x02 | 0x000104820x1000000000000080 | 0x70a180216080081c | 0x(15 rows)
Ignore rows with key = 0x0000000000000415. They record the schema of edge label “connect”. We can see there are two rows for this edge:
0xe000000000000080 | 0x70a080201080081c | 0x0x1000000000000080 | 0x70a180216080081c | 0x
By default, an edge is stored twice so that both endpoints are aware of this edge. The column1 records the edge while value records its properties.
Now let’s add a property to the vertex!
key | column1 | value— — — — — — — — — — + — — — — — — — — — — + — — — — — — — — — — — — — — —0xe000000000000080 | 0x02 | 0x0001049c0xe000000000000080 | 0x50c0 | 0xa0626fe20c9c0xe000000000000080 | 0x70a080201080081c | 0x0x0000000000000415 rows omitted0x0000000000000805 rows omitted0x1000000000000080 | 0x02 | 0x000104820x1000000000000080 | 0x70a180216080081c | 0x(27 rows)
Similarly, there are a couple of rows with key = 0x0000000000000805. They record the schema of property “name”. We can see a new row:
0xe000000000000080 | 0x50c0 | 0xa0626fe20c9c
which records the property we just added. Note that the column1 of a property is smaller than that of an edge. In fact, a property falls in the range [0x40, 0x60) while an edge falls in the range [0x60, 0x80). See IDHandler::getBounds method in janusgraph-core package.
Recall that Cassandra sorts rows by clustering key (i.e. column1 in this case). Therefore, when a query only needs properties or only needs edges, JanusGraph can fetch it quickly.