Configure JanusGraph 0.6.0 for Spark

Boxuan Li
5 min readAug 31, 2021

JanusGraph is a distributed graph database. It implements the Apache TinkerPop framework and thus supports OLAP queries based on the Spark engine. This is a tutorial that guides you to run JanusGraph on Spark. We will cover three modes: Spark local, Spark standalone cluster, and Spark on yarn.

OLAP is supported by TinkerPop’s SparkGraphComputer engine

This tutorial assumes you have a basic knowledge of JanusGraph and Spark, and you want to combine the two powerful things together. It does not require you to have run Spark applications before. It uses Cassandra as the storage backend. Using HBase should be similar but I have not tested it.

You can download janusgraph 0.6.0 here.

Spark Local

It is fairly easy to run JanusGraph in Spark local mode. It neither requires you to have Spark running nor requires any additional configurations. Just run the following on the gremlin console (./bin/gremlin.sh):

:plugin use tinkerpop.hadoop:plugin use tinkerpop.sparkgraph = GraphFactory.open("conf/hadoop-graph/read-cql.properties")g = graph.traversal().withComputer(SparkGraphComputer)g.V().count()
Running Spark traversal in Spark local mode

It returns the number of nodes stored in JanusGraph, in this case, 10002501 nodes. Of course, Spark-local is usually for testing purposes, and you would want to use Spark standalone cluster or Spark yarn cluster in production environments. See next sections.

Spark Standalone Cluster

You need to have a Spark standalone cluster running. If you already have one running, skip the next section. Otherwise, follow the steps in the next section to launch a Spark standalone cluster locally.

Start a Spark standalone cluster

Step 1. Download https://www.apache.org/dyn/closer.lua/spark/spark-3.0.0/spark-3.0.0-bin-hadoop2.7.tgz You may also download the spark-3.0.3 version.

Step 2. Uncompress it, enter sbin directory, and then run start-all.sh. This will start a Spark master instance and several workers.

➜ sbin ./start-all.sh
starting org.apache.spark.deploy.master.Master, logging to /Users/liboxuan/Downloads/spark-3.0.0-bin-hadoop2.7/logs/spark-liboxuan-org.apache.spark.deploy.master.Master-1-liboxuans-MacBook-Pro.local.out
localhost: starting org.apache.spark.deploy.worker.Worker, logging to /Users/liboxuan/Downloads/spark-3.0.0-bin-hadoop2.7/logs/spark-liboxuan-org.apache.spark.deploy.worker.Worker-1-liboxuans-MacBook-Pro.local.out

Step 3. Visit http://localhost:8080/ and verify your cluster is up. You should be able to see something like this:

Spark Standalone Cluster

Note the spark master is at spark://liboxuans-MacBook-Pro.local:7077. Therefore, in conf/hadoop-graph/read-cql-standalone-cluster.properties, we need to set spark.master accordingly.

Run SparkGraphComputer on Gremlin Console

Now that you have a Spark standalone cluster running, we can run SparkGraphComputer on the gremlin console. Before that, we need to set up some configurations.

JanusGraph distribution already contains default configurations in conf/hadoop-graph/read-cql-standalone-cluster.properties, but you need to amend it accordingly. Your spark configs should look similar to this:

spark.master=spark://liboxuans-MacBook-Pro.local:7077spark.executor.memory=1gspark.executor.extraClassPath=/Users/liboxuan/Downloads/janusgraph-0.6.0/lib/*spark.serializer=org.apache.spark.serializer.KryoSerializerspark.kryo.registrator=org.janusgraph.hadoop.serialize.JanusGraphKryoRegistrator

Note that it is important that you don’t miss the wildcard pattern * at spark.executor.extraClassPath. Also, you need to make sure this path is available to both the master and workers. If workers are running on different machines, copy the files to corresponding directories on worker machines.

Now we can run SparkGraphComputer on gremlin console!

:plugin use tinkerpop.hadoop:plugin use tinkerpop.sparkgraph = GraphFactory.open(‘conf/hadoop-graph/read-cql-standalone-cluster.properties’)g = graph.traversal().withComputer(SparkGraphComputer)g.V().count()
Running Spark traversal in Spark standalone cluster mode

Great! We can see it works without too much effort.

Spark on Yarn Cluster

Some might prefer to run SparkGraphComputer on a yarn cluster. This is much trickier and not documented on JanusGraph official website. Because TinkerPop and JanusGraph don’t ship with yarn dependencies, there are a few dependencies you need to manually include. I only tested it on a pseudo yarn cluster running locally, but it should be roughly the same for a real distributed setting.

If you already have a yarn cluster running, skip the next section. Otherwise, you can follow the steps in the next section to launch a yarn cluster.

Launch Yarn Cluster (2.7)

Step 1. Download hadoop 2.7.0 from https://hadoop.apache.org/release/2.7.0.html

Step 2. Put HADOOP_HOME variable into you .bashrc file

export HADOOP_HOME=”/Users/liboxuan/Downloads/hadoop-2.7.0"

Step 3. Open etc/hadoop/core-site.xml and add the following

<configuration><property><name>fs.defaultFS</name><value>hdfs://localhost:9000</value></property></configuration>

Step 4. Open etc/hadoop/hdfs-site.xml and add the following

<configuration><property><name>dfs.replication</name><value>1</value></property></configuration>

Step 5. Open etc/hadoop/mapred-site.xml and add the following

<configuration><property><name>mapreduce.framework.name</name><value>yarn</value></property><property><name>mapreduce.application.classpath</name> <value>$HADOOP_HOME/share/hadoop/mapreduce/*:$HADOOP_HOME/share/hadoop/mapreduce/lib/*</value></property></configuration>

Step 6. Open etc/hadoop/yarn-site.xml and add the following

<configuration><property><name>yarn.log-aggregation-enable</name><value>true</value></property></configuration>

Step 7. Run hdfs namenode -format

Step 8. Run sbin/start-all.sh

Now if you run jps command on terminal, you should be able to see NameNode, SecondaryNameNode, DataNode, NodeManager, ResourceManager processes.

You should also be able to see your yarn cluster on http://localhost:8088/cluster

Prepare for JanusGraph configurations

Now that we have a hadoop 2.7 cluster up and running, we can prepare for JanusGraph configurations.

Step 1. Download spark-3.0.0-bin-hadoop2.7 from https://www.apache.org/dyn/closer.lua/spark/spark-3.0.0/spark-3.0.0-bin-hadoop2.7.tgz

Step 2. cp spark-3.0.0-bin-hadoop2.7/jars/spark-yarn_2.12–3.0.0.jar /Users/liboxuan/Downloads/janusgraph-0.6.0/lib

Step 3. cp hadoop-2.7.0/share/hadoop/yarn/hadoop-yarn-server-*.jar /Users/liboxuan/Downloads/janusgraph-0.6.0/lib

Step 4. make a spark-gremlin.zip and put it under a temporary folder (the zip file should contain everything under spark-3.0.0-bin-hadoop2.7/jars, with guava and commons-text jars replaced by versions from janusgraph-0.6.0/lib). Exact steps are as follows:

cd /Users/liboxuan/Downloads/janusgraph-0.6.0mkdir tmpcd /Users/liboxuan/Downloads/spark-3.0.0-bin-hadoop2.7/jarscp * /Users/liboxuan/Downloads/janusgraph-0.6.0/tmpcd /Users/liboxuan/Downloads/janusgraph-0.6.0/tmprm guava-14.0.1.jar commons-text-1.6.jarcp ../lib/guava-29.0-jre.jar .cp ../lib/commons-text-1.9.jar .zip spark-gremlin.zip *.jar

Step 5. Create a JanusGraph configuration file. For convenience, I copied conf/hadoop-graph/read-cql-standalone-cluster.properties as conf/hadoop-graph/read-cql-yarn.properties and amended the spark configuration part as follows:

spark.master=yarn
spark.submit.deployMode=client
spark.yarn.archive=/Users/liboxuan/Downloads/janusgraph-0.6.0/tmp/spark-gremlin.zip
spark.yarn.appMasterEnv.CLASSPATH=./__spark_libs__/*:/Users/liboxuan/Downloads/janusgraph-0.6.0/lib/*:/Users/liboxuan/Downloads/hadoop-2.7.0/etc/hadoop
spark.executor.extraClassPath=./__spark_libs__/*:/Users/liboxuan/Downloads/janusgraph-0.6.0/lib/*:/Users/liboxuan/Downloads/hadoop-2.7.0/etc/hadoop

Run SparkGraphComputer on Gremlin Console

Don’t forget to add Hadoop conf into classpath:

export HADOOP_CONF_DIR=”${HADOOP_HOME}/etc/hadoop”export CLASSPATH=”${HADOOP_CONF_DIR}”

Now we can run SparkGraphComputer on the gremlin console:

hdfs:plugin use tinkerpop.hadoop:plugin use tinkerpop.sparkgraph = GraphFactory.open(‘conf/hadoop-graph/read-cql-yarn.properties’)g = graph.traversal().withComputer(SparkGraphComputer)g.V().count()
Running Spark traversal in Spark Yarn cluster mode

Sources and further readings:

  1. https://docs.janusgraph.org/master/advanced-topics/hadoop/
  2. http://yaaics.blogspot.com/2017/07/configuring-janusgraph-for-spark-yarn.html
  3. https://tinkerpop.apache.org/docs/current/recipes/#olap-spark-yarn

--

--

Boxuan Li

Database core developer at Microsoft & Maintainer of JanusGraph