Configure JanusGraph 0.6.0 for Spark

OLAP is supported by TinkerPop’s SparkGraphComputer engine

Spark Local

It is fairly easy to run JanusGraph in Spark local mode. It neither requires you to have Spark running nor requires any additional configurations. Just run the following on the gremlin console (./bin/gremlin.sh):

:plugin use tinkerpop.hadoop:plugin use tinkerpop.sparkgraph = GraphFactory.open(‘conf/hadoop-graph/read-cql.properties’)g = graph.traversal().withComputer(SparkGraphComputer)g.V().count()
Running Spark traversal in Spark local mode

Spark Standalone Cluster

You need to have a Spark standalone cluster running. If you already have one running, skip the next section. Otherwise, follow the steps in the next section to launch a Spark standalone cluster locally.

Start a Spark standalone cluster

Step 1. Download https://www.apache.org/dyn/closer.lua/spark/spark-3.0.0/spark-3.0.0-bin-hadoop2.7.tgz You may also download the spark-3.0.3 version.

➜ sbin ./start-all.sh
starting org.apache.spark.deploy.master.Master, logging to /Users/liboxuan/Downloads/spark-3.0.0-bin-hadoop2.7/logs/spark-liboxuan-org.apache.spark.deploy.master.Master-1-liboxuans-MacBook-Pro.local.out
localhost: starting org.apache.spark.deploy.worker.Worker, logging to /Users/liboxuan/Downloads/spark-3.0.0-bin-hadoop2.7/logs/spark-liboxuan-org.apache.spark.deploy.worker.Worker-1-liboxuans-MacBook-Pro.local.out
Spark Standalone Cluster

Run SparkGraphComputer on Gremlin Console

Now that you have a Spark standalone cluster running, we can run SparkGraphComputer on the gremlin console. Before that, we need to set up some configurations.

spark.master=spark://liboxuans-MacBook-Pro.local:7077spark.executor.memory=1gspark.executor.extraClassPath=/Users/liboxuan/Downloads/janusgraph-0.6.0/lib/*spark.serializer=org.apache.spark.serializer.KryoSerializerspark.kryo.registrator=org.janusgraph.hadoop.serialize.JanusGraphKryoRegistrator
:plugin use tinkerpop.hadoop:plugin use tinkerpop.sparkgraph = GraphFactory.open(‘conf/hadoop-graph/read-cql-standalone-cluster.properties’)g = graph.traversal().withComputer(SparkGraphComputer)g.V().count()
Running Spark traversal in Spark standalone cluster mode

Spark on Yarn Cluster

Some might prefer to run SparkGraphComputer on a yarn cluster. This is much trickier and not documented on JanusGraph official website. Because TinkerPop and JanusGraph don’t ship with yarn dependencies, there are a few dependencies you need to manually include. I only tested it on a pseudo yarn cluster running locally, but it should be roughly the same for a real distributed setting.

Launch Yarn Cluster (2.7)

Step 1. Download hadoop 2.7.0 from https://hadoop.apache.org/release/2.7.0.html

export HADOOP_HOME=”/Users/liboxuan/Downloads/hadoop-2.7.0"
<configuration><property><name>fs.defaultFS</name><value>hdfs://localhost:9000</value></property></configuration>
<configuration><property><name>dfs.replication</name><value>1</value></property></configuration>
<configuration><property><name>mapreduce.framework.name</name><value>yarn</value></property><property><name>mapreduce.application.classpath</name> <value>$HADOOP_HOME/share/hadoop/mapreduce/*:$HADOOP_HOME/share/hadoop/mapreduce/lib/*</value></property></configuration>
<configuration><property><name>yarn.log-aggregation-enable</name><value>true</value></property></configuration>

Prepare for JanusGraph configurations

Now that we have a hadoop 2.7 cluster up and running, we can prepare for JanusGraph configurations.

cd /Users/liboxuan/Downloads/janusgraph-0.6.0mkdir tmpcd /Users/liboxuan/Downloads/spark-3.0.0-bin-hadoop2.7/jarscp * /Users/liboxuan/Downloads/janusgraph-0.6.0/tmpcd /Users/liboxuan/Downloads/janusgraph-0.6.0/tmprm guava-14.0.1.jar commons-text-1.6.jarcp ../lib/guava-29.0-jre.jar .cp ../lib/commons-text-1.9.jar .zip spark-gremlin.zip *.jar
spark.master=yarn
spark.submit.deployMode=client
spark.yarn.archive=/Users/liboxuan/Downloads/janusgraph-0.6.0/tmp/spark-gremlin.zip
spark.yarn.appMasterEnv.CLASSPATH=./__spark_libs__/*:/Users/liboxuan/Downloads/janusgraph-0.6.0/lib/*:/Users/liboxuan/Downloads/hadoop-2.7.0/etc/hadoop
spark.executor.extraClassPath=./__spark_libs__/*:/Users/liboxuan/Downloads/janusgraph-0.6.0/lib/*:/Users/liboxuan/Downloads/hadoop-2.7.0/etc/hadoop

Run SparkGraphComputer on Gremlin Console

Don’t forget to add Hadoop conf into classpath:

export HADOOP_CONF_DIR=”${HADOOP_HOME}/etc/hadoop”export CLASSPATH=”${HADOOP_CONF_DIR}”
hdfs:plugin use tinkerpop.hadoop:plugin use tinkerpop.sparkgraph = GraphFactory.open(‘conf/hadoop-graph/read-cql-yarn.properties’)g = graph.traversal().withComputer(SparkGraphComputer)g.V().count()
Running Spark traversal in Spark Yarn cluster mode
  1. https://docs.janusgraph.org/master/advanced-topics/hadoop/
  2. http://yaaics.blogspot.com/2017/07/configuring-janusgraph-for-spark-yarn.html
  3. https://tinkerpop.apache.org/docs/current/recipes/#olap-spark-yarn

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Boxuan Li

Boxuan Li

Maintainer of JanusGraph, a popular distributed graph database