Bullet on Spark

This section explains how to set up and run Bullet on Spark.

Configuration

Bullet is configured at run-time using settings defined in a file. Settings not overridden will default to the values in bullet_spark_defaults.yaml. You can find out what these settings do in the comments listed in the defaults.

Installation

Download the Bullet Spark standalone jar from JCenter.

If you are using Bullet Kafka as pluggable PubSub, you can download the fat jar from JCenter. Otherwise, you need to plug in your own PubSub jar or use the RESTPubSub built-into bullet-core and turned on in the API.

To use Bullet Spark, you need to implement your own Data Producer Trait with a JVM based project. You have two ways to implement it as described in the Spark Architecture section. You include the Bullet artifact and Spark dependencies in your pom.xml or other equivalent build tools. The artifacts are available through JCenter. Here is an example if you use Scala and Maven:

<repositories>
    <repository>
        <snapshots>
            <enabled>false</enabled>
        </snapshots>
        <id>central</id>
        <name>bintray</name>
        <url>http://jcenter.bintray.com</url>
    </repository>
</repositories>
<properties>
    <scala.version>2.11.7</scala.version>
    <scala.dep.version>2.11</scala.dep.version>
    <spark.version>2.3.0</spark.version>
    <bullet.spark.version>0.1.1</bullet.spark.version>
</properties>

<dependency>
    <groupId>org.scala-lang</groupId>
    <artifactId>scala-library</artifactId>
    <version>${scala.version}</version>
    <scope>provided</scope>
</dependency>

<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-streaming_${scala.dep.version}</artifactId>
    <version>${spark.version}</version>
    <scope>provided</scope>
</dependency>

<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-core_${scala.dep.version}</artifactId>
    <version>${spark.version}</version>
    <scope>provided</scope>
</dependency>

<dependency>
     <groupId>com.yahoo.bullet</groupId>
     <artifactId>bullet-spark</artifactId>
     <version>${bullet.spark.version}</version>
</dependency>

You can also add <classifier>sources</classifier> or <classifier>javadoc</classifier> if you want the sources or javadoc.

Launch

After you have implemented your own data producer and built a jar, you could launch your Bullet Spark application. Here is an example command for a YARN cluster.

./bin/spark-submit \
    --master yarn \
    --deploy-mode cluster \
    --class com.yahoo.bullet.spark.BulletSparkStreamingMain \
    --queue <your queue> \
    --executor-memory 12g \
    --executor-cores 2 \
    --num-executors 200 \
    --driver-cores 2 \
    --driver-memory 12g \
    --conf spark.streaming.backpressure.enabled=true \
    --conf spark.default.parallelism=20 \
    ... # other Spark settings
    --jars /path/to/your-data-producer.jar,/path/to/your-pubsub.jar \
    /path/to/downloaded-bullet-spark-standalone.jar \
    --bullet-spark-conf /path/to/your-settings.yaml

You can pass other Spark settings by adding --conf key=value to the command. For more settings, you can refer to the Spark Configuration.

For other platforms, you can find the commands from the Spark Documentation.