Apache Zeppelin is an interactive computational environment built on Apache Spark like the IPython Notebook. With Apache PredictionIO (incubating) and Spark SQL, you can easily analyze your collected events when you are developing or tuning your engine.
The following instructions assume that you have the command
sbt accessible in your shell's search path. Alternatively, you can use the
sbt command that comes with Apache PredictionIO (incubating) at
Export Events to Apache Parquet
PredictionIO supports exporting your events to Apache Parquet, a columnar storage format that allows you to query quickly.
Let's export the data we imported in Recommendation Engine Template Quick Start, and assume the App ID is 1.
$ $PIO_HOME/bin/pio export --appid 1 --output /tmp/movies --format parquet
After the command has finished successfully, you should see something similar to the following.
1 2 3 4 5 6 7 8 9 10 11
root |-- creationTime: string (nullable = true) |-- entityId: string (nullable = true) |-- entityType: string (nullable = true) |-- event: string (nullable = true) |-- eventId: string (nullable = true) |-- eventTime: string (nullable = true) |-- properties: struct (nullable = true) | |-- rating: double (nullable = true) |-- targetEntityId: string (nullable = true) |-- targetEntityType: string (nullable = true)
Building Zeppelin for Apache Spark 1.2+
Start by cloning Zeppelin.
$ git clone https://github.com/apache/incubator-zeppelin.git
Build Zeppelin with Hadoop 2.4 and Spark 1.2 profiles.
$ cd zeppelin $ mvn clean package -Pspark-1.2 -Dhadoop.version=2.4.0 -Phadoop-2.4 -DskipTests
Now you should have working Zeppelin binaries.
First, start Zeppelin.
$ bin/zeppelin-daemon.sh start
By default, you should be able to access Zeppelin via web browser at http://localhost:8080. Create a new notebook and put the following in the first cell.
Performing Analysis with Zeppelin
If all steps above ran successfully, you should have a ready-to-use analytics environment by now. Let's try a few examples to see if everything is functional.
In the second cell, put in this piece of code and run it.
1 2 3
%sql SELECT entityType, event, targetEntityType, COUNT(*) AS c FROM events GROUP BY entityType, event, targetEntityType
We can also easily plot a pie chart.
%sql SELECT event, COUNT(*) AS c FROM events GROUP BY event
And see a breakdown of rating values.
1 2 3
%sql SELECT properties.rating AS r, COUNT(*) AS c FROM events WHERE properties.rating IS NOT NULL GROUP BY properties.rating ORDER BY r