Solution Resort

Learn, try and share…

Install Python 3.6 on Ubuntu 16.04

This should save you some time on finding the right commands to run:

Install Docker on Ubuntu 16.04

Feel free to grab:

How to rebase and squash git commits

Configure PyCharm CE to work with Apache Spark

This guide should help you to setup PyCharm CE to work with Python3 and Apache Spark (tested with version 2.1)

First, Create a new Pure Python PyCharm project.

Now copy the content of to your project. Your IDE should complain at the following line

from pyspark.sql import SparkSession

because it doesn’t know where is pyspark.sql which is part of the Python Spark library.

In order to tell PyCharm where the Python Spark libraries are, you need to go to Preferences->Project->Project Structure and add the zip files under $SPARK_HOME/python/lib to the content root. $SPARK_HOME is the location of your Apache Spark directory. If you haven’t downloaded Apache Spark, you can download it here

Next, go to Run -> Edit Configurations

and create a new configuration using the default Python configuration profile and add the following environment variables,

SPARK_HOME=<your spark home dir>
PYTHONPATH=<your spark home dir>/python

Then specify the name of your main .py script and the location of your text file where you want the words to be counted.

Finally, run your new configuration and it should do a word count job using Apache Spark.

Have fun!

Boilerplate – Apache Spark with Spring profile

This is a boilerplate you can try out to get started with Apache Spark (version 2.10) with Spring profile quickly:

This should allow you to configure environment specific properties (i.e. path to read some input file) really easily.

Boilerplate – Groovy and Spock

This is a boilerplate you can try out to get started with Groovy and Spock quickly

I created this because it can be tricky to find all the right dependencies & plugins to get started with Spock. This should get you started in no time.

Submit Spark Streaming job without waiting for it to finish

A Spark Streaming job is different to an ordinary Spark job. It runs 24/7 and never stops until you tell it too. Oozie is a really good tool for scheduling and orchestrating Spark jobs, but when it comes down to also making it work with Spark Streaming jobs, things gets a bit tricky.

Consider the following workflow which is an example taken from a disaster recovery process,
1. Run an ordinary spark job (i.e. a ETL process) to recovery the data from a backup.
2. Run some checks to make sure the data recovered is correct
3. Start the Spark Streaming job.

By default, when a job is submitted via, the submission processed is locked until the actual Spark job finishes. This is not ideal for Spark Streaming, because it means the workflow itself will never finish. And that’s not the only problem, Oozie itself consumes quite a bit of resources. From my experience, Oozie needs around 2 CPU Cores and 2G of RAM as a minimum to run any Spark Job (1 Core and 1G of RAM per process. Tt uses one process for Oozie itself, and another one for the submission of the Spark job).

Well, the good news is there is an option to tell spark-submit not to wait when it submits a job, which is

And when it’s set to false, the spark-submit submission process will exit and return 0 immediately as soon as the job is submitted, and of course, the Oozie job will exit as well.

Be careful not to blindly use this option everywhere. In the above disaster recovery example, only the Spark Streaming part should use this option, not the disaster recovery itself or you will end up starting the streaming job without the disaster recovery been done at all.

This option is not an obvious one, and I only came cross it when reading Hope this helps the other having similar issues.

Scalable reporting solutions with Elasticsearch

See my talk at the Elastic London Meetup about our experience building scalable reporting systems using Elasticsearch (especially if you have a legacy platform)

MySQL in Docker without losing data after rebuild

When it comes down to running database services or anything that has states in it with docker containers, the first question is often “how about my data” after the container is destroyed or rebuilt?

The simple answer is you can use Docker Data Volumes.

After reading a few articles as well as trying it out myself, the easiest and cleanest way I found is to create a Data Container with a volume first, and then tell your MySQL container to use the data volume on that container.

This can be simply done with two commands,

# creates the volume container and mount /var/lib/mysql

docker create -v /var/lib/mysql --name mysql_data_store busybox /bin/true
# start the mysql container and tell it to use the created volume on the mysql_data_store container. Do not use the -d option if you are running it with CoreOS

docker run --volumes-from mysql_data_store --name mysql -e MYSQL_ROOT_PASSWORD=<your long password> -d mysql:latest

Now if you kill and remove the MySQL container and recreate it by mounting the same data volume again, all the data you have won’t be lost because the data volume has not been destroyed. The MYSQL_ROOT_PASSWORD option will be redundant the second time you run the container as the MySQL service has already been installed.

I’ve not tried this on production yet but will do soon on some hobby projects.

More reading:

Run Docker and Docker Compose in a Vagrant box

Created recently which provisions a Vagrant VM (Ubuntu trusty) with everything necessary installed to run docker & docker compose.

The main reason I created this is because it gives you an isolated environment to run things without going through the hassle of installing docker & docker compose which can be quite annoying if you are running i.e. Mac OSX and wanted to use your own docker hub repo.

All you need to do is to install Vagrant which could be much easier than installing docker & docker compose depending on the OS you use.

Simply check it out and follow the instructions.

Page 1 of 3

Powered by WordPress & Theme by Anders Norén