This should save you some time on finding the right commands to run: https://github.com/shenghuahe/ubuntu1604_python36_installer
Author: Richard (Page 1 of 3)
This guide should help you to setup PyCharm CE to work with Python3 and Apache Spark (tested with version 2.1)
First, Create a new Pure Python PyCharm project.
Now copy the content of https://github.com/apache/spark/blob/master/examples/src/main/python/wordcount.py to your project. Your IDE should complain at the following line
from pyspark.sql import SparkSession
because it doesn’t know where is pyspark.sql which is part of the Python Spark library.
In order to tell PyCharm where the Python Spark libraries are, you need to go to Preferences->Project->Project Structure and add the zip files under $SPARK_HOME/python/lib to the content root. $SPARK_HOME is the location of your Apache Spark directory. If you haven’t downloaded Apache Spark, you can download it here http://spark.apache.org/downloads.html
Next, go to Run -> Edit Configurations
and create a new configuration using the default Python configuration profile and add the following environment variables,
PYSPARK_PYTHON=python3 SPARK_HOME=<your spark home dir> PYTHONPATH=<your spark home dir>/python
Then specify the name of your main .py script and the location of your text file where you want the words to be counted.
Finally, run your new configuration and it should do a word count job using Apache Spark.
This is a boilerplate you can try out to get started with Apache Spark (version 2.10) with Spring profile quickly: https://github.com/shenghuahe/sparkwithspringprofile
This should allow you to configure environment specific properties (i.e. path to read some input file) really easily.
This is a boilerplate you can try out to get started with Groovy and Spock quickly https://github.com/shenghuahe/groovywithspock.
I created this because it can be tricky to find all the right dependencies & plugins to get started with Spock. This should get you started in no time.
A Spark Streaming job is different to an ordinary Spark job. It runs 24/7 and never stops until you tell it too. Oozie is a really good tool for scheduling and orchestrating Spark jobs, but when it comes down to also making it work with Spark Streaming jobs, things gets a bit tricky.
Consider the following workflow which is an example taken from a disaster recovery process,
1. Run an ordinary spark job (i.e. a ETL process) to recovery the data from a backup.
2. Run some checks to make sure the data recovered is correct
3. Start the Spark Streaming job.
By default, when a job is submitted via spark-submit.sh, the submission processed is locked until the actual Spark job finishes. This is not ideal for Spark Streaming, because it means the workflow itself will never finish. And that’s not the only problem, Oozie itself consumes quite a bit of resources. From my experience, Oozie needs around 2 CPU Cores and 2G of RAM as a minimum to run any Spark Job (1 Core and 1G of RAM per process. Tt uses one process for Oozie itself, and another one for the submission of the Spark job).
Well, the good news is there is an option to tell spark-submit not to wait when it submits a job, which is
And when it’s set to false, the spark-submit submission process will exit and return 0 immediately as soon as the job is submitted, and of course, the Oozie job will exit as well.
Be careful not to blindly use this option everywhere. In the above disaster recovery example, only the Spark Streaming part should use this option, not the disaster recovery itself or you will end up starting the streaming job without the disaster recovery been done at all.
This option is not an obvious one, and I only came cross it when reading https://aws.amazon.com/blogs/big-data/submitting-user-applications-with-spark-submit/. Hope this helps the other having similar issues.
See my talk at the Elastic London Meetup about our experience building scalable reporting systems using Elasticsearch (especially if you have a legacy platform)
When it comes down to running database services or anything that has states in it with docker containers, the first question is often “how about my data” after the container is destroyed or rebuilt?
The simple answer is you can use Docker Data Volumes.
After reading a few articles as well as trying it out myself, the easiest and cleanest way I found is to create a Data Container with a volume first, and then tell your MySQL container to use the data volume on that container.
This can be simply done with two commands,
# creates the volume container and mount /var/lib/mysql docker create -v /var/lib/mysql --name mysql_data_store busybox /bin/true
# start the mysql container and tell it to use the created volume on the mysql_data_store container. Do not use the -d option if you are running it with CoreOS docker run --volumes-from mysql_data_store --name mysql -e MYSQL_ROOT_PASSWORD=<your long password> -d mysql:latest
Now if you kill and remove the MySQL container and recreate it by mounting the same data volume again, all the data you have won’t be lost because the data volume has not been destroyed. The MYSQL_ROOT_PASSWORD option will be redundant the second time you run the container as the MySQL service has already been installed.
I’ve not tried this on production yet but will do soon on some hobby projects.
Created https://github.com/richardhe-awin/vagrant-docker recently which provisions a Vagrant VM (Ubuntu trusty) with everything necessary installed to run docker & docker compose.
The main reason I created this is because it gives you an isolated environment to run things without going through the hassle of installing docker & docker compose which can be quite annoying if you are running i.e. Mac OSX and wanted to use your own docker hub repo.
All you need to do is to install Vagrant which could be much easier than installing docker & docker compose depending on the OS you use.
Simply check it out and follow the instructions.