Solution Resort

Learn, try and share…

How to rebase and squash git commits

Configure PyCharm CE to work with Apache Spark

This guide should help you to setup PyCharm CE to work with Python3 and Apache Spark (tested with version 2.1)

First, Create a new Pure Python PyCharm project.

Now copy the content of https://github.com/apache/spark/blob/master/examples/src/main/python/wordcount.py to your project. Your IDE should complain at the following line

from pyspark.sql import SparkSession

because it doesn’t know where is pyspark.sql which is part of the Python Spark library.

In order to tell PyCharm where the Python Spark libraries are, you need to go to Preferences->Project->Project Structure and add the zip files under $SPARK_HOME/python/lib to the content root. $SPARK_HOME is the location of your Apache Spark directory. If you haven’t downloaded Apache Spark, you can download it here http://spark.apache.org/downloads.html

Next, go to Run -> Edit Configurations

and create a new configuration using the default Python configuration profile and add the following environment variables,

PYSPARK_PYTHON=python3
SPARK_HOME=<your spark home dir>
PYTHONPATH=<your spark home dir>/python

Then specify the name of your main .py script and the location of your text file where you want the words to be counted.

Finally, run your new configuration and it should do a word count job using Apache Spark.

Have fun!

Boilerplate – Apache Spark with Spring profile

This is a boilerplate you can try out to get started with Apache Spark (version 2.10) with Spring profile quickly: https://github.com/shenghuahe/sparkwithspringprofile

This should allow you to configure environment specific properties (i.e. path to read some input file) really easily.

Boilerplate – Groovy and Spock

This is a boilerplate you can try out to get started with Groovy and Spock quickly https://github.com/shenghuahe/groovywithspock.

I created this because it can be tricky to find all the right dependencies & plugins to get started with Spock. This should get you started in no time.

Submit Spark Streaming job without waiting for it to finish

A Spark Streaming job is different to an ordinary Spark job. It runs 24/7 and never stops until you tell it too. Oozie is a really good tool for scheduling and orchestrating Spark jobs, but when it comes down to also making it work with Spark Streaming jobs, things gets a bit tricky.

Consider the following workflow which is an example taken from a disaster recovery process,
1. Run an ordinary spark job (i.e. a ETL process) to recovery the data from a backup.
2. Run some checks to make sure the data recovered is correct
3. Start the Spark Streaming job.

By default, when a job is submitted via spark-submit.sh, the submission processed is locked until the actual Spark job finishes. This is not ideal for Spark Streaming, because it means the workflow itself will never finish. And that’s not the only problem, Oozie itself consumes quite a bit of resources. From my experience, Oozie needs around 2 CPU Cores and 2G of RAM as a minimum to run any Spark Job (1 Core and 1G of RAM per process. Tt uses one process for Oozie itself, and another one for the submission of the Spark job).

Well, the good news is there is an option to tell spark-submit not to wait when it submits a job, which is
spark.yarn.submit.waitAppCompletion=false

And when it’s set to false, the spark-submit submission process will exit and return 0 immediately as soon as the job is submitted, and of course, the Oozie job will exit as well.

Be careful not to blindly use this option everywhere. In the above disaster recovery example, only the Spark Streaming part should use this option, not the disaster recovery itself or you will end up starting the streaming job without the disaster recovery been done at all.

This option is not an obvious one, and I only came cross it when reading https://aws.amazon.com/blogs/big-data/submitting-user-applications-with-spark-submit/. Hope this helps the other having similar issues.

Scalable reporting solutions with Elasticsearch

See my talk at the Elastic London Meetup about our experience building scalable reporting systems using Elasticsearch (especially if you have a legacy platform)

MySQL in Docker without losing data after rebuild

When it comes down to running database services or anything that has states in it with docker containers, the first question is often “how about my data” after the container is destroyed or rebuilt?

The simple answer is you can use Docker Data Volumes.

After reading a few articles as well as trying it out myself, the easiest and cleanest way I found is to create a Data Container with a volume first, and then tell your MySQL container to use the data volume on that container.

This can be simply done with two commands,

# creates the volume container and mount /var/lib/mysql

docker create -v /var/lib/mysql --name mysql_data_store busybox /bin/true
# start the mysql container and tell it to use the created volume on the mysql_data_store container. Do not use the -d option if you are running it with CoreOS

docker run --volumes-from mysql_data_store --name mysql -e MYSQL_ROOT_PASSWORD=<your long password> -d mysql:latest

Now if you kill and remove the MySQL container and recreate it by mounting the same data volume again, all the data you have won’t be lost because the data volume has not been destroyed. The MYSQL_ROOT_PASSWORD option will be redundant the second time you run the container as the MySQL service has already been installed.

I’ve not tried this on production yet but will do soon on some hobby projects.

More reading:
https://docs.docker.com/engine/userguide/containers/dockervolumes/
https://github.com/docker-library/docs/tree/master/mysql#where-to-store-data

Run Docker and Docker Compose in a Vagrant box

Created https://github.com/richardhe-awin/vagrant-docker recently which provisions a Vagrant VM (Ubuntu trusty) with everything necessary installed to run docker & docker compose.

The main reason I created this is because it gives you an isolated environment to run things without going through the hassle of installing docker & docker compose which can be quite annoying if you are running i.e. Mac OSX and wanted to use your own docker hub repo.

All you need to do is to install Vagrant which could be much easier than installing docker & docker compose depending on the OS you use.

Simply check it out and follow the instructions.

How to: Install a Virtual Apache Hadoop Cluster with Vagrant and Cloudera Manager on a Mac

Feel free to skip some of the steps if you already have certain packages installed

Get Cask
brew install caskroom/cask/brew-cask

Get Vagrant & Vagrant plugins
brew cask install virtualbox
brew cask install vagrant
brew cask install vagrant-manager
vagrant plugin install vagranthostmanager

Install Hadoop
git clone git@github.com:richardhe-awin/vagrant-hadoop-cluster.git
cd vagrant-hadoop-cluster
vagrant up

Configure Cloudera Manager (mostly referenced from http://blog.cloudera.com/blog/2014/06/how-to-install-a-virtual-apache-hadoop-cluster-with-vagrant-and-cloudera-manager/)

  1. Go to http://hadoop-master:7180/ (you might have to wait for a few minutes for the service to boot up before this is available)  and login with admin/admin
  2. Choose to use the Express version and continue
  3. When you are asked to enter the host names, enter hadoop-node1 and hadoop-node2 and click search. You should see the two hosts coming up and confirm.
  4. Keep using the default option until you got to the page asking “Login to all hosts as”. Change this to “Another user” and enter “vagrant” as the username and enter “vagrant” again for the password fields. Click next and it should start installing (this will take a while).
  5. On the “Cluster Setup” page, choose “Custom Services” and select the following: HDFS, Hive, Hue, Impala, Oozie, Solr, Spark, Sqoop2, YARN and ZooKeeper. Click Continue.
  6. On the next page, you can select what services end up on what nodes. Usually Cloudera Manager chooses the best configuration here, but you can change it if you want. For now, click Continue.
  7. On the “Database Setup” page, leave it on “Use Embedded Database.” Click Test Connection (it says it will skip this step) and click Continue.
  8. Click Continue on the “Review Changes” step. Cloudera Manager will now try to configure and start all services.
  9. Done!

 

Example of caching MVC response using Filesystem cache in ZF2

The code is available on my Github: https://github.com/shenghuahe/zf2-cache-mvc-response

Configuration

The filesystem cache is configured within Application/module.config.php and therefore delegated to ZendCacheServiceStorageCacheAbstractServiceFactory to construct the Cache Adapter.

Response Caching

The caching for the MVC response is done through event listeners. This is for separating the concerns and making the code much better decoupled and reusable.

Check out the two methods loadPageCache() and savePageCache() within Application/Module.php

savePageCache() is attached to the MvcEvent::EVENT_RENDER event with a very low priority. This makes sure $e->getResponse()->getContent() is populated before adding it to the cache.

loadPageCache() is attached to the MvcEvent::EVENT_ROUTE event with a low priority. This allows all other attached events to run first before loading the response data cache. If the response data is within the cache, $e->getResponse()->setContent() will be called and the response object will be returned. This stops all subsequent listeners attached to the same event from executing. You might wonder why the savePageCache() method no longer gets ran either, and that’s attached to a different event (EVENT_RENDER). The trick is actually done within ZendMvcApplication::run() by the following block of code:

$result = $events->trigger(MvcEvent::EVENT_ROUTE, $event, $shortCircuit);
    if ($result->stopped()) {
        $response = $result->last();
        if ($response instanceof ResponseInterface) {
            $event->setTarget($this);
            $event->setResponse($response);
            $events->trigger(MvcEvent::EVENT_FINISH, $event);
            $this->response = $response;
            return $this;
        }
    }

You can see the $result->stopped() returns true in this case and the $result object is an instance of ZendEventManagerResponseCollection. The last result is the response object with the data retrieved from the cache!

Page 1 of 3

Powered by WordPress & Theme by Anders Norén