Solution Resort

Learn, try and share…

Category: Hadoop

Configure PyCharm CE to work with Apache Spark

This guide should help you to setup PyCharm CE to work with Python3 and Apache Spark (tested with version 2.1)

First, Create a new Pure Python PyCharm project.

Now copy the content of to your project. Your IDE should complain at the following line

from pyspark.sql import SparkSession

because it doesn’t know where is pyspark.sql which is part of the Python Spark library.

In order to tell PyCharm where the Python Spark libraries are, you need to go to Preferences->Project->Project Structure and add the zip files under $SPARK_HOME/python/lib to the content root. $SPARK_HOME is the location of your Apache Spark directory. If you haven’t downloaded Apache Spark, you can download it here

Next, go to Run -> Edit Configurations

and create a new configuration using the default Python configuration profile and add the following environment variables,

SPARK_HOME=<your spark home dir>
PYTHONPATH=<your spark home dir>/python

Then specify the name of your main .py script and the location of your text file where you want the words to be counted.

Finally, run your new configuration and it should do a word count job using Apache Spark.

Have fun!

Scalable reporting solutions with Elasticsearch

See my talk at the Elastic London Meetup about our experience building scalable reporting systems using Elasticsearch (especially if you have a legacy platform)

How to: Install a Virtual Apache Hadoop Cluster with Vagrant and Cloudera Manager on a Mac

Feel free to skip some of the steps if you already have certain packages installed

Get Cask
brew install caskroom/cask/brew-cask

Get Vagrant & Vagrant plugins
brew cask install virtualbox
brew cask install vagrant
brew cask install vagrant-manager
vagrant plugin install vagranthostmanager

Install Hadoop
git clone
cd vagrant-hadoop-cluster
vagrant up

Configure Cloudera Manager (mostly referenced from

  1. Go to http://hadoop-master:7180/ (you might have to wait for a few minutes for the service to boot up before this is available)  and login with admin/admin
  2. Choose to use the Express version and continue
  3. When you are asked to enter the host names, enter hadoop-node1 and hadoop-node2 and click search. You should see the two hosts coming up and confirm.
  4. Keep using the default option until you got to the page asking “Login to all hosts as”. Change this to “Another user” and enter “vagrant” as the username and enter “vagrant” again for the password fields. Click next and it should start installing (this will take a while).
  5. On the “Cluster Setup” page, choose “Custom Services” and select the following: HDFS, Hive, Hue, Impala, Oozie, Solr, Spark, Sqoop2, YARN and ZooKeeper. Click Continue.
  6. On the next page, you can select what services end up on what nodes. Usually Cloudera Manager chooses the best configuration here, but you can change it if you want. For now, click Continue.
  7. On the “Database Setup” page, leave it on “Use Embedded Database.” Click Test Connection (it says it will skip this step) and click Continue.
  8. Click Continue on the “Review Changes” step. Cloudera Manager will now try to configure and start all services.
  9. Done!


Powered by WordPress & Theme by Anders Norén