Install Python and Jupyter

  • Download and install python 2.7.x release if python is not pre-installed in /usr/local/bin (PySpark do not support python 3 version yet).
  • Install pip and virtualenv
$ curl -O get-pip.py  # download get-pip.py
$ python get-pip.py   # install pip
$ pip install virtualenv # install virtualenv
  • Create a separate virtualenv for your playground though not mandatory.
$ virtualenv sparkenv   # create virtualenv named sparkenv
$ source sparkenv/bin/activate # activate the virtualenv
  • Install Jupyter
$ pip install jupyter

Download and Setup Spark in OSX

$ export SCALA_HOME=/Users/kalyan/scala-2.10.4
$ export PATH=$PATH:$SCALA_HOME/bin
  • Go to Spark root directory and run in command line:
$ sbt/sbt clean assembly
  • Then start up Spark, also from Spark root folder:
$ ./bin/spark-shell
  • Download Scala version 2.10.x
  • Download sbt # Setup Spark in OSX
  • Make sure you have activated the virtualenv that is previously created (sparkenv)
  • Create a new kernel (not mandatory but maintain this as a good practice)
$ python -m ipykernel install --user --name sparkkernel --display-name "sparkkernel"
  • The kernel spec is available which can be customized if needed (/Users/kalyan/Library/Jupyter/kernels/vdiag/kernel.json).
{
    "display_name": "pyspark",
    "argv": [
        "/Users/gopalk/Work/WS/pyenvs/vdiag/bin/python",
        "-m",
        "ipykernel",
        "-f",
        "{connection_file}"
    ],
    "language": "python"
}
  • Set the needed environment variables and start the notebook. The environment variables can also be set in the spec file (but not recommended due to portability reasons)
$ export PYTHONPATH="${SPARK_HOME}/python/:$PYTHONPATH"
$ export PYTHONPATH="${SPARK_HOME}/python/lib/py4j-0.9-src.zip:$PYTHONPATH"
$ jupyter notebook
  • Now you have the notebook ready with pyspark modules available.