Keith Kim : [Note] PySpark on Windows

February 2, 2020

Purpose: Not for production. Learning/testing only.
Environment: Win10 64-bit
Pr-requisite: Java 8 or later, Anaconda

Create an Anaconda environment. Mine is called "develop" for Pyspark with Python 3.7
activate develop (or your environment name)
conda install pyspark jupyter
Get 'winutils.exe' from https://github.com/steveloughran/winutils/tree/master/hadoop-2.8.1
Store winutils.exe in C:\opt\spark-hadoop\bin (or anywhere you like)
Set up an environment variable, set HADOOP_HOME=C:\opt\spark-hadoop
type "pyspark" to get Scala console. For Jupyter, just type "jupyter notebook"

Keith Kim