February 2, 2020

[Note] PySpark on Windows

Purpose: Not for production.  Learning/testing only.
Environment: Win10 64-bit
Pr-requisite: Java 8 or later, Anaconda

  1. Create an Anaconda environment.  Mine is called "develop" for Pyspark with Python 3.7
  2. activate develop (or your environment name)
  3. conda install pyspark jupyter
  4. Get 'winutils.exe' from https://github.com/steveloughran/winutils/tree/master/hadoop-2.8.1
  5. Store winutils.exe in C:\opt\spark-hadoop\bin  (or anywhere you like)
  6. Set up an environment variable, set HADOOP_HOME=C:\opt\spark-hadoop
  7. type "pyspark" to get Scala console.  For Jupyter, just type "jupyter notebook"

No comments: