Step1: Preparation

- Install Java jdk (1.8 or 11)

- Install python 3.9.x

Step 2: Download Apache Spark

1. Navigate to https://spark.apache.org/downloads.html.

2. Choose Spark release and package type as following:


After dowloading, open new terminal and type

certutil -hashfile C:\Java\spark-java-tutorial\spark-3.0.3-bin-hadoop2.7.tgz SHA512

to verify checksum

Step 3: Config environment variables :

- Create empty folders 'C:\Hadoop\bin' and create 'C:\Spark' folder

- copy winutils.exe file from https://github.com/cdarlint/winutils/raw/master/hadoop-2.7.3/bin/winutils.exe to C:\Hadoop\bin folder

- copy extracted spark-3.0.3-bin-hadoop2.7 folder to C:\Spark directory

- create 

SPARK_HOME=C:\Spark\spark-3.0.3-bin-hadoop2.7

HADOOP_HOME=C:\Hadoop

and add %SPARK_HOME%\bin and %HADOOP_HOME%\bin to Path variable.


Step 4: Launch Spark and test

close old terminal and open new terminal the type:

spark-shell


Open a web browser and navigate to http://localhost:4040/.



TEST

create a text file 'testhadoop.txt' and type:

scala> val x =sc.textFile("testhadoop.txt")

x: org.apache.spark.rdd.RDD[String] = testhadoop.txt MapPartitionsRDD[1] at textFile at <console>:24


scala> x.take(11).foreach(println)

12345678910

HelloWorld


scala> val y = x.map(_.reverse)

y: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[2] at map at <console>:25


scala> y.take(11).foreach(println)

01987654321

dlroWolleH


That is all for beginning familiaring with apache spark.