Step1: Preparation
- Install Java jdk (1.8 or 11)
- Install python 3.9.x
Step 2: Download Apache Spark
1. Navigate to https://spark.apache.org/downloads.html.
2. Choose Spark release and package type as following:
After dowloading, open new terminal and type
certutil -hashfile C:\Java\spark-java-tutorial\spark-3.0.3-bin-hadoop2.7.tgz SHA512
to verify checksum
Step 3: Config environment variables :
- Create empty folders 'C:\Hadoop\bin' and create 'C:\Spark' folder
- copy winutils.exe file from https://github.com/cdarlint/winutils/raw/master/hadoop-2.7.3/bin/winutils.exe to C:\Hadoop\bin folder
- copy extracted spark-3.0.3-bin-hadoop2.7 folder to C:\Spark directory
- create
SPARK_HOME=C:\Spark\spark-3.0.3-bin-hadoop2.7
HADOOP_HOME=C:\Hadoop
and add %SPARK_HOME%\bin and %HADOOP_HOME%\bin to Path variable.
Step 4: Launch Spark and test
close old terminal and open new terminal the type:
spark-shell
Open a web browser and navigate to http://localhost:4040/.
TEST
create a text file 'testhadoop.txt' and type:
scala> val x =sc.textFile("testhadoop.txt")
x: org.apache.spark.rdd.RDD[String] = testhadoop.txt MapPartitionsRDD[1] at textFile at <console>:24
scala> x.take(11).foreach(println)
12345678910
HelloWorld
scala> val y = x.map(_.reverse)
y: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[2] at map at <console>:25
scala> y.take(11).foreach(println)
01987654321
dlroWolleH
That is all for beginning familiaring with apache spark.
0 comments:
Post a Comment