Friday, June 16, 2017

Objective Apache and Scala Questions and Answers pdf

21. What is an “Accumulator”?
“Accumulators” are Spark’s offline debuggers. Similar to “Hadoop Counters”, “Accumulators” provide the number of “events” in a program.
Accumulators are the variables that can be added through associative operations. Spark natively supports accumulators of numeric value types and standard mutable collections. “AggregrateByKey()” and “combineByKey()” uses accumulators.

22. Which file systems does Spark support?
Hadoop Distributed File System (HDFS)
Local File system
S3

23. What is “YARN”?
“YARN” is a large-scale, distributed operating system for big data applications. It is one of the key features of Spark, providing a central and resource management platform to deliver scalable operations across the cluster.

24. List the benefits of Spark over MapReduce.
Due to the availability of in-memory processing, Spark implements the processing around 10-100x faster than Hadoop MapReduce.
Unlike MapReduce, Spark provides in-built libraries to perform multiple tasks form the same core; like batch processing, steaming, machine learning, interactive SQL queries among others.
MapReduce is highly disk-dependent whereas Spark promotes caching and in-memory data storage
Spark is capable of iterative computation while MapReduce is not.
Additionally, Spark stores data in-memory whereas Hadoop stores data on the disk. Hadoop uses replication to achieve fault tolerance while Spark uses a different data storage model, resilient distributed datasets (RDD). It also uses a clever way of guaranteeing fault tolerance that minimizes network input and output.

25. What is a “Spark Executor”?
When “SparkContext” connects to a cluster manager, it acquires an “Executor” on the cluster nodes. “Executors” are Spark processes that run computations and store the data on the worker node. The final tasks by “SparkContext” are transferred to executors.

26. List the various types of “Cluster Managers” in Spark.
The Spark framework supports three kinds of Cluster Managers:
Standalone
Apache Mesos
YARN

27. What is a “worker node”?
“Worker node” refers to any node that can run the application code in a cluster.

28. Define “PageRank”.
“PageRank” is the measure of each vertex in a graph.

29. Can we do real-time processing using Spark SQL?

30. What is the biggest shortcoming of Spark?
Spark utilizes more storage space compared to Hadoop and MapReduce.
Also, Spark streaming is not actually streaming, in the sense that some of the window functions cannot properly work on top of micro batching.

Read More Questions:
Apache and Scala Interview Questions Part1
Apache and Scala Interview Questions Part2
Apache and Scala Interview Questions Part3

No comments: