Friday, June 16, 2017

Realtime Apache and Scala Interview Questions And Answers

1. What is Apache Spark?
Wikipedia defines Apache Spark “an open source cluster computing framework originally developed in the AMPLab at University of California, Berkeley but was later donated to the Apache Software Foundation where it remains today. In contrast to Hadoop’s two-stage disk-based MapReduce paradigm, Spark’s multi-stage in-memory primitives provides performance up to 100 times faster for certain applications. By allowing user programs to load data into a cluster’s memory and query it repeatedly, Spark is well-suited to machine learning algorithms.”
Spark is essentially a fast and flexible data processing framework. It has an advanced execution engine supporting cyclic data flow with in-memory computing functionalities. Apache Spark can run on Hadoop, as a standalone system or on the cloud. Spark is capable of accessing diverse data sources including HDFS, HBase, Cassandra among others

2. Explain the key features of Spark.
• Spark allows Integration with Hadoop and files included in HDFS.
• It has an independent language (Scala) interpreter and hence comes with an interactive language shell.
• It consists of RDD’s (Resilient Distributed Datasets), that can be cached across computing nodes in a cluster.
• It supports multiple analytic tools that are used for interactive query analysis, real-time analysis and graph processing. Additionally, some of the salient features of Spark include:
Lighting fast processing: When it comes to Big Data processing, speed always matters, and Spark runs Hadoop clusters way faster than others. Spark makes this possible by reducing the number of read/write operations to the disc. It stores this intermediate processing data in memory.
Support for sophisticated analytics: In addition to simple “map” and “reduce” operations, Spark supports SQL queries, streaming data, and complex analytics such as machine learning and graph algorithms. This allows users to combine all these capabilities in a single workflow.
Real-time stream processing: Spark can handle real-time streaming. MapReduce primarily handles and processes previously stored data even though there are other frameworks to obtain real-time streaming.  Spark does this in the best way possible.

3. What is “RDD”?
RDD stands for Resilient Distribution Datasets: a collection of fault-tolerant operational elements that run in parallel. The partitioned data in RDD is immutable and is distributed in nature.

4. How does one create RDDs in Spark?
In Spark, parallelized collections are created by calling the SparkContext “parallelize” method on an existing collection in your driver program.
                val data = Array(4,6,7,8)
                val distData = sc.parallelize(data)
Text file RDDs can be created using SparkContext’s “textFile” method. Spark has the ability to create distributed datasets from any storage source supported by Hadoop, including your local file system, HDFS, Cassandra, HBase, Amazon S3, among others. Spark supports text files, “SequenceFiles”, and any other Hadoop “InputFormat” components.
                 val inputfile = sc.textFile(“input.txt”)

5. What does the Spark Engine do?
Spark Engine is responsible for scheduling, distributing and monitoring the data application across the cluster.

6. Define “Partitions”.
A “Partition” is a smaller and logical division of data, that is similar to the “split” in Map Reduce. Partitioning is the process that helps derive logical units of data in order to speed up data processing.
Here’s an example:  val someRDD = sc.parallelize( 1 to 100, 4)
Here an RDD of 100 elements is created in four partitions, which then distributes a dummy map task before collecting the elements back to the driver program.

7. What operations does the “RDD” support?
Transformations
Actions

8. Define “Transformations” in Spark.
“Transformations” are functions applied on RDD, resulting in a new RDD. It does not execute until an action occurs. map() and filer() are examples of “transformations”, where the former applies the function assigned to it on each element of the RDD and results in another RDD. The filter() creates a new RDD by selecting elements from the current RDD.

9. Define “Action” in Spark.
An “action” helps in bringing back the data from the RDD to the local machine. Execution of “action” is the result of all transformations created previously. reduce() is an action that implements the function passed again and again until only one value is left. On the other hand, the take() action takes all the values from the RDD to the local node.

10. What are the functions of “Spark Core”?
The “SparkCore” performs an array of critical functions like memory management, monitoring jobs, fault tolerance, job scheduling and interaction with storage systems.
It is the foundation of the overall project. It provides distributed task dispatching, scheduling, and basic input and output functionalities. RDD in Spark Core makes it fault tolerance. RDD is a collection of items distributed across many nodes that can be manipulated in parallel. Spark Core provides many APIs for building and manipulating these collections.

Read More Questions:
Apache and Scala Interview Questions Part1
Apache and Scala Interview Questions Part2
Apache and Scala Interview Questions Part3

No comments: