Top 40 Spark Interview Questions and Answers | Best Spark Interview Questions. Based on Hadoop and MapReduce technologies, Apache Spark is an open-source, lightning-fast computation technology that supports a range of computational methods for fast and efficient processing. Top 40 Spark Interview Questions
The main contributing factor to the processing speed of Spark applications is its in-memory cluster computation.
Using this article, software engineers and data engineers can prepare themselves for the most common interview questions in Spark technology. Based on the Spark concepts, the questions range from basic to advanced. Top 40 Spark Interview Questions
Top 40 Spark Interview Questions
1. Can you tell me what is Apache Spark about?
In terms of big data processing and analysis, Apache Spark is an open-source framework engine known for its speed and ease of use. Additionally, it has modules for graph processing, machine learning, streaming, SQL, etc. With spark, you can run in-memory computation and cyclic data flow, either in cluster mode or standalone mode, and access diverse data sources such as HBase, HDFS, Cassandra, etc.
2. What are the features of Apache Spark?
- Apache Spark facilitates high data processing speeds by reducing read-write operations to disk. The speed is almost 100x faster while performing in-memory computation and 10x faster while performing disk computation.
- Spark provides 80 high-level operators that facilitate the easy development of parallel applications.
- Spark’s DAG execution engine increases the speed of data processing with its in-memory computation feature. Additionally, this supports data caching and reduces the time it takes to retrieve data from the disk.
- Spark codes can be reused for batch-processing, data streaming, ad-hoc queries, etc.
- Spark supports fault tolerance with RDD. Using Spark RDDs, no data is lost in case of worker node failures.
- Spark supports real-time stream processing. The earlier MapReduce framework was limited to processing only existing data.
- Spark transformations using Spark RDDs are lazy. They do not generate results right away, but they create new RDDs based on existing RDDs. Lazy evaluation increases the efficiency of the system.
- Multiple Language Support: Spark supports multiple languages like R, Scala, Python, and Java, which provides dynamicity and helps to overcome the Hadoop limitation of only being able to develop applications using Java.
- Spark also integrates with the Hadoop YARN cluster manager, making it flexible.
- Has support for Spark GraphX for graph parallelism, Spark SQL, machine learning libraries, etc.
- Apache Spark is considered a more cost-effective solution when compared to Hadoop, since Hadoop requires large amounts of storage and data centers while processing data.
- Apache Spark has a large developer community that is active in continuous development. Apache considers it to be the most important project it has undertaken.
3. What is RDD?
Resilient Distribution Datasets stands for Resilient Distribution Datasets. This is a fault-tolerant collection of parallel running operational elements. RDD partitioned data is distributed and immutable. Datasets can be divided into two types:
- Parallelized collections: Designed to run simultaneously.
- HDFS or other storage systems are used for Hadoop datasets.
4.DAG refers to what in Apache Spark?
DAG stands for Directed Acyclic Graph without directed cycles. The vertices and edges would be finite. Edges from one vertex are sequentially directed to another vertex. The vertices of Spark refer to the RDDs, and the edges represent the operations that will be performed on those RDDs.
5. List the types of deployment modes in Spark.
There are 2 deploy modes in Spark. They are:
- Client Mode: The deploy mode is said to be in client mode when the spark driver component runs on the machine node from where the spark job is submitted.
- The main disadvantage of this mode is if the machine node fails, then the entire job fails.
- This mode supports both interactive shells or the job submission commands.
- The performance of this mode is worst and is not preferred in production environments.
- During cluster mode, if the spark job driver component does not run on the machine from which the spark job was submitted, the deploy mode is considered cluster mode.
- As part of the sub-process of ApplicationMaster, the spark job launches the driver component within the cluster.
- The spark-submit command is the only way to deploy using this mode (interactive shell mode is not supported).
- Due to the fact that the driver programs are run in ApplicationMaster, if the program fails, the driver program is re-instantiated.
- Dedicated cluster managers (such as stand-alone, YARN, Apache Mesos, Kubernetes, etc) are used to allocate the resources needed for the job to run, as shown in the following architecture.
In addition to the above two modes, if we have to run the application locally for unit testing and development, the deployment mode is known as “Local Mode”. Here, the jobs run on a single JVM in a single machine, making it highly inefficient as at some point, there will be a shortage of resources resulting in job failures. It is also not possible to scale up resources in this mode due to the limited memory and space.
6. What are receivers in Apache Spark Streaming?
Receivers consume data from different sources, then move it to Spark for processing. They are created by using streaming contexts in the form of long-running tasks that are scheduled to run in a round-robin manner. The receivers are configured to use only a certain amount of bandwidth single core. Data streaming is accomplished by running the receivers on various executors. Depending on how the data is sent to Spark, there are two types of receivers:
- Reliable receivers: Here, the receiver acknowledges the data sources’ post-data reception and replication on the Spark storage space.
7. What is the difference between repartition and coalesce?
Repartition | Coalesce |
---|---|
Usage repartition can increase/decrease the number of data partitions. | Spark coalesce can only reduce the number of data partitions. |
Repartition creates new data partitions and performs a full shuffle of evenly distributed data. | Coalesce makes use of already existing partitions to reduce the amount of shuffled data unevenly. |
Repartition internally calls coalesce with shuffle parameter thereby making it slower than coalesce. | Coalesce is faster than repartition. However, if there are unequal-sized data partitions, the speed might be slightly slower. |
8. What are the data formats supported by Spark?
For efficient reading and processing, Spark supports both raw and structured file formats. Spark supports file formats such as paraquet, JSON, XML, CSV, RC, Avro, TSV, etc.
9. What do you understand by Shuffling in Spark?
The process of redistributing data across partitions which may or may not cause the data to move across JVM processes or executors on the separate machines is known as shuffling/repartitioning. A partition is nothing more than a smaller logical division of data.
Spark does not control how data is partitioned.
10. What is YARN in Spark?
- YARN is one of Spark’s key features that provides a central platform for resource management and scalable operations throughout the cluster.
- Spark is a data processing tool and YARN is a cluster management technology.
Spark Interview Questions for Experienced
11. How is Apache Spark different from MapReduce?
MapReduce | Apache Spark |
MapReduce only processes data batch-by-batch. | Apache Spark can process data in real-time and in batches. |
Large data sets are slow to process with MapReduce. | Big data processing with Apache Spark is approximately 100 times faster than MapReduce. |
HDFS (Hadoop Distributed File System) stores data for MapReduce, so obtaining the data takes a long time. | Spark stores data in memory (RAM), making it easier and faster to retrieve data when needed. |
Because MapReduce heavily relies on disk, it is a high-latency framework. | A low-latency computation framework, Spark supports in-memory data storage and caching. |
MapReduce jobs require an external scheduler. | Due to Spark’s in-memory data computation, it has its own job scheduler. |
12. Explain the working of Spark with the help of its architecture.
Through the SparkSession object, Spark applications are run as independent processes that are coordinated by the Driver program. According to the one task per partition principle, the cluster manager or resource manager entity of Spark assigns the tasks of running Spark jobs to the worker nodes.
Various iterations algorithms are used repeatedly to cache data across multiple iterations. Each task applies its unit of operation to the dataset within its partition, resulting in a new partitioned dataset. Results are sent back to the main driver application for further processing or for storing the data on the disk. The following diagram illustrates how this works as described above:
13. What is the working of DAG in Spark?
A direct Acyclic Graph is a graph with finite vertices and edges. The vertices represent RDDs, and the edges represent sequential operations on RDDs. The DAG created is submitted to the DAG Scheduler, which splits the graphs into stages based on the transformations applied to the data. The stage view shows the details of the RDDs for that stage.
- The first step is to interpret the code with the help of an interpreter. The Scala interpreter interprets code written in Scala.
- When the code is entered in the Spark console, Spark creates an operator graph.
- The operator graph is submitted to the DAG Scheduler when the action is called on Spark RDD.
- The DAG Scheduler divides the operators into stages of tasks. During this stage, detailed step-by-step operations are performed on the input data. After that, the operators are pipelined together.
- The stages are then passed to the Task Scheduler, which launches the task via the cluster manager, so each stage can run independently without being dependent on others.
- The task is then executed by the worker nodes.
An RDD keeps track of the pointer to one or more parent RDDs along with its relationship to the parent. Consider the operation val childB=parentA.map() on RDD, then we have the RDD child that keeps track of its parent, which we call the RDD lineage.
14. When do you use Client and Cluster deployment modes?
- When the client machines are not close to the cluster, then the Cluster mode should be used. The network latency caused by communication between executors in Client mode is avoided by doing this. In Client mode, if the machine goes offline, the entire process is lost.
- Client mode can be used for deployment if the client machine is part of the cluster. Because the machine is part of the cluster, there will be no network latency issues, and since the maintenance of the cluster is already handled, there is no need to be concerned in case of failures.
15.How is Spark Streaming implemented in Spark?
Streaming is one of Spark’s most important features. It is simply an extension to the Spark API for processing data from multiple sources in real-time.
- The data from sources like Kafka, Kinesis, Flume, etc, are processed and sent to various destinations, such as databases, dashboards, machine learning APIs, or even file systems. Data is divided into streams (similar to batches) and processed accordingly.
- Spark streaming supports highly scalable, fault-tolerant continuous stream processing which is mostly used for fraud detection, website monitoring, website clickbait, IoT (Internet of Things) sensors, etc.
- Spark Streaming divides the data from the data stream into batches of X seconds, called Dreams or Discretized Streams. Internally, they are nothing more than a sequence of multiple RDDs. The Spark application uses Spark APIs to process these RDDs, and the results of this processing are again returned in batches. The following diagram illustrates the workflow of the spark streaming process.
16. Can you write a spark program that checks whether a given keyword exists in a huge text file?
Define keywordExists(line):
(line.find(“my_keyword”) > -1):
The return code is 1
0 is returned
SparkContext.textFile(“test_file.txt”);
Lines.map(keywordExists);
IsExist.reduce(sum);
(“Found” if sum > 0, otherwise “Not Found”)
17. What can you say about Spark Datasets?
Spark Datasets are SparkSQL data structures that combine the benefits (such as data manipulation by lambdas) of RDDs with the Spark SQL-optimised execution engine. Spark introduced this in version 1.6.
- Spark datasets are strongly typed structures that represent structured queries along with encoders.
- As well as providing type safety to the data, they provide an object-oriented programming interface.
- The datasets are more structured and have lazy query expressions that help trigger the actions. Datasets combine the strengths of both RDDs and Dataframes. A dataset is a logical plan that informs the computational query about the need for data production. The physical query plan is formed once the logical plan has been analyzed and resolved.
The following characteristics characterize datasets:
- Optimized queries: Spark datasets use the Tungsten and Catalyst Query Optimizer frameworks to provide optimized queries. A Catalyst Query Optimizer represents and manipulates a data flow graph (graph of expressions and relational operators). By emphasizing the hardware architecture of the Spark execution platform, Tungsten improves and optimizes the speed of execution of Spark jobs.
- Datasets have the flexibility of analyzing and checking syntaxes at compile-time, which is technically not possible with RDDs or Dataframes.
- The type-safe feature of datasets can be converted into “untyped” Dataframes by using the following methods provided by the Datasetholder:
- DS[T]:toDS()
- DataFrame:toDF()
- DF(columnName:String*):DataFrame
- Implementation of datasets is much faster than that of RDDs, which increases system performance.
- Since the datasets are both queryable and serializable, they can be easily stored in any persistent storage.
- Less Memory Consumed: Spark utilizes the caching feature to provide a more optimal data layout. As a result, less memory is consumed.
- A single API is provided for both Java and Scala languages. Apache Spark is widely used in these languages. Thus, libraries can be used for different types of inputs with a lesser burden.
18. Define Spark DataFrames.
Spark Dataframes have distributed collections of datasets organized into columns similar to SQL. The schema is equivalent to a table in a relational database and is primarily optimized for big data operations.
You can create data frames from a variety of data sources, such as external databases, existing RDDs, Hive Tables, etc. Spark Dataframes have the following features:
- Spark Dataframes are capable of processing data in sizes ranging from Kilobytes to Petabytes on a single node to large clusters.
- They support different data formats, such as CSV, Avro, elastic search, etc, as well as different storage systems, such as HDFS, Cassandra, and MySQL.
- Using SparkSQL catalyst optimizer, state-of-the-art optimization is achieved.
- SparkCore makes it easy to integrate Spark Dataframes with major Big Data tools.
19. Define Executor Memory in Spark
Spark executors have the same fixed number of cores and fixed heap size defined for applications developed in Spark. Spark’s heap size is controlled via the property spark.executor.memory, which is part of the -executor-memory flag. Spark applications Each worker node has one allocated executor. Executor memory is the amount of memory used by the worker node.
20. What are the functions of SparkCore?
SparkCore is the engine used to handle large-scale distributed and parallel data processing. The Spark core consists of a distributed execution engine that offers various APIs in Java, Python, and Scala that can be used to develop distributed ETL applications.
Spark Core performs many important functions including memory management, job monitoring, fault tolerance, storage system interactions, job scheduling, and basic I/O support. Additional libraries built on top of Spark Core enable diverse workloads for SQL, streaming, and machine learning. Among their responsibilities are:
- Recovering from a fault
- Interactions between memory management and storage systems
- Monitoring, scheduling, and distribution of jobs
- Functions of basic I/O
21. What do you understand by worker node?
Worker nodes are nodes that run Spark applications in a cluster. Spark driver program accepts connections from executors and sends them to worker nodes for execution. A worker node is similar to a slave node in that it receives work from a master node and executes it. Worker nodes process data and report the resources used to the master. A master determines what resources need to be allocated, and based on their availability, tasks are scheduled for worker nodes.
22. What are some of the disadvantages to using Spark in applications?
There are certain drawbacks to using Apache Spark in applications, despite it being a powerful data processing engine. Here are a few of them:
- Spark makes use of more storage space when compared to MapReduce or Hadoop which may lead to certain memory-based problems.
- Developers must be careful when running applications. The work should be distributed across multiple clusters instead of running everything on a single node.
- Since Spark makes use of “in-memory” computations, they can be a bottleneck to cost-efficient big data processing.
- While using files from the local file system, the files must be accessible at the same location on all worker nodes when working in cluster mode, since the task execution is shuffled between various worker nodes based on resource availability. Files must be copied to all worker nodes, or a separate network-mounted file-sharing system must be in place.
- One of the biggest problems with Spark is when there are a lot of small files. When Spark is used with Hadoop, HDFS produces a limited number of large files instead of a large number of small files. Spark needs to uncompress a large number of small gzipped files by keeping them on its memory and network. So much of the time is spent unzipping files in sequence and partitioning the resulting RDDs to get data in a manageable format, which would require extensive shuffling overall. Due to this, Spark spends a lot of time preparing data instead of processing it.
- Due to its inability to handle many concurrent users, Spark does not work well in multi-user environments.
23.Using Spark, how can data transfers be minimized?
The process of shuffling corresponds to data transfers. Spark applications run faster and more reliably when these transfers are minimized. These can be minimized in a number of ways. The following are:
- Broadcast variables make the join between large and small RDDs more efficient.
- Accumulators are used to updating variable values simultaneously during execution.
- Another way is to avoid the operations that trigger these reshuffles.
24. What is SchemaRDD in Spark RDD?
SchemaRDDs are RDDs consisting of row objects wrapped around integer arrays or strings that contain schema information about the data types of each column. The purpose of these tools was to make the lives of developers easier while debugging code and while running unit tests on SparkSQL modules. As they describe the RDD, they are similar to the schema of relational databases. As well as providing the basic RDD functionality, SchemaRDD also provides some relational query interfaces of SparkSQL.
Here is an example. You have an RDD named Person that represents a person’s data. SchemaRDD shows what each row of Person RDD consists of. Persons with attributes like name and age are represented in SchemaRDD.
25. Apache Spark implements SQL using what module?
SparkSQL is a powerful module that performs relational data processing along with Spark’s functional programming abilities. In addition, this module supports either SQL or Hive Query Language. It also supports different data sources and helps developers write powerful SQL queries with code transformations.
SparkSQL has four major libraries:
- API for Data Sources
- API for DataFrames
- Translator & Catalyst Optimizer
- Services for SQL
Structured and semi-structured data can be used in Spark SQL in the following ways:
- Spark supports DataFrame abstraction in Python, Scala, and Java, along with good optimization techniques.
- SparkSQL supports data reads and writes in various structured formats, including JSON, Hive, Parquet, etc.
- SparkSQL allows querying data both within the Spark program and via external tools that connect to JDBC/ODBC.
- SparkSQL is recommended in Spark applications since it allows developers to load the data, query the data from databases, and write the results to the destination.
26. What are the different persistence levels in Apache Spark?
Spark persists intermediate data from different shuffle operations automatically. However, it is recommended that you call the persist() method on the RDD. For storing RDDs on memory, disk, or both, there are different levels of persistence. Persistence levels available in Spark are:
- This is the default persistence level for storing RDDs as deserialized versions of Java objects on the JVM. The partitions are not cached if the RDDs are large and do not fit in the memory and will be recomputed as and when needed.
- MEMORY_AND_DISK: The RDDs are stored again as deserialized Java objects on the JVM. If the memory is insufficient, then partitions not fitting on the memory will be stored on disk, and the data will be read as and when needed from the disk.
- MEMORY_ONLY_SER: The RDD is stored as serialized Java Objects as One Byte per partition.
- Unlike MEMORY_ONLY_SER, MEMORY_AND_DISK_SER saves partitions not fitting in memory to avoid re-calculating them on the fly.
- The RDD partitions are stored on the disk only.
- OFF_HEAP: This level is the same as MEMORY_ONLY_SER, except that data is stored off-heap.
Persistence levels are used in the persist() method as follows:
Persist(StorageLevel.*level_value>)
Persistence Level | Space Consumed | CPU time | In-memory? | On-disk? |
MEMORY_ONLY | High | Low | Yes | No |
MEMORY_ONLY_SER | Low | High | Yes | No |
MEMORY_AND_DISK | High | Medium | Some | Some |
MEMORY_AND_DISK_SER | Low | High | Some | Some |
DISK_ONLY | Low | High | No | Yes |
OFF_HEAP | Low | High | Yes (but off-heap) | No |
27. What are the steps to calculate the executor memory?
Consider the following details regarding the cluster:
There are 10 nodes
Each node has 15 cores
RAM of each node = 61GB
We use the following approach to determine the number of cores:
The number of cores determines the number of tasks the executor can run simultaneously. As a general rule of thumb, 5 is the optimal value.
In order to calculate the number of executors, we follow the following approach:
Number of executors = Number of cores/number of concurrent tasks
Is equal to 15/5
Three equals three
Number of executors = Number of nodes x Number of executors in each node
3 x 10 = 10
Each Spark job requires 30 executors
28. Why do we need broadcast variables in Spark?
The developers can use broadcast variables to maintain read-only variables cached on each machine instead of shipping copies of them along with tasks. Each node receives a copy of a large input dataset efficiently. Variables are broadcast to the nodes using different algorithms to reduce communication costs.
29. Differentiate between Spark Datasets, Dataframes, and RDDs.
Criteria | Spark Datasets | Spark Dataframes | Spark RDDs |
Representation of Data | Spark Datasets is a combination of Dataframes and RDDs with features like static type safety and object-oriented interfaces. | Spark Dataframe is a distributed collection of data that is organized into named columns. | Spark RDDs are a distributed collection of data without schema. |
Optimization | Datasets make use of catalyst optimizers for optimization. | Dataframes also makes use of catalyst optimizer for optimization. | There is no built-in optimization engine. |
Schema Projection | Datasets find out schema automatically using SQL Engine. | Dataframes also find the schema automatically. | Schema needs to be defined manually in RDDs. |
Aggregation Speed | Dataset aggregation is faster than RDD but slower than Dataframes. | Aggregations are faster in Dataframes due to the provision of easy and powerful APIs. | RDDs are slower than both the Dataframes and the Datasets while performing even simple operations like data grouping. |
30. Can Apache Spark be used along with Hadoop? If yes, then how?
Definitely! Spark’s main feature is its compatibility with Hadoop. It, therefore, becomes a powerful framework as the combination of these two helps to leverage Spark’s processing power by making use of Hadoop’s YARN and HDFS features.
Hadoop can be integrated with Spark in the following ways:
- HDFS: Spark can be configured to run atop HDFS in order to take advantage of the distributed replication feature.
- Additionally, Spark can be configured to run alongside MapReduce in the same or a different processing framework or Hadoop cluster. Spark and MapReduce can be used together to perform real-time and batch processing.
- Spark applications can be configured to run on YARN, a cluster management framework.
31. What are Sparse Vectors? How are they different from dense vectors?
There are two parallel arrays in a sparse vector, one for indices and one for values. Non-zero values are stored in these vectors to save space.
SparseVec = Vectors.sparse(5, Array(0, 4), Array(1.0, 2.0))
- In the above example, we have a vector of size 5, but the non-zero values appear only at indices 0 and 4.
- A sparse vector is particularly useful when there are very few non-zero values. If there are only a few zero values, then dense vectors should be used. Sparse vectors would introduce the overhead of indices, which would adversely affect performance.
- Dense vectors are defined as follows:
DenseVec = Vectors.dense(4405d,260100d,400d,5.0,4.0,198.0,9070d,1.0,1.0,2.0 ,0.0)
- When used inappropriately, sparse or dense vectors have no impact on the results of calculations, but they can affect the amount of memory consumed and the speed of calculation.
32.When Spark handles the accumulated metadata, how are automatic cleanups triggered?
Cleaning tasks can be triggered automatically either by setting spark. cleaner.TTL or by dividing long-running jobs into batches and writing the intermediate results to disk.
33. How is Caching relevant in Spark Streaming?
Data streams are divided into batches of X seconds by Spark Streaming, known as DStreams. DStreams enable developers to cache the data in memory, which can be very useful if the data is used for multiple calculations.
Data can be cached in the caching Using the system cache() method or persist() method with appropriate persistence levels will achieve the best results. Data replication on two nodes is the default persistence level value for input streams such as Kafka, Flume, etc receiving data over the network.
- Using the cache method:
cached = frame.cache()
- Using the persist method for caching:
PersistDf = dframe.persist(StorageLevel.MEMORY_ONLY)
Caching has the following advantages:
- As Spark computations are expensive, caching helps to achieve reusing of data, which in turn leads to reuse computations that can save costs.
- Reusable computations save a lot of time.
- Worker nodes are able to perform/execute more jobs by saving time during computation execution.
34. Define Piping in Spark.
As per the UNIX Standard Streams, Apache Spark provides the pipe() method on RDDs, which allows the creation of various parts of occupations that can use any language.
The RDD transformation can be written using the pipe() method, which can be used for reading each element of the RDD as a String. They can be manipulated as needed and the results can be displayed as Strings.
35. What API is used for Graph Implementation in Spark?
Spark provides a powerful API called GraphX that extends Spark RDD to support graphs and graph-based computations. Resilient Distributed Property Graph is the extended property of Spark RDD, which is a directed multi-graph that has parallel edges.
User-defined properties are associated with each edge and vertex. Parallel edges indicate multiple relationships between the vertices of the same object grouGraphX There are a number of operators such as subgraph, mapReduceTriplets, joinVertices, etc. that can be used to compute graphs. Also included are a large number of graph builders and algorithms for simplifying tasks related to graph analytics.
36. How can you achieve machine learning in Spark?
MLlib, a machine learning library provided by Spark, is robust and scalable. The library implements easy and scalable ML-based algorithms and has features such as classification, clustering, dimensional reduction, regression filtering, etc.
Conclusion
We have seen the most common Spark interview questions in this article. Spark is the fastest-growing cluster computing platform.
It was designed to process big data in a more efficient way, compatible with previously existing big data tools, and compatible with various libraries.
By integrating the power of different computational models seamlessly, you can build fast and powerful applications. Spark has become a hot and lucrative technology, and knowing Spark can open doors to new, better, and more challenging careers for Software Developers and Data Engineers.
Earn Money Online Click Here
Top 40 Spark Interview Questions
