SparkContext allows your Spark Application to access Spark Cluster with the help of Resource Manager. The resource manager can be one of these three- Spark Standalone, YARN, Apache Mesos.
The different contexts in which it can run are local, yarn-client, Mesos URL and Spark URL.
Hence, SparkContext provides the various functions in Spark like get the current status of Spark Application, set the configuration, cancel a job, Cancel a stage and much more. It is an entry point to the Spark functionality. Thus, it acts a backbone.
Spark Stage- An Introduction to Physical Execution plan
with the boundary of a stage in spark marked by shuffle dependencies.
Ultimately, submission of Spark stage triggers the execution of a series of dependent parent stages. Although, there is a first Job Id present at every stage that is the id of the job which submits stage in Spark.
- ShuffleMapStage in Spark
- ResultStage in Spark
ResultStage implies as a final stage in a job that applies a function on one or many partitions of the target RDD in Spark. It also helps for computation of the result of an action.
Apache Spark Executor for Executing Spark Tasks
Some conditions in which we create Executor in Spark is:
- When CoarseGrainedExecutorBackend receives RegisteredExecutor message. Only for Spark Standalone and YARN.
- When LocalEndpoint is created for local mode.
By using the following, we can create the Spark Executor:
- From Executor ID.
- By using SparkEnv we can access the local MetricsSystem as well as BlockManager. Moreover, we can also access the local serializer by it.
- From Executor’s hostname.
- To add to tasks’ classpath, a collection of user-defined JARs. By default, it is empty.
- By flag whether it runs in local or cluster mode (disabled by default, i.e. cluster is preferred)
Moreover, when creation is successful, the one INFO messages pop up in the logs. That is:
INFO Executor: Starting executor ID [executorId] on host [executorHostname]
Basically, with a single thread, heartbeater is a daemon ScheduledThreadPoolExecutor.
We call this thread pool a driver-heartbeater.
Thread Pool — ThreadPool Property
we have also learned how Spark Executors are helpful for executing tasks. The major advantage we have learned is, we can have as many executors we want. Therefore, Executors helps to enhance the Spark performance of the system.
Spark RDD – Introduction, Features & Operations of RDD
- Resilient, i.e. fault-tolerant with the help of RDD lineage graph(DAG) and so able to recompute missing or damaged partitions due to node failures.
- Distributed, since Data resides on multiple nodes.
- Dataset represents records of the data you work with. The user can load the data set externally which can be either JSON file, CSV file, text file or database via JDBC with no specific data structure.
Apache Spark evaluates RDDs lazily. It is called when needed, which saves lots of time and improves efficiency. The first time they are used in an action so that it can pipeline the transformation. Also, the programmer can call a persist method to state which RDD they want to use in future operations.
a. Narrow Transformations
b. Wide Transformations
An Action in Spark returns final result of RDD computations. It triggers execution using lineage graph to load the data into original RDD.
Because of the above-stated limitations of RDD to make spark more versatile DataFrame and Dataset evolved.