spark

Start a Spark application

TLDR

Register your API token

$ spark register [token]

Display the currently registered API token

$ spark token

Create a new Spark project

$ spark new [project_name]

Create a new Spark project with Braintree stubs

$ spark new [project_name] --braintree

Create a new Spark project with team-based billing stubs

$ spark new [project_name] --team-billing

SYNOPSIS

spark-submit [options] <application jar | python file | R file> [application arguments]

--master <master_url>
    The URL of the master for the cluster (e.g., local, yarn, spark://host:port).

--deploy-mode <client|cluster>
    Whether to deploy the driver program on the client machine (client) or on a worker node inside the cluster (cluster).

--class <main_class>
    The main class of your application (required for Java/Scala applications).

--name <name>
    A descriptive name for your Spark application, visible in the Spark UI.

--conf <key=value>
    Arbitrary Spark configuration properties in key=value format. Can be specified multiple times.

--executor-memory <amount>
    Amount of memory to use per executor process (e.g., 1G, 512M).

--num-executors <num>
    Number of executors to launch for the application.

DESCRIPTION

Apache Spark is a unified analytics engine for large-scale data processing. While there isn't a single "spark" command in the same way ls or grep exist in Linux, users primarily interact with Spark through client scripts like spark-submit. This script is the core utility for launching any Spark application on a cluster. It handles the submission of applications written in Scala, Java, Python, or R to various cluster managers such as YARN, Mesos, Kubernetes, or Spark's own standalone manager. The spark-submit command manages the configuration of the application's resources, dependencies, and its entry point, then dispatches it for distributed execution. It's an essential tool for deploying and running complex distributed data processing workloads with Spark, simplifying what would otherwise be intricate cluster management details for developers.

CAVEATS

The term "spark" in the context of Linux typically refers to the Apache Spark distributed computing framework, not a standalone native Linux command like ls. Users interact with Spark primarily through wrapper scripts like spark-submit, spark-shell, or pyspark. Using these commands requires a pre-installed Apache Spark distribution and a Java Runtime Environment (JRE), along with Python or R for respective language applications. The specific behavior and available options of spark-submit can vary slightly depending on the Spark version and the underlying cluster manager being used.

CLUSTER MANAGERS

spark-submit is agnostic to the cluster manager, allowing applications to run seamlessly on Spark's own standalone cluster manager, YARN (Yet Another Resource Negotiator), Apache Mesos, or Kubernetes. The --master option specifies which manager to connect to.

APPLICATION PACKAGING

Spark applications for Java/Scala are typically packaged as JAR files, while Python applications are `.py` files, often with additional `.zip`, `.egg`, or `.py` files for dependencies. spark-submit facilitates the distribution of these artifacts to the cluster nodes.

HISTORY

Apache Spark originated at UC Berkeley's AMPLab in 2009 and was open-sourced in 2010. It was designed to address the limitations of Hadoop MapReduce, particularly its disk-intensive operations and challenges with iterative algorithms. Spark's in-memory processing capabilities quickly led to superior performance. It became a top-level Apache Software Foundation project in 2014. The spark-submit script, alongside other client utilities like spark-shell, has been a foundational part of its ecosystem from early development, providing a consistent and user-friendly interface to launch and manage distributed applications without directly handling complex JVM processes or cluster resource allocation.