Apache Spark is a fast and general-purpose cluster computing system with big data applications. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.
Features
Find below some of the main features of Apache Spark:
- Speed (e.g. Spark runs up to 100 times faster than Hadoop MapReduce for large-scale data processing.)
- Powerful Caching (Spark provides powerful caching and disk persistence capabilities.)
- Deployment (Apache Spark clusters can be deployed through Spark’s own cluster manager)
- Real-Time (Spark provides real-time computation & low latency because of in-memory computation)
- Polyglot (Spark provides high-level APIs in Java, Scala, Python, and R. Spark code can be written in any of these four languages.)
Use Cases
Find below some examples of possible use cases:
- Performing compute-intensive tasks
- Performing various relations operations (e.g. text search or simple data operations) on both internal and external data sources
- Performing Machine Learning (ML) tasks such as feature extraction, classification, regression, clustering, recommendation, and more
Resources
Find below some interesting links providing more information on Apache Spark: