Apache Spark is an open-source distributed general-purpose cluster-computing framework.
- Spark was initially started by Matei Zaharia at UC Berkeley's AMPLab in 2009, and open sourced in 2010 under a BSD license.
- In 2013, the project was donated to the Apache Software Foundation and switched its license to Apache 2.0. In February 2014, Spark became a Top-Level Apache Project.
- In November 2014, Spark founder M. Zaharia's company Databricks set a new world record in large scale sorting using Spark.
- Spark had in excess of 1000 contributors in 2015, making it one of the most active projects in the Apache Software Foundation and one of the most active open source big data projects.
- Given the popularity of the platform by 2014, paid programs like General Assembly and free fellowships like The Data Incubator have started offering customized training courses
...
Item | MapReduce | Apache Spark |
---|---|---|
Data Processing | batch processing | batch processing + real-time data processing |
Processing Speed | slower than Apache Spark, because of I/O disk latency | 100x faster in memory and 10x faster while running on disk |
Category | Data Processing Engine | Data Processing Engine |
Costs | less costlier comparing Apache Spark | more Costlier because of large amount of RAM |
Scalability | both are scalable limited to 1000 nodes in single cluster | both are scalable limited to 1000 nodes in single cluster |
Machine Learning | more compatible with Apache Mahout while integrating with Machine Learning | have inbuilt API's to Machine Learning |
Compatibility | Majorly compatible with all the data sources and file formats | Apache Spark can integrate with all data sources and file formats supported by Hadoop cluster |
Security | more secured compared to Apache Spark | security feature in Apache Spark is more evolving and getting matured |
Scheduler | dependent on external scheduler | have own scheduler |
Fault Tolerance | Use replication for fault tolerance | using RDD and other data storage models for fault tolerance |
Ease of Use | bit complex comparing Apache Spark because of Java APIs | Easier to use because of Rich APIs |
Duplicate Elimination | not supported | Apache Spark process every records exactly once hence eliminates duplication |
Language Support | primary language is Java but languages like C, C++, ruby, Python, Perl, Groovy is also supported | supports Java, Scalar, Python and R |
Latency | very high latency | much faster comparing MapReduce framework |
Complexity | hard to write and debug codes | easy to write and debug |
Apache Community | open source framework for processing data | open source framework for processing data at higher speed |
Coding | more lines of code | lesser lines of code |
Interactive Mode | not interactive | interactive |
Infrastructure | commodity hardware's | mid to high level hardware's |
SQL | supports through Hive Query Language | supports through Spark SQL |
Key difference between MapReduce vs Apache Spark
- MapReduce is strictly disk-based while Apache Spark uses memory and can use a disk for processing.
- MapReduce and Apache Spark both have similar compatibility in terms of data types and data sources.
- The primary difference between MapReduce and Spark is that MapReduce uses persistent storage and Spark uses Resilient Distributed Datasets.
- Hadoop MapReduce is meant for data that does not fit in the memory whereas Apache Spark has a better performance for the data that fits in the memory, particularly on dedicated clusters.
- Hadoop MapReduce can be an economical option because of Hadoop as a service and Apache Spark is more cost effective because of high availability memory
- Apache Spark and Hadoop MapReduce both are failure tolerant but comparatively Hadoop MapReduce is more failure tolerant than Spark.
- Hadoop MapReduce requires core java programming skills while Programming in Apache Spark is easier as it has an interactive mode.
- Spark is able to execute batch-processing jobs between 10 to 100 times faster than the MapReduce Although both the tools are used for processing Big Data.
When to use MapReduce
- Linear Processing of large Dataset
- No intermediate Solution required
When to use Apache Spark:
- Fast and interactive data processing
- Joining Datasets
- Graph processing
- Iterative jobs
- Real-time processing
- Machine Learning
Accelerating Apache Spark with in-memory DB (Redis)
Even though Apache Spark has better performance then MapReduce, but you may thirsty in enhancing processing performance - in-memory DB like Redis will help you to have better performance. Below architecture shows how to combine Redis with Apache Spark:
...