| Excerpt |
|---|
Apache Spark is an open-source distributed general-purpose cluster-computing framework.
|
Spark provides an interface for programming entire clusters with implicit data parallelism and fault tolerance. Below architecture shows how Apache Spark is composed/interact with other components.
...
When you see above diagram, it looks like the similar architecture with MapReduce - below table shows its difference:
| Item | MapReduce | Apache Spark |
|---|---|---|
| Data Processing | batch processing | batch processing + real-time data processing |
| Processing Speed | slower than Apache Spark, because of I/O disk latency | 100x faster in memory and 10x faster while running on disk |
| Category | Data Processing Engine | Data Processing Engine |
| Costs | less costlier comparing Apache Spark | more Costlier because of large amount of RAM |
| Scalability | both are scalable limited to 1000 nodes in single cluster | both are scalable limited to 1000 nodes in single cluster |
| Machine Learning | more compatible with Apache Mahout while integrating with Machine Learning | have inbuilt API's to Machine Learning |
| Compatibility | Majorly compatible with all the data sources and file formats | Apache Spark can integrate with all data sources and file formats supported by Hadoop cluster |
| Security | more secured compared to Apache Spark | security feature in Apache Spark is more evolving and getting matured |
| Scheduler | dependent on external scheduler | have own scheduler |
| Fault Tolerance | Use replication for fault tolerance | using RDD and other data storage models for fault tolerance |
| Ease of Use | bit complex comparing Apache Spark because of Java APIs | Easier to use because of Rich APIs |
| Duplicate Elimination | not supported | Apache Spark process every records exactly once hence eliminates duplication |
| Language Support | primary language is Java but languages like C, C++, ruby, Python, Perl, Groovy is also supported | supports Java, Scalar, Python and R |
| Latency | very high latency | much faster comparing MapReduce framework |
| Complexity | hard to write and debug codes | easy to write and debug |
| Apache Community | open source framework for processing data | open source framework for processing data at higher speed |
| Coding | more lines of code | lesser lines of code |
| Interactive Mode | not interactive | interactive |
| Infrastructure | commodity hardware's | mid to high level hardware's |
| SQL | supports through Hive Query Language | supports through Spark SQL |
Key difference between MapReduce vs Apache Spark
- MapReduce is strictly disk-based while Apache Spark uses memory and can use a disk for processing.
- MapReduce and Apache Spark both have similar compatibility in terms of data types and data sources.
- The primary difference between MapReduce and Spark is that MapReduce uses persistent storage and Spark uses Resilient Distributed Datasets.
- Hadoop MapReduce is meant for data that does not fit in the memory whereas Apache Spark has a better performance for the data that fits in the memory, particularly on dedicated clusters.
- Hadoop MapReduce can be an economical option because of Hadoop as a service and Apache Spark is more cost effective because of high availability memory
- Apache Spark and Hadoop MapReduce both are failure tolerant but comparatively Hadoop MapReduce is more failure tolerant than Spark.
- Hadoop MapReduce requires core java programming skills while Programming in Apache Spark is easier as it has an interactive mode.
- Spark is able to execute batch-processing jobs between 10 to 100 times faster than the MapReduce Although both the tools are used for processing Big Data.
When to use MapReduce
- Linear Processing of large Dataset
- No intermediate Solution required
When to use Apache Spark:
- Fast and interactive data processing
- Joining Datasets
- Graph processing
- Iterative jobs
- Real-time processing
- Machine Learning
Accelerating Apache Spark with in-memory DB (Redis)
Even though Apache Spark has better performance then MapReduce, but you may thirsty in enhancing processing performance - in-memory DB like Redis will help you to have better performance. Below architecture shows how to combine Redis with Apache Spark:
...