| Excerpt |
|---|
Apache Spark is an open-source distributed general-purpose cluster-computing framework.
|
Spark provides an interface for programming entire clusters with implicit data parallelism and fault tolerance. Below architecture shows how Apache Spark is composed/interact with other components.
...
When you see above diagram, it looks like the similar architecture with MapReduce - below table shows its difference:
| Item | MapReduce | Apache Spark |
|---|---|---|
| Data Processing | batch processing | batch processing + real-time data processing |
| Processing Speed | slower than Apache Spark, because of I/O disk latency | 100x faster in memory and 10x faster while running on disk |
| Category | Data Processing Engine | Data Processing Engine |
| Costs | less costlier comparing Apache Spark | more Costlier because of large amount of RAM |
| Scalability | both are scalable limited to 1000 nodes in single cluster | both are scalable limited to 1000 nodes in single cluster |
| Machine Learning | more compatible with Apache Mahout while integrating with Machine Learning | have inbuilt API's to Machine Learning |
| Compatibility | Majorly compatible with all the data sources and file formats | Apache Spark can integrate with all data sources and file formats supported by Hadoop cluster |
| Security | more secured compared to Apache Spark | security feature in Apache Spark is more evolving and getting matured |
| Scheduler | dependent on external scheduler | have own scheduler |
| Fault Tolerance | Use replication for fault tolerance | using RDD and other data storage models for fault tolerance |
| Ease of Use | bit complex comparing Apache Spark because of Java APIs | Easier to use because of Rich APIs |
| Duplicate Elimination | not supported | Apache Spark process every records exactly once hence eliminates duplication |
| Language Support | primary language is Java but languages like C, C++, ruby, Python, Perl, Groovy is also supported | supports Java, Scalar, Python and R |
| Latency | very high latency | much faster comparing MapReduce framework |
| Complexity | hard to write and debug codes | easy to write and debug |
| Apache Community | open source framework for processing data | open source framework for processing data at higher speed |
| Coding | more lines of code | lesser lines of code |
| Interactive Mode | not interactive | interactive |
| Infrastructure | commodity hardware's | mid to high level hardware's |
| SQL | supports through Hive Query Language | supports through Spark SQL |
Key difference between MapReduce vs Apache Spark
...