Excerpt |
---|
Apache Spark is an open-source distributed general-purpose cluster-computing framework.
|
Spark provides an interface for programming entire clusters with implicit data parallelism and fault tolerance. Below architecture shows how Apache Spark is composed/interact with other components.
...
When you see above diagram, it looks like the similar architecture with MapReduce - below table shows its difference:
Item | MapReduce | Apache Spark |
---|---|---|
Data Processing | batch processing | batch processing + real-time data processing |
Processing Speed | slower than Apache Spark, because of I/O disk latency | 100x faster in memory and 10x faster while running on disk |
Category | Data Processing Engine | Data Processing Engine |
Costs | less costlier comparing Apache Spark | more Costlier because of large amount of RAM |
Scalability | both are scalable limited to 1000 nodes in single cluster | both are scalable limited to 1000 nodes in single cluster |
Machine Learning | more compatible with Apache Mahout while integrating with Machine Learning | have inbuilt API's to Machine Learning |
Compatibility | Majorly compatible with all the data sources and file formats | Apache Spark can integrate with all data sources and file formats supported by Hadoop cluster |
Security | more secured compared to Apache Spark | security feature in Apache Spark is more evolving and getting matured |
Scheduler | dependent on external scheduler | have own scheduler |
Fault Tolerance | Use replication for fault tolerance | using RDD and other data storage models for fault tolerance |
Ease of Use | bit complex comparing Apache Spark because of Java APIs | Easier to use because of Rich APIs |
Duplicate Elimination | not supported | Apache Spark process every records exactly once hence eliminates duplication |
Language Support | primary language is Java but languages like C, C++, ruby, Python, Perl, Groovy is also supported | supports Java, Scalar, Python and R |
Latency | very high latency | much faster comparing MapReduce framework |
Complexity | hard to write and debug codes | easy to write and debug |
Apache Community | open source framework for processing data | open source framework for processing data at higher speed |
Coding | more lines of code | lesser lines of code |
Interactive Mode | not interactive | interactive |
Infrastructure | commodity hardware's | mid to high level hardware's |
SQL | supports through Hive Query Language | supports through Spark SQL |
Key difference between MapReduce vs Apache Spark
...