Model-based Performance Evaluation of Batch and Stream Applications for Big Data

Johannes Kroß and Helmut Krcmar

2017 IEEE 25th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS), pp. 80-86

September 2017 · doi:DOI 10.1109/MASCOTS.2017.21


Batch and stream processing represent the two main approaches implemented by big data systems such as Apache Spark and Apache Flink. Although only stream applications are intended to satisfy real-time requirements, both approaches are required to meet certain response time constraints. In addition, cluster architectures continuously expand and computing resources constitute high investments and expenses for organizations. Therefore, planning required capacities and predicting response times is crucial. In this work, we present a performance modeling and simulation approach by using and extending the Palladio component model. We predict performance metrics of batch and stream applications and its underlying processing systems by the example of Apache Spark on Apache Hadoop. Whereas most related work concentrates on one specific processing technique and focuses on the metric response time, we propose a general approach and consider the utilization of resources as well. In different experiments we evaluated our approach using applications and data workloads of the HiBench benchmark suite. The results indicate accurate predictions for upscaling cluster sizes as well as workloads with errors less than 18%.