Table of Contents

International Journal of Computer Science Engineering Techniques

ISSN: 2455-135X

Volume 6, Issue 3 | Published: June – 2022

Author

Kuladeep Sandra

Abstract

Enterprises increasingly depend on real-time stream processing for fraud detection, operational analytics, recommendation, and anomaly detection. Apache Flink and Apache Spark Structured Streaming are the two dominant open-source options and represent meaningfully different architectural choices: Flink is a true streaming engine with single-event semantics and a continuous-execution model, while Spark Structured Streaming is a micro-batch system that unifies streaming with the broader Spark batch ecosystem. This paper presents a systematic comparison and grounds the discussion in five years of production operating experience with Spark Structured Streaming and Kafka in a banking and insurance environment, during which the Kafka topic count grew from 2 to approximately 340. The paper addresses three research questions: how the architectures differ and what those differences imply for latency, throughput, and state; what the practical operational challenges of running Spark Structured Streaming at enterprise scale are and where Flink would offer advantages; and how practitioners should choose between the two based on workload characteristics. The conclusion is that the choice is not purely technical: it depends on latency requirements, state complexity, organizational ecosystem, and team expertise. For the majority of enterprise use cases including the ones in our environment Spark Structured Streaming is the more pragmatic choice. For the minority that require sub-second latency or genuinely complex stateful processing, Flink is worth the steeper learning curve.

Keywords

Apache Flink, Spark Structured Streaming, Kafka, latency, throughput, stateful processing, event time, windowing

Conclusion

Returning to the three research questions: RQ1. Flink and Spark Structured Streaming differ fundamentally in their execution models true streaming versus micro-batching and the difference manifests in latency (Flink lower), throughput (comparable), state management (Flink stronger for very large state), and operational characteristics (Flink more demanding, Spark more familiar). RQ2. Spark Structured Streaming at enterprise scale has predictable operational challenges that we documented in Section 4: checkpointing, micro-batch sizing interactions with file output, watermark business logic, monitoring, and sink idempotence. None is fatal. All require deliberate engineering. Flink would help with the sub-second-latency cases that micro-batching cannot reach, and with the complex stateful workloads that Spark’s state stores were not designed for. RQ3. The choice between the two is not purely technical. Latency requirements, state complexity, ecosystem fit, and team expertise all matter, and the right answer for an organization is the one that balances those factors against its starting conditions. For most enterprise teams running at our scale and with our use case mix, Spark Structured Streaming is the pragmatic choice. For teams with sub-second latency requirements or unusually large state, Flink is worth the steeper curve. The closing observation is that streaming architecture decisions are more durable than they first appear. A team that picks one engine builds operational expertise, monitoring tooling, library code, and team intuition around that engine, and the cost of switching is not the technical work of migration but the loss of all that accumulated context. We have chosen to invest in Spark Structured Streaming and to evaluate Flink case by case for individual workloads where it would be measurably better. After five years and 340 topics of operational experience, we have not yet found a case where the migration cost was justified. We expect that to change for some specific workload eventually, and when it does, the answer will not be to rebuild everything on Flink it will be to run Flink alongside Spark for the workload that needs it.

References

[1] T. Akidau, R. Bradshaw, C. Chambers, et al., “The Dataflow model: A practical approach to balancing correctness, latency, and cost in massive-scale, unbounded, out-of-order data processing,” Proc. VLDB Endowment, vol. 8, no. 12, 2015. [2] P. Carbone, A. Katsifodimos, S. Ewen, V. Markl, S. Haridi, and K. Tzoumas, “Apache Flink: Stream and batch processing in a single engine,” IEEE Data Eng. Bull., vol. 38, no. 4, 2015. [3] M. Armbrust, T. Das, J. Torres, et al., “Structured Streaming: A declarative API for real-time applications in Apache Spark,” in Proc. ACM SIGMOD Int. Conf. Manage. Data, 2018. [4] M. Zaharia, T. Das, H. Li, T. Hunter, S. Shenker, and I. Stoica, “Discretized streams: Fault-tolerant streaming computation at scale,” in Proc. ACM Symp. Operating Syst. Principles (SOSP), 2013. [5] K. M. Chandy and L. Lamport, “Distributed snapshots: Determining global states of distributed systems,” ACM Trans. Comput. Syst., vol. 3, no. 1, 1985. [6] P. Carbone, G. Fora, S. Ewen, S. Haridi, and K. Tzoumas, “Lightweight asynchronous snapshots for distributed dataflows,” arXiv:1506.08603, 2015. [7] J. Kreps, N. Narkhede, and J. Rao, “Kafka: A distributed messaging system for log processing,” in Proc. NetDB Workshop, 2011. [8] Apache Software Foundation, “Apache Kafka documentation.” [Online]. Available: kafka.apache.org [9] G. Wang, J. Koshy, S. Subramanian, et al., “Building a replicated logging system with Apache Kafka,” Proc. VLDB Endowment, vol. 8, no. 12, 2015. [10] Apache Software Foundation, “Apache Flink documentation.” [Online]. Available: flink.apache.org [11] Apache Software Foundation, “Apache Spark documentation.” [Online]. Available: spark.apache.org [12] Apache Software Foundation, “Apache Beam documentation.” [Online]. Available: beam.apache.org [13] Confluent, “Confluent Schema Registry documentation.” [Online]. Available: docs.confluent.io [14] Apache Software Foundation, “Apache Iceberg documentation.” [Online]. Available: iceberg.apache.org [15] M. Armbrust, A. Ghodsi, R. Xin, and M. Zaharia, “Lakehouse: A new generation of open platforms,” in Proc. Conf. Innovative Data Syst. Res. (CIDR), 2021. [16] M. Kleppmann, Designing Data-Intensive Applications. Sebastopol, CA: O’Reilly Media, 2017. [17] B. Beyer, C. Jones, J. Petoff, and N. R. Murphy, Eds., Site Reliability Engineering. Sebastopol, CA: O’Reilly Media, 2016.

Real-Time Stream Processing with Apache Flink vs Spark Structured Streaming An Enterprise Comparison Download

IJCSE-Certificate-Kuladeep Sandra Download