Table of Contents

International Journal of Computer Science Engineering Techniques

ISSN: 2455-135X

Volume 6, Issue 2 | Published: April – 2022

Author

Jeevan Krishna Paruchuri

Abstract

Real-time fraud detection in payment authorization workflows imposes a particularly demanding combination of constraints. Tens of thousands of transactions per second must be scored within an end-to-end latency budget of two hundred milliseconds. The features that the scoring model consumes must reflect a recent enough view of the underlying entities to capture fraudulent activity that occurred seconds earlier. The system must continue to detect fraud even under partial component failure. This paper presents a case study of a production fraud detection system operated at a large payments processor. It processes between ten thousand and fifty thousand transactions per second under a two-hundred-millisecond SLA. The system combines event-driven ingestion through Apache Kafka, Apache Spark Structured Streaming for feature aggregation, ScyllaDB for ultra-low-latency feature lookups, a custom C++ inference engine with AVX-512 optimization, and C++ as the primary serving language. The paper documents the engineering decisions that enabled the SLA to be met under steady-state and peak load. A central decision was the migration of the scoring path from Python to C++ with AVX-512 SIMD intrinsics, which reduced p99 latency by approximately 4×. The paper also reports two production incidents observed over a two-year operational window and the post-incident changes that followed. Neither incident produced fraud leakage, illustrating that graceful degradation and observability are as important as raw performance to the operational viability of fraud detection systems.

Keywords

fraud detection, real-time streaming, machine learning inference, low-latency processing, financial transactions, feature engineering

Conclusion

Real-time fraud detection in payment authorization is a demanding combination of latency, accuracy, and operational reliability requirements that few other production ML applications match. The system described in this paper event-driven ingestion through Apache Kafka, Apache Spark Structured Streaming for feature aggregation, ScyllaDB for online feature caching, custom C++ inference engine with AVX-512 optimization for model inference, and Java as the primary serving language satisfies a two-hundred-millisecond end-to-end SLA at tens of thousands of transactions per second, and has done so reliably across two years of production operation interrupted by only two minor incidents. Neither incident resulted in fraud leakage, demonstrating that the explicit graceful degradation paths and the monitoring instrumentation described in Sections 8 and 9 do their intended work. The principal lessons are that latency is a feature deserving the same engineering attention as accuracy, that the bottleneck in a real production workload is rarely where the team’s intuition expects it (in this case ScyllaDB network I/O rather than model inference), that connection pooling and request batching often produce larger improvements than algorithmic optimization, that graceful degradation under partial failure proves preferable to clean failure, and that shadow deployment combined with canary rollouts is the only reliable way to catch model regressions before they affect customers. Future work in real-time retraining, graph-based fraud signals, and explainability will extend the system in directions that the current architecture supports but does not yet exploit. As payment fraud continues to evolve and as the models defending against it become more sophisticated, the operational discipline required to run these systems well will remain at least as important as the modeling techniques themselves.

References

[1] Abadi, D. J., Carney, D., Çetintemel, U., Cherniack, M., Convey, C., Lee, S., Stonebraker, M., Tatbul, N., and Zdonik, S. (2003). Aurora: A new model and architecture for data stream management. The VLDB Journal, 12(2), 120–139. [2] Akoglu, L., Tong, H., and Koutra, D. (2015). Graph based anomaly detection and description: A survey. Data Mining and Knowledge Discovery, 29(3), 626–688. [3] Armbrust, M., Das, T., Davidson, A., Ghodsi, A., Or, A., Rosen, J., Stoica, I., Wendell, P., Xin, R., and Zaharia, M. (2018). Structured Streaming: A declarative API for real-time applications in Apache Spark. Proceedings of the 2018 ACM SIGMOD International Conference on Management of Data, 601–613. [4] Armbrust, M., Das, T., Sun, L., Yavuz, B., Zhu, S., Murthy, M., Torres, J., et al. (2020). Delta Lake: cloud object store table format with ACID guarantees. Proceedings of the VLDB Endowment, 13(12), 3411–3424. [5] Basiri, A., Behnam, N., de Rooij, R., Hochstein, L., Kosewski, L., Reynolds, J., and Rosenthal, C. (2016). Chaos engineering. IEEE Software, 33(3), 35–41. [6] Bernstein, P. A., Hadzilacos, V., and Goodman, N. (1987). Concurrency Control and Recovery in Database Systems. Addison-Wesley. [7] Beyer, B., Jones, C., Petoff, J., and Murphy, N. R. (2016). Site Reliability Engineering: How Google Runs Production Systems. O’Reilly Media. [8] Bolton, R. J., and Hand, D. J. (2002). Statistical fraud detection: A review. Statistical Science, 17(3), 235–255. [9] Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32. [10] Carbone, P., Katsifodimos, A., Ewen, S., Markl, V., Haridi, S., and Tzoumas, K. (2015). Apache Flink: Stream and batch processing in a single engine. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 36(4). [11] Carbone, P., Ewen, S., Fóra, G., Haridi, S., Richter, S., and Tzoumas, K. (2017). State management in Apache Flink: Consistent stateful distributed stream processing. Proceedings of the VLDB Endowment, 10(12), 1718–1729. [12] Carcillo, F., Le Borgne, Y. A., Caelen, O., and Bontempi, G. (2018). Streaming active learning strategies for real-life credit card fraud detection: Assessment and visualization. International Journal of Data Science and Analytics, 5(4), 285–300. [13] Chandola, V., Banerjee, A., and Kumar, V. (2009). Anomaly detection: A survey. ACM Computing Surveys, 41(3), 1–58. [14] Chandy, K. M., and Lamport, L. (1985). Distributed snapshots: Determining global states of distributed systems. ACM Transactions on Computer Systems, 3(1), 63–75. [15] Chen, T., and Guestrin, C. (2016). XGBoost: A scalable tree boosting system. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 785–794. [16] Crankshaw, D., Wang, X., Zhou, G., Franklin, M. J., Gonzalez, J. E., and Stoica, I. (2017). Clipper: A low-latency online prediction serving system. NSDI, 613–627. [17] Dal Pozzolo, A., Boracchi, G., Caelen, O., Alippi, C., and Bontempi, G. (2018). Credit card fraud detection: A realistic modeling and a novel learning strategy. IEEE Transactions on Neural Networks and Learning Systems, 29(8), 3784–3797. [18] Dean, J., and Barroso, L. A. (2013). The tail at scale. Communications of the ACM, 56(2), 74–80. [19] DeCandia, G., Hastorun, D., Jampani, M., Kakulapati, G., Lakshman, A., Pilchin, A., Sivasubramanian, S., Vosshall, P., and Vogels, W. (2007). Dynamo: Amazon’s highly available key-value store. ACM SIGOPS Operating Systems Review, 41(6), 205–220. [20] Fawcett, T., and Provost, F. (1997). Adaptive fraud detection. Data Mining and Knowledge Discovery, 1(3), 291–316. [21] Fitzpatrick, B. (2004). Distributed caching with Memcached. Linux Journal, 2004(124), 5. [22] Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. Annals of Statistics, 29(5), 1189–1232. [23] Gilbert, S., and Lynch, N. (2002). Brewer’s conjecture and the feasibility of consistent, available, partition-tolerant web services. ACM SIGACT News, 33(2), 51–59. [24] Gosling, J., Joy, B., Steele, G., Bracha, G., and Buckley, A. (2014). The Java Language Specification: Java SE 8 Edition. Addison-Wesley. [25] Helland, P. (2009). Life beyond distributed transactions: An apostate’s opinion. CIDR, 132–141. [26] Hwang, J. H., Balazinska, M., Rasin, A., Çetintemel, U., Stonebraker, M., and Zdonik, S. B. (2005). High-availability algorithms for distributed stream processing. Proceedings of the 21st International Conference on Data Engineering (ICDE), 779–790. [27] Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., and Liu, T. Y. (2017). LightGBM: A highly efficient gradient boosting decision tree. Advances in Neural Information Processing Systems, 30, 3146–3154. [28] Kleppmann, M. (2017). Designing Data-Intensive Applications. O’Reilly Media. [29] Kreps, J., Narkhede, N., and Rao, J. (2011). Kafka: A distributed messaging system for log processing. Proceedings of the NetDB Workshop. [30] Kulkarni, S., Bhagat, N., Fu, M., Kedigehalli, V., Kellogg, C., Mittal, S., Patel, J. M., Ramasamy, K., and Taneja, S. (2015). Twitter Heron: Stream processing at scale. Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, 239–250. [31] Lamport, L. (1978). Time, clocks, and the ordering of events in a distributed system. Communications of the ACM, 21(7), 558–565. [32] Lundberg, S. M., and Lee, S. I. (2017). A unified approach to interpreting model predictions. Advances in Neural Information Processing Systems, 30, 4765–4774. [33] Marz, N., and Warren, J. (2015). Big Data: Principles and Best Practices of Scalable Realtime Data Systems. Manning. [34] Nishtala, R., Fugal, H., Grimm, S., Kwiatkowski, M., Lee, H., Li, H. C., McElroy, R., Paleczny, M., Peek, D., Saab, P., et al. (2013). Scaling Memcache at Facebook. NSDI, 13, 385–398. [35] Olston, C., Fiedel, N., Gorovoy, K., Harmsen, J., Lao, L., Li, F., Rajashekhar, V., Ramesh, S., and Soyke, J. (2017). TensorFlow-Serving: Flexible, high-performance ML serving. arXiv preprint arXiv:1712.06139. [36] Phua, C., Lee, V., Smith, K., and Gayler, R. (2010). A comprehensive survey of data mining-based fraud detection research. arXiv preprint arXiv:1009.6119. [37] Polyzotis, N., Roy, S., Whang, S. E., and Zinkevich, M. (2018). Data lifecycle challenges in production machine learning: A survey. ACM SIGMOD Record, 47(2), 17–28. [38] Sculley, D., Holt, G., Golovin, D., Davydov, E., Phillips, T., Ebner, D., Chaudhary, V., Young, M., Crespo, J. F., and Dennison, D. (2015). Hidden technical debt in machine learning systems. Advances in Neural Information Processing Systems, 28, 2503–2511. [39] Shen, H., Chen, L., Jin, Y., Zhao, L., Kong, B., Philipose, M., Krishnamurthy, A., and Sundaram, R. (2019). Nexus: A GPU cluster engine for accelerating DNN-based video analysis. Proceedings of the 27th ACM SOSP, 322–337. [40] Sridharan, C. (2018). Distributed Systems Observability. O’Reilly Media. [41] Stonebraker, M., Çetintemel, U., and Zdonik, S. (2005). The 8 requirements of real-time stream processing. ACM SIGMOD Record, 34(4), 42–47. [42] Toshniwal, A., Taneja, S., Shukla, A., Ramasamy, K., Patel, J. M., Kulkarni, S., Jackson, J., Gade, K., Fu, M., Donham, J., Bhagat, N., Mittal, S., and Ryaboy, D. (2014). Storm @Twitter. Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, 147–156. [43] Vavilapalli, V. K., Murthy, A. C., Douglas, C., Agarwal, S., Konar, M., Evans, R., Graves, T., Lowe, J., Shah, H., Seth, S., et al. (2013). Apache Hadoop YARN: Yet another resource negotiator. Proceedings of the 4th Annual Symposium on Cloud Computing, 1–16. [44] Wang, G., Chen, L., Dikshit, A., Gustafson, J., Chen, B., Sax, M. J., Roesler, J., Blee-Goldman, S., Cadonna, B., Mehta, A., Madan, V., and Rao, J. (2019). Consistency and completeness: Rethinking distributed stream processing in Apache Kafka. Proceedings of the 2019 ACM SIGMOD International Conference on Management of Data. [45] West, J., and Bhattacharya, M. (2016). Intelligent financial fraud detection: A comprehensive review. Computers & Security, 57, 47–66. [46] Whitrow, C., Hand, D. J., Juszczak, P., Weston, D., and Adams, N. M. (2009). Transaction aggregation as a strategy for credit card fraud detection. Data Mining and Knowledge Discovery, 18(1), 30–55. [47] Zaharia, M., Das, T., Li, H., Hunter, T., Shenker, S., and Stoica, I. (2013). Discretized streams: Fault-tolerant streaming computation at scale. Proceedings of the 24th ACM Symposium on Operating Systems Principles, 423–438. [48] Zaharia, M., Xin, R. S., Wendell, P., Das, T., Armbrust, M., Dave, A., Meng, X., Rosen, J., Venkataraman, S., Franklin, M. J., Ghodsi, A., Gonzalez, J., Shenker, S., and Stoica, I. (2016). Apache Spark: A unified engine for big data processing. Communications of the ACM, 59(11), 56–65. [49] Armbrust, M., et al. (2021). Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics. Proceedings of CIDR 2021.

Real-Time Fraud Detection and Feature Store Design Patterns for Streaming ML in Financial Services Download

IJCSE-Certificate-Jeevan Krishna Paruchuri (2)Download