| 000 | 08218nam a22002657a 4500 | ||
|---|---|---|---|
| 008 | 201210s2024 a|||f bm|| 00| 0 eng d | ||
| 024 | 7 |
_a0009-0007-1846-825X _2ORCID |
|
| 040 |
_aEG-CaNU _cEG-CaNU |
||
| 041 | 0 |
_aeng _beng _bara |
|
| 082 | _a610 | ||
| 100 | 0 |
_aTawfik Yasser Tawfik Abseif _93569 |
|
| 245 | 1 |
_aKeyed Watermarks _b: A Fine-grained Watermark Generation for Apache Flink _c/Tawfik Yasser Tawfik Abseif |
|
| 260 | _c2024 | ||
| 300 |
_a66 p. _bill. _c21 cm. |
||
| 500 | _3Supervisor: Mohamed ElHelw | ||
| 502 | _aThesis (M.A.)—Nile University, Egypt, 2024 . | ||
| 504 | _a"Includes bibliographical references" | ||
| 505 | 0 | _aContents: Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . IV 0.1 Keywords . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . IV Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . V Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . VI List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . IX List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . X Chapters: 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2.1 Background and Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2.3 Keyed Watermarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2.4 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2.5 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Publications in the Course of this Thesis . . . . . . . . . . . . . . . . . . . . . . . . 4 2. Background and Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1 Stream Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1.1 Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1.2 Throughput . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1.3 Stateful Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2 Notions of Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2.1 Processing-time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2.2 Event-time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2.3 Out-of-order Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.3 Windowing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.3.1 Tumbling window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3.2 Sliding window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3.3 Session window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.4 Watermarking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 VII 3. Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.1 Punctuations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.2 Buffering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.3 Watermarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.4 Order-agnostic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.5 Ordered . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.6 Timestamp Frontiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.7 Keyed Watermark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.8 Extended Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 4. Keyed Watermarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 4.1 Watermarks in Apache Flink . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 4.2 Overview of Flink Pipelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 4.3 Vanilla Watermark Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 4.4 Keyed Watermark Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 5. Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 5.1.1 How we built the cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 5.2 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 6. Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 6.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 6.1.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 6.1.2 Proposed Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 6.1.3 Findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 | |
| 520 | 3 | _aAbstract: Big Data Stream processing engines, exemplified by Apache Flink, employ windowing techniques to manage unbounded streams of events. Aggregating relevant data within Windows is important for event-time windowing due to its impact on result accuracy. A pivotal role in this process is attributed to watermarks, unique timestamps signifying event progression in time. The existing watermark generation method within Apache Flink, operating at the input stream level, exhibits a bias towards faster sub-streams, causing the omission of events from slower counterparts. Our analysis determined that Apache Flink’s standard watermark generation approach results in an approximate 33% data loss when 50% of median-proximate keys experience delays. Furthermore, this loss exceeds 37% in cases where 50% of randomly selected keys encounter delays. In this thesis, we introduce an approach termed keyed watermarks to address data loss concerns and enhance data processing precision to a minimum of 99% in most scenarios. Moreover, our proposed solution reduces the latency of the watermark processing with multiple parallelism degrees for the watermark generator which enhances the scalability of stream processing in Apache Flink. Our strategy facilitates distinct progress monitoring by creating individualized watermarks for each logical sub-stream (key). Within our investigation, we delineate the essential architectural and API modifications requisite for integrating keyed watermarks while also highlighting our experience in navigating the expansion of Apache Flink’s extensive codebase. Moreover, we conduct a comparative evaluation between the efficacy of our approach and the conventional watermark generation technique concerning the accuracy of event-time tracking, the latency of watermark processing, and the growth of Flink’s maintained state. 0.1 Keywords: Keyed Watermarks, Big Data Stream Processing, Event-Time Tracking, Apache Flink | |
| 546 | _aText in English, abstracts in English and Arabic | ||
| 650 | 4 | _ainformatics | |
| 655 | 7 |
_2NULIB _aDissertation, Academic _9187 |
|
| 690 | _ainformatics | ||
| 942 |
_2ddc _cTH |
||
| 999 |
_c10897 _d10897 |
||