Keyed Watermarks : A Fine-grained Watermark Generation for Apache Flink /Tawfik Yasser Tawfik Abseif
Material type:
TextLanguage: English Summary language: English, Arabic Publication details: 2024Description: 66 p. ill. 21 cmSubject(s): Genre/Form: DDC classification: - 610
| Item type | Current library | Call number | Status | Date due | Barcode | |
|---|---|---|---|---|---|---|
Thesis
|
Main library | 610/T.Y.K/2024 (Browse shelf(Opens below)) | Not for loan |
Supervisor:
Mohamed ElHelw
Thesis (M.A.)—Nile University, Egypt, 2024 .
"Includes bibliographical references"
Contents:
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . IV
0.1 Keywords . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . IV
Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . V
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . VI
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . IX
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . X
Chapters:
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.1 Background and Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.3 Keyed Watermarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.4 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.5 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Publications in the Course of this Thesis . . . . . . . . . . . . . . . . . . . . . . . . 4
2. Background and Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1 Stream Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.1 Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.2 Throughput . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.3 Stateful Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Notions of Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.1 Processing-time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.2 Event-time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.3 Out-of-order Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Windowing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3.1 Tumbling window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3.2 Sliding window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3.3 Session window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4 Watermarking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
VII
3. Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.1 Punctuations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2 Buffering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.3 Watermarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.4 Order-agnostic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.5 Ordered . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.6 Timestamp Frontiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.7 Keyed Watermark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.8 Extended Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4. Keyed Watermarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.1 Watermarks in Apache Flink . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.2 Overview of Flink Pipelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.3 Vanilla Watermark Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.4 Keyed Watermark Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5. Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5.1.1 How we built the cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.2 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
6. Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
6.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
6.1.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
6.1.2 Proposed Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
6.1.3 Findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
Abstract:
Big Data Stream processing engines, exemplified by Apache Flink, employ windowing techniques
to manage unbounded streams of events. Aggregating relevant data within Windows is important for
event-time windowing due to its impact on result accuracy. A pivotal role in this process is attributed
to watermarks, unique timestamps signifying event progression in time. The existing watermark
generation method within Apache Flink, operating at the input stream level, exhibits a bias towards
faster sub-streams, causing the omission of events from slower counterparts. Our analysis determined
that Apache Flink’s standard watermark generation approach results in an approximate 33% data
loss when 50% of median-proximate keys experience delays. Furthermore, this loss exceeds 37% in
cases where 50% of randomly selected keys encounter delays.
In this thesis, we introduce an approach termed keyed watermarks to address data loss concerns
and enhance data processing precision to a minimum of 99% in most scenarios. Moreover, our
proposed solution reduces the latency of the watermark processing with multiple parallelism degrees
for the watermark generator which enhances the scalability of stream processing in Apache Flink.
Our strategy facilitates distinct progress monitoring by creating individualized watermarks for each
logical sub-stream (key). Within our investigation, we delineate the essential architectural and API
modifications requisite for integrating keyed watermarks while also highlighting our experience in
navigating the expansion of Apache Flink’s extensive codebase. Moreover, we conduct a comparative
evaluation between the efficacy of our approach and the conventional watermark generation technique
concerning the accuracy of event-time tracking, the latency of watermark processing, and the growth
of Flink’s maintained state.
0.1 Keywords:
Keyed Watermarks, Big Data Stream Processing, Event-Time Tracking, Apache Flink
Text in English, abstracts in English and Arabic
There are no comments on this title.