Image from Google Jackets

Keyed Watermarks : A Fine-grained Watermark Generation for Apache Flink /Tawfik Yasser Tawfik Abseif

By: Material type: TextTextLanguage: English Summary language: English, Arabic Publication details: 2024Description: 66 p. ill. 21 cmSubject(s): Genre/Form: DDC classification:
  • 610
Contents:
Contents: Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . IV 0.1 Keywords . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . IV Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . V Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . VI List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . IX List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . X Chapters: 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2.1 Background and Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2.3 Keyed Watermarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2.4 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2.5 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Publications in the Course of this Thesis . . . . . . . . . . . . . . . . . . . . . . . . 4 2. Background and Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1 Stream Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1.1 Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1.2 Throughput . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1.3 Stateful Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2 Notions of Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2.1 Processing-time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2.2 Event-time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2.3 Out-of-order Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.3 Windowing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.3.1 Tumbling window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3.2 Sliding window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3.3 Session window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.4 Watermarking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 VII 3. Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.1 Punctuations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.2 Buffering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.3 Watermarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.4 Order-agnostic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.5 Ordered . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.6 Timestamp Frontiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.7 Keyed Watermark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.8 Extended Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 4. Keyed Watermarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 4.1 Watermarks in Apache Flink . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 4.2 Overview of Flink Pipelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 4.3 Vanilla Watermark Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 4.4 Keyed Watermark Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 5. Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 5.1.1 How we built the cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 5.2 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 6. Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 6.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 6.1.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 6.1.2 Proposed Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 6.1.3 Findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
Dissertation note: Thesis (M.A.)—Nile University, Egypt, 2024 . Abstract: Abstract: Big Data Stream processing engines, exemplified by Apache Flink, employ windowing techniques to manage unbounded streams of events. Aggregating relevant data within Windows is important for event-time windowing due to its impact on result accuracy. A pivotal role in this process is attributed to watermarks, unique timestamps signifying event progression in time. The existing watermark generation method within Apache Flink, operating at the input stream level, exhibits a bias towards faster sub-streams, causing the omission of events from slower counterparts. Our analysis determined that Apache Flink’s standard watermark generation approach results in an approximate 33% data loss when 50% of median-proximate keys experience delays. Furthermore, this loss exceeds 37% in cases where 50% of randomly selected keys encounter delays. In this thesis, we introduce an approach termed keyed watermarks to address data loss concerns and enhance data processing precision to a minimum of 99% in most scenarios. Moreover, our proposed solution reduces the latency of the watermark processing with multiple parallelism degrees for the watermark generator which enhances the scalability of stream processing in Apache Flink. Our strategy facilitates distinct progress monitoring by creating individualized watermarks for each logical sub-stream (key). Within our investigation, we delineate the essential architectural and API modifications requisite for integrating keyed watermarks while also highlighting our experience in navigating the expansion of Apache Flink’s extensive codebase. Moreover, we conduct a comparative evaluation between the efficacy of our approach and the conventional watermark generation technique concerning the accuracy of event-time tracking, the latency of watermark processing, and the growth of Flink’s maintained state. 0.1 Keywords: Keyed Watermarks, Big Data Stream Processing, Event-Time Tracking, Apache Flink
Tags from this library: No tags from this library for this title. Log in to add tags.
Star ratings
    Average rating: 0.0 (0 votes)
Holdings
Item type Current library Call number Status Date due Barcode
Thesis Thesis Main library 610/T.Y.K/2024 (Browse shelf(Opens below)) Not for loan

Supervisor:
Mohamed ElHelw

Thesis (M.A.)—Nile University, Egypt, 2024 .

"Includes bibliographical references"

Contents:
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . IV
0.1 Keywords . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . IV
Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . V
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . VI
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . IX
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . X
Chapters:
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.1 Background and Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.3 Keyed Watermarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.4 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.5 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Publications in the Course of this Thesis . . . . . . . . . . . . . . . . . . . . . . . . 4
2. Background and Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1 Stream Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.1 Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.2 Throughput . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.3 Stateful Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Notions of Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.1 Processing-time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.2 Event-time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.3 Out-of-order Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Windowing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3.1 Tumbling window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3.2 Sliding window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3.3 Session window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4 Watermarking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
VII
3. Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.1 Punctuations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2 Buffering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.3 Watermarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.4 Order-agnostic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.5 Ordered . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.6 Timestamp Frontiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.7 Keyed Watermark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.8 Extended Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4. Keyed Watermarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.1 Watermarks in Apache Flink . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.2 Overview of Flink Pipelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.3 Vanilla Watermark Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.4 Keyed Watermark Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5. Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5.1.1 How we built the cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.2 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
6. Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
6.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
6.1.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
6.1.2 Proposed Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
6.1.3 Findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

Abstract:
Big Data Stream processing engines, exemplified by Apache Flink, employ windowing techniques
to manage unbounded streams of events. Aggregating relevant data within Windows is important for
event-time windowing due to its impact on result accuracy. A pivotal role in this process is attributed
to watermarks, unique timestamps signifying event progression in time. The existing watermark
generation method within Apache Flink, operating at the input stream level, exhibits a bias towards
faster sub-streams, causing the omission of events from slower counterparts. Our analysis determined
that Apache Flink’s standard watermark generation approach results in an approximate 33% data
loss when 50% of median-proximate keys experience delays. Furthermore, this loss exceeds 37% in
cases where 50% of randomly selected keys encounter delays.
In this thesis, we introduce an approach termed keyed watermarks to address data loss concerns
and enhance data processing precision to a minimum of 99% in most scenarios. Moreover, our
proposed solution reduces the latency of the watermark processing with multiple parallelism degrees
for the watermark generator which enhances the scalability of stream processing in Apache Flink.
Our strategy facilitates distinct progress monitoring by creating individualized watermarks for each
logical sub-stream (key). Within our investigation, we delineate the essential architectural and API
modifications requisite for integrating keyed watermarks while also highlighting our experience in
navigating the expansion of Apache Flink’s extensive codebase. Moreover, we conduct a comparative
evaluation between the efficacy of our approach and the conventional watermark generation technique
concerning the accuracy of event-time tracking, the latency of watermark processing, and the growth
of Flink’s maintained state.
0.1 Keywords:
Keyed Watermarks, Big Data Stream Processing, Event-Time Tracking, Apache Flink

Text in English, abstracts in English and Arabic

There are no comments on this title.

to post a comment.