MARC details
| 000 -LEADER |
| fixed length control field |
08218nam a22002657a 4500 |
| 008 - FIXED-LENGTH DATA ELEMENTS--GENERAL INFORMATION |
| fixed length control field |
201210s2024 a|||f bm|| 00| 0 eng d |
| 024 7# - Author Identifier |
| Standard number or code |
0009-0007-1846-825X |
| Source of number or code |
ORCID |
| 040 ## - CATALOGING SOURCE |
| Original cataloging agency |
EG-CaNU |
| Transcribing agency |
EG-CaNU |
| 041 0# - Language Code |
| Language code of text |
eng |
| Language code of abstract |
eng |
| -- |
ara |
| 082 ## - DEWEY DECIMAL CLASSIFICATION NUMBER |
| Classification number |
610 |
| 100 0# - MAIN ENTRY--PERSONAL NAME |
| Personal name |
Tawfik Yasser Tawfik Abseif |
| 245 1# - TITLE STATEMENT |
| Title |
Keyed Watermarks |
| Remainder of title |
: A Fine-grained Watermark Generation for Apache Flink |
| Statement of responsibility, etc. |
/Tawfik Yasser Tawfik Abseif |
| 260 ## - PUBLICATION, DISTRIBUTION, ETC. |
| Date of publication, distribution, etc. |
2024 |
| 300 ## - PHYSICAL DESCRIPTION |
| Extent |
66 p. |
| Other physical details |
ill. |
| Dimensions |
21 cm. |
| 500 ## - GENERAL NOTE |
| Materials specified |
Supervisor: <br/>Mohamed ElHelw |
| 502 ## - Dissertation Note |
| Dissertation type |
Thesis (M.A.)—Nile University, Egypt, 2024 . |
| 504 ## - Bibliography |
| Bibliography |
"Includes bibliographical references" |
| 505 0# - Contents |
| Formatted contents note |
Contents:<br/>Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . IV<br/>0.1 Keywords . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . IV<br/>Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . V<br/>Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . VI<br/>List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . IX<br/>List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . X<br/>Chapters:<br/>1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1<br/>1.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1<br/>1.2 Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3<br/>1.2.1 Background and Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3<br/>1.2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3<br/>1.2.3 Keyed Watermarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3<br/>1.2.4 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3<br/>1.2.5 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . 3<br/>1.3 Publications in the Course of this Thesis . . . . . . . . . . . . . . . . . . . . . . . . 4<br/>2. Background and Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5<br/>2.1 Stream Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5<br/>2.1.1 Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5<br/>2.1.2 Throughput . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6<br/>2.1.3 Stateful Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6<br/>2.2 Notions of Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6<br/>2.2.1 Processing-time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6<br/>2.2.2 Event-time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7<br/>2.2.3 Out-of-order Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7<br/>2.3 Windowing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8<br/>2.3.1 Tumbling window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9<br/>2.3.2 Sliding window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10<br/>2.3.3 Session window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11<br/>2.4 Watermarking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11<br/>VII<br/>3. Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13<br/>3.1 Punctuations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13<br/>3.2 Buffering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13<br/>3.3 Watermarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13<br/>3.4 Order-agnostic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14<br/>3.5 Ordered . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14<br/>3.6 Timestamp Frontiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14<br/>3.7 Keyed Watermark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14<br/>3.8 Extended Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15<br/>4. Keyed Watermarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16<br/>4.1 Watermarks in Apache Flink . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16<br/>4.2 Overview of Flink Pipelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20<br/>4.3 Vanilla Watermark Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21<br/>4.4 Keyed Watermark Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23<br/>5. Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27<br/>5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27<br/>5.1.1 How we built the cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28<br/>5.2 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30<br/>5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31<br/>6. Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58<br/>6.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58<br/>6.1.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58<br/>6.1.2 Proposed Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58<br/>6.1.3 Findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58<br/>6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59<br/>Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 |
| 520 3# - Abstract |
| Abstract |
Abstract:<br/>Big Data Stream processing engines, exemplified by Apache Flink, employ windowing techniques<br/>to manage unbounded streams of events. Aggregating relevant data within Windows is important for<br/>event-time windowing due to its impact on result accuracy. A pivotal role in this process is attributed<br/>to watermarks, unique timestamps signifying event progression in time. The existing watermark<br/>generation method within Apache Flink, operating at the input stream level, exhibits a bias towards<br/>faster sub-streams, causing the omission of events from slower counterparts. Our analysis determined<br/>that Apache Flink’s standard watermark generation approach results in an approximate 33% data<br/>loss when 50% of median-proximate keys experience delays. Furthermore, this loss exceeds 37% in<br/>cases where 50% of randomly selected keys encounter delays.<br/>In this thesis, we introduce an approach termed keyed watermarks to address data loss concerns<br/>and enhance data processing precision to a minimum of 99% in most scenarios. Moreover, our<br/>proposed solution reduces the latency of the watermark processing with multiple parallelism degrees<br/>for the watermark generator which enhances the scalability of stream processing in Apache Flink.<br/>Our strategy facilitates distinct progress monitoring by creating individualized watermarks for each<br/>logical sub-stream (key). Within our investigation, we delineate the essential architectural and API<br/>modifications requisite for integrating keyed watermarks while also highlighting our experience in<br/>navigating the expansion of Apache Flink’s extensive codebase. Moreover, we conduct a comparative<br/>evaluation between the efficacy of our approach and the conventional watermark generation technique<br/>concerning the accuracy of event-time tracking, the latency of watermark processing, and the growth<br/>of Flink’s maintained state.<br/>0.1 Keywords:<br/>Keyed Watermarks, Big Data Stream Processing, Event-Time Tracking, Apache Flink |
| 546 ## - Language Note |
| Language Note |
Text in English, abstracts in English and Arabic |
| 650 #4 - Subject |
| Subject |
informatics |
| 655 #7 - Index Term-Genre/Form |
| Source of term |
NULIB |
| focus term |
Dissertation, Academic |
| 690 ## - Subject |
| School |
informatics |
| 942 ## - ADDED ENTRY ELEMENTS (KOHA) |
| Source of classification or shelving scheme |
Dewey Decimal Classification |
| Koha item type |
Thesis |
| 655 #7 - Index Term-Genre/Form |
| -- |
187 |