Implementation of Work Flow Integration Techniques in Tavaxy for Bioinformatics Applications / Shady Alaa Eldin Eissa
Material type:
TextLanguage: English Summary language: English Publication details: 2013Description: 155 p. ill. 21 cmSubject(s): Genre/Form: DDC classification: - 610
| Item type | Current library | Call number | Status | Date due | Barcode | |
|---|---|---|---|---|---|---|
Thesis
|
Main library | 610/ SE.I 2013 (Browse shelf(Opens below)) | Not For Loan |
Browsing Main library shelves Close shelf browser (Hides shelf browser)
Supervisor: Mohamed Abou El-Hoda
Thesis (M.A.)—Nile University, Egypt, 2013 .
"Includes bibliographical references"
Contents:
1 Introduction 1
1.1 Biological Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Bioinformatics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Sequence Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Workflows and workflow management systems . . . . . . . . . . . . . . . 3
1.5 Cloud Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.6 Challenge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.7 Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.8 Thesis outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2 Sequence Analysis Introduction 11
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Sequencing technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.1 Sequencing technologies comparison . . . . . . . . . . . . . . . . . 13
vii
viii TABLE OF CONTENTS
2.3 Data formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.1 FASTA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.2 FASTQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3.3 SAM/BAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3.4 VCF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4 Software tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.4.1 Utility Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.4.2 BLAST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.4.3 Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4.4 Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4.5 Variant Calling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3 Related Work 23
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2 Workflow Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2.1 Formalism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2.2 Formal description . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2.3 Workflow editor . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2.4 Application domain . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2.5 Control flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2.6 Loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2.7 Data flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2.8 Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2.9 Execution engine . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2.10 Execution middleware for grid or cluster . . . . . . . . . . . . . . 27
3.2.11 Enactment layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2.12 Support for legacy code . . . . . . . . . . . . . . . . . . . . . . . 27
3.2.13 Hierarchical workflows . . . . . . . . . . . . . . . . . . . . . . . . 28
TABLE OF CONTENTS ix
3.2.14 Error handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2.15 Workflow repository . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.3 Workflow interoperability . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.3.1 Interoperability between Taverna and Galaxy . . . . . . . . . . . 30
3.4 Workflow patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.5 Workflows and Cloud computing . . . . . . . . . . . . . . . . . . . . . . 31
3.5.1 Galaxy and the cloud . . . . . . . . . . . . . . . . . . . . . . . . . 31
4 Taverna and Galaxy 33
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.2 Taverna . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.2.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.2.2 Workflow description language . . . . . . . . . . . . . . . . . . . . 35
4.2.3 Workflow execution engine . . . . . . . . . . . . . . . . . . . . . . 36
4.3 Galaxy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.3.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.3.2 Workflow description language . . . . . . . . . . . . . . . . . . . . 38
4.3.3 Workflow execution engine . . . . . . . . . . . . . . . . . . . . . . 39
4.4 Comparison of Taverna and Galaxy . . . . . . . . . . . . . . . . . . . . . 41
5 Tavaxy 43
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.2 Tavaxy architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.3 Workflow pattern database . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.3.1 Control patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.3.2 Advanced data patterns . . . . . . . . . . . . . . . . . . . . . . . 49
5.4 Workflow authoring tool . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.5 Workflow description language . . . . . . . . . . . . . . . . . . . . . . . . 52
5.5.1 Tool description file . . . . . . . . . . . . . . . . . . . . . . . . . . 53
x TABLE OF CONTENTS
5.6 Workflow mapper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.6.1 Executable workflow . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.6.2 Translating workflow description languages . . . . . . . . . . . . . 54
5.6.3 Workflow execution optimization . . . . . . . . . . . . . . . . . . 59
5.7 Integrating Galaxy and Taverna workflows in Tavaxy . . . . . . . . . . . 60
5.8 Workflow engine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
6 Cloud computing support inside Tavaxy 67
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
6.2 Amazon Web Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
6.3 elasticHPC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.4 Scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.5 Tavaxy on the Cloud . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.6 Tool on the Cloud . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
6.7 Subworkflow on the Cloud . . . . . . . . . . . . . . . . . . . . . . . . . . 74
7 Case studies and Experiments 77
7.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
7.2 Case study I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
7.2.1 Composing heterogeneous sub-workflows on Tavaxy . . . . . . . . 77
7.2.2 Measuring the performance . . . . . . . . . . . . . . . . . . . . . 82
7.3 Case study II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
7.3.1 Metagenomics workflow . . . . . . . . . . . . . . . . . . . . . . . 85
7.3.2 Measuring the performance . . . . . . . . . . . . . . . . . . . . . 89
7.4 Use of cloud computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
8 Conclusion and future work 95
8.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
8.1.1 Pattern based workflows interoperability . . . . . . . . . . . . . . 95
TABLE OF CONTENTS xi
8.1.2 Use of patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
8.1.3 Cloud computing support . . . . . . . . . . . . . . . . . . . . . . 96
8.2 Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
8.3 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
A Translating SCUFL to tSCUFL 99
A.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
A.1.1 Taverna language . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
A.1.2 Tavaxy language . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
A.1.3 Galaxy language . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
A.2 Simple Workflow example . . . . . . . . . . . . . . . . . . . . . . . . . . 101
A.2.1 SCUFL representation . . . . . . . . . . . . . . . . . . . . . . . . 102
A.2.2 Galaxy JSON representation . . . . . . . . . . . . . . . . . . . . . 102
A.3 Translating processors and links . . . . . . . . . . . . . . . . . . . . . . . 104
A.3.1 Translation with replacement . . . . . . . . . . . . . . . . . . . . 104
A.3.2 Translation without replacement . . . . . . . . . . . . . . . . . . . 104
A.4 Translating dependency links . . . . . . . . . . . . . . . . . . . . . . . . 104
A.5 Translating conditionals . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
A.6 Translating Iterations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
B Workflow management systems survey 113
B.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
B.2 General features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
B.3 Supported workflows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
B.4 Workflow execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
B.5 Storage and sharing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
C Patterns implementation 121
C.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
xii TABLE OF CONTENTS
C.2 Switch pattern . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
C.3 Data select pattern . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
C.4 Data merge pattern . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
C.5 Iteration pattern . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
Bibliography 127
Abstract:
Sequence analysis is one of the most important domains within bioinformatics. This
is because the sequence of nucleotides in DNA determines the function and structure of
the produced protein which in turn controls the functions and reproduction of the whole
cell.
The sequence analysis domain includes versatile tasks such as sequences alignment,
protein structure prediction, database search, among others. These different tasks are
addressed using growing set of tools, and the tools are different in terms of usage and
computational requirements. Recent analysis endeavor require the use of multiple tools
in a pipelined fashion. Workflow management systems have emerged to facilitate the
design and execution of such pipelines.
In the last few years, many workflow systems have been developed. Within the bioinformatics
community, Taverna and Galaxy are the most popular. However, there is an
interoperability problem between them where a workflow developed for one system can
not work for the other. Furthermore, not all these systems can easily handle large amounts
of biological data such as those produced by Next Generation Sequencing technologies
and cloud computing which is an emerging important paradigm is not fully utilized in an
easy to use way.
In this thesis, we present Tavaxy, a pattern based bioinformatics workflow management
system with cloud computing support. It brings together the features of Taverna
and Galaxy, where a workflow written in either Taverna or Galaxy can be imported to Tavaxy and modified within its environment and executed. In addition, workflows within
Tavaxy can run on cloud computing infrastructure. We also provide set of data patterns
that facilitate the large scale parallel processing of large datasets
Text in English, abstracts in English .
There are no comments on this title.