| 123456789101112131415161718192021222324252627282930313233343536 |
- \chapter{Introduction}
- % general information and intro
- %Understanding how things in our cosmos work, was and still is a pleasure, that the human being always wants to fulfill.
- Understanding the biological code of living things-, is an alsways developing task which plays a significant part in multiple aspects of our lives. The results of research in this area provides knowledge that helps development in the medical sector, in agriculture and more \cite{ju_21, wang_22, mo_83}.
- Getting insights into this biological code is possible through storing and studying information, embedded in genonmes \cite{dna_structure}. Since life is complex, there is a lot of information, that requires a lot of memory space \cite{alok17, survey}.\\
- % ...Communication with other researches means sending huge chunks of data through cables or through waves over the air, which costs time and makes raw data vulnerable to erorrs.\\
- % compression values and goals
- %With compression algorithms and their implementation in tools, the problem of storing information got smaller.
- Compression algorithms and their implementation has helped towards resolving the problem of storing information. Compressed data requires less space and therefore less time to be transported over networks \cite{Shannon_1948}. This advantage is scalable and, since genetic information needs a lot of storage even in a compressed state, improvements are welcomed \cite{moffat_arith}. Since this field is relatively new compared to others, such as computer theory, which created the foundation for compression algorithms, there is much to discover and new findings are not unusual \cite{Shannon_1948, sam12, geco}. From some of these findings, new tools can be developed. In general they focus on increasing at least one of two factors: the speed at which data is compressed and the compression ratio, meaning the difference between uncompressed and compressed data \cite{moffat_arith, alok17, Shannon_1948}.\\
- % ...
- % more exact explanation
- % actual focus in short and simple terms
- New discoveries in the universal rules of the stochastical organization of genomes might provide a base for new algorithms and therefore new tools or an improvement of existing ones for genome compression \cite{pet21}. The aim of this work is to analyze the current state of the art for compression tools for biological data and implemented probabilistic algorithms. Furthermore this work will determine if there is room for optimization.\\
- The discussion will include a superficial analysis of how and where this new approach could be implemented and what problems possibly need to be taken care of in the process.\\
- % focus and structure of work in greater detail
- To reach a common ground, the first pages will give the reader a quick overview on the structure of human DNA. This will include explanations for some basic terms, used in biology and computer science. The first step into the theory of genome compression will be taken, by describing differences in common file formats, used to store genome information. From there, a section which is relevant for understanding compression will follow. It will analyze differences between compression approaches, go over some history of coding theory and lead to a deeper look into the fundamentals of state of the art compression algorithms. The chapter will end with a few pages about implementations of compression algorithms in relevant tools.\\
- In order to measure an optimization, a baseline must be set. Therefore, the efficiency and effectivity of suitable state of the art tools will be measured. To be as precise as possible, the middle part of this work focuses on setting up an environment, picking input data, installing and executing tools and finally meassuring and documenting the results.\\
- These results compared with the understanding of how the tools work, will show if there is the need of an improvement and on what factor it should focus. The end of this work will be used to discuss the properties of a possible optimization, how feasibility could be determined and which problems such a project would need to overcome.\\
- % todo:
- % explain: coding
- % find uniform representation for: {letter;symbol;char} {dna;genome;sequence}
- % todo 22-11-07: change symbol to char and text to message. This might need a forward reference to shannons work, which is best placed in the intro
- %- COMPRESSION APPROACHES
- % - SAVING DIFFERENCES WITH GIVEN BASE \acs{DNA}
- % - HUFFMAN ENCODING
- % - PROBABILITY APPROACHES (WITH BASE?)
- %
- %- COMPARING TOOLS
- %
- %- POSSIBLE IMPROVEMENT
- % - \acs{DNA}S STOCHASTICAL ATTRIBUTES
- % - IMPACT ON COMPRESSION
|