| 1234567891011121314151617181920212223242526272829303132 |
- \chapter{Introduction}
- % general information and intro
- Understanding how things in our cosmos work, was and still is a pleasure, that the human being always wants to fulfill. Getting insights into the rawest form of organic life is possible through storing and studying information, embedded in genetic codes. Since live is complex, there is a lot of information, which requires a lot of memory.\\
- % ...Communication with other researches means sending huge chunks of data through cables or through waves over the air, which costs time and makes raw data vulnerable to erorrs.\\
- % compression values and goals
- With compression tools, the problem of storing information got restricted. Compressed data requires less space and therefore less time to be transported over networks. This advantage is scalable and since genetic information needs a lot of storage, even in a compressed state, improvements are welcomed. Since this field is, compared to others, like computer theory and compression approaches, relatively new, there is much to discover and new findings are not unusual. From some of this findings, new tools can be developed. They optimally increase two factors: the speed at which data is compressed and the compresseion ratio, meaning the difference between uncompressed and compressed data.\\
- % ...
- % more exact explanation
- % actual focus in short and simple terms
- New discoveries in the universal rules of stochastical organization of genomes might provide a base for new algorithms and therefore new tools or an improvement of existing ones for genome compression. The aim of this work is to analyze the current state of the art for probabilistic compression tools and their algorithms, and ultimately determine whether mentioned discoveries are already used. \texttt{might be thrown out due to time limitations} -> If this is not the case, there will be an analysation of how and where this new approach could be imprelented and if it would improve compression methods.\\
- % focus and structure of work in greater detail
- To reach a common ground, the first pages will give the reader a quick overview on the structure of human DNA. There will also be an superficial explanation for some basic terms, used in biology and computer science. The knowledge basis of this work is formed by describing differences in file formats, used to store genome data. In addition to this, a section relevant for compression will follow. This will go through the state of the art in coding theory.\\
- In order to meassure an improvement, first a baseline must be set. Therefore the efficiency and effecitity of suiteable state of the art tools will be meassured. To be as precise as possible, the main part of this work focuses on setting up an environment, picking input data, installing and executing tools and finaly meassuring and documenting the results.\\
- With this information, a static code analysis of mentioned tools follows. This will show where a possible new algorithm or an improvement to an existing one could be implemented. Running a proof of concept implementation under the same conditions and comparing runtime and compression ratio to the defined baseline shows the potential of the new approach for compression with probability algorithms.
- % todo:
- % explain: coding
- % find uniform representation for: {letter;symbol;char} {dna;genome;sequence}
- % todo 22-11-07: change symbol to char and text to message. This might need a forward reference to shannons work, which is best placed in the intro
- %- COMPRESSION APPROACHES
- % - SAVING DIFFERENCES WITH GIVEN BASE \acs{DNA}
- % - HUFFMAN ENCODING
- % - PROBABILITY APPROACHES (WITH BASE?)
- %
- %- COMPARING TOOLS
- %
- %- POSSIBLE IMPROVEMENT
- % - \acs{DNA}S STOCHASTICAL ATTRIBUTES
- % - IMPACT ON COMPRESSION
|