k1_introduction.tex 4.4 KB

1234567891011121314151617181920212223242526272829303132333435
  1. \chapter{Introduction}
  2. % general information and intro
  3. %Understanding how things in our cosmos work, was and still is a pleasure, that the human being always wants to fulfill.
  4. Understanding the biological code of living things, is a alsways developing taks which is important for multiple aspekts of our live. The results of reasearch in this area provides knowledge that helps development in the medical sector, agriculture and more \cite{ju_21, wang_22, mo_83}.
  5. Getting insights into this biological code is possible through storing and studying information, embedded in genonmes \cite{dna_structure}. Since live is complex, there is a lot of information, which requires a lot of memory \cite{alok17, survey}.\\
  6. % ...Communication with other researches means sending huge chunks of data through cables or through waves over the air, which costs time and makes raw data vulnerable to erorrs.\\
  7. % compression values and goals
  8. With compression algorithms and their implementation in tools, the problem of storing information got smaller. Compressed data requires less space and therefore less time to be transported over networks \cite{Shannon_1948}. This advantage is scalable, and since genetic information needs a lot of storage, even in a compressed state, improvements are welcomed \cite{moffat_arith}. Since this field is, compared to others, like computer theory which created the foundation for compression algorithms, relatively new, there is much to discover and new findings are not unusual \cite{Shannon_1948}. From some of this findings, new tools can be developed. In general they focus on increasing at least one of two factors: the speed at which data is compressed and the compresseion ratio, meaning the difference between uncompressed and compressed data \cite{moffat_arith, alok17, Shannon_1948}.\\
  9. % ...
  10. % more exact explanation
  11. % actual focus in short and simple terms
  12. New discoveries in the universal rules of stochastical organization of genomes might provide a base for new algorithms and therefore new tools or an improvement of existing ones for genome compression \cite{pet21}. The aim of this work is to analyze the current state of the art for compression tools for biological data and implemented, probabilistic algorithms. Further this work will determine if there is room for improvement.\\
  13. The discussion will include a superficial analysation of how and where this new approach could get implemented and what problems possibly need to be taken care of in the process.\\
  14. % focus and structure of work in greater detail
  15. To reach a common ground, the first pages will give the reader a quick overview on the structure of human DNA. There will also be an fundamental explanation for some basic terms, used in biology and computer science. The first step into the theory of genome compression will be taken, by describing differences in common file formats, used to store genome information. From there, a section which is relevant for understanding compression will follow. It will analyze differences between compression approaches, go over some history of coding theory and lead to a deeper look into the fundamentals of state of the art compression algorithms. The chapter will end with a few pages about implementations of compression algorithms in tools relevant.\\
  16. In order to meassure a improvement, a baseline must be set. Therefore the efficiency and effecitity of suiteable state of the art tools will be meassured. To be as precise as possible, the middle part of this work focuses on setting up an environment, picking input data, installing and executing tools and finaly meassuring and documenting the results.\\
  17. The results of this compared with the understanding of how the tools work, will show if there is the need of a improvement and on what factor it should focus. The end of this work will be used to discuss the properties of a a possible improvement, how feasability could be determined and which problems such a project woudl need overcome.\\
  18. % todo:
  19. % explain: coding
  20. % find uniform representation for: {letter;symbol;char} {dna;genome;sequence}
  21. % todo 22-11-07: change symbol to char and text to message. This might need a forward reference to shannons work, which is best placed in the intro
  22. %- COMPRESSION APPROACHES
  23. % - SAVING DIFFERENCES WITH GIVEN BASE \acs{DNA}
  24. % - HUFFMAN ENCODING
  25. % - PROBABILITY APPROACHES (WITH BASE?)
  26. %
  27. %- COMPARING TOOLS
  28. %
  29. %- POSSIBLE IMPROVEMENT
  30. % - \acs{DNA}S STOCHASTICAL ATTRIBUTES
  31. % - IMPACT ON COMPRESSION