k4_algorithms.tex 3.1 KB

12345678910111213141516171819202122232425262728293031323334353637383940414243444546
  1. %SUMMARY
  2. %- ABSTRACT
  3. %- INTRODUCTION
  4. %# BASICS
  5. %- \acs{DNA} STRUCTURE
  6. %- DATA TYPES
  7. % - BAM/FASTQ
  8. % - NON STANDARD
  9. %- COMPRESSION APPROACHES
  10. % - SAVING DIFFERENCES WITH GIVEN BASE \acs{DNA}
  11. % - HUFFMAN ENCODING
  12. % - PROBABILITY APPROACHES (WITH BASE?)
  13. %
  14. %# COMPARING TOOLS
  15. %-
  16. %# POSSIBLE IMPROVEMENT
  17. %- \acs{DNA}S STOCHASTICAL ATTRIBUTES
  18. %- IMPACT ON COMPRESSION
  19. \chapter{Compression aproaches}
  20. The process of compressing data serves the goal to generate an output that is smaller than its input data. In many cases, like in gene compressing, the compression is idealy lossless. This means it is possible for every compressed data, to receive the full information that were available in the origin data, by decompressing it. Lossy compression on the other hand, might excludes parts of data in the compression process, in order to increase the compression rate. The excluded parts are typicaly not necessary to transmit the origin information. This works with certain audio and pictures files or network protocols that are used to transmit video/audio streams live.
  21. For \acs{DNA} a lossless compression is needed. To be preceice a lossy compression is not possible, because there is no unnecessary data. Every nucleotide and its position is needed for the sequenced \acs{DNA} to be complete.
  22. \subsection{Huffman encoding}
  23. % list of algos and the tools that use them
  24. The well known Huffman coding, is used in several Tools for genome compression. This subsection should give the reader a general impression how this algorithm works, without going into detail. To use Huffman coding one must first define an alphabet, in our case a four letter alphabet, containing \texttt{A, C, G and T} is sufficient. The basic structure is symbolized as a tree. With that, a few simple rules apply to the structure:
  25. % binary view for alphabet
  26. % length n of sequence to compromize
  27. % greedy algo
  28. \begin{itemize}
  29. \item every symbol of the alphabet is one leaf
  30. \item the right branch from every not is marked as a 1, the left one is marked as a 0
  31. \item every symbol got a weight, the weight is defined by the frequency the symbol occours in the input text
  32. \item the less weight a node has, the higher the probability is, that this node is read next in the symbol sequence
  33. \end{itemize}
  34. The process of compromising starts with the nodes with the lowest weight and buids up to the hightest. Each step adds nodes to a tree where the most left branch should be the shortest and the most right the longest. The most left branch ends with the symbol with the highest weight, therefore occours the most in the input data.
  35. Following one path results in the binary representation for one symbol. For an alphabet like the one described above, an binary representation could initially look like this \texttt{A -> 00, C -> 01, G -> 10, T -> 11} with a sequence that has this distribution \texttt{A -> 10, C - 8, G -> 3, T -> 2} with a corresponding tree the compromised data would look like this: \texttt{}
  36. % begriffdef. alphabet,
  37. % leafs
  38. % weights
  39. % paths
  40. % (genomic squeeze <- official | inofficial -> GDC, GRS). Further \ac{ANS} or rANS ... TBD.
  41. \section{Probability aproaches}