| 1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283 |
- %SUMMARY
- %- ABSTRACT
- %- INTRODUCTION
- %# BASICS
- %- \acs{DNA} STRUCTURE
- %- DATA TYPES
- % - BAM/FASTQ
- % - NON STANDARD
- %- COMPRESSION APPROACHES
- % - SAVING DIFFERENCES WITH GIVEN BASE \acs{DNA}
- % - HUFFMAN ENCODING
- % - PROBABILITY APPROACHES (WITH BASE?)
- %
- %# COMPARING TOOLS
- %-
- %# POSSIBLE IMPROVEMENT
- %- \acs{DNA}S STOCHASTICAL ATTRIBUTES
- %- IMPACT ON COMPRESSION
- \newcommand{\mycomment}[1]{}
- % entropie fim doku grundlagen2
- % dna nucleotide zu einem kapitel -> structure of dna. auch kapitel wegstreichen (zu generisch)
- % file structure <-> datatypes. länger beschreiben: e.g. File formats to store dna
- % 3.2.1 raus
- \chapter{Compression aproaches}
- The process of compressing data serves the goal to generate an output that is smaller than its input data. In many cases, like in gene compressing, the compression is idealy lossless. This means it is possible for every compressed data, to receive the full information that were available in the origin data, by decompressing it. Lossy compression on the other hand, might excludes parts of data in the compression process, in order to increase the compression rate. The excluded parts are typicaly not necessary to transmit the origin information. This works with certain audio and pictures files or network protocols that are used to transmit video/audio streams live.
- For \acs{DNA} a lossless compression is needed. To be preceice a lossy compression is not possible, because there is no unnecessary data. Every nucleotide and its position is needed for the sequenced \acs{DNA} to be complete. For lossless compression two mayor approaches are known: the dictionary coding and the entropy coding. Both are described in detail below.\\
- \section{Dictionary coding}
- Dictionary coding, as the name suggest, uses a dictionary to eliminate redundand occurences of strings. Strings are a chain of characters representing a full word or just a part of it. This is explained shortly for a better understanding of dictionary coding but is of no great relevance to the focus of this work:
- % exkurs
- Looking at the string 'stationary' it might be smart to store 'station' and 'ary' as seperate dictionary enties. Which way is more efficient depents on the text that should get compressed.
- % end exkurs
- The dictionary should only store strings that occour in the input data. Also storing a dictionary in addition to the (compressed) input data, would be a waste of resources. Therefore the dicitonary is made out of the input data. Each first occourence is left uncompressed. Every occurence of a string, after the first one, points to its first occurence. Since this 'pointer' needs less space than the string it points to, a decrease in the size is created.\\
- % unuseable due to the lack of probability
- \mycomment{
- % - known algo
- \subsection{The LZ Family}
- The computer scientist Abraham Lempel and the electrical engineere Jacob Ziv created multiple algorithms that are based on dictionary coding. They can be recognized by the substring \texttt{LZ} in its name, like \texttt{LZ77 and LZ78} which are short for Lempel Ziv 1977 and 1978. The number at the end indictates when the algorithm was published. Today LZ78 is widely used in unix compression solutions like gzip and bz2. Those tools are also used in compressing \ac{DNA}.\\
- \ac{LZ77} basically works, by removing all repetition of a string or substring and replacing them with information where to find the first occurence and how long it is. Typically it is stored in two bytes, whereby more than one one byte can be used to point to the first occurence because usually less than one byte is required to store the length.\\
- % example
- }
- \section{Entropy coding}
- \subsection{Huffman coding}
- The well known Huffman coding, is used in several Tools for genome compression. This subsection should give the reader a general impression how this algorithm works, without going into detail. To use Huffman coding one must first define an alphabet, in our case a four letter alphabet, containing \texttt{A, C, G and T} is sufficient. The basic structure is symbolized as a tree. With that, a few simple rules apply to the structure:
- % binary view for alphabet
- % length n of sequence to compromize
- % greedy algo
- \begin{itemize}
- \item every symbol of the alphabet is one leaf
- \item the right branch from every not is marked as a 1, the left one is marked as a 0
- \item every symbol got a weight, the weight is defined by the frequency the symbol occours in the input text
- \item the less weight a node has, the higher the probability is, that this node is read next in the symbol sequence
- \end{itemize}
- The process of compressing starts with the nodes with the lowest weight and buids up to the hightest. Each step adds nodes to a tree where the most left branch should be the shortest and the most right the longest. The most left branch ends with the symbol with the highest weight, therefore occours the most in the input data.
- Following one path results in the binary representation for one symbol. For an alphabet like the one described above, the binary representation encoded in ASCI is shown here \texttt{A -> 01000001, C -> 01000011, G -> 01010100, T -> 00001010}. An imaginary sequence, that has this distribution of characters \texttt{A -> 10, C -> 8, G -> 4, T -> 2}. From this information a weighting would be calculated for each character by dividing one by the characters occurence. With a corresponding tree, created from with the weights, the binary data for each symbol would change to this \texttt{A -> 0, C -> 11, T -> 100, G -> 101}. Besides the compressed data, the information contained in the tree msut be saved for the decompression process.
- \section{Implementations}
- \subsection{} % geco
- \subsection{} % genie
- \subsection{} % samtools
- \mycomment{
- \subsection{\ac{CABAC}}
- % a form of entropy coding
- % https://en.wikipedia.org/wiki/Context-adaptive_binary_arithmetic_coding
- \section{Implementations}
- % SAM - LZ4 src: https://github.com/lz4/lz4/blob/dev/doc/lz4_Block_format.md
- % GeCo - arithmetic coding
- % Genie - CABAC
- % following text is irelevant. Just describe used algorithms in comparison chapter and refere to their base algo
- % mix of Huffman and lz77
- The DEFLATE compression algorithm combines \ac{LZ77} and Huffman coding. To get more specific, the raw data is compressed with \ac{LZ77} and remaining data is shortened by using Huffman coding.
- % huffman - little endian
- % lz77 compressed - big endian (least significant byte first/most left)
- }
|