|
@@ -146,6 +146,7 @@ Intervals for the first symbol would be represented by natural numbers between 0
|
|
|
% finite percission
|
|
% finite percission
|
|
|
The described coding is only feasible on machines with infinite percission. As soon as finite precission comes into play, the algorithm must be extendet, so that a certain length in the resulting number will not be exceeded. Since digital datatypes are limited in their capacity, like unsigned 64-bit integers which can store up to $2^64-1$ bits or any number between 0 and 18.446.744.073.709.551.615. That might seem like a great ammount at first, but considering a unfavorable alphabet, that extends the results lenght by one on each symbol that is read, only texts with the length of 63 can be encoded (62 if \acs{EOF} is exclued).
|
|
The described coding is only feasible on machines with infinite percission. As soon as finite precission comes into play, the algorithm must be extendet, so that a certain length in the resulting number will not be exceeded. Since digital datatypes are limited in their capacity, like unsigned 64-bit integers which can store up to $2^64-1$ bits or any number between 0 and 18.446.744.073.709.551.615. That might seem like a great ammount at first, but considering a unfavorable alphabet, that extends the results lenght by one on each symbol that is read, only texts with the length of 63 can be encoded (62 if \acs{EOF} is exclued).
|
|
|
|
|
|
|
|
|
|
+\label{k4:huff}
|
|
|
\subsection{Huffman encoding}
|
|
\subsection{Huffman encoding}
|
|
|
% list of algos and the tools that use them
|
|
% list of algos and the tools that use them
|
|
|
D. A. Huffmans work focused on finding a method to encode messages with a minimum of redundance. He referenced a coding procedure developed by Shannon and Fano and named after its developers, which worked similar. The \ac{SF} coding is not used today, due to the superiority in both efficiency and effectivity, in comparison to Huffman. % todo any source to last sentence. Rethink the use of finite in the following text
|
|
D. A. Huffmans work focused on finding a method to encode messages with a minimum of redundance. He referenced a coding procedure developed by Shannon and Fano and named after its developers, which worked similar. The \ac{SF} coding is not used today, due to the superiority in both efficiency and effectivity, in comparison to Huffman. % todo any source to last sentence. Rethink the use of finite in the following text
|
|
@@ -190,34 +191,42 @@ most formats, used for persisting \acs{DNA}, store more than just nucleotides an
|
|
|
\section{DEFLATE}
|
|
\section{DEFLATE}
|
|
|
% mix of huffman and LZ77
|
|
% mix of huffman and LZ77
|
|
|
The DEFLATE compression algorithm combines \ac{LZ77} and huffman coding. It is used in well known tools like gzip.
|
|
The DEFLATE compression algorithm combines \ac{LZ77} and huffman coding. It is used in well known tools like gzip.
|
|
|
-
|
|
|
|
|
-\subsubsection{misc}
|
|
|
|
|
-
|
|
|
|
|
-%check if (small) text coding is done with this:
|
|
|
|
|
-Arithmetic Asymmetric numeral systems ?
|
|
|
|
|
-Modified -> used in cram
|
|
|
|
|
-
|
|
|
|
|
|
|
+Data is split into blocks. Each block stores a header consisting of three bits. A single block can be stored in one of three forms. Each of which is represented by a identifier that is stored with the last two bits in the header.
|
|
|
|
|
+\begin{itemize}
|
|
|
|
|
+ \item \texttt{00} No compression.
|
|
|
|
|
+ \item \texttt{01} Compressed with a fixed set of Huffman codes.
|
|
|
|
|
+ \item \texttt{10} Compressed with dynamic Huffman codes.
|
|
|
|
|
+\end{itemize}
|
|
|
|
|
+The last combination \texttt{11} is reserved to mark a faulty block. The third, leading bit is set to flag the last data block \cite{rfc1951}.
|
|
|
|
|
+% lz77 part
|
|
|
|
|
+As described in \ref{k4:lz77} a compression with \acs{LZ77} results in literals, a length for each literal and pointers that are represented by the distance between pointer and the literal it points to.\\
|
|
|
|
|
+The \acs{LZ77} algorithm is executed before the huffman algorithm. Further compression steps differ from the already described algorithm and will extend to the end of this section.\\
|
|
|
|
|
+% huffman part
|
|
|
|
|
+Besides header bits and a data block, two Huffman code trees are store. One encodes literals and lenghts and the other distances. They happen to be in a compact form. This archived by a addition of two rules on top of the rules described in \ref{k4:huff}: Codes of identical lengths are orderd lexicographically, directed by the characters they represent. And the simple rule: shorter codes precede longer codes.\\
|
|
|
|
|
+To illustrated this with an example:
|
|
|
|
|
+For a text consisting out of \texttt{C} and \texttt{G}, following codes would be set for a encoding of two bit per character: \texttt{C}: 00, \texttt{G}: 01. With another character \texttt{A} in the alphabet, which would occour more often than the other two characters, the codes would change to a representation like this:
|
|
|
|
|
+
|
|
|
|
|
+\sffamily
|
|
|
|
|
+\begin{footnotesize}
|
|
|
|
|
+ \begin{longtable}[h]{ p{.3\textwidth} p{.3\textwidth}}
|
|
|
|
|
+ \toprule
|
|
|
|
|
+ \textbf{Symbol} & \textbf{Huffman code}\\
|
|
|
|
|
+ \midrule
|
|
|
|
|
+ A & 0\\
|
|
|
|
|
+ C & 10\\
|
|
|
|
|
+ G & 11\\
|
|
|
|
|
+ \bottomrule
|
|
|
|
|
+ \end{longtable}
|
|
|
|
|
+\end{footnotesize}
|
|
|
|
|
+\rmfamily
|
|
|
|
|
+
|
|
|
|
|
+Since \texttt{A} precedes \texttt{C} and \texttt{G}, it is represented with a 0. To maintain prefix-free codes, the two remaining codes are not allowed to start with a 0. \texttt{C} precedes \texttt{G} lexicographically, therefor the (in a numerical sense) smaller code is set to represent \texttt{C}.\\
|
|
|
|
|
+With this simple rules, the alphabet can be compressed too. Instead of storing codes itself, only the codelength stored \cite{rfc1951}. This might seem unnecessary when looking at a single compressed bulk of data, but when compressing blocks of data, a samller alphabet can make a relevant difference.
|
|
|
|
|
+
|
|
|
|
|
+% example header, alphabet, data block?
|
|
|
|
|
|
|
|
\section{Implementations in Relevant Tools}
|
|
\section{Implementations in Relevant Tools}
|
|
|
\subsection{} % geco
|
|
\subsection{} % geco
|
|
|
\subsection{} % genie
|
|
\subsection{} % genie
|
|
|
\subsection{} % samtools
|
|
\subsection{} % samtools
|
|
|
|
|
|
|
|
-\mycomment{
|
|
|
|
|
-\subsection{\ac{CABAC}}
|
|
|
|
|
-% a form of entropy coding
|
|
|
|
|
-% https://en.wikipedia.org/wiki/Context-adaptive_binary_arithmetic_coding
|
|
|
|
|
-
|
|
|
|
|
-
|
|
|
|
|
-\section{Implementations}
|
|
|
|
|
-% SAM - LZ4 src: https://github.com/lz4/lz4/blob/dev/doc/lz4_Block_format.md
|
|
|
|
|
-% GeCo - arithmetic coding
|
|
|
|
|
-% Genie - CABAC
|
|
|
|
|
-
|
|
|
|
|
-% following text is irelevant. Just describe used algorithms in comparison chapter and refere to their base algo
|
|
|
|
|
-
|
|
|
|
|
-% mix of Huffman and LZ77
|
|
|
|
|
-The DEFLATE compression algorithm combines \ac{LZ77} and Huffman coding. To get more specific, the raw data is compressed with \ac{LZ77} and remaining data is shortened by using Huffman coding.
|
|
|
|
|
-% huffman - little endian
|
|
|
|
|
-% LZ77 compressed - big endian (least significant byte first/most left)
|
|
|
|
|
-}
|
|
|