3 лет назад · efd24c0c69
--- a/latex/tex/kapitel/k4_algorithms.tex
+++ b/latex/tex/kapitel/k4_algorithms.tex
@@ -146,6 +146,7 @@ Intervals for the first symbol would be represented by natural numbers between 0
 
															 % finite percission
														
 
															 The described coding is only feasible on machines with infinite percission. As soon as finite precission comes into play, the algorithm must be extendet, so that a certain length in the resulting number will not be exceeded. Since digital datatypes are limited in their capacity, like unsigned 64-bit integers which can store up to $2^64-1$ bits or any number between 0 and 18.446.744.073.709.551.615. That might seem like a great ammount at first, but considering a unfavorable alphabet, that extends the results lenght by one on each symbol that is read, only texts with the length of 63 can be encoded (62 if \acs{EOF} is exclued).
														
 
															+\label{k4:huff}
														
 
															 \subsection{Huffman encoding}
														
 
															 % list of algos and the tools that use them
														
 
															 D. A. Huffmans work focused on finding a method to encode messages with a minimum of redundance. He referenced a coding procedure developed by Shannon and Fano and named after its developers, which worked similar. The \ac{SF} coding is not used today, due to the superiority in both efficiency and effectivity, in comparison to Huffman. % todo any source to last sentence. Rethink the use of finite in the following text
														
@@ -190,34 +191,42 @@ most formats, used for persisting \acs{DNA}, store more than just nucleotides an
 
															 \section{DEFLATE}
														
 
															 % mix of huffman and LZ77
														
 
															 The DEFLATE compression algorithm combines \ac{LZ77} and huffman coding. It is used in well known tools like gzip. 
														
 
															-
														
 
															-\subsubsection{misc}
														
 
															-
														
 
															-%check if (small) text coding is done with this:
														
 
															-Arithmetic Asymmetric numeral systems ?
														
 
															-Modified -> used in cram
														
 
															-
														
 
															+Data is split into blocks. Each block stores a header consisting of three bits. A single block can be stored in one of three forms. Each of which is represented by a identifier that is stored with the last two bits in the header. 
														
 
															+\begin{itemize}
														
 
															+	\item \texttt{00}		No compression.
														
 
															+	\item \texttt{01}		Compressed with a fixed set of Huffman codes.	
														
 
															+	\item \texttt{10}		Compressed with dynamic Huffman codes.
														
 
															+\end{itemize}
														
 
															+The last combination \texttt{11} is reserved to mark a faulty block. The third, leading bit is set to flag the last data block \cite{rfc1951}.
														
 
															+% lz77 part
														
 
															+As described in \ref{k4:lz77} a compression with \acs{LZ77} results in literals, a length for each literal and pointers that are represented by the distance between pointer and the literal it points to.\\
														
 
															+The \acs{LZ77} algorithm is executed before the huffman algorithm. Further compression steps differ from the already described algorithm and will extend to the end of this section.\\
														
 
															+% huffman part
														
 
															+Besides header bits and a data block, two Huffman code trees are store. One encodes literals and lenghts and the other distances. They happen to be in a compact form. This archived by a addition of two rules on top of the rules described in \ref{k4:huff}: Codes of identical lengths are orderd lexicographically, directed by the characters they represent. And the simple rule: shorter codes precede longer codes.\\
														
 
															+To illustrated this with an example:
														
 
															+For a text consisting out of \texttt{C} and \texttt{G}, following codes would be set for a encoding of two bit per character: \texttt{C}: 00, \texttt{G}: 01. With another character \texttt{A} in the alphabet, which would occour more often than the other two characters, the codes would change to a representation like this:
														
 
															+
														
 
															+\sffamily
														
 
															+\begin{footnotesize}
														
 
															+  \begin{longtable}[h]{ p{.3\textwidth} p{.3\textwidth}}
														
 
															+    \toprule
														
 
															+     \textbf{Symbol} & \textbf{Huffman code}\\
														
 
															+    \midrule
														
 
															+			A &	0\\
														
 
															+			C	&	10\\
														
 
															+			G & 11\\
														
 
															+    \bottomrule
														
 
															+  \end{longtable}
														
 
															+\end{footnotesize}
														
 
															+\rmfamily
														
 
															+
														
 
															+Since \texttt{A} precedes \texttt{C} and \texttt{G}, it is represented with a 0. To maintain prefix-free codes, the two remaining codes are not allowed to start with a 0. \texttt{C} precedes \texttt{G} lexicographically, therefor the (in a numerical sense) smaller code is set to represent \texttt{C}.\\
														
 
															+With this simple rules, the alphabet can be compressed too. Instead of storing codes itself, only the codelength stored \cite{rfc1951}. This might seem unnecessary when looking at a single compressed bulk of data, but when compressing blocks of data, a samller alphabet can make a relevant difference. 
														
 
															+
														
 
															+% example header, alphabet, data block?
														
 
															 \section{Implementations in Relevant Tools}
														
 
															 \subsection{} % geco
														
 
															 \subsection{} % genie
														
 
															 \subsection{} % samtools 
														
 
															-\mycomment{
														
 
															-\subsection{\ac{CABAC}}
														
 
															-% a form of entropy coding
														
 
															-% https://en.wikipedia.org/wiki/Context-adaptive_binary_arithmetic_coding
														
 
															-
														
 
															-
														
 
															-\section{Implementations}
														
 
															-% SAM - LZ4 src: https://github.com/lz4/lz4/blob/dev/doc/lz4_Block_format.md
														
 
															-% GeCo - arithmetic coding
														
 
															-% Genie - CABAC
														
 
															-
														
 
															-% following text is irelevant. Just describe used algorithms in comparison chapter and refere to their base algo
														
 
															-
														
 
															-% mix of Huffman and LZ77
														
 
															-The DEFLATE compression algorithm combines \ac{LZ77} and Huffman coding. To get more specific, the raw data is compressed with \ac{LZ77} and remaining data is shortened by using Huffman coding. 
														
 
															-% huffman - little endian
														
 
															-% LZ77 compressed - big endian (least significant byte first/most left)
														
 
															-}
														
--- a/latex/tex/literatur.bib
+++ b/latex/tex/literatur.bib
@@ -67,7 +67,7 @@
 
															   publisher    = {Oxford University Press ({OUP})},
														
 
															 }
														
 
															-@TechReport{defalte,
														
 
															+@TechReport{rfc1951,
														
 
															   author    = {L Peter Deutsch},
														
 
															   date      = {1996-05},
														
 
															   title     = {{DEFLATE} Compressed Data Format Specification version 1.3},
														
@@ -240,21 +240,6 @@
 
															   publisher    = {{IBM}},
														
 
															 }
														
 
															-@TechReport{rfc1951,
														
 
															-  author       = {L. Peter Deutsch},
														
 
															-  institution  = {RFC Editor},
														
 
															-  title        = {DEFLATE Compressed Data Format Specification version 1.3},
														
 
															-  note         = {\url{http://www.rfc-editor.org/rfc/rfc1951.txt}},
														
 
															-  number       = {1951},
														
 
															-  type         = {RFC},
														
 
															-  url          = {http://www.rfc-editor.org/rfc/rfc1951.txt},
														
 
															-  howpublished = {Internet Requests for Comments},
														
 
															-  issn         = {2070-1721},
														
 
															-  month        = {May},
														
 
															-  publisher    = {RFC Editor},
														
 
															-  year         = {1996},
														
 
															-}
														
 
															-
														
 
															 @Article{ieee-float,
														
 
															   title   = {IEEE Standard for Floating-Point Arithmetic},
														
 
															   doi     = {10.1109/IEEESTD.2019.8766229},