3 gadi atpakaļ · efd24c0c69
--- a/latex/tex/kapitel/k4_algorithms.tex
+++ b/latex/tex/kapitel/k4_algorithms.tex
@@ -146,6 +146,7 @@ Intervals for the first symbol would be represented by natural numbers between 0
 
				 % finite percission
			
 
				 The described coding is only feasible on machines with infinite percission. As soon as finite precission comes into play, the algorithm must be extendet, so that a certain length in the resulting number will not be exceeded. Since digital datatypes are limited in their capacity, like unsigned 64-bit integers which can store up to $2^64-1$ bits or any number between 0 and 18.446.744.073.709.551.615. That might seem like a great ammount at first, but considering a unfavorable alphabet, that extends the results lenght by one on each symbol that is read, only texts with the length of 63 can be encoded (62 if \acs{EOF} is exclued).
			
 
				 
			
 
				+\label{k4:huff}
			
 
				 \subsection{Huffman encoding}
			
 
				 % list of algos and the tools that use them
			
 
				 D. A. Huffmans work focused on finding a method to encode messages with a minimum of redundance. He referenced a coding procedure developed by Shannon and Fano and named after its developers, which worked similar. The \ac{SF} coding is not used today, due to the superiority in both efficiency and effectivity, in comparison to Huffman. % todo any source to last sentence. Rethink the use of finite in the following text
			
@@ -190,34 +191,42 @@ most formats, used for persisting \acs{DNA}, store more than just nucleotides an
 
				 \section{DEFLATE}
			
 
				 % mix of huffman and LZ77
			
 
				 The DEFLATE compression algorithm combines \ac{LZ77} and huffman coding. It is used in well known tools like gzip. 
			
 
				-
			
 
				-\subsubsection{misc}
			
 
				-
			
 
				-%check if (small) text coding is done with this:
			
 
				-Arithmetic Asymmetric numeral systems ?
			
 
				-Modified -> used in cram
			
 
				-
			
 
				+Data is split into blocks. Each block stores a header consisting of three bits. A single block can be stored in one of three forms. Each of which is represented by a identifier that is stored with the last two bits in the header. 
			
 
				+\begin{itemize}
			
 
				+	\item \texttt{00}		No compression.
			
 
				+	\item \texttt{01}		Compressed with a fixed set of Huffman codes.	
			
 
				+	\item \texttt{10}		Compressed with dynamic Huffman codes.
			
 
				+\end{itemize}
			
 
				+The last combination \texttt{11} is reserved to mark a faulty block. The third, leading bit is set to flag the last data block \cite{rfc1951}.
			
 
				+% lz77 part
			
 
				+As described in \ref{k4:lz77} a compression with \acs{LZ77} results in literals, a length for each literal and pointers that are represented by the distance between pointer and the literal it points to.\\
			
 
				+The \acs{LZ77} algorithm is executed before the huffman algorithm. Further compression steps differ from the already described algorithm and will extend to the end of this section.\\
			
 
				+% huffman part
			
 
				+Besides header bits and a data block, two Huffman code trees are store. One encodes literals and lenghts and the other distances. They happen to be in a compact form. This archived by a addition of two rules on top of the rules described in \ref{k4:huff}: Codes of identical lengths are orderd lexicographically, directed by the characters they represent. And the simple rule: shorter codes precede longer codes.\\
			
 
				+To illustrated this with an example:
			
 
				+For a text consisting out of \texttt{C} and \texttt{G}, following codes would be set for a encoding of two bit per character: \texttt{C}: 00, \texttt{G}: 01. With another character \texttt{A} in the alphabet, which would occour more often than the other two characters, the codes would change to a representation like this:
			
 
				+
			
 
				+\sffamily
			
 
				+\begin{footnotesize}
			
 
				+  \begin{longtable}[h]{ p{.3\textwidth} p{.3\textwidth}}
			
 
				+    \toprule
			
 
				+     \textbf{Symbol} & \textbf{Huffman code}\\
			
 
				+    \midrule
			
 
				+			A &	0\\
			
 
				+			C	&	10\\
			
 
				+			G & 11\\
			
 
				+    \bottomrule
			
 
				+  \end{longtable}
			
 
				+\end{footnotesize}
			
 
				+\rmfamily
			
 
				+
			
 
				+Since \texttt{A} precedes \texttt{C} and \texttt{G}, it is represented with a 0. To maintain prefix-free codes, the two remaining codes are not allowed to start with a 0. \texttt{C} precedes \texttt{G} lexicographically, therefor the (in a numerical sense) smaller code is set to represent \texttt{C}.\\
			
 
				+With this simple rules, the alphabet can be compressed too. Instead of storing codes itself, only the codelength stored \cite{rfc1951}. This might seem unnecessary when looking at a single compressed bulk of data, but when compressing blocks of data, a samller alphabet can make a relevant difference. 
			
 
				+
			
 
				+% example header, alphabet, data block?
			
 
				 
			
 
				 \section{Implementations in Relevant Tools}
			
 
				 \subsection{} % geco
			
 
				 \subsection{} % genie
			
 
				 \subsection{} % samtools 
			
 
				 
			
 
				-\mycomment{
			
 
				-\subsection{\ac{CABAC}}
			
 
				-% a form of entropy coding
			
 
				-% https://en.wikipedia.org/wiki/Context-adaptive_binary_arithmetic_coding
			
 
				-
			
 
				-
			
 
				-\section{Implementations}
			
 
				-% SAM - LZ4 src: https://github.com/lz4/lz4/blob/dev/doc/lz4_Block_format.md
			
 
				-% GeCo - arithmetic coding
			
 
				-% Genie - CABAC
			
 
				-
			
 
				-% following text is irelevant. Just describe used algorithms in comparison chapter and refere to their base algo
			
 
				-
			
 
				-% mix of Huffman and LZ77
			
 
				-The DEFLATE compression algorithm combines \ac{LZ77} and Huffman coding. To get more specific, the raw data is compressed with \ac{LZ77} and remaining data is shortened by using Huffman coding. 
			
 
				-% huffman - little endian
			
 
				-% LZ77 compressed - big endian (least significant byte first/most left)
			
 
				-}
			
--- a/latex/tex/literatur.bib
+++ b/latex/tex/literatur.bib
@@ -67,7 +67,7 @@
 
				   publisher    = {Oxford University Press ({OUP})},
			
 
				 }
			
 
				 
			
 
				-@TechReport{defalte,
			
 
				+@TechReport{rfc1951,
			
 
				   author    = {L Peter Deutsch},
			
 
				   date      = {1996-05},
			
 
				   title     = {{DEFLATE} Compressed Data Format Specification version 1.3},
			
@@ -240,21 +240,6 @@
 
				   publisher    = {{IBM}},
			
 
				 }
			
 
				 
			
 
				-@TechReport{rfc1951,
			
 
				-  author       = {L. Peter Deutsch},
			
 
				-  institution  = {RFC Editor},
			
 
				-  title        = {DEFLATE Compressed Data Format Specification version 1.3},
			
 
				-  note         = {\url{http://www.rfc-editor.org/rfc/rfc1951.txt}},
			
 
				-  number       = {1951},
			
 
				-  type         = {RFC},
			
 
				-  url          = {http://www.rfc-editor.org/rfc/rfc1951.txt},
			
 
				-  howpublished = {Internet Requests for Comments},
			
 
				-  issn         = {2070-1721},
			
 
				-  month        = {May},
			
 
				-  publisher    = {RFC Editor},
			
 
				-  year         = {1996},
			
 
				-}
			
 
				-
			
 
				 @Article{ieee-float,
			
 
				   title   = {IEEE Standard for Floating-Point Arithmetic},
			
 
				   doi     = {10.1109/IEEESTD.2019.8766229},