Pārlūkot izejas kodu

deflate finished for now

u 3 gadi atpakaļ
vecāks
revīzija
efd24c0c69
2 mainītis faili ar 35 papildinājumiem un 41 dzēšanām
  1. 34 25
      latex/tex/kapitel/k4_algorithms.tex
  2. 1 16
      latex/tex/literatur.bib

+ 34 - 25
latex/tex/kapitel/k4_algorithms.tex

@@ -146,6 +146,7 @@ Intervals for the first symbol would be represented by natural numbers between 0
 % finite percission
 The described coding is only feasible on machines with infinite percission. As soon as finite precission comes into play, the algorithm must be extendet, so that a certain length in the resulting number will not be exceeded. Since digital datatypes are limited in their capacity, like unsigned 64-bit integers which can store up to $2^64-1$ bits or any number between 0 and 18.446.744.073.709.551.615. That might seem like a great ammount at first, but considering a unfavorable alphabet, that extends the results lenght by one on each symbol that is read, only texts with the length of 63 can be encoded (62 if \acs{EOF} is exclued).
 
+\label{k4:huff}
 \subsection{Huffman encoding}
 % list of algos and the tools that use them
 D. A. Huffmans work focused on finding a method to encode messages with a minimum of redundance. He referenced a coding procedure developed by Shannon and Fano and named after its developers, which worked similar. The \ac{SF} coding is not used today, due to the superiority in both efficiency and effectivity, in comparison to Huffman. % todo any source to last sentence. Rethink the use of finite in the following text
@@ -190,34 +191,42 @@ most formats, used for persisting \acs{DNA}, store more than just nucleotides an
 \section{DEFLATE}
 % mix of huffman and LZ77
 The DEFLATE compression algorithm combines \ac{LZ77} and huffman coding. It is used in well known tools like gzip. 
-
-\subsubsection{misc}
-
-%check if (small) text coding is done with this:
-Arithmetic Asymmetric numeral systems ?
-Modified -> used in cram
-
+Data is split into blocks. Each block stores a header consisting of three bits. A single block can be stored in one of three forms. Each of which is represented by a identifier that is stored with the last two bits in the header. 
+\begin{itemize}
+	\item \texttt{00}		No compression.
+	\item \texttt{01}		Compressed with a fixed set of Huffman codes.	
+	\item \texttt{10}		Compressed with dynamic Huffman codes.
+\end{itemize}
+The last combination \texttt{11} is reserved to mark a faulty block. The third, leading bit is set to flag the last data block \cite{rfc1951}.
+% lz77 part
+As described in \ref{k4:lz77} a compression with \acs{LZ77} results in literals, a length for each literal and pointers that are represented by the distance between pointer and the literal it points to.\\
+The \acs{LZ77} algorithm is executed before the huffman algorithm. Further compression steps differ from the already described algorithm and will extend to the end of this section.\\
+% huffman part
+Besides header bits and a data block, two Huffman code trees are store. One encodes literals and lenghts and the other distances. They happen to be in a compact form. This archived by a addition of two rules on top of the rules described in \ref{k4:huff}: Codes of identical lengths are orderd lexicographically, directed by the characters they represent. And the simple rule: shorter codes precede longer codes.\\
+To illustrated this with an example:
+For a text consisting out of \texttt{C} and \texttt{G}, following codes would be set for a encoding of two bit per character: \texttt{C}: 00, \texttt{G}: 01. With another character \texttt{A} in the alphabet, which would occour more often than the other two characters, the codes would change to a representation like this:
+
+\sffamily
+\begin{footnotesize}
+  \begin{longtable}[h]{ p{.3\textwidth} p{.3\textwidth}}
+    \toprule
+     \textbf{Symbol} & \textbf{Huffman code}\\
+    \midrule
+			A &	0\\
+			C	&	10\\
+			G & 11\\
+    \bottomrule
+  \end{longtable}
+\end{footnotesize}
+\rmfamily
+
+Since \texttt{A} precedes \texttt{C} and \texttt{G}, it is represented with a 0. To maintain prefix-free codes, the two remaining codes are not allowed to start with a 0. \texttt{C} precedes \texttt{G} lexicographically, therefor the (in a numerical sense) smaller code is set to represent \texttt{C}.\\
+With this simple rules, the alphabet can be compressed too. Instead of storing codes itself, only the codelength stored \cite{rfc1951}. This might seem unnecessary when looking at a single compressed bulk of data, but when compressing blocks of data, a samller alphabet can make a relevant difference. 
+
+% example header, alphabet, data block?
 
 \section{Implementations in Relevant Tools}
 \subsection{} % geco
 \subsection{} % genie
 \subsection{} % samtools 
 
-\mycomment{
-\subsection{\ac{CABAC}}
-% a form of entropy coding
-% https://en.wikipedia.org/wiki/Context-adaptive_binary_arithmetic_coding
-
-
-\section{Implementations}
-% SAM - LZ4 src: https://github.com/lz4/lz4/blob/dev/doc/lz4_Block_format.md
-% GeCo - arithmetic coding
-% Genie - CABAC
-
-% following text is irelevant. Just describe used algorithms in comparison chapter and refere to their base algo
-
-% mix of Huffman and LZ77
-The DEFLATE compression algorithm combines \ac{LZ77} and Huffman coding. To get more specific, the raw data is compressed with \ac{LZ77} and remaining data is shortened by using Huffman coding. 
-% huffman - little endian
-% LZ77 compressed - big endian (least significant byte first/most left)
-}

+ 1 - 16
latex/tex/literatur.bib

@@ -67,7 +67,7 @@
   publisher    = {Oxford University Press ({OUP})},
 }
 
-@TechReport{defalte,
+@TechReport{rfc1951,
   author    = {L Peter Deutsch},
   date      = {1996-05},
   title     = {{DEFLATE} Compressed Data Format Specification version 1.3},
@@ -240,21 +240,6 @@
   publisher    = {{IBM}},
 }
 
-@TechReport{rfc1951,
-  author       = {L. Peter Deutsch},
-  institution  = {RFC Editor},
-  title        = {DEFLATE Compressed Data Format Specification version 1.3},
-  note         = {\url{http://www.rfc-editor.org/rfc/rfc1951.txt}},
-  number       = {1951},
-  type         = {RFC},
-  url          = {http://www.rfc-editor.org/rfc/rfc1951.txt},
-  howpublished = {Internet Requests for Comments},
-  issn         = {2070-1721},
-  month        = {May},
-  publisher    = {RFC Editor},
-  year         = {1996},
-}
-
 @Article{ieee-float,
   title   = {IEEE Standard for Floating-Point Arithmetic},
   doi     = {10.1109/IEEESTD.2019.8766229},