ソースを参照

finished huffman for now. In case i cant remember: start expanding arithmetic coding

u 3 年 前
コミット
7a8ef71f9d
2 ファイル変更62 行追加40 行削除
  1. 60 38
      latex/tex/kapitel/k4_algorithms.tex
  2. 2 2
      latex/tex/literatur.bib

+ 60 - 38
latex/tex/kapitel/k4_algorithms.tex

@@ -39,9 +39,9 @@ For \acs{DNA} a lossless compression is needed. To be preceice a lossy compressi
 \subsection{Dictionary coding}
 \label{k4:dict}
 Dictionary coding, as the name suggest, uses a dictionary to eliminate redundand occurences of strings. Strings are a chain of characters representing a full word or just a part of it. For a better understanding this should be illustrated by a short example:
-% exkurs
+% demo substrings
 Looking at the string 'stationary' it might be smart to store 'station' and 'ary' as seperate dictionary enties. Which way is more efficient depents on the text that should get compressed. 
-% end exkurs
+% end demo
 The dictionary should only store strings that occour in the input data. Also storing a dictionary in addition to the (compressed) input data, would be a waste of resources. Therefore the dicitonary is made out of the input data. Each first occourence is left uncompressed. Every occurence of a string, after the first one, points to its first occurence. Since this 'pointer' needs less space than the string it points to, a decrease in the size is created.\\
 
 Unfortunally, known implementations like the ones out of LZ Family, do not use probabilities to compress and are therefore out of scope for this work. Since finding repeting sections and their location might also be improved, this chapter will remain.
@@ -72,25 +72,33 @@ transmitter and receiver are changed to compression/encoding and decompression/d
 
 Shannons Entropy provides a formular to determine the 'uncertainty of a probability distribution' in a finite field.
 
-%H(X) \defd \Sum{x\in X, prob(x)\neq0}{}{prob(x) * log_2(frac{1}{prob(x)})} \equiv  - \Sum { x\in X, prob(x)\neq0 } {} {prob(x) * log_2 (prob(x))}.
-\begin{figure}[H]
-  \centering
-  \includegraphics[width=12cm]{k4/shannon_entropy.png}
-  \caption{Shannons definition of entropy.}
-  \label{k4:entropy}
-\end{figure}
-
-He defined entropy as shown in figure \ref{f4:entropy}. Let X be a finite probability space. Then x in X are possible final states of an probability experimen over X. Every state that actually occours, while executing the experiment generates infromation which is meassured in \textit{Bits} with the part of the formular displayed in \ref{k4:info-in-bits}\cite{delfs_knebl,Shannon_1948}:
-
-%\bein{math}
-% log_2(frac{1}{prob(x)}) \equiv - log_2(prob(x)).
-%\end{math} 
-\begin{figure}[H]
-  \centering
-  \includegraphics[width=8cm]{k4/information_bits.png}
-  \caption{The amount of information measured in bits, in case x is the end state of a probability experiment.}
-  \label{f4:info-in-bits}
-\end{figure}
+\begin{equation}\label{eq:entropy}
+%\resizebox{.9 \textwidth}{!}
+%{$
+  H(X):=\sum\limits_{x\in X, prob(x)\neq 0} prob(x) \cdot log_2(\frac{1}{prob(x)})
+ \equiv-\sum\limits_{x\in X, prob(x)\neq 0} prob(x) \cdot log_2 (prob(x)).
+%$}
+\end{equation}
+
+%\begin{figure}[H]
+%  \centering
+%  \includegraphics[width=12cm]{k4/shannon_entropy.png}
+%  \caption{Shannons definition of entropy.}
+%  \label{k4:entropy}
+%\end{figure}
+
+He defined entropy as shown in figure \eqref{eq:entropy}. Let X be a finite probability space. Then x in X are possible final states of an probability experimen over X. Every state that actually occours, while executing the experiment generates infromation which is meassured in \textit{Bits} with the part of the formular displayed in \ref{eq:info-in-bits}\cite{delfs_knebl,Shannon_1948}:
+
+\begin{equation}\label{eq:info-in-bits}
+ log_2(\frac{1}{prob(x)}) \equiv - log_2(prob(x)).
+\end{equation} 
+
+%\begin{figure}[H]
+%  \centering
+%  \includegraphics[width=8cm]{k4/information_bits.png}
+%  \caption{The amount of information measured in bits, in case x is the end state of a probability experiment.}
+%  \label{f4:info-in-bits}
+%\end{figure}
 
 %todo explain 2.2 second bulletpoint of delfs_knebl. Maybe read gumbl book
 
@@ -102,40 +110,54 @@ He defined entropy as shown in figure \ref{f4:entropy}. Let X be a finite probab
 Arithmetic coding is an approach to solve the problem of wasting memeory due to the overhead which is created by encoding certain lenghts of alphabets in binary. Encoding a three-letter alphabet requires at least two bit per letter. Since there are four possilbe combinations with two bits, one combination is not used, so the full potential is not exhausted. Looking at it from another perspective, less storage would be required, if there would be a possibility to encode two letters in the alphabet with one bit and the other one with a two byte combination. This approache is not possible because the letters would not be clearly distinguishable. The two bit letter could be interpreted either as the letter it should represent or as two one bit letters.
 % check this wording 'simulating' with sources 
 % this is called subdividing
-Arithmetic coding works by translating a n-letter alphabet into a n-letter binary encoding. This is possible by projecting the input text on a floatingpoint number. Every character in the alphabet is represented by an intervall between two floating point numbers in the space between 0.0 and 1.0 (exclusively). This intervall is determined by its distribution in the input text (intervall start) and the the start of the next character (intervall end). To encode a sequence of characters subdividing is used.
+Arithmetic coding works by translating a n-letter symbol chain or string into a n-letter binary encoding. 
+% wtf does a n-letter binary encoding looks like?
+This is possible by projecting the input text on a floatingpoint number. Every character in the alphabet is represented by an intervall between two floating point numbers in the space between 0.0 and 1.0 (exclusively). This intervall is determined by its distribution in the input text (intervall start) and the the start of the next character (intervall end). The whole intervall is a sum of all subintervalls that reference symbols from the alphabet. It extends from 0.0 to 1.0. To encode a sequence of characters subdividing is used.
 % exkurs on subdividing?
 This means the intervall start of the character is noted, its intervall is split into smaller intervalls with the ratios of the initial intervalls between 0.0 and 1.0. With this, the second character is choosen. This process is repeated for until a intervall for the last character is choosen.\\
 To encode in binary, the binary floating point representation of a number inside the intervall, for the last character is calculated, by using a similar process, described above, called subdividing.
-% its finite subdividing because processors bottleneck floatingpoints 
+% its finite subdividing because of processor architecture
 
 \subsection{Huffman encoding}
 % list of algos and the tools that use them
 D. A. Huffmans work focused on finding a method to encode messages with a minimum of redundance. He referenced a coding procedure developed by Shannon and Fano and named after its developers, which worked similar. The \ac{SF} coding is not used today, due to the superiority in both efficiency and effectivity, in comparison to Huffman. % todo any source to last sentence. Rethink the use of finite in the following text
 Even though his work was released in 1952, the method he developed is in use  today. Not only tools for genome compression but in compression tools with a more general ussage \cite{rfcgzip}.\\ 
-Compression with the Huffman algorithm works only on finite alphabets. It also provides a solution to the problem, described at the beginning of \ref{k4:arith}, on waste through unused bits, for certain alphabet lengths. Huffman did not save more than one symbol in one bit, like it is done in arithmetic coding, but he decreased the number of bits used per symbol in a message. This is possible by setting individual bit lengths for symbols, used in the text that should get compressed \cite{huf52}. 
-As with other codings, a set of symbols must be defined. For any text constructed with symbols from mentioned alphabet, a binary tree is constructed, which will determine how the symbols will be encoded. As in arithmetic coding, the probability of a letter is calculated for given text. The binary tree will be constructed after following guidelines:
+Compression with the Huffman algorithm also provides a solution to the problem, described at the beginning of \ref{k4:arith}, on waste through unused bits, for certain alphabet lengths. Huffman did not save more than one symbol in one bit, like it is done in arithmetic coding, but he decreased the number of bits used per symbol in a message. This is possible by setting individual bit lengths for symbols, used in the text that should get compressed \cite{huf52}. 
+As with other codings, a set of symbols must be defined. For any text constructed with symbols from mentioned alphabet, a binary tree is constructed, which will determine how the symbols will be encoded. As in arithmetic coding, the probability of a letter is calculated for given text. The binary tree will be constructed after following guidelines \cite{alok17}:
 % greedy algo?
 \begin{itemize}
   \item Every symbol of the alphabet is one leaf.
   \item The right branch from every knot is marked as a 1, the left one is marked as a 0.
-  \item Every symbol got a weight, the weight is defined by the frequency the symbol occours in the input text.
+  \item Every symbol got a weight, the weight is defined by the frequency the symbol occours in the input text. This might be a fraction between 0 and 1 or an integer. In this scenario it will described as the first.
   \item The less weight a leaf has, the higher the probability is, that this node is read next in the symbol sequence.
   \item The leaf with the lowest probability is most left and the one with the highest probability is most right in the tree. 
 \end{itemize}
 %todo tree building explanation
-Constructing the tree begins with as many nodes as there are symbols, in the alphabet. 
 % storytime might need to be rearranged
-A often mentioned difference between \acs{FA} and Huffman coding, is that first is working top down while the latter is working bottom up. This means the tree starts with the lowest weights. The nodes that are not leafs have no value ascribed to them. They only need their weight, which is defined by the weights of their individual child nodes.\\
-So starting with the two lowest weightened symbols, a node is added to connect both. from there on, the two leafs will only get rearranged through the rearrangement of their temporary root node. With the added, blank node the count of available nodes got down by one. The new node weights as much as the summ of weights of its child nodes. Now the two lowest weights are paired as described until there are only two subtrees left which can be combined by a root.\\
-With the fact in mind, that left branches are assigned with 0 and right branches with 1, following a path until a leaf is reached reveals the encoding for this particular leaf. Since high weightened and therefore often occuring leafs are positioned to the left, short paths lead to them and so only few bits are needed to encode them. Following the tree on the other side, the symbols occur more rarely, paths get longer and so do the bit counts. % todo <- rewrite '...counts'
-
-
-In our case a four letter alphabet, containing \texttt{A, C, G and T} is sufficient.
-The process of compressing starts with the nodes with the lowest weight and buids up to the hightest. Each step adds nodes to a tree where the most left branch should be the shortest and the most right the longest. The most left branch ends with the symbol with the highest weight, therefore occours the most in the input data.
-Following one path results in the binary representation for one symbol. For an alphabet like the one described above, the binary representation encoded in ASCI is shown here \texttt{A -> 01000001, C -> 01000011, G -> 01010100, T -> 00001010}. An imaginary sequence, that has this distribution of characters \texttt{A -> 10, C -> 8, G -> 4, T -> 2}. From this information a weighting would be calculated for each character by dividing one by the characters occurence. With a corresponding tree, created from with the weights, the binary data for each symbol would change to this \texttt{A -> 0, C -> 11, T -> 100, G -> 101}. Besides the compressed data, the information contained in the tree msut be saved for the decompression process.\\
-% todo shannon fano mention. SF might be older than huffman and inspired it?
-% -> yes
-
+A often mentioned difference between \acs{FA} and Huffman coding, is that first is working top down while the latter is working bottom up. This means the tree starts with the lowest weights. The nodes that are not leafs have no value ascribed to them. They only need their weight, which is defined by the weights of their individual child nodes \cite{moffat20, alok17}.\\
+
+Given \texttt{K(W,L)} as a node structure, with the weigth or probability as \texttt{$W_{i}$} and codeword length as \texttt{$L_{i}$} for the node \texttt{$K_{i}$}. Then will \texttt{$L_{av}$} be the average length for \texttt{L} in a finite chain of symbols, with a distribution that is mapped onto \texttt{W} \cite{huf}.
+\begin{equation}\label{eq:huf}
+  L_{av}=\sum_{i=0}^{n-1}w_{i}\cdot l_{i}
+\end{equation}
+The equation \eqref{eq:huf} describes the path, to the desired state, for the tree. The upper bound \texttt{n} is assigned the length of the input text. The touple in any node \texttt{K} consists of a weight \texttt{$w_{i}$}, that also references a symbol, and the length of a codeword \texttt{$l_{i}$}. This codeword will later encode a single symbol from the alphabet. Working with digital codewords, an element in \texttt{l} contains a sequence of zeros and ones. Since there in this coding method, there is no fixed length for codewords, the premise of \texttt{prefix free code} must be adhered to. This means there can be no codeword that match the sequence of any prefix of another codeword. To illustrate this: 0, 10, 11 would be a set of valid codewords but adding a codeword like 01 or 00 would make the set invalid because of the prefix 0, which is already a single codeword.\\
+With all important elements described: the sum that results from this equation is the average length a symbol in the encoded input text will require to be stored \cite{huf52, moffat20}.
+
+% example
+% todo illustrate
+For this example a four letter alphabet, containing \texttt{A, C, G and T} will be used. The exact input text is not relevant, since only the resulting probabilities are needed. With a distribution like \texttt{<A, $100\frac11=0.11$>, <C, $100\frac71=0.71$>, <G, $100\frac13=0.13$>, <T, $100\frac5=0.05$>}, a probability for each symbol is calculated by dividing the message length by the times the symbol occured.\\ 
+For an alphabet like the one described above, the binary representation encoded in ASCI is shown here \texttt{A -> 01000001, C -> 01000011, G -> 01010100, T -> 00001010}. The average length for any symbol encoded in \acs{ASCII} is eight, while only using four of the available $2^8$ symbols, a overhead of 252 unused bit combinations. For this example it is more vivid, using a imaginary encoding format, without overhead. It would result in a average codeword length of two, because four symbols need a minimum of $2^2$ bits.\\
+So starting with the two lowest weightened symbols, a node is added to connect both.\\
+\texttt{<A, T>, <C>, <G>}\\
+With the added, blank node the count of available nodes got down by one. The new node weights as much as the sum of weights of its child nodes so the probability of 0.16 is assigned to \texttt{<A,T>}. From there on, the two leafs will only get rearranged through the rearrangement of their temporary root node. Now the two lowest weights are paired as described, until there are only two subtrees or nodes left which can be combined by a root.\\
+\texttt{<C, <A, T>>, <G>}\\
+The \texttt{<C, <A, T>>} has a probability of 0.29. Adding the last node \texttt{G} results in a root node with the probability of 1.0.\\
+With the fact in mind, that left branches are assigned with 0 and right branches with 1, following a path until a leaf is reached reveals the encoding for this particular leaf. With a corresponding tree, created from with the weights, the binary sequences to encode the alphabet would look like this:\\
+\texttt{A -> 0, C -> 11, T -> 100, G -> 101}.\\ 
+Since high weightened and therefore often occuring leafs are positioned to the left, short paths lead to them and so only few bits are needed to encode them. Following the tree on the other side, the symbols occur more rarely, paths get longer and so do the codeword. Applying \eqref{eq:huf} to this example, results in 1.45 bits per encoded symbol. In this example the text would require over one bit less storage for every second symbol.\\
+% impacting the dark ground called reality
+Leaving the theory and entering the practice, brings some details that lessen this improvement by a bit. A few bytes are added through the need of storing the information contained in the tree. Also, like described in % todo add ref to k3 formats
+most formats, used for persisting \acs{DNA}, store more than just nucleotides and therefore require more characters. What compression ratios implementations of huffman coding provide, will be discussed in \ref{k5:results}.\\
 
 \section{DEFLATE}
 % mix of huffman and lz77

+ 2 - 2
latex/tex/literatur.bib

@@ -1,4 +1,4 @@
-@Article{Al_Okaily_2017,
+@Article{alok17,
   author       = {Anas Al-Okaily and Badar Almarri and Sultan Al Yami and Chun-Hsi Huang},
   date         = {2017-04-01},
   journaltitle = {Journal of Computational Biology},
@@ -204,7 +204,7 @@
   year        = {1952},
 }
 
-@Article{Moffat_2020,
+@Article{moffat20,
   author       = {Alistair Moffat},
   date         = {2020-07},
   journaltitle = {{ACM} Computing Surveys},