Bladeren bron

little improvement on arith

u 3 jaren geleden
bovenliggende
commit
349590e008
3 gewijzigde bestanden met toevoegingen van 46 en 6 verwijderingen
  1. 1 0
      latex/tex/kapitel/k1_introduction.tex
  2. 21 6
      latex/tex/kapitel/k4_algorithms.tex
  3. 24 0
      latex/tex/literatur.bib

+ 1 - 0
latex/tex/kapitel/k1_introduction.tex

@@ -18,6 +18,7 @@ With this information, a static code analysis of mentioned tools follows. This w
 % todo: 
 %   explain: coding 
 %   find uniform representation for: {letter;symbol;char} {dna;genome;sequence}
+% todo 22-11-07: change symbol to char and text to message. This might need a forward reference to shannons work, which is best placed in the intro
 
 %- COMPRESSION APPROACHES
 % - SAVING DIFFERENCES WITH GIVEN BASE \acs{DNA}

+ 21 - 6
latex/tex/kapitel/k4_algorithms.tex

@@ -34,7 +34,7 @@ Data contians information. In digital data  clear, physical limitations delimit
 The boundaries of information, when it comes to storing capabilities, can be illustrated by using the example mentioned above. A drive with the capacity of 1\acs{GB} could contain a book in form of images, where the content of each page is stored in a single image. Another, more ressourcefull way would be storing just the text of every page in \acs{UTF-16}. The information, the text would provide to a potential reader would not differ. Changing the text encoding to \acs{ASCII} and/or using compression techniques would reduce the required space even more, without loosing any information.\\
 % excurs end
 In contrast to lossless compression, lossy compression might excludes parts of data in the compression process, in order to increase the compression rate. The excluded parts are typicaly not necessary to persist the origin information. This works with certain audio and pictures formats, and in network protocols \cite{cnet13}.
-For \acs{DNA} a lossless compression is needed. To be preceice a lossy compression is not possible, because there is no unnecessary data. Every nucleotide and its position is needed for the sequenced \acs{DNA} to be complete. For lossless compression two mayor approaches are known: the dictionary coding and the entropy coding. Both are described in detail below \cite{cc14}.\\
+For \acs{DNA} a lossless compression is needed. To be preceice a lossy compression is not possible, because there is no unnecessary data. Every nucleotide and its position is needed for the sequenced \acs{DNA} to be complete. For lossless compression two mayor approaches are known: the dictionary coding and the entropy coding. Methods from both fields, that aquired reputation, are described in detail below \cite{cc14, moffat20, moffat_arith, alok17}.\\
 
 \subsection{Dictionary coding}
 \label{k4:dict}
@@ -107,11 +107,26 @@ He defined entropy as shown in figure \eqref{eq:entropy}. Let X be a finite prob
 
 \label{k4:arith}
 \subsection{Arithmetic coding}
-Arithmetic coding is an approach to solve the problem of wasting memeory due to the overhead which is created by encoding certain lenghts of alphabets in binary. Encoding a three-letter alphabet requires at least two bit per letter. Since there are four possilbe combinations with two bits, one combination is not used, so the full potential is not exhausted. Looking at it from another perspective, less storage would be required, if there would be a possibility to encode two letters in the alphabet with one bit and the other one with a two byte combination. This approache is not possible because the letters would not be clearly distinguishable. The two bit letter could be interpreted either as the letter it should represent or as two one bit letters.
-% check this wording 'simulating' with sources 
-% this is called subdividing
-Arithmetic coding works by translating a n-letter symbol chain or string into a n-letter binary encoding. 
-% wtf does a n-letter binary encoding looks like?
+This coding method is an approach to solve the problem of wasting memeory due to the overhead which is created by encoding certain lenghts of alphabets in binary. For example: Encoding a three-letter alphabet requires at least two bit per letter. Since there are four possilbe combinations with two bits, one combination is not used, so the full potential is not exhausted. Looking at it from another perspective and thinking a step further: Less storage would be required, if there would be a possibility to encode more than one letter in two bit.\\
+Dr. Jorma Rissanen described arithmetic coding in a publication in 1976 \autocite{ris76}. % Besides information theory and math, he also published stuff about dna
+This works goal was to define an algorithm that requires no blocking. Meaning the input text could be encoded as one instead of splitting it and encoding the smaller texts or single symbols. He stated that the coding speed of arithmetic coding is comparable to that of conventional coding methods \cite{ris76}.  
+
+The coding algorithm works with probabilities for symbols in an alphabet. From any text, the alphabet is defined by the set of individual symbols, used in the text. The probability for a single symbol is defined as its distribution. In a \texttt{n} symbol long text, with the first letter in the alphabet occuring \texttt{o} times, its probability is $\frac{o}{n}$.\\
+
+% todo rethink this equation stuff (and compare it to the original work <-compl.)
+\begin{itemize}
+  \item $p_{i}$ represents the probability of the symbol, at position \texttt{i} in the alphabet.
+  \item L will contain the bit sequence with which the text is encoded. This sequence can be seen as a single value. A fraction between 0 and 1 which gets more precise with every processed symbol.
+  \item R represents the product of probabilities of processed symbols. 
+\end{itemize}
+
+\begin{equation}
+  L_{i} = L_{i-1} + R \cdot \sum{i=1}^{j-1} \cdot p_{i}
+\end{equation}
+\begin{equation}
+  R_{i} = R_{i-1} \cdot p_{j}
+\end{equation}
+
 This is possible by projecting the input text on a floatingpoint number. Every character in the alphabet is represented by an intervall between two floating point numbers in the space between 0.0 and 1.0 (exclusively). This intervall is determined by its distribution in the input text (intervall start) and the the start of the next character (intervall end). The whole intervall is a sum of all subintervalls that reference symbols from the alphabet. It extends from 0.0 to 1.0. To encode a sequence of characters subdividing is used.
 % exkurs on subdividing?
 This means the intervall start of the character is noted, its intervall is split into smaller intervalls with the ratios of the initial intervalls between 0.0 and 1.0. With this, the second character is choosen. This process is repeated for until a intervall for the last character is choosen.\\

+ 24 - 0
latex/tex/literatur.bib

@@ -216,4 +216,28 @@
   publisher    = {Association for Computing Machinery ({ACM})},
 }
 
+@Article{moffat_arith,
+  author       = {Alistair Moffat and Radford M. Neal and Ian H. Witten},
+  date         = {1998-07},
+  journaltitle = {{ACM} Transactions on Information Systems},
+  title        = {Arithmetic coding revisited},
+  doi          = {10.1145/290159.290162},
+  number       = {3},
+  pages        = {256--294},
+  volume       = {16},
+  publisher    = {Association for Computing Machinery ({ACM})},
+}
+
+@Article{ris76,
+  author       = {J. J. Rissanen},
+  date         = {1976-05},
+  journaltitle = {{IBM} Journal of Research and Development},
+  title        = {Generalized Kraft Inequality and Arithmetic Coding},
+  doi          = {10.1147/rd.203.0198},
+  number       = {3},
+  pages        = {198--203},
+  volume       = {20},
+  publisher    = {{IBM}},
+}
+
 @Comment{jabref-meta: databaseType:biblatex;}