3 yıl önce · 1c0d65af47
--- a/latex/result/thesis.pdf
+++ b/latex/result/thesis.pdf
--- a/latex/tex/kapitel/k4_algorithms.tex
+++ b/latex/tex/kapitel/k4_algorithms.tex
@@ -231,7 +231,7 @@ Given \texttt{K(W,L)} as a node structure, with the weigth or probability as \te
 
															 \begin{equation}\label{eq:huf}
														
 
															   L_{av}=\sum_{i=0}^{n-1}w_{i}\cdot l_{i}
														
 
															 \end{equation}
														
 
															-The equation \eqref{eq:huf} describes the path, to the desired state, for the tree. The upper bound \texttt{n} is assigned the length of the input text. The touple in any node \texttt{K} consists of a weight \texttt{$w_{i}$}, that also references a symbol, and the length of a codeword \texttt{$l_{i}$}. This codeword will later encode a single symbol from the alphabet. Working with digital codewords, an element in \texttt{l} contains a sequence of zeros and ones. Since there in this coding method, there is no fixed length for codewords, the premise of \texttt{prefix free code} must be adhered to. This means there can be no codeword that match the sequence of any prefix of another codeword. To illustrate this: 0, 10, 11 would be a set of valid codewords but adding a codeword like 01 or 00 would make the set invalid because of the prefix 0, which is already a single codeword.\\
														
 
															+The equation \eqref{eq:huf} describes the path, to the desired state, for the tree. The upper bound \texttt{n} is assigned the length of the input text. The tuple in any node \texttt{K} consists of a weight \texttt{$w_{i}$}, that also references a symbol, and the length of a codeword \texttt{$l_{i}$}. This codeword will later encode a single symbol from the alphabet. Working with digital codewords, an element in \texttt{l} contains a sequence of zeros and ones. Since there in this coding method, there is no fixed length for codewords, the premise of \texttt{prefix-free code} must be adhered to. This means there can be no codeword that match the sequence of any prefix of another codeword. To illustrate this: 0, 10, 11 would be a set of valid codewords but adding a codeword like 01 or 00 would make the set invalid because of the prefix 0, which is already a single codeword.\\
														
 
															 With all important elements described: the sum that results from this equation is the average length a symbol in the encoded input text will require to be stored \cite{huf52, moffat20}.
														
 
															 % example
														
@@ -259,11 +259,11 @@ The average length for any symbol encoded in \acs{ASCII} is eight, while only us
 
															 \rmfamily
														
 
															 The exact input text is not relevant, since only the resulting probabilities are needed. To make this example more illustrative, possible occurrences are listed in the most right column of \ref{t:huff-pre}. The probability for each symbol is calculated by dividing the message length by the times the symbol occured. This and the resulting probabilities on a scale between 0.0 and 1.0, for this example are shown in \ref{t:huff-pre} \cite{huf52}.\\ 
														
 
															-Creating a tree will be done bottom up. In the first step, for each symbol from the alphabet, a node without any connection is formed .\\
														
 
															+Creating a tree is done bottom-up. In the first step, for each symbol from the alphabet, a node without any connection is formed .\\
														
 
															 \texttt{<A>, <T>, <C>, <G>}\\
														
 
															-Starting with the two lowest weightened symbols, a node is added to connect both. With the added, blank node the count of available nodes got down by one. The new node weights as much as the sum of weights of its child nodes so the probability of 0.16 is assigned to \texttt{<A,T>}.\\
														
 
															+Starting with the two lowest weightened symbols, a node is added to connect both. With the added, blank node the count of available nodes got reduces by one. The new node weights as much as the sum of weights of its child nodes so the probability of 0.16 is assigned to \texttt{<A,T>}.\\
														
 
															 \texttt{<A, T>, <C>, <G>}\\
														
@@ -316,16 +316,16 @@ The information on the following pages was received through static code analysis
 
															 This tool has three development stages. the first \acs{GeCo} released in 2016 \acs{GeCo}. This tool happens to have the smalles codebase, with only eleven C files. The two following extensions \acs{GeCo}2, released in 2020 and the latest version \acs{GeCo}3 have bigger codebases \cite{geco-repo}. They also provide features like the ussage of a neural network, which are of no help for this work. Since the file, providing arithmetic coding functionality, do not differ between all three versions, the first release was analyzed.\\
														
 
															 % explain header files
														
 
															 The header files, that this tool includes in \texttt{geco.c}, can be split into three categories: basic operations, custom operations and compression algorithms. 
														
 
															-The basic operations include header files for general purpose functions, that can be found in almost any c++ Project. The provided functionality includes operations for text-output on the command line inferface, memory management, random number generation and several calculations on up to real numbers.\\
														
 
															-Custom operations happens to include general purpose functions too, with the difference that they were written, altered or extended by \acs{GeCo}s developer. The last category cosists of several C Files, containing implementations of two arithmetic coding implementations: \textbf{first} \texttt{bitio.c} and \texttt{arith.c}, \textbf{second} \texttt{arith\_aux.c}.\\
														
 
															-The first two were developed by John Carpinelli, Wayne Salamonsen, Lang Stuiver and Radford Neal (is only mentioned in the latter). Comparing the two files, \texttt{bitio.c} has less code, shorter comments and much more not functioning code sections. Overall the conclusion would be likely that \texttt{arith.c} is some kind of official release, wheras \texttt{bitio.c} severs as a experimental  file for the developers to create proof of concepts. The described files adapt code from Armando J. Pinho licenced by University of Aveiro DETI/IEETA written in 1999.\\
														
 
															+The basic operations include header files for general purpose functions, that can be found in almost any c++ Project. The provided functionality includes operations for text-output on the command line inferface, memory management, random number generation and several calculations on numbers from natural to real.\\
														
 
															+Custom operations happens to include general purpose functions too, with the difference that they were written, altered or extended by \acs{GeCo}s developer. The last category cosists of several C Files, containing implementations of two arithmetic coding algorithms: \textbf{first} \texttt{bitio.c} and \texttt{arith.c}, \textbf{second} \texttt{arith\_aux.c}.\\
														
 
															+The first two were developed by John Carpinelli, Wayne Salamonsen, Lang Stuiver and Radford Neal. Comparing the two files, \texttt{bitio.c} has less code, shorter comments and much more not functioning code sections. Overall the conclusion would be likely that \texttt{arith.c} is some kind of official release, wheras \texttt{bitio.c} severs as a experimental  file for the developers to create proof of concepts. The described files adapt code from Armando J. Pinho licenced by University of Aveiro DETI/IEETA written in 1999.\\
														
 
															 The second implementation was also licensed by University of Aveiro DETI/IEETA, but no author is mentioned. From interpreting the function names and considering the lenght of function bodys \texttt{arith\_aux.c} could serve as a wrapper for basic functions that are often used in arithmetic coding.\\
														
 
															 Since original versions of the files licensed by University of Aveiro could not be found, there is no way to determine if the files comply with their originals or if changes has been made. This should be considered while following the static analysis.
														
 
															 Following function calls in all three files led to the conclusion that the most important function is defined as \texttt{arithmetic\_encode} in \texttt{arith.c}. In this function the actual artihmetic encoding is executed. This function has no redirects to other files, only one function call \texttt{ENCODE\_RENORMALISE} the remaining code consists of arithmetic operations only \cite{geco-repo}.
														
 
															 % if there is a chance for improvement, this function should be consindered as a entry point to test improving changes.
														
 
															-Following function calls int the \texttt{compressor} section of \texttt{geco.c}, to find the call of \texttt{arith.c} no sign of multithreading could be identified. This fact leaves additional optimization possibilities and will be discussed at the end of this work.
														
 
															+While following function calls in the \texttt{compressor} section of \texttt{geco.c}, to find the locations where \texttt{arith.c} gets executed, no sign of multithreading could be identified. This fact leaves additional optimization possibilities.
														
 
															 %useless? -> Both, \texttt{bitio.c} and \texttt{arith.c} are pretty simliar. They were developed by the same authors, execpt for Radford Neal who is only mentioned in \texttt{arith.c}, both are based on the work of A. Moffat \cite{moffat_arith}.
														
 
															 %\subsection{genie} % genie
														
@@ -342,13 +342,13 @@ Data is split into blocks. Each block stores a header consisting of three bit. A
 
															 	\item \texttt{01}		Compressed with a fixed set of Huffman codes.	
														
 
															 	\item \texttt{10}		Compressed with dynamic Huffman codes.
														
 
															 \end{itemize}
														
 
															-The last combination \texttt{11} is reserved to mark a faulty block. The third, leading bit is set to flag the last data block \cite{rfc1951}.
														
 
															+The last combination \texttt{11} is reserved to mark a faulty block. The third, leading bit is set to flag the last data block \cite{rfc1951}.\\
														
 
															 % lz77 part
														
 
															-As described in \ref{k4:lz} a compression with \acs{LZ77} results in literals, a length for each literal and pointers that are represented by the distance between pointer and the literal it points to.
														
 
															+%As described in \ref{k4:lz} the compression with \acs{LZ77} results in literals, a length for each literal and pointers that are represented by the distance between pointer and the literal it points to.
														
 
															 The \acs{LZ77} algorithm is executed before the Huffman algorithm. Further compression steps differ from the already described algorithm and will extend to the end of this section.\\
														
 
															 % Huffman part
														
 
															-Besides header bit and a data block, two Huffman code trees are store. One encodes literals and lenghts and the other distances. They happen to be in a compact form. This is archived by a addition of two rules on top of the rules described in \ref{k4:huff}: Codes of identical lengths are orderd lexicographically, directed by the characters they represent. And the simple rule: shorter codes precede longer codes.
														
 
															+Besides header bit and a data block, two Huffman code trees are store. One encodes literals and lenghts and the other distances. They happen to be in a compact form. This is achieved by a addition of two rules on top of the rules described in \ref{k4:huff}: Codes of identical lengths are orderd lexicographically, directed by the characters they represent. And the simple rule: shorter codes precede longer codes.
														
 
															 To illustrated this with an example:
														
 
															 For a text consisting out of \texttt{C} and \texttt{G}, following codes would be set, for a encoding of two bit per character: \texttt{C}: 00, \texttt{G}: 01. With another character \texttt{A} in the alphabet, which would occur more often than the other two characters, the codes would change to a representation like this:
														
@@ -366,7 +366,7 @@ For a text consisting out of \texttt{C} and \texttt{G}, following codes would be
 
															 \end{footnotesize}
														
 
															 \rmfamily
														
 
															-Since \texttt{A} precedes \texttt{C} and \texttt{G}, it is represented with a 0. To maintain prefix-free codes, the two remaining codes are not allowed to contain a leading 0. \texttt{C} precedes \texttt{G} lexicographically, therefor the (in a numerical sense) smaller code is set to represent \texttt{C}.\\
														
 
															+Since \texttt{A} precedes \texttt{C} and \texttt{G}, it is represented by a 0. To maintain prefix-free codes, the two remaining codes are not allowed to contain a leading 0. \texttt{C} precedes \texttt{G} lexicographically, therefor the (in a numerical sense) smaller code is set to represent \texttt{C}.\\
														
 
															 With this simple rules, the alphabet can be compressed too. Instead of storing codes itself, only the codelength stored \cite{rfc1951}. This might seem unnecessary when looking at a single compressed bulk of data, but when compressing blocks of data, a samller alphabet can make a relevant difference.\\
														
 
															 % example header, alphabet, data block?
														
@@ -379,7 +379,7 @@ The complete structure is displayed in \ref{k4:cram-struct}. The following parag
 
															 \begin{figure}[H]
														
 
															   \centering
														
 
															   \includegraphics[width=15cm]{k4/cram-structure.png}
														
 
															-  \caption{\acs{CRAM} file format structure \cite{bam}.}
														
 
															+  \caption{File Format Structure of Samtools \acs{CRAM} \cite{bam}.}
														
 
															   \label{k4:cram-struct}
														
 
															 \end{figure}
														
--- a/latex/tex/kapitel/k5_feasability.tex
+++ b/latex/tex/kapitel/k5_feasability.tex
@@ -59,8 +59,6 @@ Reading from /proc/cpuinfo reveals processor specifications. Since most of the i
 
															   \item address sizes: 36 bits physical, 48 bits virtual
														
 
															 \end{itemize}
														
 
															-Full CPU secificaiton can be found in \ref{a5:cpu}.
														
 
															-
														
 
															 % explanation on some entry: https://linuxwiki.de/proc/cpuinfo
														
 
															 %\begin{em}
														
 
															 %processor	: 7
														
@@ -92,16 +90,12 @@ Full CPU secificaiton can be found in \ref{a5:cpu}.
 
															 %power management:
														
 
															 %\end{em}
														
 
															-The installed \ac{RAM} was offering a total of 16GB with four 4GB instances. 
														
 
															+The installed \ac{RAM} was offering a total of 16~\acs{GB} with four 4~\acs{GB} instances. 
														
 
															 For this paper relevant specifications are listed below:
														
 
															-\noindent Command used to list 
														
 
															-\begin{lstlisting}[language=bash]
														
 
															-   dmidecode --type 17
														
 
															-\end{lstlisting}
														
 
															 \begin{itemize}
														
 
															   \item{Total/Data Width: 64 bits}
														
 
															-  \item{Size: 4GB}
														
 
															+  \item{Size: 4~\acs{GB}}
														
 
															   \item{Type: DDR3}
														
 
															   \item{Type Detail: Synchronous}
														
 
															   \item{Speed/Configured Memory Speed: 1600 Megatransfers/s}
														
@@ -132,13 +126,13 @@ For this paper relevant specifications are listed below:
 
															 %
														
 
															 \section{Operating System and Additionally Installed Packages}
														
 
															-To leave the testing environment in a consistent state, not project specific processes running in the background, should be avoided. 
														
 
															+To leave the testing environment in a consistent state, non-project specific processes running in the background, should be avoided. 
														
 
															 Due to following circumstances, a current Linux distribution was chosen as a suitable operating system:
														
 
															 \begin{itemize}
														
 
															-  \item{factors that interfere with a consistent efficiency value should be avoided}
														
 
															-  \item{packages, support and user experience should be present to an reasonable amount}
														
 
															+  \item{Factors that interfere with a consistent efficiency value should be avoided.}
														
 
															+  \item{Packages, support and user experience should be present to an reasonable amount.}
														
 
															 \end{itemize}
														
 
															-Some background processes will run while the compression analysis is done. This is owed to the demand of an increasingly complex operating system to execute complex programs. Considering that different tools will be exeuted in this environment, minimizing the background processes would require building a custom operating system or configuring an existing one to fit this specific use case. The boundary set by the time limitation for this work rejects named alternatives. 
														
 
															+Some background processes will run while the compression analysis is done. This is owed to the demand of an increasingly complex operating system to execute complex programs. Considering that different tools will be exeuted in this environment, minimizing the background processes would require building a custom operating system or configuring an existing one to fit this specific use case. The boundary set by the time limitation for this work rejects mentioned alternatives. 
														
 
															 %By comparing the values of explained factors, a sweet spot can be determined:
														
 
															 % todo: add preinstalled package/program count and other specs
														
 
															 Choosing \textbf{Debian GNU/Linux} version \textbf{11} features enough packages to run every tool without spending to much time on the setup.\\
														
@@ -170,12 +164,12 @@ Following criteria is reqiured for test data to be appropriate:
 
															 \end{itemize}
														
 
															 A second, bigger set of testfiles were required. This would verify the test results are not limited to small files. A minimum of one gigabyte of average filesize were set as a boundary. This corresponds to over five times the size of the first set.\\
														
 
															 % data gathering
														
 
															-Since there are multiple open \ac{FTP} servers which distribute a variety of files, finding a suitable first set is rather easy. The ensembl database featured defined criteria, so the first available set called:
														
 
															+Since there are multiple open \ac{FTP} servers which distribute a variety of files, finding a suitable first set is rather easy. The Ensembl database featured defined criteria, so the first available set called:
														
 
															 \texttt{Homo\_sapiens.GRCh38.dna.chromosome}
														
 
															-were chosen \cite{ftp-ensembl}. This sample includes 20 chromosomes, whereby considering the filenames, one chromosome is contained in each single file. After retrieving and unpacking the files, write privileges on them was withdrawn. So no tool could alter any file contents, without sufficient permission.
														
 
															-Finding a second, bigger set happened to be more complicated. \acs{FTP} offers no fast, reliable way to sort files according to their size, regardless of their position. Since available servers \cite{ftp-ensembl, ftp-ncbi, ftp-igsr} offer several thousand files, stored in variating, deep directory structures, mapping filesize, filetype and file path takes too much time and resources for the scope of this work. This problematic combined with a easily triggered overflow in the samtools library, resulted in a set of several, manualy searched and tested \acs{FASTq} files. Compared to the first set, there is a noticable lack of quantity, but the filesizes happen to be of a fortunate distribution. With pairs of two files in the ranges of 0.6, 1.1, 1.2 and one file with a size of 1.3 gigabyte, effects on scaling sizes should be clearly visible.\\
														
 
															+was picked \cite{ftp-ensembl}. This sample includes 20 chromosomes, whereby considering the filenames, one chromosome is contained in each single file. After retrieving and unpacking the files, write privileges on them was withdrawn. So no tool could alter any file contents, without sufficient permission.
														
 
															+Finding a second, bigger set happened to be more complicated. \acs{FTP} offers no fast, reliable way to sort files according to their size, regardless of their position. Since available servers \cite{ftp-ensembl, ftp-ncbi, ftp-igsr} offer several thousand files, stored in varying, deep directory structures, mapping filesize, filetype and file path takes too much time and resources for the scope of this work. This problematic combined with a easily triggered overflow in the samtools library, resulted in a set of several, manualy searched and tested \acs{FASTq} files. Compared to the first set, there is a noticable lack of quantity, but the filesizes happen to be of a fortunate distribution. With pairs of two files in the ranges of 0.6, 1.1, 1.2 and one file with a size of 1.3 gigabyte, effects on scaling sizes should be clearly visible.\\
														
 
															 \mycomment{
														
 
															 %make sure this needs to stay.
														
--- a/latex/tex/kapitel/k6_results.tex
+++ b/latex/tex/kapitel/k6_results.tex
@@ -1,17 +1,16 @@
 
															 \chapter{Results and Discussion}
														
 
															-The tables \ref{a6:compr-size} and \ref{a6:compr-time} contain raw measurement values for the two goals, described in \ref{k5:goals}. The table \ref{a6:compr-time} lists how long each compression procedure took, in milliseconds. \ref{a6:compr-size} contains file sizes in bytes. In these tables, as well as in the other ones associated with tests in the scope of this work, the a name scheme is used, to improve readability. The filenames were replaced by \texttt{File} followed by two numbers separated by a point. For the first test set, the number prefix \texttt{1.} was used, the second set is marked with a \texttt{2.}. For example, the fourth file of each test, in tables are named like this \texttt{File 1.4} and \texttt{File 2.4}. The name of the associated source file for the first set is:
														
 
															+The tables \ref{a6:compr-size} and \ref{a6:compr-time} contain raw measurement values for the two goals, described in \ref{k5:goals}. The table \ref{a6:compr-time} lists how long each compression procedure took, in milliseconds. \ref{a6:compr-size} contains file sizes in bytes. In these tables, as well as in the other ones associated with tests in the scope of this work, the a naming scheme is used, to improve readability. The filenames were replaced by \texttt{File} followed by two numbers separated by a point. For the first test set, the number prefix \texttt{1.} was used, the second set is marked with a \texttt{2.}. For example, the fourth file of each test, in tables are named like this \texttt{File 1.4} and \texttt{File 2.4}. The name of the associated source file for the first set is:
														
 
															 \texttt{Homo\_sapiens.GRCh38.dna.chromosome.\textbf{4}.fa}
														
 
															 Since the source files of the second set are not named as consistent as in the first one, a third column in \ref{k6:set2size} was added, which is mapping table ID. and source file name.\\
														
 
															 The files contained in each test set, as well as their size can be found in the tables \ref{k6:set1size} and \ref{k6:set2size}.
														
 
															-The first test set contained a total of 2.8 \acs{GB} unevenly spread over 21 files, while the second test set contained 7 \acs{GB} in total, with a quantity of seven files.\\
														
 
															+The first test set contained a total of 2.8~\acs{GB} unevenly spread over 21 files, while the second test set contained 7~\acs{GB} in total, with a quantity of seven files.\\
														
 
															-\label{k6:set1size}
														
 
															 \sffamily
														
 
															 \begin{footnotesize}
														
 
															   \begin{longtable}[h]{ p{.4\textwidth} p{.4\textwidth}} 
														
 
															-    \caption[First Test Set Files and their Sizes in MB]                       % Caption für das Tabellenverzeichnis
														
 
															+    \caption[First Test Set Files and their Sizes in \acs{MB}]                       % Caption für das Tabellenverzeichnis
														
 
															         {Files contained in the First Test Set and their Sizes in \acs{MB}} % Caption für die Tabelle selbst
														
 
															         \\
														
 
															     \toprule
														
@@ -39,15 +38,15 @@ The first test set contained a total of 2.8 \acs{GB} unevenly spread over 21 fil
 
															 			File 1.20& 62.483\\
														
 
															 			File 1.21& 45.289\\
														
 
															     \bottomrule
														
 
															+		\label{k6:set1size}
														
 
															   \end{longtable}
														
 
															 \end{footnotesize}
														
 
															 \rmfamily
														
 
															-\label{k6:set2size}
														
 
															 \sffamily
														
 
															 \begin{footnotesize}
														
 
															   \begin{longtable}[h]{ p{.2\textwidth} p{.2\textwidth} p{.4\textwidth}} 
														
 
															-    \caption[Second Test Set Files and their Sizes in MB]                       % Caption für das Tabellenverzeichnis
														
 
															+    \caption[Second Test Set Files and their Sizes in \acs{MB}]                       % Caption für das Tabellenverzeichnis
														
 
															         {Files contained in the Second Test Set, their Sizes in \acs{MB} and Source File Names} % Caption für die Tabelle selbst
														
 
															         \\
														
 
															     \toprule
														
@@ -61,18 +60,19 @@ The first test set contained a total of 2.8 \acs{GB} unevenly spread over 21 fil
 
															 			File 2.6& 1071.095& SRR002818.recal.fastq\\
														
 
															 			File 2.7& 1240.564& SRR002819.recal.fastq\\
														
 
															     \bottomrule
														
 
															+	\label{k6:set2size}
														
 
															   \end{longtable}
														
 
															 \end{footnotesize}
														
 
															 \rmfamily
														
 
															+
														
 
															 \section{Interpretation of Results}
														
 
															 The units milliseconds and bytes store a high precision. Unfortunately they are harder to read and compare, solely by the readers eyes. Therefore the data was altered. Sizes in \ref{k6:sizepercent} are displayed in percentage, in relation to the respective source file. Meaning the compression with \acs{GeCo} on:
														
 
															 \texttt{Homo\_sapiens.GRCh38.dna.chromosome.11.fa}
														
 
															 resulted in a compressed file which were only 17.6\% as big.
														
 
															-Runtimes in \ref{k6:time} were converted into seconds and have been rounded to two decimal places.
														
 
															-Also a line was added to the bottom of each table, showing the average percentage or runtime for each process.\\
														
 
															-\label{k6:sizepercent}
														
 
															+Runtimes in \ref{k6:time} were converted into seconds and have been rounded to two decimal places. Also a line was added to the bottom of each table, showing the average percentage of runtime for each process.\\
														
 
															+
														
 
															 \sffamily
														
 
															 \begin{footnotesize}
														
 
															   \begin{longtable}[h]{ p{.2\textwidth} p{.2\textwidth} p{.2\textwidth} p{.2\textwidth}}
														
@@ -107,13 +107,13 @@ Also a line was added to the bottom of each table, showing the average percentag
 
															       &&&\\
														
 
															 			\textbf{Total}& 18.98& 24.99& 22.71\\
														
 
															     \bottomrule
														
 
															+		\label{k6:sizepercent}
														
 
															   \end{longtable}
														
 
															 \end{footnotesize}
														
 
															 \rmfamily
														
 
															 Overall, Samtools \acs{BAM} resulted in 71.76\% size reduction, the \acs{CRAM} methode improved this by rughly 2.5\%. \acs{GeCo} provided the greatest reduction with 78.53\%. This gap of about 4\% comes with a comparatively great sacrifice in time.\\
														
 
															-\label{k6:time}
														
 
															 \sffamily
														
 
															 \begin{footnotesize}
														
 
															   \begin{longtable}[ht]{ p{.2\textwidth} p{.2\textwidth} p{.2\textwidth} p{.2\textwidth}}
														
@@ -147,6 +147,7 @@ Overall, Samtools \acs{BAM} resulted in 71.76\% size reduction, the \acs{CRAM} m
 
															       &&&\\
														
 
															       \textbf{Total}&42.57&2.09&9.32\\
														
 
															     \bottomrule
														
 
															+		\label{k6:time}
														
 
															   \end{longtable}
														
 
															 \end{footnotesize}
														
 
															 \rmfamily
														
@@ -161,7 +162,6 @@ Before interpreting this data further, a quick view into development processes:
 
															 %For the second set of test data, the file identifier was set to follow the scheme \texttt{File 2.x} where x is a number between zero and seven. While the first set of test data had names that matched the file identifiers, considering its numbering, the second set had more variating names. The mapping between identifier and file can be found in \ref{}. % todo add test set tables
														
 
															 Reviewing \ref{k6:recal-time} one will notice, that \acs{GeCo} reached a runtime over 60 seconds on every run. Instead of displaying the runtime solely in seconds, a leading number followed by an m indicates how many minutes each run took.
														
 
															-\label{k6:recal-size}
														
 
															 \sffamily
														
 
															 \begin{footnotesize}
														
 
															   \begin{longtable}[c]{ p{.2\textwidth} p{.2\textwidth} p{.2\textwidth} p{.2\textwidth}}
														
@@ -182,11 +182,11 @@ Reviewing \ref{k6:recal-time} one will notice, that \acs{GeCo} reached a runtime
 
															       &&&\\
														
 
															 			\textbf{Total}& 1.07& 7.11& 6.17\\
														
 
															     \bottomrule
														
 
															+		\label{k6:recal-size}
														
 
															   \end{longtable}
														
 
															 \end{footnotesize}
														
 
															 \rmfamily
														
 
															-\label{k6:recal-time}
														
 
															 \sffamily
														
 
															 \begin{footnotesize}
														
 
															   \begin{longtable}[ht]{ p{.2\textwidth} p{.2\textwidth} p{.2\textwidth} p{.2\textwidth}}
														
@@ -207,14 +207,15 @@ Reviewing \ref{k6:recal-time} one will notice, that \acs{GeCo} reached a runtime
 
															       &&&\\
														
 
															 			\textbf{Total}& 1m43.447& 13.474& 20.567\\
														
 
															     \bottomrule
														
 
															+		\label{k6:recal-time}
														
 
															   \end{longtable}
														
 
															 \end{footnotesize}
														
 
															 \rmfamily
														
 
															-In both tables \ref{k6:recal-time} and \ref{k6:recal-size} the already identified pattern can be observed. Looking at the compression ratio in \ref{k6:recal-size} a maximum compression of 99.04\% was reached with \acs{GeCo}. In this set of test files, file seven were the one with the greatest size (\~1.3 Gigabyte). Closely folled by file one and two (\~1.2 Gigabyte). 
														
 
															+In both tables \ref{k6:recal-time} and \ref{k6:recal-size} the already identified pattern can be observed. Looking at the compression ratio in \ref{k6:recal-size} a maximum compression of 99.04\% was reached with \acs{GeCo}. In this set of test files, file seven were the one with the greatest size (\~ 1.3~\acs{GB}). Closely folled by file one and two (\~ 1.2~\acs{GB}). 
														
 
															 \section{View on Possible Improvements}
														
 
															-So far, this work went over formats for storing genomes, methods to compress files (in mentioned formats) and through tests where implementations of named algorithms compress several files and analyzed the results. The test results show that \acs{GeCo} provides a better compression ratio than Samtools and takes more time to run through. So in this testrun, implementations of arithmetic coding resulted in a better compression ratio than Samtools \acs{BAM} with the mix of Huffman coding and \acs{LZ77}, or Samtools custom compression format \acs{CRAM}. Comparing results in \autocite{survey}, supports this statement. This study used \acs{FASTA}/Multi-FASTA files from 71MB to 166MB and found that \acs{GeCo} had a variating compression ratio from 12.34 to 91.68 times smaller than the input reference and also resulted in long runtimes up to over 600 minutes \cite{survey}. Since this study focused on another goal than this work and therefore used different test variables and environments, the results can not be compared. But what can be taken from this, is that arithmetic coding, at least in \acs{GeCo} is in need of a runtime improvement.\\
														
 
															+So far, this work went over formats for storing genomes, methods to compress files (in mentioned formats) and through tests where implementations of named algorithms compress several files and analyzed the results. The test results show that \acs{GeCo} provides a better compression ratio than Samtools and takes more time to run through. So in this testrun, implementations of arithmetic coding resulted in a better compression ratio than Samtools \acs{BAM} with the mix of Huffman coding and \acs{LZ77}, or Samtools custom compression format \acs{CRAM}. Comparing results in \autocite{survey}, supports this statement. This study used \acs{FASTA}/Multi-FASTA files from 71~\acs{MB} to 166~\acs{MB} and found that \acs{GeCo} had a variating compression ratio from 12.34 to 91.68 times smaller than the input reference and also resulted in long runtimes up to over 600 minutes \cite{survey}. Since this study focused on another goal than this work and therefore used different test variables and environments, the results can not be compared. But what can be taken from this, is that arithmetic coding, at least in \acs{GeCo} is in need of a runtime improvement.\\
														
 
															 The actual mathematical prove of such an improvement, the planing of a implementation and the development of a proof of concept, will be a rewarding but time and ressource comsuming project. Dealing with those tasks would go beyond the scope of this work. But in order to widen the foundation for this tasks, the rest of this work will consist of considerations and problem analysis, which should be thought about and dealt with to develop a improvement.
														
 
															 S.V. Petoukhov described his prepublished findings, which are under ongoing research, about the distribution of nucleotides \cite{pet21}. With the probability of one nucleotide, in a sequence of sufficient length, estimations about the direct neighbours of this nucleotide might be revealed. This can be illustrated in this formula \cite{pet21}:\\
														
@@ -234,7 +235,7 @@ Further he described that there might be a simliarity between nucleotides.
 
															 The exemplaric probabilities he displayed are reprinted in \ref{k6:pet-prob}. Noteable are the similarities in the distirbution of \%A and \%G as well as in \%C and \%T. They align until the third digit after the decimal point. According to Petoukhov, this regularity is found in the genome of humans, some anmials, plants, bacteria and more \cite{pet21}.\\
														
 
															 % begin optimization 
														
 
															-Considering this and the measured results, an improvement in the arithmetic coding process and therefore in \acs{GeCo}s efficiency, would be a good start to equalize the great gap in the compression duration. Combined with a tool that is developed with todays standards, there is a possibility that even greater improvements could be archived.\\
														
 
															+Considering this and the measured results, an improvement in the arithmetic coding process and therefore in \acs{GeCo}s efficiency, would be a good start to equalize the great gap in the compression duration. Combined with a tool that is developed with todays standards, there is a possibility that even greater improvements could be achived.\\
														
 
															 % simple theoretical approach
														
 
															 How would a theoretical improvement approach look like? As described in \ref{k4:arith}, entropy coding requires to determine the probabilies of each symbol in the alphabet. The simplest way to do that, is done by parsing the whole sequence from start to end and increasing a counter for each nucleotide that got parsed. 
														
 
															 With new findings discovered by Petoukhov in cosideration, the goal would be to create an entropy coding implementation that beats current implementation in the time needed to determine probabilities. A possible approach would be that the probability of one nucleotide can be used to determine the probability of other nucelotides, by a calculation rather than the process of counting each one.
														
@@ -248,7 +249,7 @@ This approach throws a few questions that need to be answered in order to plan a
 
															 \end{itemize}
														
 
															 % first bulletpoint
														
 
															-The question for how many probabilities are needed, needs to be answered, to start working on any kind of implementation. This question will only get answered by theoretical prove. It could happen in form of a mathematical equation, which proves that counting all occurrences of one nucleotide reveals can be used to determin all probabilities. 
														
 
															+The question for how many probabilities are needed, needs to be answered, to start working on any kind of implementation. This question will only get answered by theoretical prove. It could happen in form of a mathematical equation, which proves that counting all occurrences of one form of nucleotide can be used to determin probabilities of the other nucleotides. 
														
 
															 %Since this task is time and resource consuming and there is more to discuss, finding a answer will be postponed to another work. 
														
 
															 %One should keep in mind that this is only one of many approaches. Any prove of other approaches which reduces the probability determination, can be taken in instead. 
														
@@ -310,15 +311,15 @@ If there space for improvement in the parsing/counting process, what problems ne
 
															 % C ist ungefaehr T => bytes(genome) - bytes(T) = 2*bytes(A) = 2*bytes(G) = bytes(A) + bytes(T)
														
 
															 % bulletpoint 3
														
 
															-Another important question that needs answered would be: If Petoukhovs findings will show that, through simliarities in the distribution of each nucleotide, one can lead to the aproximation of the other three. Entropy codings work with probabilities, how does that affect the coding mechanism?
														
 
															-With a equal probability for each nucleotide, entropy coding can not be treated as a whole. This is due to the fact, that Huffman coding makes use of differing probabilities. A equal distribution means every character will be encoded in the same length which would make the encoding process unnecessary. Arithmetic coding on the other hand is able to handle equal probabilities.
														
 
															+The last point referes to the possibility that Petoukhovs findings will show that the simliarities in the distribution is univeral. Entropy codings work with probabilities, how does that affect the coding mechanism?
														
 
															+With a equal probability for each nucleotide, entropy coding can not be treated as a whole. This is due to the fact, that Huffman coding makes use of differing probabilities. A equal distribution means every character will be encoded in the same length which would make the encoding process less usefull. Arithmetic coding on the other hand is able to handle equal probabilities.
														
 
															 The fact that there are obviously chains of repeating nucleotides in genomes. For example \texttt{File 1.10}, which contains this subsequence:
														
 
															 \texttt{AACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTTAACCC} 
														
 
															-Without determining probabilities, one can see that the amount of \texttt{C}s outnumbers \texttt{T}s and \texttt{A}s. With the whole 133258320 symbols 130\acs{MB}, the probability distribution will align more. The following values have been roundet down: \texttt{A $\approx$ 0.291723, C $\approx$ 0.207406, G $\approx$ 0.208009, T $\approx$ 0.2928609}. The pattern described by S. Petoukhov is recognizable. But by cutting out a subsection, of relevant size, with unequal distributions will have an impact on the probabilities of the whole sequence. 
														
 
															+Without determining probabilities, one can see that the amount of \texttt{C}s outnumbers \texttt{T}s and \texttt{A}s. With the whole 133258320 symbols 130~\acs{MB}, the probability distribution will align more. The following values have been roundet down: \texttt{A $\approx$ 0.291723, C $\approx$ 0.207406, G $\approx$ 0.208009, T $\approx$ 0.2928609}. The pattern described by S. Petoukhov is recognizable. But by cutting out a subsection, of relevant size, with unequal distributions will have an impact on the probabilities of the whole sequence. 
														
 
															 If a greater sequence would lead to a more equal distribution, this knowledge could be used to help determining distributions on subsequences of one with equaly distributed probabilities.\\
														
 
															-There are some rules that apply to any whole chromosom sequence as well as to subsequences rerefenced by \texttt{S}. With the knowledge about lenght \texttt{len(S)} and the frequency and position of one symbol e.g. \texttt{C} represented as \texttt{|C|}, rules about the enveloping sequence can be derived. The arithmetic operations on symbols $\cdot$ for consecutive repetitions and $+$ for the concatination are used. For x and y as the ammount of nucleotides before the first and after the last \texttt{C}:
														
 
															+There are some rules that apply to any whole chromosom sequence as well as to subsequences referenced by \texttt{S}. With the knowledge about lenght \texttt{len(S)} and the frequency and position of one symbol e.g. \texttt{C} represented as \texttt{|C|}, rules about the enveloping sequence can be derived. The arithmetic operations on symbols $\cdot$ for consecutive repetitions and $+$ for the concatination are used. For x and y as the ammount of nucleotides before the first and after the last \texttt{C}:
														
 
															 \begin{itemize}
														
 
															 	\item $\frac{len(S)}{x/y-1}\cdot (|C| -1)$ determines the ammount of $(x \cdot N) + C$ and $C + (y \cdot N)$ sequences $\in S$. 
														
@@ -335,7 +336,7 @@ Besides multithreading, there are other methods that could impact improvement ap
 
															 \mycomment{
														
 
															 Summarizing relevant points to end this work in a final conclusion and the view in a possible future:
														
 
															 - coding algorithms did not change drastically, in the last deccades 
														
 
															-- improvements are archived by additions to existing algorithms and combining multiple algorithms for specific tasks
														
 
															+- improvements are achived by additions to existing algorithms and combining multiple algorithms for specific tasks
														
 
															 - tests and comparings shown that arithmetic coding lacks in efficiency
														
 
															 possible future events:
														
@@ -354,13 +355,13 @@ bad case
 
															 Before resulting in a final conclusion, a quick summary of important points:
														
 
															 \begin{itemize}
														
 
															 	\item coding algorithms did not change drastically, in the last deccades 
														
 
															-	\item improvements are archived by additions to existing algorithms and combining multiple algorithms for specific tasks
														
 
															+	\item improvements are achived by additions to existing algorithms and combining multiple algorithms for specific tasks
														
 
															  	\item tests and comparings shown that arithmetic coding lacks in efficiency
														
 
															 \end{itemize}
														
 
															 The goal for this new optimization approach is clearly defined. Also a possible test environment and measuremnet techniques that indicate a success have been testes, in this work as well as in cited works \cite{survey}. Considering how other improvements were implemented in the past, shows that the way this approach would work is feasible \cite{moffat_arith}. This combined with the last point leads to assumption that there is a realistic chance to optimize entropy coding, specifically the arithmetic coding algorithm.\\
														
 
															 This assumption will consolidate by viewing best- and worst-case szenarios that could result from further research. Two variables are taken into this thought process. One would be the success of the optimization approach and the other if Petoukhov's findings develop favorable:
														
 
															 The best case would be described as optimization through exact determination of the whole probability distribution is possible and Petoukhov's findings prove that his rules are universal for genomes between living organisms. This would result in a faster compression with entropy coding. Depending on the dimension either a tool that is implementing entropy coding only or a hybrid tool, with improved efficiency in its entropy coding algorithms would set the new \texttt{state of the art}.\\
														
 
															-In a worst case szenario, the exact determination of probability distributions would not be possible. This would mean more research should be done in approximating probability distibutions. Additionally how the use of $A\approx G \approx 0.2914$ and $C\approx T\approx 0.2086$ could provide efficiency improvements in reference-free compression of whole chromosomes and general improvements in the compression of a reference genome in reference-based compression solutions \cite{survey}.\\
														
 
															+In a worst case szenario, the exact determination of probability distributions would not be possible. This would mean more research should be done in approximating probability distibutions. Additionally, how the use of $A\approx G \approx 0.2914$ and $C\approx T\approx 0.2086$ could provide efficiency improvements in reference-free compression of whole chromosomes and general improvements in the compression of a reference genome in reference-based compression solutions \cite{survey}.\\
														
 
															 Also Petoukov would be wrong about the universality of the defined rules, considering the examplary caculation of probability determination of \texttt{File 1.10} a concern that his rules do not apply to any genomes, and he had a miscalculation is out of the way. This would limit the range of the impact an improvement would create. The combination of which genomes follow Petoukov's rules and a list of tools that specialize on the compression of those would set the new goal for an optimization approach.\\
														
 
															 %From this perspective, how favorable research turns out does not determine if there will be an impact but just how far it will reach.