瀏覽代碼

fixed and added tables

u 3 年之前
父節點
當前提交
ce578296d6

二進制
latex/tex/bilder/k3/fasta-structure.png


二進制
latex/tex/bilder/k3/fastq-structure.png


二進制
latex/tex/bilder/k3/sam-structure.png


+ 0 - 0
latex/tex/bilder/k3/dict-coding.png → latex/tex/bilder/k4/dict-coding.png


+ 0 - 0
latex/tex/bilder/k3/huffman-tree.png → latex/tex/bilder/k4/huffman-tree.png


+ 3 - 2
latex/tex/docinfo.tex

@@ -39,7 +39,8 @@
 %          erkannt.
 
 % Kurze (maximal halbseitige) Beschreibung, worum es in der Arbeit geht auf Deutsch
-\newcommand{\hsmaabstractde}{TBD.}
+\newcommand{\hsmaabstractde}{Verschiedene Algorithmen werden verwendet, um sequenzierte DNA zu speichern. Eine neue Entdeckung darüber wie die Bausteine der DNA angeordnet sind, bietet die Möglichkeit vorhandene kompressions methoden zum speichern von sequenzierten DNA zu verbessern.\\
+Diese Arbeit vergleicht vier weit verbreitete Kompressionsmethoden und analysiert deren Verwendung von Algorithmen. Durch die Ergebnisse lässt sich der Schluss ziehen, dass Verbesserungen in der Implementation von arithmetischer codierung möglich sind. Die abschließende Diskussion betrachtet mögliche vorgehensweisen zur Vebesserung und welche Aufgaben diese mit sich ziehen könnten.}
 
 % Kurze (maximal halbseitige) Beschreibung, worum es in der Arbeit geht auf Englisch
-\newcommand{\hsmaabstracten}{TBD.}
+\newcommand{\hsmaabstracten}{A variety of algorithms is used to compress sequenced DNA. New findings in the patterns of how the building blocks of DNA are distributed, might provide a chance to improve long used compression algorithms. The comparison of four widely used compression methods and a analysis on their implemented algorithms, leads to the conclusion that improvements are feasable. The closing discussion provides insights, in possible improvement approaches and which challenges they might involve.}

+ 1 - 0
latex/tex/kapitel/Abstract.tex

@@ -0,0 +1 @@
+A variety of algorithms is used to compress sequenced DNA. New findings in the patterns of how nucleotides are distributed in DNA, might provide a chance to improve long used compression algorithms. The comparison of four widely used compression methods and a analysis on their implemented algorithms, leads to the conclusion that improvements are feasable. The closing discussion provides insights, in possible improvement approaches and which challenges they might involve. 

+ 63 - 0
latex/tex/kapitel/a5_feasability.tex

@@ -0,0 +1,63 @@
+\label{a5:cpu}
+\textbf{CPU specification. Due to redundance, the information is limited to the last core, beginning at:} processor : 7\\
+\noindent
+\begin{lstlisting}[language=bash]
+	cat /proc/cpuinfo 
+\end{lstlisting}
+processor : 0\\
+...\\
+
+processor	: 7\\
+vendor\_id	: GenuineIntel\\
+cpu family	: 6\\
+model		: 58\\
+model name	: Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz\\
+stepping	: 9\\
+microcode	: 0x15\\
+cpu MHz		: 2412.891\\
+cache size	: 8192 KB\\
+physical id	: 0\\
+siblings	: 8\\
+core id		: 3\\
+cpu cores	: 4\\
+apicid		: 7\\
+initial apicid	: 7\\
+fpu		: yes\\
+fpu\_exception	: yes\\
+cpuid level	: 13\\
+wp		: yes\\
+flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant\_tsc arch\_perfmon pebs bts rep\_good nopl xtopology nonstop\_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds\_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4\_1 sse4\_2 x2apic popcnt tsc\_deadline\_timer aes xsave avx f16c rdrand lahf\_lm cpuid\_fault epb pti tpr\_shadow vnmi flexpriority ept vpid fsgsbase smep erms xsaveopt dtherm ida arat pln pts\\
+vmx flags	: vnmi preemption\_timer invvpid ept\_x\_only flexpriority tsc\_offset vtpr mtf vapic ept vpid unrestricted\_guest\\
+bugs		: cpu\_meltdown spectre\_v1 spectre\_v2 spec\_store\_bypass l1tf mds swapgs itlb\_multihit srbds mmio\_unknown\\
+bogomips	: 6784.56\\
+clflush size	: 64\\
+cache\_alignment	: 64\\
+address sizes	: 36 bits physical, 48 bits virtual\\
+power management:\\
+
+\label{a5:pkg}
+\textbf{manually installed packages:}
+autoconf\\
+automake\\
+bzip2\\
+cmake\\
+gcc\\
+git\\
+htop\\
+libbz2-dev\\
+libcurl4-gnutls-dev\\
+libhts-dev\\
+libhtscodecs2\\
+liblzma-dev\\
+libncurses5-dev\\
+libomp-dev\\
+libssl-dev\\
+zlib1g-dev\\
+openssh-client\\
+perl\\
+rsync\\
+screen\\
+sudo\\
+ufw\\
+vim\\
+wget\\

+ 75 - 49
latex/tex/kapitel/a6_results.tex

@@ -1,72 +1,98 @@
 \chapter{Erster Anhang: Lange Tabelle}
-\label{t:efficiency}
+\label{a6:testsets-time}
 
+\label{a6:set1time}
 \sffamily
 \begin{footnotesize}
   \begin{longtable}[c]{ p{.2\textwidth} p{.2\textwidth} p{.2\textwidth} p{.2\textwidth}}
-    \caption[Compression Efficiency]                       % Caption für das Tabellenverzeichnis
+    \caption[Compression Efficiency for first test set in milliseconds]                       % Caption für das Tabellenverzeichnis
         {Compression duration meassured in milliseconds} % Caption für die Tabelle selbst
         \\
     \toprule
      \textbf{ID.} & \textbf{\acs{GeCo}} & \textbf{Samtools \acs{BAM}} & \textbf{Samtools \acs{CRAM}} \\
     \midrule
-     File 1 & 235005& 3786& 16926\\
-     File 2 & 246503& 3784& 17043\\
-     File 3 & 20169& 3123& 13999\\
-     File 4 & 194081& 3011& 13445\\
-     File 5 & 183878& 2862& 12802\\
-     File 6 & 173646& 2685& 12015\\
-     File 7 & 159999& 2503& 11198\\
-     File 8 & 148288& 2286& 10244\\
-     File 9 & 12304& 2078& 9210\\
-     File 10 & 134937& 2127& 9461\\
-     File 11 & 136299& 2132& 9508\\
-     File 12 & 134932& 2115& 9456\\
-     File 13 & 999022& 1695& 7533\\
-     File 14 & 924753& 1592& 7011\\
-     File 15 & 852555& 1507& 6598\\
-     File 16 & 827651& 1390& 6089\\
-     File 17 & 820814& 1306& 5791\\
-     File 18 & 798429& 1277& 5603\\
-     File 19 & 586058& 960& 4106\\
-     File 20 & 645884& 1026& 4507\\
-     File 21 & 411984& 721& 3096\\
+     File 1.1 & 235005& 3786& 16926\\
+     File 1.2 & 246503& 3784& 17043\\
+     File 1.3 & 20169& 3123& 13999\\
+     File 1.4 & 194081& 3011& 13445\\
+     File 1.5 & 183878& 2862& 12802\\
+     File 1.6 & 173646& 2685& 12015\\
+     File 1.7 & 159999& 2503& 11198\\
+     File 1.8 & 148288& 2286& 10244\\
+     File 1.9 & 12304& 2078& 9210\\
+     File 1.10 & 134937& 2127& 9461\\
+     File 1.11 & 136299& 2132& 9508\\
+     File 1.12 & 134932& 2115& 9456\\
+     File 1.13 & 999022& 1695& 7533\\
+     File 1.14 & 924753& 1592& 7011\\
+     File 1.15 & 852555& 1507& 6598\\
+     File 1.16 & 827651& 1390& 6089\\
+     File 1.17 & 820814& 1306& 5791\\
+     File 1.18 & 798429& 1277& 5603\\
+     File 1.19 & 586058& 960& 4106\\
+     File 1.20 & 645884& 1026& 4507\\
+     File 1.21 & 411984& 721& 3096\\
     \bottomrule
   \end{longtable}
 \end{footnotesize}
 \rmfamily
 
-\label{t:effectivity}
+\label{a6:testsets-size}
+\label{a6:set1size}
 \sffamily
 \begin{footnotesize}
-  \begin{longtable}[c]{ p{.2\textwidth} p{.2\textwidth} p{.2\textwidth} p{.2\textwidth}}
-    \caption[Compression Effectivity]                       % Caption für das Tabellenverzeichnis
-        {File sizes in different compression formats} % Caption für die Tabelle selbst
+  \begin{longtable}[h]{ p{.2\textwidth} p{.2\textwidth} p{.2\textwidth} p{.2\textwidth} p{.2\textwidth}}
+    \caption[Compression Effectivity for First Test Set in Byte]                       % Caption für das Tabellenverzeichnis
+        \textbf{File Sizes for Different Formats in Byte} % Caption für die Tabelle selbst
+        \\
+    \toprule
+     \textbf{ID.} & \textbf{Uncompressed Source File} & \textbf{\acs{GeCo}} & \textbf{Samtools \acs{BAM}}& \textbf{Samtools \acs{CRAM}} \\
+    \midrule
+			%src, geco, bam and cram in byte
+			File 2.1& 253105752& 46364770& 62048289& 55769827\\
+			File 2.2& 246230144& 49938168& 65391181& 58026123\\
+			File 2.3& 201600541& 41117340& 53586949& 47707954\\
+			File 2.4& 193384854& 39248276& 51457814& 45564837\\
+			File 2.5& 184563953& 37133480& 48838053& 43655371\\
+			File 2.6& 173652802& 35355184& 46216304& 40980906\\
+			File 2.7& 162001796& 31813760& 42371043& 38417108\\
+			File 2.8& 147557670& 30104816& 39107538& 34926945\\
+			File 2.9& 140701352& 23932541& 32708272& 29459829\\
+			File 2.10& 136027438& 27411806& 35855955& 32238052\\
+			File 2.11& 137338124& 27408185& 35894133& 32529673\\
+			File 2.12& 135496623& 27231126& 35580843& 32166751\\
+			File 2.13& 116270459& 20696778& 26467775& 23568321\\
+			File 2.14& 108827838& 18676723& 24284901& 21887811\\
+			File 2.15& 103691101& 16804782& 22486646& 20493276\\
+			File 2.16& 91844042& 16005173& 21568790& 19895937\\
+			File 2.17& 84645123& 15877526& 21294270& 20177456\\
+			File 2.18& 81712897& 16344067& 20684650& 19310998\\
+			File 2.19& 59594634& 10488207& 14616042& 14251243\\
+			File 2.20& 65518294& 13074402& 16769658& 15510100\\
+			File 2.21& 47488540& 7900773& 10477999& 9708258\\
+    \bottomrule
+  \end{longtable}
+\end{footnotesize}
+\rmfamily
+
+\label{a6:set2size}
+\sffamily
+\begin{footnotesize}
+  \begin{longtable}[c]{ p{.2\textwidth} p{.2\textwidth} p{.2\textwidth} p{.2\textwidth} p{.2\textwidth}}
+    \caption[Compression Effectivity for Second Test Set in Byte]                       % Caption für das Tabellenverzeichnis
+        {File Sizes for Different Formats in Byte} % Caption für die Tabelle selbst
         \\
     \toprule
-     \textbf{ID.} & \textbf{Source File} & \textbf{\acs{GeCo}} & \textbf{Samtools \acs{CRAM}} \\
+     \textbf{ID.} & \textbf{Uncompressed Source File}& \textbf{\acs{GeCo}} & \textbf{Samtools \acs{BAM}} & \textbf{Samtools \acs{CRAM}} \\
     \midrule
-     File 1& 253105752& 46364770& 55769827\\
-     File 2& 136027438& 27411806& 32238052\\
-     File 3& 137338124& 27408185& 32529673\\
-     File 4& 135496623& 27231126& 32166751\\
-     File 5& 116270459& 20696778& 23568321\\
-     File 6& 108827838& 18676723& 21887811\\
-     File 7& 103691101& 16804782& 20493276\\
-     File 8& 91844042& 16005173& 19895937\\
-     File 9& 84645123& 15877526& 20177456\\
-     File 10& 81712897& 16344067& 19310998\\
-     File 11& 59594634& 10488207& 14251243\\
-     File 12& 246230144& 49938168& 58026123\\
-     File 13& 65518294& 13074402& 15510100\\
-     File 14& 47488540& 7900773& 9708258\\
-     File 15& 51665500& 41117340& 47707954\\
-     File 16& 201600541& 39248276& 45564837\\
-     File 17& 193384854& 37133480& 43655371\\
-     File 18& 184563953& 35355184& 40980906\\
-     File 19& 173652802& 31813760& 38417108\\
-     File 20& 162001796& 30104816& 34926945\\
-     File 21& 147557670& 23932541& 29459829\\
+			%src, geco, bam and cram in byte
+			File 2.1& 1246731616& 12414797& 78260121& 67130756\\
+			File 2.2& 1261766002& 12363734& 80895953& 69649632\\
+			File 2.3& 657946854& 7966180& 53201724& 47175349\\
+			File 2.4& 708837816& 8499132& 54569686& 48521201\\
+			File 2.5& 1118234394& 12088239& 84764250& 75118457\\
+			File 2.6& 1123124224& 12265535& 88147227& 77826446\\
+			File 2.7& 1300825946& 12450651& 75860986& 60239362\\
     \bottomrule
   \end{longtable}
 \end{footnotesize}

+ 4 - 0
latex/tex/kapitel/abkuerzungen.tex

@@ -16,11 +16,15 @@
   \acro{FASTq}{File Format Based on FASTA}
   \acro{FTP}{File Transfere Protocol}
   \acro{GA4GH}{Global Alliance for Genomics and Health}
+	\acro{GB}{Gigabyte}
   \acro{GeCo}{Genome Compressor}
+	\acro{GPL}{GNU General Public License}
   \acro{IUPAC}{International Union of Pure and Applied Chemistry}
   \acro{LZ77}{Lempel Ziv 1977}
   \acro{LZ78}{Lempel Ziv 1978}
+	\acro{MIT}{Massachusetts Institute of Technology}
   \acro{RAM}{Random Access Memory} 
   \acro{SAM}{Sequence Alignment Map}
+	\acro{UDP}{Universal Datagram Protocol}
 	\acro{UTF}{Unicode Transformation Format}
 \end{acronym}

+ 7 - 6
latex/tex/kapitel/k1_introduction.tex

@@ -1,19 +1,20 @@
 \chapter{Introduction}
 % general information and intro
-Understanding how things in our cosmos work, was and still is a pleasure, that the human being always wants to fulfill. Getting insights into the rawest form of organic life is possible through storing and studying information, embedded in genetic codes. Since live is complex, there is a lot of information, which requires a lot of memory.\\
+Understanding how things in our cosmos work, was and still is a pleasure, that the human being always wants to fulfill. Getting insights into the rawest form of organic life is possible through storing and studying information, embedded in genetic codes \cite{dna_structure}. Since live is complex, there is a lot of information, which requires a lot of memory \cite{alok17, survey}.\\
 % ...Communication with other researches means sending huge chunks of data through cables or through waves over the air, which costs time and makes raw data vulnerable to erorrs.\\
 % compression values and goals
-With compression tools, the problem of storing information got restricted. Compressed data requires less space and therefore less time to be transported over networks. This advantage is scalable and since genetic information needs a lot of storage, even in a compressed state, improvements are welcomed. Since this field is, compared to others, like computer theory and compression approaches, relatively new, there is much to discover and new findings are not unusual. From some of this findings, new tools can be developed. They optimally increase two factors: the speed at which data is compressed and the compresseion ratio, meaning the difference between uncompressed and compressed data.\\
+With compression algorithms and their implementation in tools, the problem of storing information got smaller. Compressed data requires less space and therefore less time to be transported over networks \cite{Shannon_1948}. This advantage is scalable, and since genetic information needs a lot of storage, even in a compressed state, improvements are welcomed \cite{moffat_arith}. Since this field is, compared to others, like computer theory which created the foundation for compression algorithms, relatively new, there is much to discover and new findings are not unusual \cite{Shannon_1948}. From some of this findings, new tools can be developed. In general they focus on increasing at least one of two factors: the speed at which data is compressed and the compresseion ratio, meaning the difference between uncompressed and compressed data \cite{moffat_arith, alok17, Shannon_1948}.\\
 % ...
 % more exact explanation
 
 % actual focus in short and simple terms
-New discoveries in the universal rules of stochastical organization of genomes might provide a base for new algorithms and therefore new tools or an improvement of existing ones for genome compression. The aim of this work is to analyze the current state of the art for probabilistic compression tools and their algorithms, and ultimately determine whether mentioned discoveries are already used. \texttt{might be thrown out due to time limitations} -> If this is not the case, there will be an analysation of how and where this new approach could be imprelented and if it would improve compression methods.\\
+New discoveries in the universal rules of stochastical organization of genomes might provide a base for new algorithms and therefore new tools or an improvement of existing ones for genome compression \cite{pet21}. The aim of this work is to analyze the current state of the art for compression tools for biological data and implemented, probabilistic algorithms. Further this work will determine if there is room for improvement.\\
+The discussion will include a superficial analysation of how and where this new approach could get implemented and what problems possibly need to be taken care of in the process.\\
 
 % focus and structure of work in greater detail 
-To reach a common ground, the first pages will give the reader a quick overview on the structure of human DNA. There will also be an superficial explanation for some basic terms, used in biology and computer science. The knowledge basis of this work is formed by describing differences in file formats, used to store genome data. In addition to this, a section relevant for compression will follow. This will go through the state of the art in coding theory.\\
-In order to meassure an improvement, first a baseline must be set. Therefore the efficiency and effecitity of suiteable state of the art tools will be meassured. To be as precise as possible, the main part of this work focuses on setting up an environment, picking input data, installing and executing tools and finaly meassuring and documenting the results.\\
-With this information, a static code analysis of mentioned tools follows. This will show where a possible new algorithm or an improvement to an existing one could be implemented. Running a proof of concept implementation under the same conditions and comparing runtime and compression ratio to the defined baseline shows the potential of the new approach for compression with probability algorithms.
+	To reach a common ground, the first pages will give the reader a quick overview on the structure of human DNA. There will also be an fundamental explanation for some basic terms, used in biology and computer science. The first step into the theory of genome compression will be taken, by describing differences in common file formats, used to store genome information. From there, a section which is relevant for understanding compression will follow. It will analyze differences between compression approaches, go over some history of coding theory and lead to a deeper look into the fundamentals of state of the art compression algorithms. The chapter will end with a few pages about implementations of compression algorithms in tools relevant.\\
+In order to meassure a improvement, a baseline must be set. Therefore the efficiency and effecitity of suiteable state of the art tools will be meassured. To be as precise as possible, the middle part of this work focuses on setting up an environment, picking input data, installing and executing tools and finaly meassuring and documenting the results.\\
+The results of this compared with the understanding of how the tools work, will show if there is the need of a improvement and on what factor it should focus. The end of this work will be used to discuss the properties of a a possible improvement, how feasability could be determined and which problems such a project woudl need overcome.\\
 
 % todo: 
 %   explain: coding 

+ 9 - 9
latex/tex/kapitel/k2_dna_structure.tex

@@ -28,10 +28,9 @@ To strengthen the understanding of how and where biological information is store
   \label{k2:gene-overview}
 \end{figure}
 
-All living organisms, like plants and animals, are made of cells (a human body can consist out of several trillion cells) \cite{cells}.\\
-A cell in itself is a living organism; The smallest one possible. It consists out of two layers from which the inner one is called nucleus. The nucleus contains chromosomes and those chromosomes hold the genetic information in form of \ac{DNA}. 
- 
-\acs{DNA} is often seen in the form of a double helix. A double helix consists, as the name suggests, of two single helix. 
+All living organisms, like plants and animals, are made of cells. To get a rough impression, a human body can consist out of several trillion cells.
+A cell in itself, is the smallest living organism. Most cells consists of a outer section and a core which is a called nucleus. In \ref{k2:gene-overview} the nucleus is illustrated as a purple, cirlce like scheme, inside a lighter circle. The nucleus contains chromosomes. Those chromosomes contain genetic information, about its organism, in form of \ac{DNA} \cite{cells}.\\
+\acs{DNA} is often seen in the form of a double helix, like shown in \ref{k2:dna-struct}. A double helix consists, as the name suggests, of two single helix \cite{dna_structure}. 
 
 \begin{figure}[ht]
   \centering
@@ -40,9 +39,10 @@ A cell in itself is a living organism; The smallest one possible. It consists ou
   \label{k2:dna-struct}
 \end{figure}
 
-Each of them consists of two main components: the Sugar Phosphate backbone, which is not relevant for this work and the Bases. The arrangement of Bases represents the Information stored in the \acs{DNA}. A base is an organic molecule, they are called Nucleotides \cite{dna_structure}. \\
-% describe Genomes?
-
-For this work, nucleotides are the most important parts of the \acs{DNA}. A Nucleotide can occur in one of four forms: it can be either adenine, thymine, guanine or cytosine. Each of them got a Counterpart with which a bond can be established: adenine can bond with thymine, guanine can bond with cytosine.\\
-From the perspective of an computer scientist: The content of one helix must be stored, to persist the full information. In more practical terms: The nucleotides of only one (entire) helix needs to be stored physically, to save the information of the whole \acs{DNA} because the other half can be determined by ``inverting'' the stored one. An example would show the counterpart for e.g.: \texttt{adenine, guanine, adenine} chain which would be a chain of \texttt{thymine, cytosine, thymine}. For the sake of simplicity, one does not write out the full name of each nucleotide, but only its initiat. So the example would change to \texttt{AGA} in one Helix, \texttt{TCT} in the other.\\
+Each of them consists of two main components: the sugar phosphate backbone, which is not relevant for this work and the bases. The suggar phosphate backbones are illustrated as flat stripes, circulating aroung the horizontal line in \ref{k2:dna-struct}. Pairs of bases are symbolized as vertical bars between the suggar phosphates. 
+The arrangement of Bases represents the information, stored in the \acs{DNA}. Whar is here described as base is a organic molecule, which is also called nucleotide \cite{dna_structure}.\\
+For this work, nucleotides are the most important parts of the \acs{DNA}. A nucleotide can occur in one of four forms: it can be either adenine, thymine, guanine or cytosine. Each of them got a Counterpart with which a bond can be established: adenine can bond with thymine, guanine can bond with cytosine.\\
+From the perspective of an computer scientist: The content of one helix must be stored, to persist the full information. In more practical terms: The nucleotides of only one (entire) helix needs to be stored physically, to save the information of the whole \acs{DNA}. The other half can be determined by ``inverting'' the stored one. 
+% todo OPT -> figure?
+An example would show the counterpart for e.g.: \texttt{adenine, guanine, adenine} chain which would be a chain of \texttt{thymine, cytosine, thymine}. For the sake of simplicity, one does not write out the full name of each nucleotide, but only its initiat. So the example would change to \texttt{AGA} in one Helix, \texttt{TCT} in the other.\\
 This representation ist commonly used to store \acs{DNA} digitally. Depending on the sequencing procedure and other factors, more information is stored and therefore more characters are required but for now 'A', 'C', 'G' and 'T' should be the only concern.

+ 59 - 40
latex/tex/kapitel/k3_datatypes.tex

@@ -25,83 +25,102 @@
 \section{File Formats used to Store DNA}
 \label{chap:file formats}
 As described in previous chapters \ac{DNA} can be represented by a string with the buildingblocks A,T,G and C. Using a common file format for saving text would be impractical because the ammount of characters or symbols in the used alphabet, defines how many bits are used to store each single symbol.\\
-The \ac{ASCII} table is a character set, registered in 1975 and to this day still in use to encode texts digitally. For the purpose of communication bigger character sets replaced \acs{ASCII}. It is still used in situations where storage is short.\\
+The \ac{ASCII} \cite{iso-ascii} table is a character set, registered in 1975 and to this day still in use to encode texts digitally. For the purpose of communication bigger character sets replaced \acs{ASCII}. It is still used in situations where storage is short.\\
 % grund dass ASCII abgelöst wurde -> zu wenig darstellungsmöglichkeiten. Pro heute -> weniger overhead pro character
-Storing a single \textit{A} with \acs{ASCII} encoding, requires 8 bit (\,excluding magic bytes and the bytes used to mark \ac{EOF})\ . Since there are at least $2^8$ or 128 displayable symbols. The buildingblocks of \acs{DNA} require a minimum of four letters, so two bits are needed
+The buildingblocks of \acs{DNA} require a minimum of four letters, so at least two bits are needed. Storing a single \textit{A} with \acs{ASCII} encoding, requires 8 bit (\,excluding magic bytes and the bytes used to mark the \ac{EOF})\. Since there are at least $2^8$ or 128 displayable symbols with \acs{ASCII} encoding, this leaves a great overhead of unused combination.\\
 % cout out examples. Might be needed later or elsewhere
 % \texttt{00 -> A, 01 -> T, 10 -> G, 11 -> C}. 
-In most tools, more than four symbols are used. This is due to the complexity in sequencing \acs{DNA}. It is not 100\% preceice, so additional symbols are used to mark nucelotides that could not or could only partly get determined. Further a so called quality score is used to indicate the certainty, for each single nucleotide, that it got sequenced correctly.\\
+In most tools, more than four symbols are used. This is due to the complexity in sequencing \acs{DNA}. It is not 100\% preceice, so additional symbols are used to mark nucelotides that could not or could only partly get determined. Further a so called quality score is used to indicate the certainty, for each single nucleotide, that it got sequenced correctly \cite{survey, Cock_2009}.\\
 More common everyday-usage text encodings like unicode require 16 bits per letter. So settling with \acs{ASCII} has improvement capabilities but is, on the other side, more efficient than using bulkier alternatives like unicode.\\
 
 % differences between information that is store
 Formats for storing uncompressed genomic data, can be sorted into several categories. Three noticable ones would be \cite{survey}:
 \begin{itemize}
-	\item sequenced reads
-	\item aligned data
-	\item sequence variation
+	\item Sequenced reads
+	\item Aligned data
+	\item Sequence variation
 \end{itemize}
-The categories are listed on their complexity, considering their usecase and data structure, in ascending order. Starting with sequence variation, also called haplotype describes formats storing graph based structures that focus on analysing variations in different genomes \acs{haplo, survey}. 
-Sequenced reads focus on storing continous protein chains from a sequenced genome \acs{survey}.
-Aligned data is somwhat simliar to sequenced reads with the difference that instead of a whole chain of genomes, overlapping subsequences are stored. This could be described as a rawer form of sequenced reads. This way aligned data stores additional information on how certain a specific part of a genome is read correctly.
-The focus of this work lays on compression of sequenced data but not on the likelyhood of how accurate the data might be. Therefore, only formats that include sequenced reads will be worked with.\\
+The categories are listed on their complexity, considering their usecase and data structure, in ascending order. Starting with sequence variation, also called haplotype describes formats storing graph based structures that focus on analysing variations in different genomes \cite{haplo, sam12}. 
+Sequenced reads focus on storing continous protein chains from a sequenced genome \cite{survey}.
+Aligned data is somwhat simliar to sequenced reads with the difference that instead of a whole chain of genomes, overlapping subsequences are stored. This could be described as a rawer form of sequenced reads. This way aligned data stores additional information on how certain a specific part of a genome is read correctly \cite{survey, sam12}.
+The focus of this work lays on compression of sequenced data but not on the likelyhood of how accurate the data might be. Therefore, only formats that are able to store sequenced reads will be worked with. Note that some alginged data formats are also able to store aligned reads, since latter is just a less informative representation of first \cite{survey, sam12}.\\
 
-% my criteria
+% ausschluss criteria
 Several people and groups have developed different file formats to store genomes. Unfortunaly, the only standard for storing genomic data is fairly new \cite{isompeg, mpeg}. Therefore, formats and tools implementing this standard are mostly still in development. In order to not go beyond scope, this work will focus only on file formats that fulfill following criteria:\\
 \begin{itemize}
-  \item{the format has reputation}
+  \item{The format has reputation. This can be indicated through:}
 	\begin{itemize}
-		\item through a scientific paper, that prooved its superiority to other relevant tools. 
-		\item through a broad ussage of the format determined by its use on ftp servery that focus on supporting scientific research.
+		\item{A scientific paper, that prooved its superiority to other relevant tools.}
+		\item{A broad ussage of the format determined by its use on ftp servers, which focus on supporting scientific research.}
 	\end{itemize}
-  \item{the format should not specialize on only one type of \acs{DNA}.}
-  \item{the format stores nucleotide seuqences and does not neccesarily include \ac{IUPAC} codes besides A, C, G and T \cite{iupac}.}
-  \item{the format is open source. Otherwise, improvements can not be tested, without buying the software and/or requesting permission to disassemble and reverse engineer the software or parts of it.}
+  \item{The format should not specialize on only one type of \acs{DNA} or target a specific technology.}
+  \item{The format stores nucleotide seuqences and does not neccesarily include \ac{IUPAC} codes besides A, C, G and T \cite{iupac}.}
+  \item{The format is open source. Otherwise, improvements can not be tested, without buying the software and/or requesting permission to disassemble and reverse engineer the software or parts of it.}
 \end{itemize}
 
-Information on available formats where gathered through various Internet platforms \cite{ensembl, ucsc, ga4gh}. 
-Some common file formats found:
+	Information on available formats where gathered through various Internet platforms \cite{ensembl, ucsc, ga4gh} and scientific papers \cite{survey, sam12, Cock_2009}. 
+Some common file formats found:\\
+
 \begin{itemize}
 % which is relevant? 
-  \item{\ac{FASTA}} 
-  \item{\ac{FASTq}} 
-  \item{\ac{SAM}/\ac{BAM}} \cite{bam, sam12}
-  \item{\ac{CRAM}} \cite{bam, sam12}
-  \item{twoBit} \cite{twobit}
-  %\item{VCF} genotype format -> anaylses differences of two seq
+  \item{\ac{FASTA}/Multi-\acs{FASTA}}
+  \item{\ac{FASTq} \cite{Cock_2009}} 
+  \item{\ac{SAM}/\ac{BAM}} \cite{sam12, bam}
+  \item{\ac{CRAM}} \cite{sam12, bam}
+  %\item{twoBit} \cite{twobit}
 \end{itemize}
 
 % groups: sequence data, alignment data, haplotypic
 % src: http://help.oncokdm.com/en/articles/1195700-what-is-a-bam-fastq-vcf-and-bed-file
-Since methods to store this kind of Data are still in development, there are many more file formats. From the selection listed above, \acs{FASTA} and \acs{FASTq} seem to have established the reputation of a inoficial standard for sequenced reads \cite{survey, geco, vertical, cram-origin}. \\
-Considering the first criteria, by searching through anonymously accesable \acs{ftp} servers, only two formats are used commonly: FASTA or its extension \acs{FASTq} and the \acs{BAM} Format \acs{ftp-igsr, ftp-ncbi, ftp-ensembl}.
-
+Since methods to store this kind of Data are still in development, there are many more file formats. From the selection listed above, \acs{FASTA} and \acs{FASTq} seem to have established the reputation of a inoficial standard for sequenced reads \cite{survey, geco, sam12, vertical, cram-origin}.\\
+Considering the first criteria, by searching through anonymously accesable \acs{FTP} servers, only two formats are used commonly: FASTA or its extension \acs{FASTq} and the \acs{BAM} Format \cite{ftp-igsr, ftp-ncbi, ftp-ensembl}.
+% todo Explain twobit: The last format, called twoBit is also included, because it is 
 
 \subsection{\acs{FASTA} and \acs{FASTq}}
-The rather simple \acs{FASTA} format consists of two repeated sections. The first section consists of one line and stores metadata about the sequenced genome and the file itself. This line, also called header, contains a comment section starting with \texttt{>} followed by a custom text \cite{alok17, Cock_2009}. The comment section is usually used to store information about the sequenced genome and sometimes metadata about the file itself like its size in bytes.\\
-The other section contains the sequenced genome whereas each nucleotide is represented by character \texttt{A, C, G or T}. There are three more nucleotide characters that store additional information and some characters for representing amino acids, but in order to not go beyond scope, only \texttt{A, C, G, and T} will be paid attention to.\\
-The second section can take multiple lines and is determined by a empty line. After that the file end is reached or another touple of header and sequence can be found.\\
+The rather simple \acs{FASTA} format, are widely used when it comes to storing sequenced reads, without a quality store \cite{sam12, survey}. Since it is a uncompressed format, \acs{FASTA} files are often transmitted compressed with an external tool like gzip \cite{ftp-ensembl, ftp-ncbi}.\\
+
+\begin{figure}[h]
+  \centering
+  \includegraphics[width=14cm]{k3/fasta-structure.png}
+  \caption{Edited example of a \acs{FASTA} file. The original was received from the ensemble server \cite{ftp-ensembl}.}
+  \label{k3:fasta-struct}
+\end{figure}
+
+The format consists of two repeated sections. The first section consists of one line and stores metadata about the sequenced genome and the file itself. This line, also called header, contains a comment section starting with \texttt{>} followed by a custom text \cite{alok17, Cock_2009}. The comment section is usually used to store information about the sequenced genome and sometimes metadata about the file itself like its size in bytes.\\
+The other section contains the sequenced genome whereas each nucleotide is represented by character \texttt{A, C, G or T}. There are more nucleotide characters that store additional information and some characters for representing amino acids, but in order to not go beyond scope, only \texttt{A, C, G, and T} will be paid attention to \cite{iupac}.\\
+The second section can have multiple lines of sequences. A simliar format is the Multi-\acs{FASTA} file format, it consists of concatenated \acs{FASTA} files.\cite{survey}.\\
 % fastq
-In addition to its predecessor, \acs{FASTq} files contain a quality score. The file content consists of four sections, wherby no section is stored in more than one line. All four lines contain information about one sequence. The exact structure of \acs{FASTq} is formated in this order \cite{illufastq}:
+In addition to its predecessor, \acs{FASTq} files contain a quality score. The file content consists of four sections, wherby no section is stored in more than one line. All four lines contain information about one sequence. The exact structure of \acs{FASTq} is formated in this order \cite{Cock_2009}:
 \begin{itemize}
 	\item Line 1: Sequence identifier aka. Title, starting with an @ and an optional description.
 	\item Line 2: The seuqence consisting of nucleoids, symbolized by A, T, G and C.
-	\item Line 3: A '+' that functions as a seperator. Optionally followed by content of Line 1.
+	\item Line 3: A '+' that functions as a seperator or delmitier. Optionally followed by content of Line 1.
 	\item Line 4: quality line(s). consisting of letters and special characters in the \acs{ASCII} scope.
 \end{itemize}
-The quality scores have no fixed format. To name a few, there is the sanger format, the solexa format introduced by Solexa Inc., the Illumina and the QUAL format which is generated by the PHRED software \cite{Cock_2009}.\\
-The quality value shows the estimated probability of error in the sequencing process.
+The quality scores have no fixed format. To name a few, there is the sanger format, the solexa format introduced by Solexa Inc., the Illumina and the QUAL format which is generated by the PHRED software.\\
+The quality value shows the estimated probability of error in the sequencing process \cite{Cock_2009}.\\
+
+\begin{figure}[h]
+  \centering
+  \includegraphics[width=13cm]{k3/fastq-structure.png}
+  \caption{Altered example of a \acs{FASTq} file. The original was received from ncbi server \cite{ftp-ncbi}.}
+  \label{k3:fastq-struct}
+\end{figure}
+
+In \ref{k3:fasta-struct} the described structure is illustrated. The sequence and the delmitier section were altered, to illustrate the stucture of this format better.
+In the header section, \texttt{SRR002906.1} is the sequence identifiert, the following text is a description. In the delimiter line, the header section without the leading @ could be written again. The last line shows the header for the second sequence.\\
 
 \label{k2:sam}
 \subsection{Sequence Alignment Map}
 % src https://github.com/samtools/samtools
-\acs{SAM} often seen in its compressed, binary representation \acs{BAM} with the fileextension \texttt{.bam}, is part of the SAMtools package, a uitlity tool for processing SAM/BAM and CRAM files. The SAM/BAM file is a text based format delimited by TABs \cite{bam}. It uses 7-bit US-ASCII, to be precise Charset ANSI X3.4-1968 \cite{rfcansi}. The structure is more complex than the one in \acs{FASTq} and described best, accompanied by an example:
+\acs{SAM} often seen in its compressed, binary representation \acs{BAM} with the fileextension \texttt{.bam}, is part of the SAMtools package, a uitlity tool for processing SAM/BAM and CRAM files. The SAM/BAM file is a text based format delimited by the whitespace character called tabulation or \textbf{tab} for short \cite{sam12}. It uses 7-bit US-ASCII, to be precise Charset ANSI X3.4-1968 \cite{rfcansi}. The structure is more complex than the one in \acs{FASTq} and described best, accompanied by an example:
 
-\begin{figure}[ht]
+\begin{figure}[h]
   \centering
-  \includegraphics[width=15cm]{k3/sam-structure.png}
-  \caption{SAM/BAM file structure example}
+  \includegraphics[width=14cm]{k3/sam-structure.png}
+  \caption{\acs{SAM} file structure example \cite{bam}.}
   \label{k2:bam-struct}
 \end{figure}
 
-Compared to \acs{FASTA} \acs{SAM} and further compression forms, store more information. As displayed in \ref{k_2:bam-struct} this is done by adding, identifier for Reads e.g. \textbf{+r003}, aligning subsequences and writing additional symbols like dots e.g. \textbf{ATAGCT......} in the split alignment +r004 \cite{bam}. A full description of the information stored in \acs{SAM} files would be of little value to this work, therefore further information on is left out but can be found in \cite{sam12} or at \cite{bam}.
+Compared to \acs{FASTA} \acs{SAM} and further compression forms, store more information. As displayed in \ref{k2:bam-struct} this is done by adding, identifier for Reads e.g. \textbf{+r003}, aligning subsequences and writing additional symbols like dots e.g. \textbf{ATAGCT......} in the split alignment +r004 \cite{survey}. A full description of the information stored in \acs{SAM} files would be of little value to this work, therefore further information on is left out but can be found in \cite{sam12} or at \cite{bam}.\\
 Samtools provide the feature to convert a \acs{FASTA} file into \acs{SAM} format. Since there is no way to calulate mentioned, additional information from the information stored in \acs{FASTA}, the converted files only store two lines. The first one stores metadata about the file and the second stores the nucleotide sequence in just one line.

+ 141 - 56
latex/tex/kapitel/k4_algorithms.tex

@@ -24,41 +24,52 @@
 % 3.2.1 raus
 
 \section{Compression aproaches}
-The process of compressing data serves the goal to generate an output that is smaller than its input data.\\
-In many cases, like in gene compressing, the compression is ideally lossless. This means it is possible for every compressed data, to receive the whole information, which were available in the origin data, by decompressing it.\\
+The process of compressing data serves the goal to generate an output that is smaller than its input \cite{dict}.\\
+In many cases, like in gene compressing, the compression is idealy lossless. This means it is possible with any compressed data, to receive the full information that were available in the origin data, by decompressing it. Lossy compression on the other hand, might excludes parts of data in the compression process, in order to increase the compression rate. The excluded parts are typicaly not necessary to transmit the origin information. This works with certain audio and pictures files or with network protocols like \ac{UDP} which are used to transmit video/audio streams live \cite{rfc-udp, cnet13}.\\
+For storing \acs{DNA} a lossless compression is needed. To be preceice a lossy compression is not possible, because there is no unnecessary data. Every nucleotide and its exact position is needed for the sequence to be complete and usefull.\\
 Before going on, the difference between information and data should be emphasized.\\
 % excurs data vs information
-Data contains information. In digital data  clear, physical limitations delimit what and how much of something can be stored. A bit can only store 0 or 1, eleven bits can store up to $2^11$ combinations of bits and a 1 Gigabyte drive can store no more than 1 Gigabyte data. Information on the other hand, is limited by the way how it is stored. In some cases the knowledge received in a earlier point in time must be considered too, but this can be neglected for reasons described in the subsection \ref{k4:dict}.\\
+Data contains information. In digital data  clear, physical limitations delimit what and how much of something can be stored. A bit can only store 0 or 1, eleven bit can store up to $2^{11}$ combinations of bit and a 1 \acs{GB} drive can store no more than 1 \acs{GB} data. Information on the other hand, is limited by the way how it is stored. What exactly defines informations, depends on multiple factors. The context in which information is transmitted and the source and destination of the information. This can be in form of a signal, transfered from one entity to another or information that is persisted so it can be obtained at a later point in time.\\
 % excurs information vs data
-The boundaries of information, when it comes to storing capabilities, can be illustrated by using the example mentioned above. A drive with the capacity of 1 Gigabyte could contain a book in form of images, where the content of each page is stored in a single image. Another, more resourceful way would be storing just the text of every page in \acs{UTF-16}. The information, the text would provide to a potential reader would not differ. Changing the text encoding to \acs{ASCII} and/or using compression techniques would reduce the required space even more, without loosing any information.\\
+For the scope of this work, information will be seen as the type and position of nucleotides, sequenced from \acs{DNA}. To get even more preceise, it is a chain of characters from a alphabet of \texttt{A, C, G, and T}, since this is the \textit{de facto} standard for digital persistence of \acs{DNA} \cite{isompeg}.
+The boundaries of information, when it comes to storing capabilities, can be illustrated by using the example mentioned above. A drive with the capacity of 1 \acs{GB} could contain a book in form of images, where the content of each page is stored in a single image. Another, more resourceful way would be storing just the text of every page in \acs{UTF}-16 \cite{isoutf}. The information, the text would provide to a potential reader would not differ. Changing the text encoding to \acs{ASCII} and/or using compression techniques would reduce the required space even more, without loosing any information.\\
 % excurs end
-In contrast to lossless compression, lossy compression might excludes parts of data in the compression process, in order to increase the compression rate. The excluded parts are typically not necessary to persist the origin information. This works with certain audio and pictures formats, and in network protocols \cite{cnet13}.
 For \acs{DNA} a lossless compression is needed. To be precise a lossy compression is not possible, because there is no unnecessary data. Every nucleotide and its position is needed for the sequenced \acs{DNA} to be complete. For lossless compression two mayor approaches are known: the dictionary coding and the entropy coding. Methods from both fields, that aquired reputation, are described in detail below \cite{cc14, moffat20, moffat_arith, alok17}.\\
 
 \subsection{Dictionary coding}
-\textbf{Disclaimer}
-Unfortunally, known implementations like the ones out of LZ Family, do not use probabilities to compress and are therefore not in the main scope for this work. To strenghten the understanding of compression algortihms this section will remain. Also a hybrid implementation described later will use both dictionary and entropy coding.\\
 
 \label{k4:dict}
 Dictionary coding, as the name suggest, uses a dictionary to eliminate redundand occurences of strings. Strings are a chain of characters representing a full word or just a part of it. For a better understanding this should be illustrated by a short example:
 % demo substrings
 Looking at the string 'stationary' it might be smart to store 'station' and 'ary' as seperate dictionary enties. Which way is more efficient depents on the text that should get compressed. 
 % end demo
-The dictionary should only store strings that occour in the input data. Also storing a dictionary in addition to the (compressed) input data, would be a waste of resources. Therefore the dicitonary is made out of the input data. Each first occourence is left uncompressed. Every occurence of a string, after the first one, points to its first occurence. Since this 'pointer' needs less space than the string it points to, a decrease in the size is created.\\
+The dictionary should only store strings that occour in the input data. Also storing a dictionary in addition to the (compressed) input data, would be a waste of resources. Therefore the dicitonary is part of the text. Each first occourence is left uncompressed. Each occurence of a string, after the first one, points either to to its first occurence or to the last replacement of its occurence.\\ 
+\ref{k4:dict-fig} illustrates how this process is executed. The bar on top of the figure, which extends over the full widht, symbolizes any text. The squares inside the text are repeating occurences of text segments. 
+In the dictonary coding process, the square annotated as \texttt{first occ.} is added to the dictionary. \texttt{second} and \texttt{third occ.} get replaced by a structure \texttt{<pos, len>} consisting of a pointer to the position of the first occurence \texttt{pos} and the length of that occurence \texttt{len}.
+The bar at the bottom of the figure shows how the compressed text for this example would be structured. The dotted lines would only consist of two bytes, storing position and lenght, pointing to \texttt{first occ.}. Decompressing this text would only require to parse the text from left to right and replace every \texttt{<pos, len>} with the already parsed word from the dictionary. This means jumping back to the parsed position stored in the replacement, reading for as long as the length dictates, copying the read section, jumping back and pasting the section.\\
+% offsets are volatile when replacing
 
-% unuseable due to the lack of probability
-% - known algo
+\begin{figure}[H]
+  \centering
+  \includegraphics[width=15cm]{k4/dict-coding.png}
+  \caption{Schematic sketch, illustrating the replacement of multiple occurences done in dictionary coding.}
+  \label{k4:dict-fig}
+\end{figure}
+
+\label{k4:lz}
 \subsubsection{The LZ Family}
-The computer scientist Abraham Lempel and the electrical engineere Jacob Ziv created multiple algorithms that are based on dictionary coding. They can be recognized by the substring \texttt{LZ} in its name, like \texttt{LZ77 and LZ78} which are short for Lempel Ziv 1977 and 1978. The number at the end indictates when the algorithm was published. Today LZ78 is widely used in unix compression solutions like gzip and bz2. Those tools are also used in compressing \ac{DNA}.\\
-\acs{LZ77} basically works, by removing all repetition of a string or substring and replacing them with information where to find the first occurence and how long it is. Typically it is stored in two bytes, whereby more than one one byte can be used to point to the first occurence because usually less than one byte is required to store the length.\\
-% example 
+The computer scientist Abraham Lempel and the electrical engineere Jacob Ziv created multiple algorithms that are based on dictionary coding. They can be recognized by the substring \texttt{LZ} in its name, like \texttt{LZ77 and LZ78} which are short for Lempel Ziv 1977 and 1978 \cite{lz77}. The number at the end indictates when the algorithm was published. Today LZ78 is widely used in unix compression solutions like gzip and bz2. Those tools are also used in compressing \ac{DNA}.\\
+
+\acs{LZ77} basically works, by removing all repetition of a string or substring 
+and replacing them with information where to find the first occurence and how long it is. Lempel and Ziv described restricted the pointer in a range to integers. Today a pointer, length pair is typically stored in two bytes. One bit is reseverd to indicate that the next 15 bit are a position, lenght pair. More than 8 bit are available to store the pointer and the rest is reserved for storing the length. Exact amounts depend on the implementation \cite{rfc1951, lz77}.
+% rewrite and implement this:
+%This method is limited by the space a pointer is allowed to take. Other variants let the replacement store the offset to the last replaced occurence, therefore it is harder to reach a position where the space for a pointer runs out.
 
- % (genomic squeeze <- official | inofficial -> GDC, GRS). Further \ac{ANS} or rANS ... TBD.
-\ac{LZ77} basically works, by removing all repetition of a string or substring and replacing them with information where to find the first occurence and how long it is. Typically it is stored in two bytes, whereby more than one one byte can be used to point to the first occurence because usually less than one byte is required to store the length.\\
+Unfortunally, implementations like the ones out of LZ Family, do not use probabilities to compress and are therefore not in the main scope for this work. To strenghten the understanding of compression algortihms this section will remain. Also it will be usefull for the explanation of a hybrid coding method, which will get described later in this chapter.\\
 
 
 \subsection{Shannons Entropy}
-The founder of information theory Claude Elwood Shannon described entropy and published it in 1948 \cite{Shannon_1948}. In this work he focused on transmitting information. His theorem is applicable to almost any form of communication signal. His findings are not only usefull for forms of information transmition. 
+The founder of information theory Claude Elwood Shannon described entropy and published it in 1948 \cite{Shannon_1948}. In this work he focused on transmitting information. His theorem is applicable to almost any form of communication signal. His findings are not only usefull for forms of information transmition.
 
 % todo insert Fig. 1 shannon_1948
 \begin{figure}[H]
@@ -68,8 +79,8 @@ The founder of information theory Claude Elwood Shannon described entropy and pu
   \label{k4:comsys}
 \end{figure}
 
-Altering \ref{k4:comsys} would show how this can be applied to other technology like compression. The Information source and destination are left unchanged, one has to keep in mind, that it is possible that both are represented by the same phyiscal actor. 
-Transmitter and receiver would be changed to compression/encoding and decompression/decoding and inbetween ther is no signal but any period of time \cite{Shannon_1948}.\\
+Altering \ref{k4:comsys} would show how this can be applied to other technology like compression. The Information source and destination are left unchanged, one has to keep in mind, that it is possible that both are represented by the same physical actor.\\
+Transmitter and receiver would be changed to compression/encoding and decompression/decoding. Inbetween those two, there is no signal but instead any period of time \cite{Shannon_1948}.\\
 
 Shannons Entropy provides a formular to determine the 'uncertainty of a probability distribution' in a finite field.
 
@@ -88,27 +99,27 @@ Shannons Entropy provides a formular to determine the 'uncertainty of a probabil
 %  \label{k4:entropy}
 %\end{figure}
 
-He defined entropy as shown in figure \eqref{eq:entropy}. Let X be a finite probability space. Then x in X are possible final states of an probability experimen over X. Every state that actually occours, while executing the experiment generates infromation which is meassured in \textit{Bits} with the part of the formular displayed in \ref{eq:info-in-bits} \cite{delfs_knebl,Shannon_1948}:
+He defined entropy as shown in figure \eqref{eq:entropy}. Let X be a finite probability space. Then $x\in X$ are possible final states of an probability experimen over X. Every state that actually occours, while executing the experiment generates infromation which is meassured in \textit{Bits} with the part of the formular displayed in \ref{eq:info-in-bit} \cite{delfs_knebl,Shannon_1948}:
 
-\begin{equation}\label{eq:info-in-bits}
+\begin{equation}\label{eq:info-in-bit}
  log_2(\frac{1}{prob(x)}) \equiv - log_2(prob(x)).
 \end{equation} 
 
 %\begin{figure}[H]
 %  \centering
-%  \includegraphics[width=8cm]{k4/information_bits.png}
-%  \caption{The amount of information measured in bits, in case x is the end state of a probability experiment.}
-%  \label{f4:info-in-bits}
+%  \includegraphics[width=8cm]{k4/information_bit.png}
+%  \caption{The amount of information measured in bit, in case x is the end state of a probability experiment.}
+%  \label{f4:info-in-bit}
 %\end{figure}
 
 %todo explain 2.2 second bulletpoint of delfs_knebl. Maybe read gumbl book
 
-%This can be used to find the maximum amount of bits needed to store information.\\ 
+%This can be used to find the maximum amount of bit needed to store information.\\ 
 % alphabet, chain of symbols, kurz entropy erklären
 
 \label{k4:arith}
 \subsection{Arithmetic coding}
-This coding method is an approach to solve the problem of wasting memeory due to the overhead which is created by encoding certain lenghts of alphabets in binary. For example: Encoding a three-letter alphabet requires at least two bit per letter. Since there are four possilbe combinations with two bits, one combination is not used, so the full potential is not exhausted. Looking at it from another perspective and thinking a step further: Less storage would be required, if there would be a possibility to encode more than one letter in two bit.\\
+This coding method is an approach to solve the problem of wasting memeory due to the overhead which is created by encoding certain lenghts of alphabets in binary \cite{ris76, moffat_arith}. For example: Encoding a three-letter alphabet requires at least two bit per letter. Since there are four possilbe combinations with two bit, one combination is not used, so the full potential is not exhausted. Looking at it from another perspective and thinking a step further: Less storage would be required, if there would be a possibility to encode more than one letter in two bit.\\
 Dr. Jorma Rissanen described arithmetic coding in a publication in 1976 \cite{ris76}. % Besides information theory and math, he also published stuff about dna
 This works goal was to define an algorithm that requires no blocking. Meaning the input text could be encoded as one instead of splitting it and encoding the smaller texts or single symbols. He stated that the coding speed of arithmetic coding is comparable to that of conventional coding methods \cite{ris76}.  
 
@@ -130,24 +141,28 @@ The coding algorithm works with probabilities for symbols in an alphabet. From a
 \end{equation}
 }
 
-This is possible by projecting the input text on a binary encoded fraction between 0 and 1. To get there, each character in the alphabet is represented by an interval between two floating point numbers in the space between 0.0 and 1.0 (exclusively). This interval is determined by the symbols distribution in the input text (interval start) and the the start of the next character (interval end). The sum of all intervals will result in one.\\
+%todo ~figures~
+
+This is possible by projecting the input text on a binary encoded fraction between 0 and 1. To get there, each character in the alphabet is represented by an interval between two floating point numbers in the space between 0.0 and 1.0 (exclusively). This interval is determined by the symbols distribution in the input text (interval start) and the the start of the next character (interval end). The sum of all intervals will result in one \cite{moffat_arith}.\\
 To encode a text, subdividing is used, step by step on the text symbols from start to the end. The interval that represents the current character will be subdivided. Meaning the choosen interval will be divided into subintervals with the proportional size of the intervals calculated in the beginning.\\
 To store as few informations as possible and due to the fact that fractions in binary have limited accuracity, only a single number, that lays between upper and lower end of the last intervall will be stored. To encode in binary, the binary floating point representation of any number inside the interval, for the last character is calculated, by using a similar process, described above.
- To summarize the encoding process in short: 
+To summarize the encoding process in short \cite{moffat_arith, witten87}:\\
+ 
 \begin{itemize}
 	\item The interval representing the first character is noted. 
 	\item Its interval is split into smaller intervals, with the ratios of the initial intervals between 0.0 and 1.0. 
 	\item The interval representing the second character is choosen.
 	\item This process is repeated, until a interval for the last character is determined.
-	\item A binary floating point number is determined wich lays in between the interval that represents the represents the last symbol. 
+	\item A binary floating point number is determined wich lays in between the interval that represents the represents the last symbol.\\
 \end{itemize}
 % its finite subdividing because of the limitation that comes with processor architecture
 
-For the decoding process to work, the \ac{EOF} symbol must be be present as the last symbol in the text. The compressed file will store the probabilies of each alphabet symbol as well as the floatingpoint number. The decoding process executes in a simmilar procedure as the encoding. The stored probabilies determine intervals. Those will get subdivided, by using the encoded floating point as guidance, until the \ac{EOF} symbol is found. By noting in which interval the floating point is found, for every new subdivision, and projecting the probabilies associated with the intervals onto the alphabet, the origin text can be read.\\
+For the decoding process to work, the \ac{EOF} symbol must be be present as the last symbol in the text. The compressed file will store the probabilies of each alphabet symbol as well as the floatingpoint number. The decoding process executes in a simmilar procedure as the encoding. The stored probabilies determine intervals. Those will get subdivided, by using the encoded floating point as guidance, until the \ac{EOF} symbol is found. By noting in which interval the floating point is found, for every new subdivision, and projecting the probabilies associated with the intervals onto the alphabet, the origin text can be read \cite{witten87, moffat_arith, ris76}.\\
 % rescaling
 % math and computers
 In computers, arithmetic operations on floating point numbers are processed with integer representations of given floating point number \cite{ieee-float}. The number 0.4 + would be represented by $4\cdot 10^-1$.\\
-Intervals for the first symbol would be represented by natural numbers between 0 and 100 and $... \cdot 10^-x$. \texttt{x} starts with the value 2 and grows as the intgers grow in length, meaning only if a uneven number is divided. For example: Dividing a uneven number like $5\cdot 10^-1$ by two, will result in $25\cdot 10^-2$. On the other hand, subdividing $4\cdot 10^y$ by two, with any negativ real number as y would not result in a greater \texttt{x} the length required to display the result will match the length required to display the input number.\\
+Intervals for the first symbol would be represented by natural numbers between 0 and 100 and $... \cdot 10^-x$. \texttt{x} starts with the value 2 and grows as the intgers grow in length, meaning only if a uneven number is divided. For example: Dividing a uneven number like $5\cdot 10^-1$ by two, will result in $25\cdot 10^-2$. On the other hand, subdividing $4\cdot 10^y$ by two, with any negativ real number as y would not result in a greater \texttt{x} the length required to display the result will match the length required to display the input number \cite{witten87, moffat_arith}.\\
+
 % example
 \begin{figure}[H]
   \centering
@@ -157,14 +172,14 @@ Intervals for the first symbol would be represented by natural numbers between 0
 \end{figure}
 
 % finite percission
-The described coding is only feasible on machines with infinite percission. As soon as finite precission comes into play, the algorithm must be extendet, so that a certain length in the resulting number will not be exceeded. Since digital datatypes are limited in their capacity, like unsigned 64-bit integers which can store up to $2^64-1$ bits or any number between 0 and 18,446,744,073,709,551,615. That might seem like a great ammount at first, but considering a unfavorable alphabet, that extends the results lenght by one on each symbol that is read, only texts with the length of 63 can be encoded (62 if \acs{EOF} is exclued).
+The described coding is only feasible on machines with infinite percission \cite{witten87}. As soon as finite precission comes into play, the algorithm must be extendet, so that a certain length in the resulting number will not be exceeded. Since digital datatypes are limited in their capacity, like unsigned 64-bit integers which can store up to $2^64-1$ bit or any number between 0 and 18,446,744,073,709,551,615. That might seem like a great ammount at first, but considering a unfavorable alphabet, that extends the results lenght by one on each symbol that is read, only texts with the length of 63 can be encoded (62 if \acs{EOF} is exclued) \cite{moffat_arith}.
 
 \label{k4:huff}
 \subsection{Huffman encoding}
 % list of algos and the tools that use them
 D. A. Huffmans work focused on finding a method to encode messages with a minimum of redundance. He referenced a coding procedure developed by Shannon and Fano and named after its developers, which worked similar. The Shannon-Fano coding is not used today, due to the superiority in both efficiency and effectivity, in comparison to Huffman. % todo any source to last sentence. 
 Even though his work was released in 1952, the method he developed is in use  today. Not only tools for genome compression but in compression tools with a more general ussage \cite{rfcgzip}.\\ 
-Compression with the Huffman algorithm also provides a solution to the problem, described at the beginning of \ref{k4:arith}, on waste through unused bits, for certain alphabet lengths. Huffman did not save more than one symbol in one bit, like it is done in arithmetic coding, but he decreased the number of bits used per symbol in a message. This is possible by setting individual bit lengths for symbols, used in the text that should get compressed \cite{huf52}. 
+Compression with the Huffman algorithm also provides a solution to the problem, described at the beginning of \ref{k4:arith}, on waste through unused bit, for certain alphabet lengths. Huffman did not save more than one symbol in one bit, like it is done in arithmetic coding, but he decreased the number of bit used per symbol in a message. This is possible by setting individual bit lengths for symbols, used in the text that should get compressed \cite{huf52}. 
 As with other codings, a set of symbols must be defined. For any text constructed with symbols from mentioned alphabet, a binary tree is constructed, which will determine how the symbols will be encoded. As in arithmetic coding, the probability of a letter is calculated for given text. The binary tree will be constructed after following guidelines \cite{alok17}:
 % greedy algo?
 \begin{itemize}
@@ -177,7 +192,7 @@ As with other codings, a set of symbols must be defined. For any text constructe
 %todo tree building explanation
 A often mentioned difference between Shannon-Fano and Huffman coding, is that first is working top down while the latter is working bottom up. This means the tree starts with the lowest weights. The nodes that are not leafs have no value ascribed to them. They only need their weight, which is defined by the weights of their individual child nodes \cite{moffat20, alok17}.\\
 
-Given \texttt{K(W,L)} as a node structure, with the weigth or probability as \texttt{$W_{i}$} and codeword length as \texttt{$L_{i}$} for the node \texttt{$K_{i}$}. Then will \texttt{$L_{av}$} be the average length for \texttt{L} in a finite chain of symbols, with a distribution that is mapped onto \texttt{W} \cite{huf}.
+Given \texttt{K(W,L)} as a node structure, with the weigth or probability as \texttt{$W_{i}$} and codeword length as \texttt{$L_{i}$} for the node \texttt{$K_{i}$}. Then will \texttt{$L_{av}$} be the average length for \texttt{L} in a finite chain of symbols, with a distribution that is mapped onto \texttt{W} \cite{huf52}.
 \begin{equation}\label{eq:huf}
   L_{av}=\sum_{i=0}^{n-1}w_{i}\cdot l_{i}
 \end{equation}
@@ -186,26 +201,84 @@ With all important elements described: the sum that results from this equation i
 
 % example
 % todo illustrate
-For this example a four letter alphabet, containing \texttt{A, C, G and T} will be used. The exact input text is not relevant, since only the resulting probabilities are needed. With a distribution like \texttt{<A, $100\frac11=0.11$>, <C, $100\frac71=0.71$>, <G, $100\frac13=0.13$>, <T, $100\frac5=0.05$>}, a probability for each symbol is calculated by dividing the message length by the times the symbol occured.\\ 
-For an alphabet like the one described above, the binary representation encoded in ASCI is shown here \texttt{A -> 01000001, C -> 01000011, G -> 01010100, T -> 00001010}. The average length for any symbol encoded in \acs{ASCII} is eight, while only using four of the available $2^8$ symbols, a overhead of 252 unused bit combinations. For this example it is more vivid, using a imaginary encoding format, without overhead. It would result in a average codeword length of two, because four symbols need a minimum of $2^2$ bits.\\
-So starting with the two lowest weightened symbols, a node is added to connect both.\\
+For this example a four letter alphabet, containing \texttt{A, C, G and T} will be used. For this alphabet, the binary representation encoded in ASCII is listed in the second column of \ref{t:huff-pre}. 
+The average length for any symbol encoded in \acs{ASCII} is eight, while only using four of the available $2^8$ symbols, a overhead of 252 unused bit combinations. For this example it is more vivid, using a imaginary encoding format, without overhead. It would result in a average codeword length of two, because four symbols need a minimum of $2^2$ bit.\\
+
+\label{t:huff-pre}
+\sffamily
+\begin{footnotesize}
+  \begin{longtable}[ht]{ p{.2\textwidth} p{.2\textwidth} p{.2\textwidth} p{.2\textwidth}}
+    \caption[Huffman example table no. 1 (pre encoding)]                       % Caption für das Tabellenverzeichnis
+        {ASCII Codes and probabilities for \texttt{A,C,G and T}} % Caption für die Tabelle selbst
+        \\
+    \toprule
+     \textbf{Symbol} & \textbf{\acs{ASCII} Code} &  \textbf{Probability} & \textbf{Occurences}\\
+    \midrule
+			A & 0100 0001 & $\frac{100}{11}=0.11$	& 11\\
+			C & 0100 0011 & $\frac{100}{71}=0.71$ & 71\\
+			G & 0101 0100 & $\frac{100}{13}=0.13$ & 13\\
+			T & 0000 1010 & $\frac{100}{5}=0.05$	&  5\\
+    \bottomrule
+  \end{longtable}
+\end{footnotesize}
+\rmfamily
+
+The exact input text is not relevant, since only the resulting probabilities are needed. To make this example more illustrative, possible occurences are listed in the most right column of \ref{t:huff-pre}. The probability for each symbol is calculated by dividing the message length by the times the symbol occured. This and the resulting probabilities on a scale between 0.0 and 1.0, for this example are shown in \ref{t:huff-pre} \cite{huf52}.\\ 
+Creating a tree will be done bottom up. In the first step, for each symbol from the alphabet, a node without any connection is formed .\\
+
+\texttt{<A>, <T>, <C>, <G>}\\
+
+Starting with the two lowest weightened symbols, a node is added to connect both. With the added, blank node the count of available nodes got down by one. The new node weights as much as the sum of weights of its child nodes so the probability of 0.16 is assigned to \texttt{<A,T>}.\\
+
 \texttt{<A, T>, <C>, <G>}\\
-With the added, blank node the count of available nodes got down by one. The new node weights as much as the sum of weights of its child nodes so the probability of 0.16 is assigned to \texttt{<A,T>}. From there on, the two leafs will only get rearranged through the rearrangement of their temporary root node. Now the two lowest weights are paired as described, until there are only two subtrees or nodes left which can be combined by a root.\\
-\texttt{<C, <A, T>>, <G>}\\
-The \texttt{<C, <A, T>>} has a probability of 0.29. Adding the last node \texttt{G} results in a root node with the probability of 1.0.\\
-With the fact in mind, that left branches are assigned with 0 and right branches with 1, following a path until a leaf is reached reveals the encoding for this particular leaf. With a corresponding tree, created from with the weights, the binary sequences to encode the alphabet would look like this:\\
-\texttt{A -> 0, C -> 11, T -> 100, G -> 101}.\\ 
-Since high weightened and therefore often occuring leafs are positioned to the left, short paths lead to them and so only few bits are needed to encode them. Following the tree on the other side, the symbols occur more rarely, paths get longer and so do the codeword. Applying \eqref{eq:huf} to this example, results in 1.45 bits per encoded symbol. In this example the text would require over one bit less storage for every second symbol.\\
+
+From there on, the two leafs will only get rearranged through the rearrangement of their temporary root node. Now the two lowest weights are paired as described, until there are only two subtrees or nodes left which can be combined by a root.\\
+
+\texttt{<G, <A, T> >, <C>}\\
+
+The \texttt{<G, <A, T> >} has a probability of 0.29. Adding the last, highest weightened node \texttt{C} results in a root node with the probability of 1.0.
+For a better understanding of this example, and to help further explanations, the resulting tree is illustrated in \ref{k4:huff-tree}.\\
+
+\begin{figure}[H]
+  \centering
+  \includegraphics[width=8cm]{k4/huffman-tree.png}
+  \caption{Final version of the Huffman tree for described example.}
+  \label{k4:huff-tree}
+\end{figure}
+
+As illustrated in \ref{k4:huff-tree} the left branches are assigned with 0 and right branches with 1, following a path until a leaf is reached reveals the encoding for this particular leaf. With a corresponding tree, created from with the weights, the binary sequences to encode the alphabet can be seen in the second column of \ref{t:huff-post}.\\
+
+\label{t:huff-post}
+\sffamily
+\begin{footnotesize}
+  \begin{longtable}[ht]{ p{.2\textwidth} p{.2\textwidth} p{.2\textwidth} p{.2\textwidth}}
+    \caption[Huffman example table no. 2 (post encoding)]                       % Caption für das Tabellenverzeichnis
+        {Huffman codes for \texttt{A,C,G and T}} % Caption für die Tabelle selbst
+        \\
+    \toprule
+     \textbf{Symbol} & \textbf{Huffman Code} & \textbf{Occurences}\\
+    \midrule
+			A & 100 & 11\\
+			C & 0		&	71\\
+			G & 11	& 13\\
+			T & 101 &  5\\
+    \bottomrule
+  \end{longtable}
+\end{footnotesize}
+\rmfamily
+
+Since high weightened and therefore often occuring leafs are positioned to the left, short paths lead to them and so only few bit are needed to encode them. Following the tree on the other side, the symbols occur more rarely, paths get longer and so do the codeword. Applying \eqref{eq:huf} to this example, results in 1.45 bit per encoded symbol. In this example the text would require over one bit less storage for every second symbol \cite{huf52}.\\
 % impacting the dark ground called reality
-Leaving the theory and entering the practice, brings some details that lessen this improvement by a bit. A few bytes are added through the need of storing the information contained in the tree. Also, like described in \ref{chap:file formats} most formats, used for persisting \acs{DNA}, store more than just nucleotides and therefore require more characters. What compression ratios implementations of huffman coding provide, will be discussed in \ref{k5:results}.\\
+Leaving the theory and entering the practice, brings some details that lessen this improvement by a bit. A few bytes are added through the need of storing the information contained in the tree. Also, like described in \ref{chap:file formats} most formats, used for persisting \acs{DNA}, store more than just nucleotides and therefore require more characters \cite{Cock_2009, sam12}.\\
 
 \section{Implementations in Relevant Tools}
-This section should give the reader a quick overview, how a small variety of compression tools implement described compression algorithms. 
+This section should give the reader a overview, how a small variety of compression tools implement described compression algorithms. It is written with the goal to compensate a problem that ocurs in scientific papers, and sometimes in technical specifications for programs. They often lack information on the implementation, in a satisfying dimension \cite{sam12, geco, bam}.\\ 
+The information on the following pages was received through static code analysis. Meaning the comprehension of a programs behaviour or its interactions due to the analysis of its source code. This is possible because the analysed tools are openly published and licenced under \ac{GPL} v3 \cite{geco} and \ac{MIT}/Expat \cite{bam}, which permits the free use for scientific purposes \cite{gpl, mitlic}.\\
 
 \label{k4:geco}
 \subsection{\ac{GeCo}} % geco
 % differences between geco geco2 and geco3
-This tool has three development stages, the first \acs{GeCo} released in 2016 \acs{geco}. This tool happens to have the smalles codebase, with only eleven C files. The two following extensions \acs{GeCo}2, released in 2020 and the latest version \acs{GeCo}3 have bigger codebases. They also provide features like the ussage of a neural network, which are of no help for this work. Since the file, providing arithmetic coding functionality, do not differ between all three versions, the first release was analyzed.\\
+This tool has three development stages. the first \acs{GeCo} released in 2016 \acs{GeCo}. This tool happens to have the smalles codebase, with only eleven C files. The two following extensions \acs{GeCo}2, released in 2020 and the latest version \acs{GeCo}3 have bigger codebases \cite{geco-repo}. They also provide features like the ussage of a neural network, which are of no help for this work. Since the file, providing arithmetic coding functionality, do not differ between all three versions, the first release was analyzed.\\
 % explain header files
 The header files, that this tool includes in \texttt{geco.c}, can be split into three categories: basic operations, custom operations and compression algorithms. 
 The basic operations include header files for general purpose functions, that can be found in almost any c++ Project. The provided functionality includes operations for text-output on the command line inferface, memory management, random number generation and several calculations on up to real numbers.\\
@@ -214,10 +287,10 @@ The first two were developed by John Carpinelli, Wayne Salamonsen, Lang Stuiver
 The second implementation was also licensed by University of Aveiro DETI/IEETA, but no author is mentioned. From interpreting the function names and considering the lenght of function bodys \texttt{arith\_aux.c} could serve as a wrapper for basic functions that are often used in arithmetic coding.\\
 Since original versions of the files licensed by University of Aveiro could not be found, there is no way to determine if the files comply with their originals or if changes has been made. This should be considered while following the static analysis.
 
-Following function calls in all three files led to the conclusion that the most important function is defined as \texttt{arithmetic\_encode} in \texttt{arith.c}. In this function the actual artihmetic encoding is executed. This function has no redirects to other files, only one function call \texttt{ENCODE\_RENORMALISE} the remaining code consists of arithmetic operations only.
+Following function calls in all three files led to the conclusion that the most important function is defined as \texttt{arithmetic\_encode} in \texttt{arith.c}. In this function the actual artihmetic encoding is executed. This function has no redirects to other files, only one function call \texttt{ENCODE\_RENORMALISE} the remaining code consists of arithmetic operations only \cite{geco-repo}.
 % if there is a chance for improvement, this function should be consindered as a entry point to test improving changes.
 
-Following function calls int the \texttt{compressor} section of \texttt{geco.c}, to find the call of \texttt{arith.c} no sign of multithreading could be identified. This fact leaves additional optimization possibilities and will be discussed in \ref{k6:results}.
+Following function calls int the \texttt{compressor} section of \texttt{geco.c}, to find the call of \texttt{arith.c} no sign of multithreading could be identified. This fact leaves additional optimization possibilities and will be discussed at the end of this work.
 
 %useless? -> Both, \texttt{bitio.c} and \texttt{arith.c} are pretty simliar. They were developed by the same authors, execpt for Radford Neal who is only mentioned in \texttt{arith.c}, both are based on the work of A. Moffat \cite{moffat_arith}.
 %\subsection{genie} % genie
@@ -228,7 +301,7 @@ Compression in this fromat is done by a implementation called BGZF, which is a b
 \paragraph{DEFLATE}
 % mix of huffman and LZ77
 The DEFLATE compression algorithm combines \acs{LZ77} and huffman coding. It is used in well known tools like gzip. 
-Data is split into blocks. Each block stores a header consisting of three bits. A single block can be stored in one of three forms. Each of which is represented by a identifier that is stored with the last two bits in the header. 
+Data is split into blocks. Each block stores a header consisting of three bit. A single block can be stored in one of three forms. Each of which is represented by a identifier that is stored with the last two bit in the header. 
 \begin{itemize}
 	\item \texttt{00}		No compression.
 	\item \texttt{01}		Compressed with a fixed set of Huffman codes.	
@@ -236,13 +309,13 @@ Data is split into blocks. Each block stores a header consisting of three bits.
 \end{itemize}
 The last combination \texttt{11} is reserved to mark a faulty block. The third, leading bit is set to flag the last data block \cite{rfc1951}.
 % lz77 part
-As described in \ref{k4:lz77} a compression with \acs{LZ77} results in literals, a length for each literal and pointers that are represented by the distance between pointer and the literal it points to.
-The \acs{LZ77} algorithm is executed before the huffman algorithm. Further compression steps differ from the already described algorithm and will extend to the end of this section.
+As described in \ref{k4:lz} a compression with \acs{LZ77} results in literals, a length for each literal and pointers that are represented by the distance between pointer and the literal it points to.
+The \acs{LZ77} algorithm is executed before the huffman algorithm. Further compression steps differ from the already described algorithm and will extend to the end of this section.\\
 
 % huffman part
-Besides header bits and a data block, two Huffman code trees are store. One encodes literals and lenghts and the other distances. They happen to be in a compact form. This archived by a addition of two rules on top of the rules described in \ref{k4:huff}: Codes of identical lengths are orderd lexicographically, directed by the characters they represent. And the simple rule: shorter codes precede longer codes.
+Besides header bit and a data block, two Huffman code trees are store. One encodes literals and lenghts and the other distances. They happen to be in a compact form. This is archived by a addition of two rules on top of the rules described in \ref{k4:huff}: Codes of identical lengths are orderd lexicographically, directed by the characters they represent. And the simple rule: shorter codes precede longer codes.
 To illustrated this with an example:
-For a text consisting out of \texttt{C} and \texttt{G}, following codes would be set for a encoding of two bit per character: \texttt{C}: 00, \texttt{G}: 01. With another character \texttt{A} in the alphabet, which would occour more often than the other two characters, the codes would change to a representation like this:
+For a text consisting out of \texttt{C} and \texttt{G}, following codes would be set, for a encoding of two bit per character: \texttt{C}: 00, \texttt{G}: 01. With another character \texttt{A} in the alphabet, which would occour more often than the other two characters, the codes would change to a representation like this:
 
 \sffamily
 \begin{footnotesize}
@@ -258,7 +331,7 @@ For a text consisting out of \texttt{C} and \texttt{G}, following codes would be
 \end{footnotesize}
 \rmfamily
 
-Since \texttt{A} precedes \texttt{C} and \texttt{G}, it is represented with a 0. To maintain prefix-free codes, the two remaining codes are not allowed to start with a 0. \texttt{C} precedes \texttt{G} lexicographically, therefor the (in a numerical sense) smaller code is set to represent \texttt{C}.
+Since \texttt{A} precedes \texttt{C} and \texttt{G}, it is represented with a 0. To maintain prefix-free codes, the two remaining codes are not allowed to contain a leading 0. \texttt{C} precedes \texttt{G} lexicographically, therefor the (in a numerical sense) smaller code is set to represent \texttt{C}.\\
 With this simple rules, the alphabet can be compressed too. Instead of storing codes itself, only the codelength stored \cite{rfc1951}. This might seem unnecessary when looking at a single compressed bulk of data, but when compressing blocks of data, a samller alphabet can make a relevant difference.\\
 
 % example header, alphabet, data block?
@@ -266,8 +339,18 @@ BGZF extends this by creating a series of blocks. Each can not extend a limit of
 
 \subsubsection{CRAM}
 The improvement of \acs{BAM} \cite{cram-origin} called \acs{CRAM}, also features a block structure \cite{bam}. The whole file can be seperated into four sections, stored in ascending order: File definition, a CRAM Header Container, multiple Data Container and a final CRAM EOF Container.\\
-The File definition consists of 26 uncompressed bytes, storing formating information and a identifier. The CRAM header contains meta information about Data Containers and is optionally compressed with gzip. This container can also contain a uncompressed zero-padded section, reseved for \acs{SAM} header information \cite{bam}. This saves time, in case the compressed file is altered and its compression need to be updated. The last container in a \acs{CRAM} file serves as a indicator that the \acs{EOF} is reached. Since in addition information about the file and its structure is stored, a maximum of 38 uncompressed bytes can be reached.\\
-A Data Container can be split into three sections. From this sections the one storing the actual sequence consists of blocks itself, displayed in \ref FIGURE as the bottom row.
+The complete structure is displayed in \ref{k4:cram-struct}. The following paragrph will give a brief description to the high level view of a \acs{CRAM} fiel, illustrated as the most upper bar. Followed by a closer look at the data container, which components are listed in the bar, at the center of \ref{k4:cram-struct}. The most in deph explanation will be given to the bottom bar, which shows the structure of so called slices.\\
+
+\begin{figure}[H]
+  \centering
+  \includegraphics[width=15cm]{k4/cram-structure.png}
+  \caption{\acs{CRAM} file format structure \cite{bam}.}
+  \label{k4:cram-struct}
+\end{figure}
+
+The File definition, illustrated on the left side of the first bar in \ref{k4:cram-struct}, consists of 26 uncompressed bytes, storing formating information and a identifier. The CRAM header contains meta information about Data Containers and is optionally compressed with gzip. This container can also contain a uncompressed zero-padded section, reseved for \acs{SAM} header information \cite{bam}. This saves time, in case the compressed file is altered and its compression need to be updated. The last container in a \acs{CRAM} file serves as a indicator that the \acs{EOF} is reached. Since in addition information about the file and its structure is stored, a maximum of 38 uncompressed bytes can be reached.\\
+A Data Container can be split into three sections. From this sections the one storing the actual sequence consists of blocks itself, displayed in \ref{k4:cram-struct} as the bottom row.\\
+
 \begin{itemize}
 	\item Container Header.
 	\item Compression Header.
@@ -278,7 +361,9 @@ A Data Container can be split into three sections. From this sections the one st
 		\item A variable amount of External Data Blocks.
 	\end{itemize}
 \end{itemize}
-The Container Header stores information on how to decompress the data stored in the following block sections. The Compression Header contains information about what kind of data is stored and some encoding information for \acs{SAM} specific flags. The actual data is stored in the Data Blocks. Those consist of encoded bit streams. According to the Samtools specification, the encoding can be one of the following: External, Huffman and two other methods which happen to be either a form of huffman coding or a shortened binary representation of integers. The External option allows to use gzip, bzip2 which is a form of multiple coding methods including run length encoding and huffman, a encoding from the LZ family called LZMA or a combination of arithmetic and huffman coding called rANS.
+
+The Container Header stores information on how to decompress the data stored in the following block sections. The Compression Header contains information about what kind of data is stored and some encoding information for \acs{SAM} specific flags \cite{bam}. 
+The actual data is stored in the Data Blocks. Those consist of encoded bit streams. According to the Samtools specification, the encoding can be one of the following: External, Huffman and two other methods which happen to be either a form of huffman coding or a shortened binary representation of integers \cite{bam}. The External option allows to use gzip, bzip2 which is a form of multiple coding methods including run length encoding and huffman, a encoding from the LZ family called LZMA or a combination of arithmetic and huffman coding called rANS \cite{sam12}.
 % possible encodings: 
 % external: no encoding or gzip, bzip2, lzma
 % huffman

+ 8 - 4
latex/tex/kapitel/k5_feasability.tex

@@ -24,8 +24,8 @@
 
 %\chapter{Analysis for Possible Compression Improvements}
 \chapter{Environment and Procedure to Determine the State of The Art Efficiency and Compressionratio of Relevant Tools}
-% goal define
 \label{k5:goals}
+% goal define
 Since improvements must be measured, defining a baseline which would need to be beaten bevorhand is necessary. Others have dealt with this task several times with common algorithms and tools, and published their results. But since the test case, that need to be build for this work, is rather uncommon in its compilation, the available data are not very useful. Therefore, new test data must be created.\\
 The goal of this is, to determine a baseline for efficiency and effectivity of state of the art tools, used to compress \ac{DNA}. This baseline is set by two important factors:
 
@@ -59,7 +59,7 @@ Reading from /proc/cpuinfo reveals processor specifications. Since most of the i
   \item address sizes: 36 bits physical, 48 bits virtual
 \end{itemize}
 
-Full CPU secificaiton can be found in appendix.%todo finish
+Full CPU secificaiton can be found in \ref{a5:cpu}.
 
 % explanation on some entry: https://linuxwiki.de/proc/cpuinfo
 %\begin{em}
@@ -170,8 +170,12 @@ Following criteria is reqiured for test data to be appropriate:
 \end{itemize}
 A second, bigger set of testfiles were required. This would verify the test results are not limited to small files. A minimum of one gigabyte of average filesize were set as a boundary. This corresponds to over five times the size of the first set.\\
 % data gathering
-Since there are multiple open \ac{FTP} servers which distribute a variety of files, finding a suitable first set is rather easy. The ensembl database featured defined criteria, so the first available set called Homo\_sapiens.GRCh38.dna.chromosome were chosen \cite{ftp-ensembl}. This sample includes over 20 chromosomes, whereby considering the filenames, one chromosome is contained in each single file. After retrieving and unpacking the files, write privileges on them was withdrawn. So no tool could alter any file contents.
-Finding a second, bigger set happened to be more complicated. \acs{FTP} offers no fast, reliable way to sort files according to their size, regardless of their position. Since available servers \acs{ftp-ensembl, ftp-ncbi, ftp-isgr} offer several thousand files, stored in variating, deep directory structures, mapping filesize, filetype and file path takes too much time and resources for the scope of this work. This problematic combined with a easily triggered overflow in the samtools library, resulted in a set of several, manualy searched and tested \acs{FASTq} files. Compared to the first set, there is a noticable lack of quantity, but the filesizes happen to be of a fortunate distribution. With pairs of two files in the ranges of 0.6, 1.1, 1.2 and one file with a size of 1.3 gigabyte, effects on scaling sizes should be clearly visible.\\
+Since there are multiple open \ac{FTP} servers which distribute a variety of files, finding a suitable first set is rather easy. The ensembl database featured defined criteria, so the first available set called:
+
+\texttt{Homo\_sapiens.GRCh38.dna.chromosome}
+
+were chosen \cite{ftp-ensembl}. This sample includes 20 chromosomes, whereby considering the filenames, one chromosome is contained in each single file. After retrieving and unpacking the files, write privileges on them was withdrawn. So no tool could alter any file contents, without sufficient permission.
+Finding a second, bigger set happened to be more complicated. \acs{FTP} offers no fast, reliable way to sort files according to their size, regardless of their position. Since available servers \cite{ftp-ensembl, ftp-ncbi, ftp-igsr} offer several thousand files, stored in variating, deep directory structures, mapping filesize, filetype and file path takes too much time and resources for the scope of this work. This problematic combined with a easily triggered overflow in the samtools library, resulted in a set of several, manualy searched and tested \acs{FASTq} files. Compared to the first set, there is a noticable lack of quantity, but the filesizes happen to be of a fortunate distribution. With pairs of two files in the ranges of 0.6, 1.1, 1.2 and one file with a size of 1.3 gigabyte, effects on scaling sizes should be clearly visible.\\
  
 % todo make sure this needs to stay.
 \noindent Following tools and parameters where used in this process:

+ 142 - 82
latex/tex/kapitel/k6_results.tex

@@ -1,51 +1,111 @@
 \chapter{Results and Discussion}
-The two tables \ref{t:effectivity}, \ref{t:efficiency} contain raw measurement values for the two goals, described in \ref{k5:goals}. The first table visualizes how long each compression procedure took, in milliseconds. The second one contains file sizes in bytes. Each row contains information about one of the files following this naming scheme:
+The tables \ref{a6:testsets-size} and \ref{a6:testsets-time} contain raw measurement values for the two goals, described in \ref{k5:goals}. The table \ref{a6:testsets-time} lists how long each compression procedure took, in milliseconds. \ref{a6:testsets-size} contains file sizes in bytes. In these tables, as well as in the other ones associated with tests in the scope of this work, the a name scheme is used, to improve readability. The filenames were replaced by \texttt{File} followed by two numbers seperated by a point. For the first test set, the number prefix \texttt{1.} was used, the second set is marked with a \texttt{2.}. For example, the fourth file of each test, in tables are named like this \texttt{File 1.4} and \texttt{File 2.4}. The name of the associated source file for the first set is:
 
-\texttt{Homo\_sapiens.GRCh38.dna.chromosome.}x\texttt{.fa}
+\texttt{Homo\_sapiens.GRCh38.dna.chromosome.\textbf{4}.fa}
 
-To improve readability, the filename in all tables were replaced by \texttt{File}. To determine which file was compressed, simply replace the placeholder with the number following \texttt{File}.\\
+Since the source files of the second set are not named as consistent as in the first one, a third column in \ref{k6:set2size} was added, which is mapping table ID. and source file name.\\
+The files contained in each test set, as well as their size can be found in the tables \ref{k6:set1size} and \ref{k6:set2size}.
+The first test set contained a total of 2.8 \acs{GB} unevenly spread over 21 files, while the second test set contained 7 \acs{GB} in total, with a quantity of seven files.\\
 
+\label{k6:set1size}
+\sffamily
+\begin{footnotesize}
+  \begin{longtable}[h]{ p{.4\textwidth} p{.4\textwidth}} 
+    \caption[First Test Set Files and their Sizes in MB]                       % Caption für das Tabellenverzeichnis
+        {Files contained in the First Test Set and their Sizes in \acs{MB}} % Caption für die Tabelle selbst
+        \\
+    \toprule
+     \textbf{ID.} & \textbf{Size in \acs{MB}} \\
+    \midrule
+			File 1.1& 241.38\\
+			File 1.2& 234.823\\
+			File 1.3& 192.261\\
+			File 1.4& 184.426\\
+			File 1.5& 176.014\\
+			File 1.6& 165.608\\
+			File 1.7& 154.497\\
+			File 1.8& 140.722\\
+			File 1.9& 134.183\\
+			File 1.10& 129.726\\
+			File 1.11& 130.976\\
+			File 1.12& 129.22\\
+			File 1.13& 110.884\\
+			File 1.14& 103.786\\
+			File 1.15& 98.888\\
+			File 1.16& 87.589\\
+			File 1.17& 80.724\\
+			File 1.18& 77.927\\
+			File 1.19& 56.834\\
+			File 1.20& 62.483\\
+			File 1.21& 45.289\\
+    \bottomrule
+  \end{longtable}
+\end{footnotesize}
+\rmfamily
+
+\label{k6:set2size}
+\sffamily
+\begin{footnotesize}
+  \begin{longtable}[h]{ p{.2\textwidth} p{.2\textwidth} p{.4\textwidth}} 
+    \caption[Second Test Set Files and their Sizes in MB]                       % Caption für das Tabellenverzeichnis
+        {Files contained in the Second Test Set, their Sizes in \acs{MB} and Source File Names} % Caption für die Tabelle selbst
+        \\
+    \toprule
+     \textbf{ID.} & \textbf{Size in \acs{MB}} & \textbf{Source File Name}\\
+    \midrule
+			File 2.1& 1188.976& SRR002905.recal.fastq\\
+			File 2.2& 1203.314& SRR002906.recal.fastq\\
+			File 2.3& 627.467& SRR002815.recal.fastq\\
+			File 2.4& 676.0& SRR002816.recal.fastq\\
+			File 2.5& 1066.431& SRR002817.recal.fastq\\
+			File 2.6& 1071.095& SRR002818.recal.fastq\\
+			File 2.7& 1240.564& SRR002819.recal.fastq\\
+    \bottomrule
+  \end{longtable}
+\end{footnotesize}
+\rmfamily
 \section{Interpretation of Results}
-The units milliseconds and bytes store a high precision. Unfortunately they are harder to read and compare, solely by the readers eyes. Therefore the data was altered. Sizes in \ref{t:sizepercent} are displayed in percentage, in relation to the respective source file. Meaning the compression with \acs{GeCo} on:
+The units milliseconds and bytes store a high precision. Unfortunately they are harder to read and compare, solely by the readers eyes. Therefore the data was altered. Sizes in \ref{k6:sizepercent} are displayed in percentage, in relation to the respective source file. Meaning the compression with \acs{GeCo} on:
 
-Homo\_sapiens.GRCh38.dna.chromosome.11.fa 
+\texttt{Homo\_sapiens.GRCh38.dna.chromosome.11.fa}
 
 resulted in a compressed file which were only 17.6\% as big.
-Runtimes in \ref{t:time} were converted into seconds and have been rounded to two decimal places.
+Runtimes in \ref{k6:time} were converted into seconds and have been rounded to two decimal places.
 Also a line was added to the bottom of each table, showing the average percentage or runtime for each process.\\
-\label{t:sizepercent}
+\label{k6:sizepercent}
 \sffamily
 \begin{footnotesize}
-  \begin{longtable}[c]{ p{.2\textwidth} p{.2\textwidth} p{.2\textwidth} p{.2\textwidth}}
-    \caption[Compression Effectivity]                       % Caption für das Tabellenverzeichnis
+  \begin{longtable}[h]{ p{.2\textwidth} p{.2\textwidth} p{.2\textwidth} p{.2\textwidth}}
+    \caption[Compression Effectivity for first test set]                       % Caption für das Tabellenverzeichnis
         {File sizes in different compression formats in \textbf{percent}} % Caption für die Tabelle selbst
         \\
     \toprule
      \textbf{ID.} & \textbf{\acs{GeCo} \%} & \textbf{Samtools \acs{BAM}\%}& \textbf{Samtools \acs{CRAM} \%} \\
     \midrule
-			File 1& 18.32& 24.51& 22.03\\
-			File 2& 20.15& 26.36& 23.7\\
-			File 3& 19.96& 26.14& 23.69\\
-			File 4& 20.1& 26.26& 23.74\\
-			File 5& 17.8& 22.76& 20.27\\
-			File 6& 17.16& 22.31& 20.11\\
-			File 7& 16.21& 21.69& 19.76\\
-			File 8& 17.43& 23.48& 21.66\\
-			File 9& 18.76& 25.16& 23.84\\
-			File 10& 20.0& 25.31& 23.63\\
-			File 11& 17.6& 24.53& 23.91\\
-			File 12& 20.28& 26.56& 23.57\\
-			File 13& 19.96& 25.6& 23.67\\
-			File 14& 16.64& 22.06& 20.44\\
-			File 15& 79.58& 103.72& 92.34\\
-			File 16& 19.47& 25.52& 22.6\\
-			File 17& 19.2& 25.25& 22.57\\
-			File 18& 19.16& 25.04& 22.2\\
-			File 19& 18.32& 24.4& 22.12\\
-			File 20& 18.58& 24.14& 21.56\\
-			File 21& 16.22& 22.17& 19.96\\
+			%geco bam and cram in percent
+			File 1.1& 18.32& 24.51& 22.03\\
+			File 1.2& 20.28& 26.56& 23.57\\
+			File 1.3& 20.4& 26.58& 23.66\\
+			File 1.4& 20.3& 26.61& 23.56\\
+			File 1.5& 20.12& 26.46& 23.65\\
+			File 1.6& 20.36& 26.61& 23.6\\
+			File 1.7& 19.64& 26.15& 23.71\\
+			File 1.8& 20.4& 26.5& 23.67\\
+			File 1.9& 17.01& 23.25& 20.94\\
+			File 1.10& 20.15& 26.36& 23.7\\
+			File 1.11& 19.96& 26.14& 23.69\\
+			File 1.12& 20.1& 26.26& 23.74\\
+			File 1.13& 17.8& 22.76& 20.27\\
+			File 1.14& 17.16& 22.31& 20.11\\
+			File 1.15& 16.21& 21.69& 19.76\\
+			File 1.16& 17.43& 23.48& 21.66\\
+			File 1.17& 18.76& 25.16& 23.84\\
+			File 1.18& 20.0& 25.31& 23.63\\
+			File 1.19& 17.6& 24.53& 23.91\\
+			File 1.20& 19.96& 25.6& 23.67\\
+			File 1.21& 16.64& 22.06& 20.44\\
       &&&\\
-			\textbf{Total}& 21.47& 28.24& 25.59\\
+			\textbf{Total}& 18.98& 24.99& 22.71\\
     \bottomrule
   \end{longtable}
 \end{footnotesize}
@@ -53,37 +113,37 @@ Also a line was added to the bottom of each table, showing the average percentag
 
 Overall, Samtools \acs{BAM} resulted in 71.76\% size reduction, the \acs{CRAM} methode improved this by rughly 2.5\%. \acs{GeCo} provided the greatest reduction with 78.53\%. This gap of about 4\% comes with a comparatively great sacrifice in time.\\
 
-\label{t:time}
+\label{k6:time}
 \sffamily
 \begin{footnotesize}
   \begin{longtable}[ht]{ p{.2\textwidth} p{.2\textwidth} p{.2\textwidth} p{.2\textwidth}}
-    \caption[Compression Efficiency]                       % Caption für das Tabellenverzeichnis
+    \caption[Compression Efficiency for first test set]                       % Caption für das Tabellenverzeichnis
         {Compression duration in seconds} % Caption für die Tabelle selbst
         \\
     \toprule
-     \textbf{ID.} & \textbf{\acs{GeCo} } & \textbf{Samtools \acs{BAM}}& \textbf{Samtools \acs{CRAM} } \\
+     \textbf{ID.} & \textbf{\acs{GeCo}} & \textbf{Samtools \acs{BAM}}& \textbf{Samtools \acs{CRAM}} \\
     \midrule
-			File 1 & 23.5& 3.786& 16.926\\
-			File 2 & 24.65& 3.784& 17.043\\
-			File 3 & 2.016& 3.123& 13.999\\
-			File 4 & 19.408& 3.011& 13.445\\
-			File 5 & 18.387& 2.862& 12.802\\
-			File 6 & 17.364& 2.685& 12.015\\
-			File 7 & 15.999& 2.503& 11.198\\
-			File 8 & 14.828& 2.286& 10.244\\
-      File 9 & 12.304& 2.078& 9.21\\
-			File 10 & 13.493& 2.127& 9.461\\
-			File 11 & 13.629& 2.132& 9.508\\
-			File 12 & 13.493& 2.115& 9.456\\
-			File 13 & 99.902& 1.695& 7.533\\
-			File 14 & 92.475& 1.592& 7.011\\
-			File 15 & 85.255& 1.507& 6.598\\
-			File 16 & 82.765& 1.39& 6.089\\
-			File 17 & 82.081& 1.306& 5.791\\
-			File 18 & 79.842& 1.277& 5.603\\
-			File 19 & 58.605& 0.96& 4.106\\
-			File 20 & 64.588& 1.026& 4.507\\
-			File 21 & 41.198& 0.721& 3.096\\
+			File 1.1 & 23.5& 3.786& 16.926\\
+			File 1.2 & 24.65& 3.784& 17.043\\
+			File 1.3 & 2.016& 3.123& 13.999\\
+			File 1.4 & 19.408& 3.011& 13.445\\
+			File 1.5 & 18.387& 2.862& 12.802\\
+			File 1.6 & 17.364& 2.685& 12.015\\
+			File 1.7 & 15.999& 2.503& 11.198\\
+			File 1.8 & 14.828& 2.286& 10.244\\
+      File 1.9 & 12.304& 2.078& 9.21\\
+			File 1.10 & 13.493& 2.127& 9.461\\
+			File 1.11 & 13.629& 2.132& 9.508\\
+			File 1.12 & 13.493& 2.115& 9.456\\
+			File 1.13 & 99.902& 1.695& 7.533\\
+			File 1.14 & 92.475& 1.592& 7.011\\
+			File 1.15 & 85.255& 1.507& 6.598\\
+			File 1.16 & 82.765& 1.39& 6.089\\
+			File 1.17 & 82.081& 1.306& 5.791\\
+			File 1.18 & 79.842& 1.277& 5.603\\
+			File 1.19 & 58.605& 0.96& 4.106\\
+			File 1.20 & 64.588& 1.026& 4.507\\
+			File 1.21 & 41.198& 0.721& 3.096\\
       &&&\\
       \textbf{Total}&42.57&2.09&9.32\\
     \bottomrule
@@ -91,42 +151,42 @@ Overall, Samtools \acs{BAM} resulted in 71.76\% size reduction, the \acs{CRAM} m
 \end{footnotesize}
 \rmfamily
 
-As \ref{t:time} is showing, the average compression duration for \acs{GeCo} is at 42.57s. That is a little over 33s, or 78\% longer than the average runtime of samtools for compressing into the \acs{CRAM} format.\\
+As \ref{k6:time} is showing, the average compression duration for \acs{GeCo} is at 42.57s. That is a little over 33s, or 78\% longer than the average runtime of samtools for compressing into the \acs{CRAM} format.\\
 Since \acs{CRAM} requires a file in \acs{BAM} format, the third row is calculated by adding the time needed to compress into \acs{BAM} with the time needed to compress into \acs{CRAM}. 
 While \acs{SAM} format is required for compressing a \acs{FASTA} into \acs{BAM} and further into \acs{CRAM}, in itself it does not features no compression. However, the conversion from \acs{SAM} to \acs{FASTA} can result in a decrease in size. At first this might be contra intuitive since, as described in \ref{k2:sam} \acs{SAM} stores more information than \acs{FASTA}. This can be explained by comparing the sequence storing mechanism. A \acs{FASTA} sequence section can be spread over multiple lines whereas \acs{SAM} files store a sequence in just one line, converting can result in a \acs{SAM} file that is smaller than the original \acs{FASTA} file.
 % (hi)storytime
 Before interpreting this data further, a quick view into development processes: \acs{GeCo} stopped development in the year 2016 while Samtools is being developed since 2015, to this day, with over 70 people contributing.\\
 
 % big tables
-For the second set of testdata, the file identifier was set to follow the scheme \texttt{File 2.x} where x is a number between zero and seven. While the first set of testdata had names that matched the file identifiers, considering its numbering, the second set had more variating names. The mapping between identifier and file can be found in \ref{}. % todo add testset tables
-Reviewing \ref{t:recal-time} one will notice, that \acs{GeCo} reached a runtime over 60 seconds on every run. Instead of displaying the runtime solely in seconds, a leading number followed by an m indicates how many minutes each run took.
+%For the second set of test data, the file identifier was set to follow the scheme \texttt{File 2.x} where x is a number between zero and seven. While the first set of test data had names that matched the file identifiers, considering its numbering, the second set had more variating names. The mapping between identifier and file can be found in \ref{}. % todo add test set tables
+Reviewing \ref{k6:recal-time} one will notice, that \acs{GeCo} reached a runtime over 60 seconds on every run. Instead of displaying the runtime solely in seconds, a leading number followed by an m indicates how many minutes each run took.
 
-\label{t:recal-size}
+\label{k6:recal-size}
 \sffamily
 \begin{footnotesize}
   \begin{longtable}[c]{ p{.2\textwidth} p{.2\textwidth} p{.2\textwidth} p{.2\textwidth}}
-    \caption[Compression Effectivity for greater files]                       % Caption für das Tabellenverzeichnis
+    \caption[Compression Effectivity for second test set]                       % Caption für das Tabellenverzeichnis
         {File sizes in different compression formats in \textbf{percent}} % Caption für die Tabelle selbst
         \\
     \toprule
-     \textbf{ID.} & \textbf{\acs{GeCo} \%} & \textbf{Samtools \acs{BAM}\%}& \textbf{Samtools \acs{CRAM} \%} \\
+     \textbf{ID.} & \textbf{\acs{GeCo}\% }& \textbf{Samtools \acs{BAM}\% }& \textbf{Samtools \acs{CRAM}\% } \\
     \midrule
 			%geco bam and cram in percent
-			File 1& 1.00& 6.28& 5.38\\
-			File 2& 0.98& 6.41& 5.52\\
-			File 3& 1.21& 8.09& 7.17\\
-			File 4& 1.20& 7.70& 6.85\\
-			File 5& 1.08& 7.58& 6.72\\
-			File 6& 1.09& 7.85& 6.93\\
-			File 7& 0.96& 5.83& 4.63\\
+			File 2.1& 1.00& 6.28& 5.38\\
+			File 2.2& 0.98& 6.41& 5.52\\
+			File 2.3& 1.21& 8.09& 7.17\\
+			File 2.4& 1.20& 7.70& 6.85\\
+			File 2.5& 1.08& 7.58& 6.72\\
+			File 2.6& 1.09& 7.85& 6.93\\
+			File 2.7& 0.96& 5.83& 4.63\\
       &&&\\
-			\textbf{Total}	1.07& 7.11& 6.17\\
+			\textbf{Total}& 1.07& 7.11& 6.17\\
     \bottomrule
   \end{longtable}
 \end{footnotesize}
 \rmfamily
 
-\label{t:recal-time}
+\label{k6:recal-time}
 \sffamily
 \begin{footnotesize}
   \begin{longtable}[ht]{ p{.2\textwidth} p{.2\textwidth} p{.2\textwidth} p{.2\textwidth}}
@@ -134,24 +194,24 @@ Reviewing \ref{t:recal-time} one will notice, that \acs{GeCo} reached a runtime
         {Compression duration in seconds} % Caption für die Tabelle selbst
         \\
     \toprule
-     \textbf{ID.} & \textbf{\acs{GeCo} } & \textbf{Samtools \acs{BAM}}& \textbf{Samtools \acs{CRAM} } \\
+      \textbf{ID.} & \textbf{\acs{GeCo}} & \textbf{Samtools \acs{BAM}}& \textbf{Samtools \acs{CRAM}} \\
     \midrule
 			%compress time for geco, bam and cram in seconds
-			File 1 & 1m58.427& 16.248& 23.016\\
-			File 2 & 1m57.905& 15.770& 22.892\\
-			File 3 & 1m09.725& 07.732& 12.858\\
-			File 4 & 1m13.694& 08.291& 13.649\\
-			File 5 & 1m51.001& 14.754& 23.713\\
-			File 6 & 1m51.315& 15.142& 24.358\\
-			File 7 & 2m02.065& 16.379& 23.484\\
+			File 2.1 & 1m58.427& 16.248& 23.016\\
+			File 2.2 & 1m57.905& 15.770& 22.892\\
+			File 2.3 & 1m09.725& 07.732& 12.858\\
+			File 2.4 & 1m13.694& 08.291& 13.649\\
+			File 2.5 & 1m51.001& 14.754& 23.713\\
+			File 2.6 & 1m51.315& 15.142& 24.358\\
+			File 2.7 & 2m02.065& 16.379& 23.484\\
       &&&\\
-			\textbf{Total}	 & 1m43.447& 13.474& 20.567\\
+			\textbf{Total}& 1m43.447& 13.474& 20.567\\
     \bottomrule
   \end{longtable}
 \end{footnotesize}
 \rmfamily
 
-In both tables \ref{t:recal-time} and \ref{t:recal-size} the already identified pattern can be observed. Looking at the compression ratio in \ref{t:recal-size} a maximum compression of 99.04\% was reached with \acs{GeCo}. In this set of test files, file seven were the one with the greatest size (\~1.3 Gigabyte). Closely folled by file one and two (\~1.2 Gigabyte). 
+In both tables \ref{k6:recal-time} and \ref{k6:recal-size} the already identified pattern can be observed. Looking at the compression ratio in \ref{k6:recal-size} a maximum compression of 99.04\% was reached with \acs{GeCo}. In this set of test files, file seven were the one with the greatest size (\~1.3 Gigabyte). Closely folled by file one and two (\~1.2 Gigabyte). 
 
 \section{View on Possible Improvements}
 So far, this work went over formats for storing genomes, methods to compress files (in mentioned formats) and through tests where implementations of named algorithms compress several files and analyzed the results. The test results show that \acs{GeCo} provides a better compression ratio than Samtools and takes more time to run through. So in this testrun, implementations of arithmetic coding resulted in a better compression ratio than Samtools \acs{BAM} with the mix of huffman coding and \acs{LZ77}, or Samtools custom compression format \acs{CRAM}. Comparing results in \autocite{survey}, supports this statement. This study used \acs{FASTA}/Multi-FASTA files from 71MB to 166MB and found that \acs{GeCo} had a variating compression ratio from 12.34 to 91.68 times smaller than the input reference and also resulted in long runtimes up to over 600 minutes \cite{survey}. Since this study focused on another goal than this work and therefore used different test variables and environments, the results can not be compared. But what can be taken from this, is that arithmetic coding, at least in \acs{GeCo} is in need of a runtime improvement.\\
@@ -179,7 +239,7 @@ The question for how many probabilities are needed, needs to be answered, to sta
 %One should keep in mind that this is only one of many approaches. Any proove of other approaches which reduces the probability determination, can be taken in instead. 
 
 % second bullet point (mutlithreading aspect=
-The Second point must be asked, because the improvement in counting only one nucleotide in comparison to counting three, would be to little to be called relevant. Especially if multithreading is a option. Since in the static codeanalysis in \ref{k3:GeCo} revealed no multithreading, the analysis for improvements when splitting the workload onto several threads should be considered, before working on an improvement based on Petoukhovs findings. This is relevant, because some improvements, like the one described above, will loose efficiency if only subsections of a genomes are processed. A tool like OpenMC for multithreading C programs would possibly supply the required functionality to develop a prove of concept \cite{cthreading, pet21}.
+The Second point must be asked, because the improvement in counting only one nucleotide in comparison to counting three, would be to little to be called relevant. Especially if multithreading is a option. Since in the static codeanalysis in \ref{k4:geco} revealed no multithreading, the analysis for improvements when splitting the workload onto several threads should be considered, before working on an improvement based on Petoukhovs findings. This is relevant, because some improvements, like the one described above, will loose efficiency if only subsections of a genomes are processed. A tool like OpenMC for multithreading C programs would possibly supply the required functionality to develop a prove of concept \cite{cthreading, pet21}.
 % theoretical improvement with pseudocode
 But how could a improvement look like, not considering possible difficulties multithreading would bring?
 To answer this, first a mechanism to determine a possible improvement must be determined. To compare parts of a programm and their complexity, the Big-O notation is used. Unfortunally this is only covering loops and coditions as a whole. Therefore a more detailed view on operations must be created: 
@@ -224,10 +284,10 @@ The fact that there are obviously chains of repeating nucleotides in genomes. Fo
 Without determining probabilities, one can see that the amount of \texttt{A}s outnumbers \texttt{T}s and neither \texttt{C} nor \texttt{G} are present. With the whole 1.2 gigabytes, the distribution will align more, but by cutting out a subsection, of relevant size, with unequal distributions will have an impact on the probabilities of the whole sequence. If a greater sequence would lead to a more equal distribution, this knowledge could be used to help determining distributions on subsequences of one with equaly distributed probabilities.
 % length cutting
 
-
 % how is data interpreted
 % why did the tools result in this, what can we learn
 % improvements
 % - goal: less time to compress
 % 	- approach: optimize probability determination
 % 	-> how?
+

+ 78 - 8
latex/tex/literatur.bib

@@ -109,7 +109,7 @@
   url   = {https://github.com/samtools/hts-specs.},
 }
 
-@Article{bed,
+@Online{bed,
   author = {Sanger Institute, Genome Research Limited},
   date   = {2022-10-20},
   title  = {BED Browser Extensible Data},
@@ -172,16 +172,14 @@
 
 @TechReport{rfcgzip,
   author       = {L. Peter Deutsch and Jean-Loup Gailly and Mark Adler and L. Peter Deutsch and Glenn Randers-Pehrson},
-  institution  = {RFC Editor},
+  date         = {1996-05},
   title        = {GZIP file format specification version 4.3},
-  note         = {\url{http://www.rfc-editor.org/rfc/rfc1952.txt}},
   number       = {1952},
   type         = {RFC},
-  url          = {http://www.rfc-editor.org/rfc/rfc1952.txt},
   howpublished = {Internet Requests for Comments},
   issn         = {2070-1721},
   month        = {May},
-  publisher    = {RFC Editor},
+  publisher    = {RFC},
   year         = {1996},
 }
 
@@ -269,7 +267,7 @@
   publisher    = {Oxford University Press ({OUP})},
 }
 
-@Article{cram_origin,
+@Article{cram-origin,
   author       = {Markus Hsi-Yang Fritz and Rasko Leinonen and Guy Cochrane and Ewan Birney},
   date         = {2011-01},
   journaltitle = {Genome Research},
@@ -363,10 +361,11 @@
 
 @TechReport{isompeg,
   author      = {{ISO Central Secretary}},
+  date        = {2020-10},
   institution = {International Organization for Standardization},
   title       = {MPGE-G},
   language    = {en},
-  number      = {ISO/IEC 23092:2019},
+  number      = {ISO/IEC 23092-1:2020},
   type        = {Standard},
   url         = {https://www.iso.org/standard/23092.html},
   year        = {2019},
@@ -417,7 +416,7 @@
   year      = {2003},
 }
 
-@Online{gecoRepo,
+@Online{geco-repo,
   author = {Cobilab},
   date   = {2022-11-19},
   title  = {Repositories for the three versions of GeCo},
@@ -432,4 +431,75 @@
   publisher = {{MDPI} {AG}},
 }
 
+@TechReport{iso-ascii,
+  author      = {ISO/IEC JTC 1/SC 2 Coded character sets},
+  date        = {1998-04},
+  institution = {International Organization for Standardization},
+  title       = {Information technology — 8-bit single-byte coded graphic character sets — Part 1: Latin alphabet No. 1},
+  type        = {Standard},
+  address     = {Geneva, CH},
+  key         = {ISO8859-1:1998},
+  volume      = {1998},
+  year        = {1998},
+}
+
+@Book{dict,
+  author    = {McIntosh, Colin},
+  date      = {2013},
+  title     = {Cambridge International Dictionary of English},
+  isbn      = {9781107035157},
+  pages     = {1856},
+  publisher = {Cambridge University Press},
+}
+
+@TechReport{rfc-udp,
+  author       = {J. Postel},
+  date         = {1980-08-28},
+  institution  = {RFC Editor},
+  title        = {User Datagram Protocol},
+  doi          = {10.17487/RFC0768},
+  number       = {768},
+  pagetotal    = {3},
+  url          = {https://www.rfc-editor.org/info/rfc768},
+  howpublished = {RFC 768},
+  month        = aug,
+  publisher    = {RFC Editor},
+  series       = {Request for Comments},
+  year         = {1980},
+}
+
+@TechReport{isoutf,
+  author = {ISO},
+  title  = {ISO/IEC 10646:2020 UTF},
+}
+
+@Article{lz77,
+  author  = {Ziv, J. and Lempel, A.},
+  title   = {A universal algorithm for sequential data compression},
+  doi     = {10.1109/TIT.1977.1055714},
+  number  = {3},
+  pages   = {337-343},
+  volume  = {23},
+  journal = {IEEE Transactions on Information Theory},
+  year    = {1977},
+}
+
+@Online{code-analysis,
+  author = {Ryan Dewhurst},
+  date   = {2022-11-20},
+  editor = {Kirsten S and Nick Bloor and Sarah Baso and James Bowie and Evgeniy Ryzhkov and Iberiam and Ann Campbell and Jonathan Marcil and Christina Schelin and Jie Wang and Fabian and Achim and Dirk Wetter},
+  title  = {Static Code Analysis},
+  url    = {https://owasp.org/www-community/controls/Static_Code_Analysis},
+}
+
+@Online{gpl,
+  title = {GNU Public License},
+  url   = {http://www.gnu.org/licenses/gpl-3.0.html},
+}
+
+@Online{mitlic,
+  title = {MIT License},
+  url   = {https://spdx.org/licenses/MIT.html},
+}
+
 @Comment{jabref-meta: databaseType:biblatex;}

+ 1 - 0
latex/tex/thesis.tex

@@ -196,6 +196,7 @@
 % Anhang. Wenn Sie keinen Anhang haben, entfernen Sie einfach
 % diesen Teil.
 \appendix
+\input{kapitel/a5_feasability}
 \input{kapitel/a6_results}
 
 \end{document}