SHA1
--- a/latex/tex/bilder/k2/cell.png
+++ b/latex/tex/bilder/k2/cell.png
--- a/latex/tex/bilder/k4/arith-resize.png
+++ b/latex/tex/bilder/k4/arith-resize.png
--- a/latex/tex/bilder/k4/com-sys.png
+++ b/latex/tex/bilder/k4/com-sys.png
--- a/latex/tex/bilder/k4/cram-structure.png
+++ b/latex/tex/bilder/k4/cram-structure.png
--- a/latex/tex/kapitel/abkuerzungen.tex
+++ b/latex/tex/kapitel/abkuerzungen.tex
@@ -16,7 +16,6 @@
 
				   \acro{FASTq}{File Format Based on FASTA}
			
 
				   \acro{FTP}{File Transfere Protocol}
			
 
				   \acro{GA4GH}{Global Alliance for Genomics and Health}
			
 
				-	\acro{GB}{Gigabyte}
			
 
				   \acro{GeCo}{Genome Compressor}
			
 
				   \acro{IUPAC}{International Union of Pure and Applied Chemistry}
			
 
				   \acro{LZ77}{Lempel Ziv 1977}
			
--- a/latex/tex/kapitel/k1_introduction.tex
+++ b/latex/tex/kapitel/k1_introduction.tex
@@ -1,14 +1,14 @@
 
				 \chapter{Introduction}
			
 
				 % general information and intro
			
 
				-Understanding how things in our cosmos work, was and still is a pleasure the human being always wants to fullfill. Getting insights into the rawest form of organic live is possible through storing and studying information, embeded in genetic codes. Since live is complex, there is a lot of information, which requires a lot of memory.\\
			
 
				+Understanding how things in our cosmos work, was and still is a pleasure, that the human being always wants to fulfill. Getting insights into the rawest form of organic life is possible through storing and studying information, embedded in genetic codes. Since live is complex, there is a lot of information, which requires a lot of memory.\\
			
 
				 % ...Communication with other researches means sending huge chunks of data through cables or through waves over the air, which costs time and makes raw data vulnerable to erorrs.\\
			
 
				 % compression values and goals
			
 
				-With compression tools, the problem of storing information got restricted. Compressed data requires less space and therefore less time to be tranported over networks. This advantage is scaleable and since genetic information needs a lot of storage, even in a compressed state, improvements are welcomed. Since this field is, compared to others, like computer theorie and compression approchaes, relatively new, there is much to discover and new findings are not unusual. From some of this findings, new tools can be developed. They optimally increase two factors: the speed at which data is compressed and the compresseion ratio, meaning the difference between uncompressed and compressed data.\\
			
 
				+With compression tools, the problem of storing information got restricted. Compressed data requires less space and therefore less time to be transported over networks. This advantage is scalable and since genetic information needs a lot of storage, even in a compressed state, improvements are welcomed. Since this field is, compared to others, like computer theory and compression approaches, relatively new, there is much to discover and new findings are not unusual. From some of this findings, new tools can be developed. They optimally increase two factors: the speed at which data is compressed and the compresseion ratio, meaning the difference between uncompressed and compressed data.\\
			
 
				 % ...
			
 
				 % more exact explanation
			
 
				 
			
 
				 % actual focus in short and simple terms
			
 
				-New discoveries in the universal rules of stochastical organisation of genomes might provide a base for new algoriths and therefore new tools or an improvement of existing ones for genome compression. The aim of this work is to analyze the current state of the art for probabilistic compression tools and their algorithms, and ultimately determine whether mentioned discoveries are already used. \texttt{might be thrown out due to time limitations} -> If this is not the case, there will be an analysation of how and where this new approach could be imprelented and if it would improve compression methods.\\
			
 
				+New discoveries in the universal rules of stochastical organization of genomes might provide a base for new algorithms and therefore new tools or an improvement of existing ones for genome compression. The aim of this work is to analyze the current state of the art for probabilistic compression tools and their algorithms, and ultimately determine whether mentioned discoveries are already used. \texttt{might be thrown out due to time limitations} -> If this is not the case, there will be an analysation of how and where this new approach could be imprelented and if it would improve compression methods.\\
			
 
				 
			
 
				 % focus and structure of work in greater detail 
			
 
				 To reach a common ground, the first pages will give the reader a quick overview on the structure of human DNA. There will also be an superficial explanation for some basic terms, used in biology and computer science. The knowledge basis of this work is formed by describing differences in file formats, used to store genome data. In addition to this, a section relevant for compression will follow. This will go through the state of the art in coding theory.\\
			
--- a/latex/tex/kapitel/k2_dna_structure.tex
+++ b/latex/tex/kapitel/k2_dna_structure.tex
@@ -18,24 +18,34 @@
 
				 %- IMPACT ON COMPRESSION
			
 
				 
			
 
				 \chapter{The Structure of the Human Genome and how its Digital Form is Compressed}
			
 
				-\section{Structure of Human \ac{DNA}}
			
 
				+\section{Structure of Human \acs{DNA}}
			
 
				 To strengthen the understanding of how and where biological information is stored, this section starts with a quick and general rundown on the structure of any living organism.\\
			
 
				-% todo add picture
			
 
				-All living organisms, like plants and animals, are made of cells (a human body can consist out of several trillion cells) \cite{cells}.
			
 
				-A cell in itself is a living organism; The smalles one possible. It consists out of two layers from which the inner one is called nucleus. The nucleus contains chromosomes and those chromosomes hold the genetic information in form of \ac{DNA}. 
			
 
				+
			
 
				+\begin{figure}[ht]
			
 
				+  \centering
			
 
				+  \includegraphics[width=6cm]{k2/cell.png}
			
 
				+  \caption{A superficial representation of the physical positioning of genomes. Showing a double helix (bottom), a chromosome (upper rihgt) and a chell (upper center).}
			
 
				+  \label{k2:gene-overview}
			
 
				+\end{figure}
			
 
				+
			
 
				+All living organisms, like plants and animals, are made of cells (a human body can consist out of several trillion cells) \cite{cells}.\\
			
 
				+\autocite{cells}\\
			
 
				+\textcite{cells}\\
			
 
				+\footcite{cells}\\
			
 
				+A cell in itself is a living organism; The smallest one possible. It consists out of two layers from which the inner one is called nucleus. The nucleus contains chromosomes and those chromosomes hold the genetic information in form of \ac{DNA}. 
			
 
				  
			
 
				-\ac{DNA} is often seen in the form of a double helix. A double helix consists, as the name suggestes, of two single helix. 
			
 
				+\acs{DNA} is often seen in the form of a double helix. A double helix consists, as the name suggests, of two single helix. 
			
 
				 
			
 
				 \begin{figure}[ht]
			
 
				   \centering
			
 
				   \includegraphics[width=15cm]{k2/dna.png}
			
 
				-  \caption{A purely diagrammatic figure of the components \ac{DNA} is made of. The smaller, inner rods symbolize nucleotide links and the outer ribbons the phosphate-suggar chains \cite{dna_structure}.}
			
 
				+  \caption{A purely diagrammatic figure of the components \acs{DNA} is made of. The smaller, inner rods symbolize nucleotide links and the outer ribbons the phosphate-sugar chains \autocite{dna_structure}.}
			
 
				   \label{k2:dna-struct}
			
 
				 \end{figure}
			
 
				 
			
 
				-Each of them consists of two main components: the Suggar Phosphat backbone, which is not relevant for this work and the Bases. The arrangement of Bases represents the Information stored in the \ac{DNA}. A base is an organic molecule, they are called Nucleotides \cite{dna_structure}. \\
			
 
				+Each of them consists of two main components: the Sugar Phosphate backbone, which is not relevant for this work and the Bases. The arrangement of Bases represents the Information stored in the \acs{DNA}. A base is an organic molecule, they are called Nucleotides \cite{dna_structure}. \\
			
 
				 % describe Genomes?
			
 
				 
			
 
				-For this work, nucleotides are the most important parts of the \ac{DNA}. A Nucleotide can occour in one of four forms: it can be either adenine, thymine, guanine or cytosine. Each of them got a Counterpart with which a bond can be established: adenine can bond with thymine, guanine can bond with cytosine.\\
			
 
				-From the perspective of an computer scientist: The content of one helix must be stored, to persist the full information. In more practical terms: The nucleotides of only one (entire) helix needs to be stored physically, to save the information of the whole \ac{DNA} because the other half can be determined by ``inverting'' the stored one. An example would show the counterpart for e.g.: \texttt{adenine, guanine, adenine} chain which would be a chain of \texttt{thymine, cytosine, thymine}. For the sake of simplicity, one does not write out the full name of each nucleotide, but only its initiat. So the example would change to \texttt{AGA} in one Helix, \texttt{TCT} in the other.\\
			
 
				-This representation ist commonly used to store \ac{DNA} digitally. Depending on the sequencing procedure and other factors, more information is stored and therefore more characters are required but for now 'A', 'C', 'G' and 'T' should be the only concern.
			
 
				+For this work, nucleotides are the most important parts of the \acs{DNA}. A Nucleotide can occur in one of four forms: it can be either adenine, thymine, guanine or cytosine. Each of them got a Counterpart with which a bond can be established: adenine can bond with thymine, guanine can bond with cytosine.\\
			
 
				+From the perspective of an computer scientist: The content of one helix must be stored, to persist the full information. In more practical terms: The nucleotides of only one (entire) helix needs to be stored physically, to save the information of the whole \acs{DNA} because the other half can be determined by ``inverting'' the stored one. An example would show the counterpart for e.g.: \texttt{adenine, guanine, adenine} chain which would be a chain of \texttt{thymine, cytosine, thymine}. For the sake of simplicity, one does not write out the full name of each nucleotide, but only its initiat. So the example would change to \texttt{AGA} in one Helix, \texttt{TCT} in the other.\\
			
 
				+This representation ist commonly used to store \acs{DNA} digitally. Depending on the sequencing procedure and other factors, more information is stored and therefore more characters are required but for now 'A', 'C', 'G' and 'T' should be the only concern.
			
--- a/latex/tex/kapitel/k3_datatypes.tex
+++ b/latex/tex/kapitel/k3_datatypes.tex
@@ -4,7 +4,7 @@
 
				 %# BASICS
			
 
				 %- \acs{DNA} STRUCTURE
			
 
				 %- DATA TYPES
			
 
				-% - BAM/\ac{FASTQ}
			
 
				+% - BAM/\acs{FASTq}
			
 
				 % - NON STANDARD
			
 
				 %- COMPRESSION APPROACHES
			
 
				 % - SAVING DIFFERENCES WITH GIVEN BASE \acs{DNA}
			
@@ -17,76 +17,91 @@
 
				 %- \acs{DNA}S STOCHASTICAL ATTRIBUTES 
			
 
				 %- IMPACT ON COMPRESSION
			
 
				 
			
 
				-% todo: use this https://www.reddit.com/r/bioinformatics/comments/7wfdra/eli5_what_are_the_differences_between_fastq_and/
			
 
				-
			
 
				 % bigger picture - structure chapters like this:
			
 
				 % what is it/how does it work
			
 
				 % where are limits (e.g. BAM)
			
 
				 % what is our focus (and maybe 'why')
			
 
				 
			
 
				 \section{File Formats used to Store DNA}
			
 
				-\label{chap:filetypes}
			
 
				-As described in previous chapters \ac{DNA} can be represented by a string with the buildingblocks A,T,G and C. Using a common fileformat for saving text would be impractical because the ammount of characters or symbols in the used alphabet, defines how many bits are used to store each single symbol.\\
			
 
				-The \ac{ascii} table is a characterset, registered in 1975 and to this day still in use to encode texts digitally. For the purpose of communication bigger charactersets replaced \ac{ascii}. It is still used in situations where storage is short.
			
 
				-% todo grund dass ascii abgelöst wurde -> zu wenig darstellungsmöglichkeiten. Pro heute -> weniger overhead pro character
			
 
				-Storing a single \textit{A} with \ac{ascii} encoding, requires 8 bit (\,excluding magic bytes and the bytes used to mark \ac{EOF})\ . Since there are at least $2^8$ or 128 displayable symbols. The buildingblocks of \ac{DNA} require a minimum of four letters, so two bits are needed
			
 
				+\label{chap:file formats}
			
 
				+As described in previous chapters \ac{DNA} can be represented by a string with the buildingblocks A,T,G and C. Using a common file format for saving text would be impractical because the ammount of characters or symbols in the used alphabet, defines how many bits are used to store each single symbol.\\
			
 
				+The \ac{ASCII} table is a character set, registered in 1975 and to this day still in use to encode texts digitally. For the purpose of communication bigger character sets replaced \acs{ASCII}. It is still used in situations where storage is short.\\
			
 
				+% grund dass ASCII abgelöst wurde -> zu wenig darstellungsmöglichkeiten. Pro heute -> weniger overhead pro character
			
 
				+Storing a single \textit{A} with \acs{ASCII} encoding, requires 8 bit (\,excluding magic bytes and the bytes used to mark \ac{EOF})\ . Since there are at least $2^8$ or 128 displayable symbols. The buildingblocks of \acs{DNA} require a minimum of four letters, so two bits are needed
			
 
				 % cout out examples. Might be needed later or elsewhere
			
 
				 % \texttt{00 -> A, 01 -> T, 10 -> G, 11 -> C}. 
			
 
				-In most tools, more than four symbols are used. This is due to the complexity in sequencing \ac{DNA}. It is not 100\% preceice, so additional symbols are used to mark nucelotides that could not or could only partly get determined. Further a so called quality score is used to indicate the certainty, for each single nucleotide, that is was sequenced correctly as what was stored.\\
			
 
				-More common everyday-usage text encodings like unicode require 16 bits per letter. So settling with \ac{ascii} has improvement capabilitie but is, on the other side, more efficient than using bulkier alternatives like unicode.\\
			
 
				+In most tools, more than four symbols are used. This is due to the complexity in sequencing \acs{DNA}. It is not 100\% preceice, so additional symbols are used to mark nucelotides that could not or could only partly get determined. Further a so called quality score is used to indicate the certainty, for each single nucleotide, that it got sequenced correctly.\\
			
 
				+More common everyday-usage text encodings like unicode require 16 bits per letter. So settling with \acs{ASCII} has improvement capabilities but is, on the other side, more efficient than using bulkier alternatives like unicode.\\
			
 
				 
			
 
				-Several people and groups have developed different fileformats to store genomes. Unfortunally for this work, there is no defined standard filetype or set of filetypes, therefore one has to gather information by themselve. In order to not go beyond scope, this work will focus only on fileformats that fullfill following criteria:\\
			
 
				+% differences between information that is store
			
 
				+Formats for storing uncompressed genomic data, can be sorted into several categories. Three noticable ones would be \autocite{survey}:
			
 
				 \begin{itemize}
			
 
				-  \item{the format has reputation, either through a scientific paper, that prooved its superiority to other relevant tools or through a broad ussage of the format.}
			
 
				-  \item{the format should no specialize on only one type of \ac{DNA}.}
			
 
				-  \item{the format mainly stores nucleotide seuqences and does not neccesarily include \ac{IUPAC} codes besides A, C, G and T \autocite{iupac}.}
			
 
				-  \item{the format is open source. Otherwise improvements can not be tested, without buying the software and/or requesting permission to disassemble and reverse engineer the software or parts of it.}
			
 
				-  \item{the compression methode used in the format is based on probabilities.}
			
 
				+	\item sequenced reads
			
 
				+	\item aligned data
			
 
				+	\item sequence variation
			
 
				+\end{itemize}
			
 
				+The categories are listed on their complexity, considering their usecase and data structure, in ascending order. Starting with sequence variation, also called haplotype describes formats storing graph based structures that focus on analysing variations in different genomes \acs{haplo, survey}. 
			
 
				+Sequenced reads focus on storing continous protein chains from a sequenced genome \acs{survey}.
			
 
				+Aligned data is somwhat simliar to sequenced reads with the difference that instead of a whole chain of genomes, overlapping subsequences are stored. This could be described as a rawer form of sequenced reads. This way aligned data stores additional information on how certain a specific part of a genome is read correctly.
			
 
				+The focus of this work lays on compression of sequenced data but not on the likelyhood of how accurate the data might be. Therefore, only formats that include sequenced reads will be worked with.\\
			
 
				+
			
 
				+% my criteria
			
 
				+Several people and groups have developed different file formats to store genomes. Unfortunaly, the only standard for storing genomic data is fairly new \autocite{isompeg, mpeg}. Therefore, formats and tools implementing this standard are mostly still in development. In order to not go beyond scope, this work will focus only on file formats that fulfill following criteria:\\
			
 
				+\begin{itemize}
			
 
				+  \item{the format has reputation}
			
 
				+	\begin{itemize}
			
 
				+		\item through a scientific paper, that prooved its superiority to other relevant tools. 
			
 
				+		\item through a broad ussage of the format determined by its use on ftp servery that focus on supporting scientific research.
			
 
				+	\end{itemize}
			
 
				+  \item{the format should not specialize on only one type of \acs{DNA}.}
			
 
				+  \item{the format stores nucleotide seuqences and does not neccesarily include \ac{IUPAC} codes besides A, C, G and T \autocite{iupac}.}
			
 
				+  \item{the format is open source. Otherwise, improvements can not be tested, without buying the software and/or requesting permission to disassemble and reverse engineer the software or parts of it.}
			
 
				 \end{itemize}
			
 
				 
			
 
				 Information on available formats where gathered through various Internet platforms \autocite{ensembl, ucsc, ga4gh}. 
			
 
				-Some common fileformats found:
			
 
				+Some common file formats found:
			
 
				 \begin{itemize}
			
 
				 % which is relevant? 
			
 
				-  \item{BED}  % \autocite{bed}
			
 
				-  \item{CRAM} % \autocite{cram}
			
 
				-  \item{FASTA} % \autocite{}
			
 
				-  \item{\ac{FASTQ}} % \autocite{}
			
 
				-  \item{GFF} % \autocite{}
			
 
				-  \item{SAM/\ac{BAM}} % \autocite{}
			
 
				-  \item{twoBit}% \autocite{}
			
 
				-  \item{VCF}% \autocite{}
			
 
				-
			
 
				+  \item{\ac{FASTA}} 
			
 
				+  \item{\ac{FASTq}} 
			
 
				+  \item{\ac{SAM}/\ac{BAM}} \autocite{bam, sam12}
			
 
				+  \item{\ac{CRAM}} \autocite{bam, sam12}
			
 
				+  \item{twoBit} \autocite{twobit}
			
 
				+  %\item{VCF} genotype format -> anaylses differences of two seq
			
 
				 \end{itemize}
			
 
				+
			
 
				+% groups: sequence data, alignment data, haplotypic
			
 
				 % src: http://help.oncokdm.com/en/articles/1195700-what-is-a-bam-fastq-vcf-and-bed-file
			
 
				-Since methods to store this kind of Data are still in development, there are many more filetypes. The few, mentioned above are used by different organisations and 
			
 
				-%todo calc percentage 
			
 
				-are backed by scientific papers.\\
			
 
				-Considering the first criteria, by searching through anonymously accesable \ac{ftp} servers, only two formats are used commonly: FASTA or its extension \ac{FASTQ} and the \ac{BAM} Format. %todo <- add ftp servers to cite
			
 
				+Since methods to store this kind of Data are still in development, there are many more file formats. From the selection listed above, \acs{FASTA} and \acs{FASTq} seem to have established the reputation of a inoficial standard for sequenced reads \autocite{survey, geco, vertical, cram-origin}. \\
			
 
				+Considering the first criteria, by searching through anonymously accesable \acs{ftp} servers, only two formats are used commonly: FASTA or its extension \acs{FASTq} and the \acs{BAM} Format \acs{ftp-igsr, ftp-ncbi, ftp-ensembl}.
			
 
				 
			
 
				-\subsection{\ac{FASTQ}}
			
 
				-% todo add some fasta knowledge
			
 
				-Is a text base format for storing sequenced data. It saves nucleotides as letters and in extend to that, the quality values are saved.
			
 
				-\ac{FASTQ} files are split into multiples of four, each four lines contain the informations for one sequence. The exact structure of \ac{FASTQ} format is as follows:
			
 
				-\texttt{
			
 
				-Line 1: Sequence identifier aka. Title, starting with an @ and an optional description.\\
			
 
				-Line 2: The seuqence consisting of nucleoids, symbolized by A, T, G and C.\\
			
 
				-Line 3: A '+' that functions as a seperator. Optionally followed by content of Line 1.\\
			
 
				-Line 4: quality line(s). consisting of letters and special characters in the \ac{ascii} scope.}\\
			
 
				 
			
 
				-The quality values have no fixed type, to name a few there is the sanger format, the solexa format introduced by Solexa Inc., the Illumina and the QUAL format which is generated by the PHRED software. 
			
 
				+\subsection{\acs{FASTA} and \acs{FASTq}}
			
 
				+The rather simple \acs{FASTA} format consists of two repeated sections. The first section consists of one line and stores metadata about the sequenced genome and the file itself. This line, also called header, contains a comment section starting with \texttt{>} followed by a custom text \autocite{alok17, Cock_2009}. The comment section is usually used to store information about the sequenced genome and sometimes metadata about the file itself like its size in bytes.\\
			
 
				+The other section contains the sequenced genome whereas each nucleotide is represented by character \texttt{A, C, G or T}. There are three more nucleotide characters that store additional information and some characters for representing amino acids, but in order to not go beyond scope, only \texttt{A, C, G, and T} will be paid attention to.\\
			
 
				+The second section can take multiple lines and is determined by a empty line. After that the file end is reached or another touple of header and sequence can be found.\\
			
 
				+% fastq
			
 
				+In addition to its predecessor, \acs{FASTq} files contain a quality score. The file content consists of four sections, wherby no section is stored in more than one line. All four lines contain information about one sequence. The exact structure of \acs{FASTq} is formated in this order \autocite{illufastq}:
			
 
				+\begin{itemize}
			
 
				+	\item Line 1: Sequence identifier aka. Title, starting with an @ and an optional description.
			
 
				+	\item Line 2: The seuqence consisting of nucleoids, symbolized by A, T, G and C.
			
 
				+	\item Line 3: A '+' that functions as a seperator. Optionally followed by content of Line 1.
			
 
				+	\item Line 4: quality line(s). consisting of letters and special characters in the \acs{ASCII} scope.
			
 
				+\end{itemize}
			
 
				+The quality scores have no fixed format. To name a few, there is the sanger format, the solexa format introduced by Solexa Inc., the Illumina and the QUAL format which is generated by the PHRED software \autocite{Cock_2009}.\\
			
 
				 The quality value shows the estimated probability of error in the sequencing process.
			
 
				-[...]
			
 
				 
			
 
				+\label{r2:sam}
			
 
				 \subsection{Sequence Alignment Map}
			
 
				 % src https://github.com/samtools/samtools
			
 
				-\ac{SAM} often seen in its compressed, binary representation \ac{BAM} with the fileextension \texttt{.bam}, is part of the SAMtools package, a uitlity tool for processing SAM/BAM and CRAM files. The SAM/BAM file is a text based format delimited by TABs. It uses 7-bit US-ASCII, to be precise Charset ANSI X3.4-1968 as defined in RFC1345. The structure is more complex than the one in \ac{FASTQ} and described best, accompanied by an example:
			
 
				+\acs{SAM} often seen in its compressed, binary representation \acs{BAM} with the fileextension \texttt{.bam}, is part of the SAMtools package, a uitlity tool for processing SAM/BAM and CRAM files. The SAM/BAM file is a text based format delimited by TABs \autocite{bam}. It uses 7-bit US-ASCII, to be precise Charset ANSI X3.4-1968 \autocite{rfcansi}. The structure is more complex than the one in \acs{FASTq} and described best, accompanied by an example:
			
 
				 
			
 
				 \begin{figure}[ht]
			
 
				   \centering
			
 
				   \includegraphics[width=15cm]{k3/sam-structure.png}
			
 
				   \caption{SAM/BAM file structure example}
			
 
				-  \label{k_datatypes:bam-struct}
			
 
				+  \label{k2:bam-struct}
			
 
				 \end{figure}
			
 
				 
			
 
				-
			
 
				+Compared to \acs{FASTA} \acs{SAM} and further compression forms, store more information. As displayed in \ref{k_2:bam-struct} this is done by adding, identifier for Reads e.g. \textbf{+r003}, aligning subsequences and writing additional symbols like dots e.g. \textbf{ATAGCT......} in the split alignment +r004 \autocite{bam}. A full description of the information stored in \acs{SAM} files would be of little value to this work, therefore further information on is left out but can be found in \autocite{sam12} or at \autocite{bam}.
			
 
				+Samtools provide the feature to convert a \acs{FASTA} file into \acs{SAM} format. Since there is no way to calulate mentioned, additional information from the information stored in \acs{FASTA}, the converted files only store two lines. The first one stores metadata about the file and the second stores the nucleotide sequence in just one line.
			
--- a/latex/tex/kapitel/k4_algorithms.tex
+++ b/latex/tex/kapitel/k4_algorithms.tex
@@ -23,20 +23,22 @@
 
				 % file structure/format <-> datatypes. länger beschreiben: e.g. File formats to store dna
			
 
				 % 3.2.1 raus
			
 
				 
			
 
				-
			
 
				 \section{Compression aproaches}
			
 
				 The process of compressing data serves the goal to generate an output that is smaller than its input data.\\
			
 
				-In many cases, like in gene compressing, the compression is idealy lossless. This means it is possible for every compressed data, to receive the whole information, which were available in the origin data, by decompressing it.\\
			
 
				+In many cases, like in gene compressing, the compression is ideally lossless. This means it is possible for every compressed data, to receive the whole information, which were available in the origin data, by decompressing it.\\
			
 
				 Before going on, the difference between information and data should be emphasized.\\
			
 
				 % excurs data vs information
			
 
				-Data contians information. In digital data  clear, physical limitations delimit what and how much of something can be stored. A bit can only store 0 or 1, eleven bits can store up to $2^11$ combinations of bits and a 1\acs{GB} drive can store no more than 1\acs{GB} data. Information on the other hand, is limited by the way how it is stored. In some cases the knowledge received in a earlier point in time must be considered too, but this can be neglected for reasons described in the subsection \ref{k4:dict}.\\
			
 
				+Data contains information. In digital data  clear, physical limitations delimit what and how much of something can be stored. A bit can only store 0 or 1, eleven bits can store up to $2^11$ combinations of bits and a 1 Gigabyte drive can store no more than 1 Gigabyte data. Information on the other hand, is limited by the way how it is stored. In some cases the knowledge received in a earlier point in time must be considered too, but this can be neglected for reasons described in the subsection \ref{k4:dict}.\\
			
 
				 % excurs information vs data
			
 
				-The boundaries of information, when it comes to storing capabilities, can be illustrated by using the example mentioned above. A drive with the capacity of 1\acs{GB} could contain a book in form of images, where the content of each page is stored in a single image. Another, more ressourcefull way would be storing just the text of every page in \acs{UTF-16}. The information, the text would provide to a potential reader would not differ. Changing the text encoding to \acs{ASCII} and/or using compression techniques would reduce the required space even more, without loosing any information.\\
			
 
				+The boundaries of information, when it comes to storing capabilities, can be illustrated by using the example mentioned above. A drive with the capacity of 1 Gigabyte could contain a book in form of images, where the content of each page is stored in a single image. Another, more resourceful way would be storing just the text of every page in \acs{UTF-16}. The information, the text would provide to a potential reader would not differ. Changing the text encoding to \acs{ASCII} and/or using compression techniques would reduce the required space even more, without loosing any information.\\
			
 
				 % excurs end
			
 
				-In contrast to lossless compression, lossy compression might excludes parts of data in the compression process, in order to increase the compression rate. The excluded parts are typicaly not necessary to persist the origin information. This works with certain audio and pictures formats, and in network protocols \cite{cnet13}.
			
 
				-For \acs{DNA} a lossless compression is needed. To be preceice a lossy compression is not possible, because there is no unnecessary data. Every nucleotide and its position is needed for the sequenced \acs{DNA} to be complete. For lossless compression two mayor approaches are known: the dictionary coding and the entropy coding. Methods from both fields, that aquired reputation, are described in detail below \cite{cc14, moffat20, moffat_arith, alok17}.\\
			
 
				+In contrast to lossless compression, lossy compression might excludes parts of data in the compression process, in order to increase the compression rate. The excluded parts are typically not necessary to persist the origin information. This works with certain audio and pictures formats, and in network protocols \autocite{cnet13}.
			
 
				+For \acs{DNA} a lossless compression is needed. To be precise a lossy compression is not possible, because there is no unnecessary data. Every nucleotide and its position is needed for the sequenced \acs{DNA} to be complete. For lossless compression two mayor approaches are known: the dictionary coding and the entropy coding. Methods from both fields, that aquired reputation, are described in detail below \autocite{cc14, moffat20, moffat_arith, alok17}.\\
			
 
				 
			
 
				 \subsection{Dictionary coding}
			
 
				+\textbf{Disclaimer}
			
 
				+Unfortunally, known implementations like the ones out of LZ Family, do not use probabilities to compress and are therefore not in the main scope for this work. To strenghten the understanding of compression algortihms this section will remain. Also a hybrid implementation described later will use both dictionary and entropy coding.\\
			
 
				+
			
 
				 \label{k4:dict}
			
 
				 Dictionary coding, as the name suggest, uses a dictionary to eliminate redundand occurences of strings. Strings are a chain of characters representing a full word or just a part of it. For a better understanding this should be illustrated by a short example:
			
 
				 % demo substrings
			
@@ -44,27 +46,30 @@ Looking at the string 'stationary' it might be smart to store 'station' and 'ary
 
				 % end demo
			
 
				 The dictionary should only store strings that occour in the input data. Also storing a dictionary in addition to the (compressed) input data, would be a waste of resources. Therefore the dicitonary is made out of the input data. Each first occourence is left uncompressed. Every occurence of a string, after the first one, points to its first occurence. Since this 'pointer' needs less space than the string it points to, a decrease in the size is created.\\
			
 
				 
			
 
				-
			
 
				 % unuseable due to the lack of probability
			
 
				 % - known algo
			
 
				 \subsubsection{The LZ Family}
			
 
				 The computer scientist Abraham Lempel and the electrical engineere Jacob Ziv created multiple algorithms that are based on dictionary coding. They can be recognized by the substring \texttt{LZ} in its name, like \texttt{LZ77 and LZ78} which are short for Lempel Ziv 1977 and 1978. The number at the end indictates when the algorithm was published. Today LZ78 is widely used in unix compression solutions like gzip and bz2. Those tools are also used in compressing \ac{DNA}.\\
			
 
				-\ac{LZ77} basically works, by removing all repetition of a string or substring and replacing them with information where to find the first occurence and how long it is. Typically it is stored in two bytes, whereby more than one one byte can be used to point to the first occurence because usually less than one byte is required to store the length.\\
			
 
				+\acs{LZ77} basically works, by removing all repetition of a string or substring and replacing them with information where to find the first occurence and how long it is. Typically it is stored in two bytes, whereby more than one one byte can be used to point to the first occurence because usually less than one byte is required to store the length.\\
			
 
				 % example 
			
 
				 
			
 
				  % (genomic squeeze <- official | inofficial -> GDC, GRS). Further \ac{ANS} or rANS ... TBD.
			
 
				 \ac{LZ77} basically works, by removing all repetition of a string or substring and replacing them with information where to find the first occurence and how long it is. Typically it is stored in two bytes, whereby more than one one byte can be used to point to the first occurence because usually less than one byte is required to store the length.\\
			
 
				 
			
 
				-Unfortunally, known implementations like the ones out of LZ Family, do not use probabilities to compress and are therefore out of scope for this work. Since finding repeting sections and their location might also be improved, this chapter will remain.\\
			
 
				 
			
 
				 \subsection{Shannons Entropy}
			
 
				-The founder of information theory Claude Elwood Shannon described entropy and published it in 1948 \cite{Shannon_1948}. In this work he focused on transmitting information. His theorem is applicable to almost any form of communication signal. His findings are not only usefull for forms of information transmition. 
			
 
				+The founder of information theory Claude Elwood Shannon described entropy and published it in 1948 \autocite{Shannon_1948}. In this work he focused on transmitting information. His theorem is applicable to almost any form of communication signal. His findings are not only usefull for forms of information transmition. 
			
 
				 
			
 
				 % todo insert Fig. 1 shannon_1948
			
 
				+\begin{figure}[H]
			
 
				+  \centering
			
 
				+  \includegraphics[width=15cm]{k4/com-sys.png}
			
 
				+  \caption{Schematic diagram of a general communication system by Shannons definition. \autocite{Shannon_1948}}
			
 
				+  \label{k4:comsys}
			
 
				+\end{figure}
			
 
				 
			
 
				-Altering this figure shows how it can be used for other technology like compression.\\
			
 
				-The Information source and destination are left unchanged, one has to keep in mind, that it is possible that both are represented by the same phyiscal actor. 
			
 
				-transmitter and receiver are changed to compression/encoding and decompression/decoding and inbetween ther is no signal but any period of time \cite{Shannon_1948}.\\
			
 
				+Altering \ref{k4:comsys} would show how this can be applied to other technology like compression. The Information source and destination are left unchanged, one has to keep in mind, that it is possible that both are represented by the same phyiscal actor. 
			
 
				+Transmitter and receiver would be changed to compression/encoding and decompression/decoding and inbetween ther is no signal but any period of time \autocite{Shannon_1948}.\\
			
 
				 
			
 
				 Shannons Entropy provides a formular to determine the 'uncertainty of a probability distribution' in a finite field.
			
 
				 
			
@@ -83,7 +88,7 @@ Shannons Entropy provides a formular to determine the 'uncertainty of a probabil
 
				 %  \label{k4:entropy}
			
 
				 %\end{figure}
			
 
				 
			
 
				-He defined entropy as shown in figure \eqref{eq:entropy}. Let X be a finite probability space. Then x in X are possible final states of an probability experimen over X. Every state that actually occours, while executing the experiment generates infromation which is meassured in \textit{Bits} with the part of the formular displayed in \ref{eq:info-in-bits}\cite{delfs_knebl,Shannon_1948}:
			
 
				+He defined entropy as shown in figure \eqref{eq:entropy}. Let X be a finite probability space. Then x in X are possible final states of an probability experimen over X. Every state that actually occours, while executing the experiment generates infromation which is meassured in \textit{Bits} with the part of the formular displayed in \ref{eq:info-in-bits} \autocite{delfs_knebl,Shannon_1948}:
			
 
				 
			
 
				 \begin{equation}\label{eq:info-in-bits}
			
 
				  log_2(\frac{1}{prob(x)}) \equiv - log_2(prob(x)).
			
@@ -105,7 +110,7 @@ He defined entropy as shown in figure \eqref{eq:entropy}. Let X be a finite prob
 
				 \subsection{Arithmetic coding}
			
 
				 This coding method is an approach to solve the problem of wasting memeory due to the overhead which is created by encoding certain lenghts of alphabets in binary. For example: Encoding a three-letter alphabet requires at least two bit per letter. Since there are four possilbe combinations with two bits, one combination is not used, so the full potential is not exhausted. Looking at it from another perspective and thinking a step further: Less storage would be required, if there would be a possibility to encode more than one letter in two bit.\\
			
 
				 Dr. Jorma Rissanen described arithmetic coding in a publication in 1976 \autocite{ris76}. % Besides information theory and math, he also published stuff about dna
			
 
				-This works goal was to define an algorithm that requires no blocking. Meaning the input text could be encoded as one instead of splitting it and encoding the smaller texts or single symbols. He stated that the coding speed of arithmetic coding is comparable to that of conventional coding methods \cite{ris76}.  
			
 
				+This works goal was to define an algorithm that requires no blocking. Meaning the input text could be encoded as one instead of splitting it and encoding the smaller texts or single symbols. He stated that the coding speed of arithmetic coding is comparable to that of conventional coding methods \autocite{ris76}.  
			
 
				 
			
 
				 % unusable because equation is only half correct
			
 
				 \mycomment{
			
@@ -140,19 +145,28 @@ To store as few informations as possible and due to the fact that fractions in b
 
				 % its finite subdividing because of the limitation that comes with processor architecture
			
 
				 
			
 
				 For the decoding process to work, the \ac{EOF} symbol must be be present as the last symbol in the text. The compressed file will store the probabilies of each alphabet symbol as well as the floatingpoint number. The decoding process executes in a simmilar procedure as the encoding. The stored probabilies determine intervals. Those will get subdivided, by using the encoded floating point as guidance, until the \ac{EOF} symbol is found. By noting in which interval the floating point is found, for every new subdivision, and projecting the probabilies associated with the intervals onto the alphabet, the origin text can be read.\\
			
 
				-% sclaing
			
 
				-In computers, arithmetic operations on floating point numbers are processed with integer representations of given floating point number \cite{ieee-float}. The number 0.4 + would be represented by $4\cdot 10^-1$.\\
			
 
				+% rescaling
			
 
				+% math and computers
			
 
				+In computers, arithmetic operations on floating point numbers are processed with integer representations of given floating point number \autocite{ieee-float}. The number 0.4 + would be represented by $4\cdot 10^-1$.\\
			
 
				 Intervals for the first symbol would be represented by natural numbers between 0 and 100 and $... \cdot 10^-x$. \texttt{x} starts with the value 2 and grows as the intgers grow in length, meaning only if a uneven number is divided. For example: Dividing a uneven number like $5\cdot 10^-1$ by two, will result in $25\cdot 10^-2$. On the other hand, subdividing $4\cdot 10^y$ by two, with any negativ real number as y would not result in a greater \texttt{x} the length required to display the result will match the length required to display the input number.\\
			
 
				+% example
			
 
				+\begin{figure}[H]
			
 
				+  \centering
			
 
				+  \includegraphics[width=15cm]{k4/arith-resize.png}
			
 
				+  \caption{Illustrative rescaling in arithmetic coding process. \autocite{witten87}}
			
 
				+  \label{k4:rescale}
			
 
				+\end{figure}
			
 
				+
			
 
				 % finite percission
			
 
				-The described coding is only feasible on machines with infinite percission. As soon as finite precission comes into play, the algorithm must be extendet, so that a certain length in the resulting number will not be exceeded. Since digital datatypes are limited in their capacity, like unsigned 64-bit integers which can store up to $2^64-1$ bits or any number between 0 and 18.446.744.073.709.551.615. That might seem like a great ammount at first, but considering a unfavorable alphabet, that extends the results lenght by one on each symbol that is read, only texts with the length of 63 can be encoded (62 if \acs{EOF} is exclued).
			
 
				+The described coding is only feasible on machines with infinite percission. As soon as finite precission comes into play, the algorithm must be extendet, so that a certain length in the resulting number will not be exceeded. Since digital datatypes are limited in their capacity, like unsigned 64-bit integers which can store up to $2^64-1$ bits or any number between 0 and 18,446,744,073,709,551,615. That might seem like a great ammount at first, but considering a unfavorable alphabet, that extends the results lenght by one on each symbol that is read, only texts with the length of 63 can be encoded (62 if \acs{EOF} is exclued).
			
 
				 
			
 
				 \label{k4:huff}
			
 
				 \subsection{Huffman encoding}
			
 
				 % list of algos and the tools that use them
			
 
				-D. A. Huffmans work focused on finding a method to encode messages with a minimum of redundance. He referenced a coding procedure developed by Shannon and Fano and named after its developers, which worked similar. The \ac{SF} coding is not used today, due to the superiority in both efficiency and effectivity, in comparison to Huffman. % todo any source to last sentence. Rethink the use of finite in the following text
			
 
				-Even though his work was released in 1952, the method he developed is in use  today. Not only tools for genome compression but in compression tools with a more general ussage \cite{rfcgzip}.\\ 
			
 
				-Compression with the Huffman algorithm also provides a solution to the problem, described at the beginning of \ref{k4:arith}, on waste through unused bits, for certain alphabet lengths. Huffman did not save more than one symbol in one bit, like it is done in arithmetic coding, but he decreased the number of bits used per symbol in a message. This is possible by setting individual bit lengths for symbols, used in the text that should get compressed \cite{huf52}. 
			
 
				-As with other codings, a set of symbols must be defined. For any text constructed with symbols from mentioned alphabet, a binary tree is constructed, which will determine how the symbols will be encoded. As in arithmetic coding, the probability of a letter is calculated for given text. The binary tree will be constructed after following guidelines \cite{alok17}:
			
 
				+D. A. Huffmans work focused on finding a method to encode messages with a minimum of redundance. He referenced a coding procedure developed by Shannon and Fano and named after its developers, which worked similar. The Shannon-Fano coding is not used today, due to the superiority in both efficiency and effectivity, in comparison to Huffman. % todo any source to last sentence. Rethink the use of finite in the following text
			
 
				+Even though his work was released in 1952, the method he developed is in use  today. Not only tools for genome compression but in compression tools with a more general ussage \autocite{rfcgzip}.\\ 
			
 
				+Compression with the Huffman algorithm also provides a solution to the problem, described at the beginning of \ref{k4:arith}, on waste through unused bits, for certain alphabet lengths. Huffman did not save more than one symbol in one bit, like it is done in arithmetic coding, but he decreased the number of bits used per symbol in a message. This is possible by setting individual bit lengths for symbols, used in the text that should get compressed \autocite{huf52}. 
			
 
				+As with other codings, a set of symbols must be defined. For any text constructed with symbols from mentioned alphabet, a binary tree is constructed, which will determine how the symbols will be encoded. As in arithmetic coding, the probability of a letter is calculated for given text. The binary tree will be constructed after following guidelines \autocite{alok17}:
			
 
				 % greedy algo?
			
 
				 \begin{itemize}
			
 
				   \item Every symbol of the alphabet is one leaf.
			
@@ -163,14 +177,14 @@ As with other codings, a set of symbols must be defined. For any text constructe
 
				 \end{itemize}
			
 
				 %todo tree building explanation
			
 
				 % storytime might need to be rearranged
			
 
				-A often mentioned difference between \acs{FA} and Huffman coding, is that first is working top down while the latter is working bottom up. This means the tree starts with the lowest weights. The nodes that are not leafs have no value ascribed to them. They only need their weight, which is defined by the weights of their individual child nodes \cite{moffat20, alok17}.\\
			
 
				+A often mentioned difference between Shannon-Fano and Huffman coding, is that first is working top down while the latter is working bottom up. This means the tree starts with the lowest weights. The nodes that are not leafs have no value ascribed to them. They only need their weight, which is defined by the weights of their individual child nodes \autocite{moffat20, alok17}.\\
			
 
				 
			
 
				-Given \texttt{K(W,L)} as a node structure, with the weigth or probability as \texttt{$W_{i}$} and codeword length as \texttt{$L_{i}$} for the node \texttt{$K_{i}$}. Then will \texttt{$L_{av}$} be the average length for \texttt{L} in a finite chain of symbols, with a distribution that is mapped onto \texttt{W} \cite{huf}.
			
 
				+Given \texttt{K(W,L)} as a node structure, with the weigth or probability as \texttt{$W_{i}$} and codeword length as \texttt{$L_{i}$} for the node \texttt{$K_{i}$}. Then will \texttt{$L_{av}$} be the average length for \texttt{L} in a finite chain of symbols, with a distribution that is mapped onto \texttt{W} \autocite{huf}.
			
 
				 \begin{equation}\label{eq:huf}
			
 
				   L_{av}=\sum_{i=0}^{n-1}w_{i}\cdot l_{i}
			
 
				 \end{equation}
			
 
				 The equation \eqref{eq:huf} describes the path, to the desired state, for the tree. The upper bound \texttt{n} is assigned the length of the input text. The touple in any node \texttt{K} consists of a weight \texttt{$w_{i}$}, that also references a symbol, and the length of a codeword \texttt{$l_{i}$}. This codeword will later encode a single symbol from the alphabet. Working with digital codewords, an element in \texttt{l} contains a sequence of zeros and ones. Since there in this coding method, there is no fixed length for codewords, the premise of \texttt{prefix free code} must be adhered to. This means there can be no codeword that match the sequence of any prefix of another codeword. To illustrate this: 0, 10, 11 would be a set of valid codewords but adding a codeword like 01 or 00 would make the set invalid because of the prefix 0, which is already a single codeword.\\
			
 
				-With all important elements described: the sum that results from this equation is the average length a symbol in the encoded input text will require to be stored \cite{huf52, moffat20}.
			
 
				+With all important elements described: the sum that results from this equation is the average length a symbol in the encoded input text will require to be stored \autocite{huf52, moffat20}.
			
 
				 
			
 
				 % example
			
 
				 % todo illustrate
			
@@ -185,24 +199,47 @@ With the fact in mind, that left branches are assigned with 0 and right branches
 
				 \texttt{A -> 0, C -> 11, T -> 100, G -> 101}.\\ 
			
 
				 Since high weightened and therefore often occuring leafs are positioned to the left, short paths lead to them and so only few bits are needed to encode them. Following the tree on the other side, the symbols occur more rarely, paths get longer and so do the codeword. Applying \eqref{eq:huf} to this example, results in 1.45 bits per encoded symbol. In this example the text would require over one bit less storage for every second symbol.\\
			
 
				 % impacting the dark ground called reality
			
 
				-Leaving the theory and entering the practice, brings some details that lessen this improvement by a bit. A few bytes are added through the need of storing the information contained in the tree. Also, like described in % todo add ref to k3 formats
			
 
				-most formats, used for persisting \acs{DNA}, store more than just nucleotides and therefore require more characters. What compression ratios implementations of huffman coding provide, will be discussed in \ref{k5:results}.\\
			
 
				+Leaving the theory and entering the practice, brings some details that lessen this improvement by a bit. A few bytes are added through the need of storing the information contained in the tree. Also, like described in \ref{chap:file formats} most formats, used for persisting \acs{DNA}, store more than just nucleotides and therefore require more characters. What compression ratios implementations of huffman coding provide, will be discussed in \ref{k5:results}.\\
			
 
				 
			
 
				-\section{DEFLATE}
			
 
				+\section{Implementations in Relevant Tools}
			
 
				+This section should give the reader a quick overview, how a small variety of compression tools implement described compression algorithms. 
			
 
				+
			
 
				+\subsection{\ac{GeCo}} % geco
			
 
				+% geco.c: analyze data/open files, parse to determine fileformat, create alphabet
			
 
				+
			
 
				+% explain header files
			
 
				+The header files, that this tool includes in \texttt{geco.c}, can be split into three categories: basic operations, custom operations and compression algorithms. 
			
 
				+The basic operations include header files for general purpose functions, that can be found in almost any c++ Project. The provided functionality includes operations for text-output on the command line inferface, memory management, random number generation and several calculations on up to real numbers.\\
			
 
				+Custom operations happens to include general purpose functions too, with the difference that they were written, altered or extended by \acs{GeCo}s developer. The last category cosists of several C Files, containing implementations of two arithmetic coding implementations: \textbf{first} \texttt{bitio.c} and \texttt{arith.c}, \textbf{second} \texttt{arith\_aux.c}.\\
			
 
				+The first two were developed by John Carpinelli, Wayne Salamonsen, Lang Stuiver and Radford Neal (is only mentioned in the latter). Comparing the two files, \texttt{bitio.c} has less code, shorter comments and much more not functioning code sections. Overall the conclusion would be likely that \texttt{arith.c} is some kind of official release, wheras \texttt{bitio.c} severs as a experimental  file for the developers to create proof of concepts. The described files adapt code from Armando J. Pinho licenced by University of Aveiro DETI/IEETA written in 1999.\\
			
 
				+The second implementation was also licensed by University of Aveiro DETI/IEETA, but no author is mentioned. From interpreting the function names and considering the lenght of function bodys \texttt{arith\_aux.c} could serve as a wrapper for basic functions that are often used in arithmetic coding.\\
			
 
				+Since original versions of the files licensed by University of Aveiro could not be found, there is no way to determine if the files comply with their originals or if changes has been made. This should be considered while following the static analysis.
			
 
				+
			
 
				+Following function calls in all three files led to the conclusion that the most important function is defined as \texttt{arithmetic\_encode} in \texttt{arith.c}. In this function the actual artihmetic encoding is executed. This function has no redirects to other files, only one function call \texttt{ENCODE\_RENORMALISE} the remaining code consists of arithmetic operations only.
			
 
				+% if there is a chance for improvement, this function should be consindered as a entry point to test improving changes.
			
 
				+
			
 
				+%useless? -> Both, \texttt{bitio.c} and \texttt{arith.c} are pretty simliar. They were developed by the same authors, execpt for Radford Neal who is only mentioned in \texttt{arith.c}, both are based on the work of A. Moffat \autocite{moffat_arith}.
			
 
				+%\subsection{genie} % genie
			
 
				+\subsection{Samtools} % samtools 
			
 
				+\subsubsection{BAM}
			
 
				+Compression in this fromat is done by a implementation called BGZF, which is a block compression on top of a widely used algorithm called DEFLATE.
			
 
				+\label{k4:deflate}
			
 
				+\paragraph{DEFLATE}
			
 
				 % mix of huffman and LZ77
			
 
				-The DEFLATE compression algorithm combines \ac{LZ77} and huffman coding. It is used in well known tools like gzip. 
			
 
				+The DEFLATE compression algorithm combines \acs{LZ77} and huffman coding. It is used in well known tools like gzip. 
			
 
				 Data is split into blocks. Each block stores a header consisting of three bits. A single block can be stored in one of three forms. Each of which is represented by a identifier that is stored with the last two bits in the header. 
			
 
				 \begin{itemize}
			
 
				 	\item \texttt{00}		No compression.
			
 
				 	\item \texttt{01}		Compressed with a fixed set of Huffman codes.	
			
 
				 	\item \texttt{10}		Compressed with dynamic Huffman codes.
			
 
				 \end{itemize}
			
 
				-The last combination \texttt{11} is reserved to mark a faulty block. The third, leading bit is set to flag the last data block \cite{rfc1951}.
			
 
				+The last combination \texttt{11} is reserved to mark a faulty block. The third, leading bit is set to flag the last data block \autocite{rfc1951}.
			
 
				 % lz77 part
			
 
				-As described in \ref{k4:lz77} a compression with \acs{LZ77} results in literals, a length for each literal and pointers that are represented by the distance between pointer and the literal it points to.\\
			
 
				-The \acs{LZ77} algorithm is executed before the huffman algorithm. Further compression steps differ from the already described algorithm and will extend to the end of this section.\\
			
 
				+As described in \ref{k4:lz77} a compression with \acs{LZ77} results in literals, a length for each literal and pointers that are represented by the distance between pointer and the literal it points to.
			
 
				+The \acs{LZ77} algorithm is executed before the huffman algorithm. Further compression steps differ from the already described algorithm and will extend to the end of this section.
			
 
				+
			
 
				 % huffman part
			
 
				-Besides header bits and a data block, two Huffman code trees are store. One encodes literals and lenghts and the other distances. They happen to be in a compact form. This archived by a addition of two rules on top of the rules described in \ref{k4:huff}: Codes of identical lengths are orderd lexicographically, directed by the characters they represent. And the simple rule: shorter codes precede longer codes.\\
			
 
				+Besides header bits and a data block, two Huffman code trees are store. One encodes literals and lenghts and the other distances. They happen to be in a compact form. This archived by a addition of two rules on top of the rules described in \ref{k4:huff}: Codes of identical lengths are orderd lexicographically, directed by the characters they represent. And the simple rule: shorter codes precede longer codes.
			
 
				 To illustrated this with an example:
			
 
				 For a text consisting out of \texttt{C} and \texttt{G}, following codes would be set for a encoding of two bit per character: \texttt{C}: 00, \texttt{G}: 01. With another character \texttt{A} in the alphabet, which would occour more often than the other two characters, the codes would change to a representation like this:
			
 
				 
			
@@ -220,13 +257,30 @@ For a text consisting out of \texttt{C} and \texttt{G}, following codes would be
 
				 \end{footnotesize}
			
 
				 \rmfamily
			
 
				 
			
 
				-Since \texttt{A} precedes \texttt{C} and \texttt{G}, it is represented with a 0. To maintain prefix-free codes, the two remaining codes are not allowed to start with a 0. \texttt{C} precedes \texttt{G} lexicographically, therefor the (in a numerical sense) smaller code is set to represent \texttt{C}.\\
			
 
				-With this simple rules, the alphabet can be compressed too. Instead of storing codes itself, only the codelength stored \cite{rfc1951}. This might seem unnecessary when looking at a single compressed bulk of data, but when compressing blocks of data, a samller alphabet can make a relevant difference. 
			
 
				+Since \texttt{A} precedes \texttt{C} and \texttt{G}, it is represented with a 0. To maintain prefix-free codes, the two remaining codes are not allowed to start with a 0. \texttt{C} precedes \texttt{G} lexicographically, therefor the (in a numerical sense) smaller code is set to represent \texttt{C}.
			
 
				+With this simple rules, the alphabet can be compressed too. Instead of storing codes itself, only the codelength stored \autocite{rfc1951}. This might seem unnecessary when looking at a single compressed bulk of data, but when compressing blocks of data, a samller alphabet can make a relevant difference.\\
			
 
				 
			
 
				 % example header, alphabet, data block?
			
 
				+BGZF extends this by creating a series of blocks. Each can not extend a limit of 64 Kilobyte. Each block contains a standard gzip file header, followed by compressed data.\\
			
 
				 
			
 
				-\section{Implementations in Relevant Tools}
			
 
				-\subsection{} % geco
			
 
				-\subsection{} % genie
			
 
				-\subsection{} % samtools 
			
 
				+\subsubsection{CRAM}
			
 
				+The improvement of \acs{BAM} \autocite{cram-origin} called \acs{CRAM}, also features a block structure \autocite{bam}. The whole file can be seperated into four sections, stored in ascending order: File definition, a CRAM Header Container, multiple Data Container and a final CRAM EOF Container.\\
			
 
				+The File definition consists of 26 uncompressed bytes, storing formating information and a identifier. The CRAM header contains meta information about Data Containers and is optionally compressed with gzip. This container can also contain a uncompressed zero-padded section, reseved for \acs{SAM} header information \autocite{bam}. This saves time, in case the compressed file is altered and its compression need to be updated. The last container in a \acs{CRAM} file serves as a indicator that the \acs{EOF} is reached. Since in addition information about the file and its structure is stored, a maximum of 38 uncompressed bytes can be reached.\\
			
 
				+A Data Container can be split into three sections. From this sections the one storing the actual sequence consists of blocks itself, displayed in \ref FIGURE as the bottom row.
			
 
				+\begin{itemize}
			
 
				+	\item Container Header.
			
 
				+	\item Compression Header.
			
 
				+	\item A variable amount of Slices.
			
 
				+	\begin{itemize}
			
 
				+		\item Slice Header.
			
 
				+		\item Core Data Block.
			
 
				+		\item A variable amount of External Data Blocks.
			
 
				+	\end{itemize}
			
 
				+\end{itemize}
			
 
				+The Container Header stores information on how to decompress the data stored in the following block sections. The Compression Header contains information about what kind of data is stored and some encoding information for \acs{SAM} specific flags. The actual data is stored in the Data Blocks. Those consist of encoded bit streams. According to the Samtools specification, the encoding can be one of the following: External, Huffman and two other methods which happen to be either a form of huffman coding or a shortened binary representation of integers. The External option allows to use gzip, bzip2 which is a form of multiple coding methods including run length encoding and huffman, a encoding from the LZ family called LZMA or a combination of arithmetic and huffman coding called rANS.
			
 
				+% possible encodings: 
			
 
				+% external: no encoding or gzip, bzip2, lzma
			
 
				+% huffman
			
 
				+% Byte array coding -> huffman or external...
			
 
				+% Beta coding -> binary representation
			
 
				 
			
--- a/latex/tex/kapitel/k5_feasability.tex
+++ b/latex/tex/kapitel/k5_feasability.tex
@@ -26,21 +26,21 @@
 
				 \chapter{Environment and Procedure to Determine the State of The Art Efficiency and Compressionratio of Relevant Tools}
			
 
				 % goal define
			
 
				 \label{k5:goals}
			
 
				-Since improvements must be meassured, defining a baseline which would need to be beaten bevorhand is neccesary. Others have dealt with this task several times with common algorithms and tools, and published their results. But since the test case, that need to be build for this work, is rather uncommon in its compilation, the available data are not very usefull. Therefore new testdata must be created.\\
			
 
				-The goal of this is, to determine a baseline for efficiendy and effectivity of state of the art tools, used to compress \ac{DNA}. This baseline is set by two important factors:
			
 
				+Since improvements must be measured, defining a baseline which would need to be beaten bevorhand is necessary. Others have dealt with this task several times with common algorithms and tools, and published their results. But since the test case, that need to be build for this work, is rather uncommon in its compilation, the available data are not very useful. Therefore, new test data must be created.\\
			
 
				+The goal of this is, to determine a baseline for efficiency and effectivity of state of the art tools, used to compress \ac{DNA}. This baseline is set by two important factors:
			
 
				 
			
 
				 \begin{itemize}
			
 
				   \item Efficiency: \textbf{Duration} the Process had run for
			
 
				   \item Effectivity: The difference in \textbf{Size} between input and compressed data
			
 
				 \end{itemize}
			
 
				 
			
 
				-As a third point, the compliance that files were compressed losslessly should be verified. This is done by comparing the source file to a copy that got compressed and than decompressed again. If one of the two processes should operate lossy, a difference between the source file and the copy a difference in size should be recognizeable. 
			
 
				+As a third point, the compliance that files were compressed losslessly should be verified. This is done by comparing the source file to a copy that got compressed and than decompressed again. If one of the two processes should operate lossy, a difference between the source file and the copy a difference in size should be recognizable. 
			
 
				 
			
 
				 %environment, test setup, raw results
			
 
				 \section{Sever specifications and test environment}
			
 
				 To be able to recreate this in the future, relevant specifications and the commands that reveiled this information are listed in this section.\\
			
 
				 
			
 
				-Reading from /proc/cpuinfo reveals processor spezifications. Since most of the information displayed in the seven entrys is redundant, only the last entry is shown. Below are relevant specifications listed:
			
 
				+Reading from /proc/cpuinfo reveals processor specifications. Since most of the information displayed in the seven entries is redundant, only the last entry is shown. Below are relevant specifications listed:
			
 
				 
			
 
				 \noindent
			
 
				 \begin{lstlisting}[language=bash]
			
@@ -133,23 +133,23 @@ For this paper relevant specifications are listed below:
 
				 
			
 
				 \section{Operating System and Additionally Installed Packages}
			
 
				 To leave the testing environment in a consistent state, not project specific processes running in the background, should be avoided. 
			
 
				-Due to following circumstances, a current Linux distribution was choosen as a suiteable operating system:
			
 
				+Due to following circumstances, a current Linux distribution was chosen as a suitable operating system:
			
 
				 \begin{itemize}
			
 
				   \item{factors that interfere with a consistent efficiency value should be avoided}
			
 
				   \item{packages, support and user experience should be present to an reasonable ammount}
			
 
				 \end{itemize}
			
 
				-Some backround processes will run while the compression analysis is done. This is owed to the demand of an increasingly complex operating system to execute complex programs. Considering that different tools will be exeuted in this environment, minimizing the backround processes would require building a custom operating system or configuring an existing one to fit this specific use case. The boundary set by the time limitation for this work rejects named alternatives. 
			
 
				-%By comparing the values of explaied factors, a sweet spot can be determined:
			
 
				-% todo: add preinstalled package/programm count and other specs
			
 
				+Some background processes will run while the compression analysis is done. This is owed to the demand of an increasingly complex operating system to execute complex programs. Considering that different tools will be exeuted in this environment, minimizing the background processes would require building a custom operating system or configuring an existing one to fit this specific use case. The boundary set by the time limitation for this work rejects named alternatives. 
			
 
				+%By comparing the values of explained factors, a sweet spot can be determined:
			
 
				+% todo: add preinstalled package/program count and other specs
			
 
				 Choosing \textbf{Debian GNU/Linux} version \textbf{11} features enough packages to run every tool without spending to much time on the setup.\\
			
 
				-The graphical user interface and most other optional packages were ommited. The only additional package added in the installation process is the ssh server package. Further a list of packages required by the compression tools were installed. At last, some additional packages were installed for the purpose of simplifying work processes and increasing the safety of the environment.
			
 
				+The graphical user interface and most other optional packages were omitted. The only additional package added in the installation process is the ssh server package. Further a list of packages required by the compression tools were installed. At last, some additional packages were installed for the purpose of simplifying work processes and increasing the safety of the environment.
			
 
				 \begin{itemize}
			
 
				   \item{installation process: ssh-server}
			
 
				   \item{tool requirements:, git, libhts-dev, autoconf, automake, cmake, make, gcc, perl, zlib1g-dev, libbz2-dev, liblzma-dev, libcurl4-gnutls-dev, libssl-dev, libncurses5-dev, libomp-dev}
			
 
				   \item{additional packages: ufw, rsync, screen, sudo} 
			
 
				 \end{itemize}
			
 
				 
			
 
				-A complete list of installed packages as well as individual versions can be found in appendix. % todo appendix
			
 
				+A complete list of installed packages as well as individual versions can be found in the appendix. % todo appendix
			
 
				 
			
 
				 %user@debian raw$\ cat /etc/os-release 
			
 
				 %PRETTY_NAME="Debian GNU/Linux 11 (bullseye)"
			
@@ -163,12 +163,15 @@ A complete list of installed packages as well as individual versions can be foun
 
				 %BUG_REPORT_URL="https://bugs.debian.org/"
			
 
				 
			
 
				 \section{Selection, Receivement, and Preperation of Testdata}
			
 
				-Following criteria is requiered for testdata to be appropriate:
			
 
				+Following criteria is reqiured for test data to be appropriate:
			
 
				 \begin{itemize}
			
 
				-  \item{The testfile is in a format that all or at least most of the tools can work with.}
			
 
				-  \item{The file is publicly available and free to use.}
			
 
				+  \item{The test file is in a format that all or at least most of the tools can work with, meaning \acs{FASTA} or \acs{FASTq} files.}
			
 
				+  \item{The file is publicly available and free to use (for research).}
			
 
				 \end{itemize}
			
 
				-Since there are multiple open \ac{FTP} servers which distribute a varety of files, finding a suiteable one is rather easy. The ensembl databse featured defined criteria, so the first suiteable were choosen: Homo\_sapiens.GRCh38.dna.chromosome. This sample includes over 20 chromosomes, whereby considering the filenames, one chromosome was contained in a single file. After retrieving and unpacking the files, write priviledges on them was withdrawn. So no tool could alter any file contents.\\
			
 
				+A second, bigger set of testfiles were required. This would verify the test results are not limited to small files. The size of 1 Gigabyte per file, would hold over five times as much data as the first set.
			
 
				+Since there are multiple open \ac{FTP} servers which distribute a variety of files, finding a suitable first set is rather easy. The ensembl database featured defined criteria, so the first available set called Homo\_sapiens.GRCh38.dna.chromosome were chosen \autocite{ftp-ensembl}. This sample includes over 20 chromosomes, whereby considering the filenames, one chromosome is contained in each single file. After retrieving and unpacking the files, write privileges on them was withdrawn. So no tool could alter any file contents.
			
 
				+Finding a second, bigger set happened to be more complicated. \acs{FTP} offers no fast, reliable way to sort files according to their size, regardless of their position. Since available servers \acs{ftp-ensembl, ftp-ncbi, ftp-isgr} offer several thousand files, stored in variating, deep directory structures, mapping filesize, filetype and file path takes too much time and resources to be done in this work. This problematic combined with a easily triggered overflow in the samtools library, resulted in a set of several, manualy searched and tested files which lacks in quantity. The variety of different species in this set \acs{DNA} provide a additional, interesting factor.\\
			
 
				+ 
			
 
				 % todo make sure this needs to stay.
			
 
				 \noindent Following tools and parameters where used in this process:
			
 
				 \begin{lstlisting}[language=bash]
			
@@ -177,8 +180,8 @@ Since there are multiple open \ac{FTP} servers which distribute a varety of file
 
				   \$ chmod -w ./*
			
 
				 \end{lstlisting}
			
 
				 
			
 
				-The choosen tools are able to handle the \acs{FASTA} format. However Samtools must convert \acs{FASTA} files into their \acs{SAM} format bevor the file can be compressed. The compression will firstly lead to an output with \acs{BAM} format, from there it can be compressed further into a \acs{CRAM} file. For \acs{CRAM} compression, the time needed for each step, from converting to two compressions, is summed up and displayed as one. For the compression time into the \acs{BAM} format, just the conversion and the single compression time is summed up. The conversion from \acs{FASTA} to \acs{SAM} is not displayed in the results. This is due to the fact that this is no compression process, and therefor has no value to this work.\\
			
 
				-Even though \acs{SAM} files are not compressed, there was a small but noticeable difference in size between the files in each format. Since \acs{FASTA} should store less information, by leaving out quality scores, this observation was counterintuitive. Comparing the first few lines showed two things: the header line were altered and newlines were removed. The alteration of the header line would result in just a few more bytes. To verify, no information was lost while converting, both files were temporarly stripped from metadata and formating, so the raw data of both files can be compared. Using \texttt{diff} showed no differences between the stored characters in each file. 
			
 
				+The chosen tools are able to handle the \acs{FASTA} format. However Samtools must convert \acs{FASTA} files into their \acs{SAM} format bevor the file can be compressed. The compression will firstly lead to an output with \acs{BAM} format, from there it can be compressed further into a \acs{CRAM} file. For \acs{CRAM} compression, the time needed for each step, from converting to two compressions, is summed up and displayed as one. For the compression time into the \acs{BAM} format, just the conversion and the single compression time is summed up. The conversion from \acs{FASTA} to \acs{SAM} is not displayed in the results. This is due to the fact that this is no compression process, and therefor has no value to this work.\\
			
 
				+Even though \acs{SAM} files are not compressed, there can be a small but noticeable difference in size between the files in each format. Since \acs{FASTA} should store less information, by leaving out quality scores, this observation was counterintuitive. Comparing the first few lines showed two things: the header line were altered and newlines were removed. The alteration of the header line would result in just a few more bytes. To verify, no information was lost while converting, both files were temporary stripped from metadata and formatting, so the raw data of both files can be compared. Using \texttt{diff} showed no differences between the stored characters in each file. 
			
 
				 % user@debian data$\ ls -l --block-size=M raw/Homo_sapiens.GRCh38.dna.chromosome.1.fa 
			
 
				 % -r--r--r-- 1 user user 242M Jun  4 10:49 raw/Homo_sapiens.GRCh38.dna.chromosome.1.fa
			
 
				 % user@debian data$\ ls -l --block-size=M samtools/files/Homo_sapiens.GRCh38.dna.chromosome.1.sam
			
@@ -190,7 +193,7 @@ Even though \acs{SAM} files are not compressed, there was a small but noticeable
 
				 % remove newlines: tr -d '\n' 
			
 
				 
			
 
				 % convert just once. test for losslessness?
			
 
				-% get testdata: wget http://ftp.ensembl.org/pub/release-107/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.chromosome.{2,3,4,5,6,7,8,9,10}.fa.gz
			
 
				+% get test data: wget http://ftp.ensembl.org/pub/release-107/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.chromosome.{2,3,4,5,6,7,8,9,10}.fa.gz
			
 
				 % unzip it: gzip -d ./*
			
 
				 % withdraw write priv: chmod -w ./*
			
 
				 
			
@@ -204,5 +207,5 @@ Even though \acs{SAM} files are not compressed, there was a small but noticeable
 
				 
			
 
				 % - can run recursively and threaded
			
 
				 
			
 
				-% - im falle von testdata: hetzer, dedizierter hardware, auf server compilen, specs aufschreiben -> 'lscpu' || 'cat /proc/cpuinfo'
			
 
				+% - im falle von test data: hetzer, dedizierter hardware, auf server compilen, specs aufschreiben -> 'lscpu' || 'cat /proc/cpuinfo'
			
 
				 
			
--- a/latex/tex/kapitel/k6_results.tex
+++ b/latex/tex/kapitel/k6_results.tex
@@ -1,12 +1,13 @@
 
				 \chapter{Results and Discussion}
			
 
				 The two tables \ref{t:effectivity}, \ref{t:efficiency} contain raw measurement values for the two goals, described in \ref{k5:goals}. The first table visualizes how long each compression procedure took, in milliseconds. The second one contains file sizes in bytes. Each row contains information about one of the \texttt{Homo\_sapiens.GRCh38.dna.chromosome.}x\texttt{.fa} files. To improve readability, the filename in all tables were replaced by \texttt{File}. To determine which file was compressed, simply replace the placeholder with the number following \texttt{File}.\\
			
 
				 
			
 
				-The units milliseconds and bytes store a high persicion for measurements. Unfortunally they are harder to read and compare to the human eye. Therefore, starting with comparing sizes, \ref{t:sizepercent} contian the percentual size of each file in relation to the respective source file. The compression with \acs{GeCo} with the file Homo\_sapiens.GRCh38.dna.chromosome.11.fa resulted in a file that were only 17.6\% as big.\\
			
 
				+\section{Interpretation of Results}
			
 
				+The units milliseconds and bytes store a high persicion for measurements. Unfortunately they are harder to read and compare to the human eye. Therefore, starting with comparing sizes, \ref{t:sizepercent} contains each file size in percentage, in relation to the respective source file. The compression with \acs{GeCo} with the file Homo\_sapiens.GRCh38.dna.chromosome.11.fa resulted in a file that were only 17.6\% as big.\\
			
 
				 
			
 
				 \label{t:sizepercent}
			
 
				 \sffamily
			
 
				 \begin{footnotesize}
			
 
				-  \begin{longtable}[r]{ p{.2\textwidth} p{.2\textwidth} p{.2\textwidth} p{.2\textwidth}}
			
 
				+  \begin{longtable}[ht]{ p{.2\textwidth} p{.2\textwidth} p{.2\textwidth} p{.2\textwidth}}
			
 
				     \caption[Compression Effectivity]                       % Caption für das Tabellenverzeichnis
			
 
				         {File sizes in different compression formats in \textbf{percent}} % Caption für die Tabelle selbst
			
 
				         \\
			
@@ -43,18 +44,16 @@ The units milliseconds and bytes store a high persicion for measurements. Unfort
 
				 \rmfamily
			
 
				 Overall, Samtools \acs{BAM} resulted in 71.76\% size reduction, the \acs{CRAM} methode improved this by rughly 2.5\%. \acs{GeCo} provided the greatest reduction with 78.53\%. This gap of about 4\% comes with a comparatively great sacrifice in time.\\
			
 
				 
			
 
				-
			
 
				 \label{t:time}
			
 
				 \sffamily
			
 
				 \begin{footnotesize}
			
 
				-  \begin{longtable}[r]{ p{.2\textwidth} p{.2\textwidth} p{.2\textwidth} p{.2\textwidth}}
			
 
				+  \begin{longtable}[ht]{ p{.2\textwidth} p{.2\textwidth} p{.2\textwidth} p{.2\textwidth}}
			
 
				     \caption[Compression Effectivity]                       % Caption für das Tabellenverzeichnis
			
 
				         {Compression duration in seconds} % Caption für die Tabelle selbst
			
 
				         \\
			
 
				     \toprule
			
 
				      \textbf{ID.} & \textbf{\acs{GeCo} } & \textbf{Samtools \acs{BAM}}& \textbf{Samtools \acs{CRAM} } \\
			
 
				     \midrule
			
 
				-			compress time for geco, bam and cram in seconds
			
 
				 			File 1 & 23.5& 3.786& 16.926\\
			
 
				 			File 2 & 24.65& 3.784& 17.043\\
			
 
				 			File 3 & 2.016& 3.123& 13.999\\
			
@@ -84,4 +83,54 @@ Overall, Samtools \acs{BAM} resulted in 71.76\% size reduction, the \acs{CRAM} m
 
				 \rmfamily
			
 
				 
			
 
				 As \ref{t:time} is showing, the average compression duration for \acs{GeCo} is at 42.57s. That is a little over 33s, or 78\% longer than the average runtime of samtools for compressing into the \acs{CRAM} format.\\
			
 
				-Before interpreting this data further, a quick view into development processes: \acs{GeCo} stopped development in the year 2016 while Samtools is being developed since 2015, to this day, with over 70 people contributing. Considering the data with that in mind, an improvement in \acs{GeCo}s efficiency, would be a start to equalize the great gap in the compression duration.\\
			
 
				+Since \acs{CRAM} requires a file in \acs{BAM} format, the third row is calculated by adding the time needed to compress into \acs{BAM} with the time needed to compress into \acs{CRAM}. 
			
 
				+While \acs{SAM} format is required for compressing a \acs{FASTA} into \acs{BAM} and further into \acs{CRAM}, in itself it does not features no compression. However, the conversion from \acs{SAM} to \acs{FASTA} can result in a decrease in size. At first this might be contra intuitive, since as described in \ref{k2:sam} \acs{SAM} stores more information than \acs{FASTA}. This can be explained by comparing the sequence storing mechanism. A \acs{FASTA} sequence section can be spread over multiple lines whereas \acs{SAM} files store a sequence in just one line, converting can result in a \acs{SAM} file that is smaller than the original \acs{FASTA} file.
			
 
				+% (hi)storytime
			
 
				+Before interpreting this data further, a quick view into development processes: \acs{GeCo} stopped development in the year 2016 while Samtools is being developed since 2015, to this day, with over 70 people contributing.\\
			
 
				+% interpret bit files and compare
			
 
				+
			
 
				+\section{View on Possible Improvements}
			
 
				+% todo explain new findings
			
 
				+S. Petukhov described new findings about the distribution of nucleotides. With the probability of one nucleotide, in a sequence of sufficient length, information about the direct neighbours is revealed. For example, with the probability of \texttt{C}, the probabilities for sets (n-plets) of any nucleotide \texttt{N}, including \texttt{C} can be determined:\\
			
 
				+%\%C ≈ Σ\%CN ≈ Σ\%NС ≈ Σ\%CNN ≈ Σ\%NCN ≈ Σ\%NNC ≈ Σ\%CNNN ≈ Σ\%NCNN ≈ Σ\%NNCN ≈ Σ\%NNNC\\
			
 
				+
			
 
				+% begin optimization 
			
 
				+Considering this and the meassured results, an improvement in the arithmetic coding process and therefore in \acs{GeCo}s efficiency, would be a good start to equalize the great gap in the compression duration. Combined with a tool that is developed with todays standards, there is a possibility that even greater improvements could be archived.\\
			
 
				+% simple theoretical approach
			
 
				+How would a theoretical improvement approach look like? As described in \ref{k4:arith}, entropy coding requires to determine the probabilies of each symbol in the alphabet. The simplest way to do that, is done by parsing the whole sequence from start to end and increasing a counter for each nucleotide that got parsed. 
			
 
				+With new findings discovered by S. Petukhov in cosideration, the goal would be to create an entropy coding implementation that beats current implementation in the time needed to determine probabilities. A possible approach would be that the probability of one nucleotide can be used to determine the probability of other nucelotides, by a calculation rather than the process of counting each one.
			
 
				+This approach throws a few questions that need to be answered in order to plan a implementation:  
			
 
				+\begin{itemize}
			
 
				+	\item How many probabilities are needed to calculate the others?
			
 
				+	\item Is there space for improvement in the parsing/counting process?
			
 
				+	%\item Is there space for visible improvements, when only counting one nucleotide?
			
 
				+	\item How can the variation between probabilities be determined?
			
 
				+\end{itemize}
			
 
				+
			
 
				+Second point must be asked, because the improvement in counting only one nucleotide in comparison to counting three, would be to little to be called relevant.
			
 
				+%todo compare time needed: to store a variable <-> parsing the sequence
			
 
				+To compare parts of a programm and their complexity, the Big-O notation is used. Unfortunally this is only covering loops and coditions as a whole. Therefore a more detailed view on operations must be created: 
			
 
				+Considering a single threaded loop with the purpose to count every nucleotide in a sequence, the process of counting can be split into several operations, defined by this pseudocode.
			
 
				+
			
 
				+%todo use GeCo arith function with bigO
			
 
				+while (sequence not end):\\
			
 
				+	next\_nucleotide = read\_next\_nucleotide(sequence)\\
			
 
				+	for (element in alphabet\_probabilities):\\
			
 
				+		if (element equals next\_nucleotide)\\
			
 
				+			element = element + 1\\
			
 
				+		fi\\
			
 
				+	rof\\
			
 
				+elihw\\
			
 
				+
			
 
				+This loop will itterate over a whole sequence, counting each nucleotide. In line three, a inner loop can be found which itterates over the alphabet, to determine which symbol should be increased. Considering the findings, described above, the inner loop can be left out, because there is no need to compare the read nucleotide against more than one symbol. The Big-O notation for this code, with any sequence with the length of n, would be decreseased from O($n^2$) to O($n\cdot 1)$) or simply O(N) \cite{big-o}. Which is clearly an improvement in complexety and therefor also in runtime.\\
			
 
				+The runtime for calculations of the other symbols probabilities must be considered as well and compared against the nested loop to be certain, that the overall was improved.
			
 
				+% more realistic view on parsing todo need cites
			
 
				+In practice, obviously smarter ways are used, to determine probabilities. Like splitting the sequence in multiple parts and parse each subsequence asynchronous. This results can either sumed up for global probabilities or get used individually on each associated subsequence. Either way, the presented improvement approach should be appliable to both parsing methods.\\
			
 
				+
			
 
				+
			
 
				+% how is data interpreted
			
 
				+% why did the tools result in this, what can we learn
			
 
				+% improvements
			
 
				+% - goal: less time to compress
			
 
				+% 	- approach: optimize probability determination
			
 
				+% 	-> how?
			
--- a/latex/tex/literatur.bib
+++ b/latex/tex/literatur.bib
@@ -109,7 +109,7 @@
 
				   url   = {https://github.com/samtools/hts-specs.},
			
 
				 }
			
 
				 
			
 
				-@Online{bed,
			
 
				+@Article{bed,
			
 
				   author = {Sanger Institute, Genome Research Limited},
			
 
				   date   = {2022-10-20},
			
 
				   title  = {BED Browser Extensible Data},
			
@@ -248,4 +248,165 @@
 
				   year    = {2019},
			
 
				 }
			
 
				 
			
 
				+@Article{big-o,
			
 
				+  author = {Mala, Firdous and Ali, Rouf},
			
 
				+  title  = {The Big-O of Mathematics and Computer Science},
			
 
				+  doi    = {10.26855/jamc.2022.03.001},
			
 
				+  pages  = {1-3},
			
 
				+  volume = {6},
			
 
				+  month  = {01},
			
 
				+  year   = {2022},
			
 
				+}
			
 
				+
			
 
				+@Article{sam12,
			
 
				+  author       = {Petr Danecek and James K Bonfield and Jennifer Liddle and John Marshall and Valeriu Ohan and Martin O Pollard and Andrew Whitwham and Thomas Keane and Shane A McCarthy and Robert M Davies and Heng Li},
			
 
				+  date         = {2021-01},
			
 
				+  journaltitle = {{GigaScience}},
			
 
				+  title        = {Twelve years of {SAMtools} and {BCFtools}},
			
 
				+  doi          = {10.1093/gigascience/giab008},
			
 
				+  number       = {2},
			
 
				+  volume       = {10},
			
 
				+  publisher    = {Oxford University Press ({OUP})},
			
 
				+}
			
 
				+
			
 
				+@Article{cram_origin,
			
 
				+  author       = {Markus Hsi-Yang Fritz and Rasko Leinonen and Guy Cochrane and Ewan Birney},
			
 
				+  date         = {2011-01},
			
 
				+  journaltitle = {Genome Research},
			
 
				+  title        = {Efficient storage of high throughput {DNA} sequencing data using reference-based compression},
			
 
				+  doi          = {10.1101/gr.114819.110},
			
 
				+  number       = {5},
			
 
				+  pages        = {734--740},
			
 
				+  volume       = {21},
			
 
				+  publisher    = {Cold Spring Harbor Laboratory},
			
 
				+}
			
 
				+
			
 
				+@Online{illufastq,
			
 
				+  author = {Illumina},
			
 
				+  date   = {2022-11-17},
			
 
				+  title  = {Illumina FASTq file structure explained},
			
 
				+  url    = {https://support.illumina.com/bulletins/2016/04/fastq-files-explained.html},
			
 
				+}
			
 
				+
			
 
				+@TechReport{rfcansi,
			
 
				+  author       = {K. Simonsen and},
			
 
				+  title        = {Character Mnemonics and Character Sets},
			
 
				+  number       = {1345},
			
 
				+  type         = {RFC},
			
 
				+  howpublished = {Internet Requests for Comments},
			
 
				+  issn         = {2070-1721},
			
 
				+  month        = {June},
			
 
				+  year         = {1992},
			
 
				+}
			
 
				+
			
 
				+@Article{witten87,
			
 
				+  author       = {Ian H. Witten and Radford M. Neal and John G. Cleary},
			
 
				+  date         = {1987-06},
			
 
				+  journaltitle = {Communications of the {ACM}},
			
 
				+  title        = {Arithmetic coding for data compression},
			
 
				+  doi          = {10.1145/214762.214771},
			
 
				+  issn         = {0001-0782},
			
 
				+  number       = {6},
			
 
				+  pages        = {520–540},
			
 
				+  url          = {https://doi.org/10.1145/214762.214771},
			
 
				+  volume       = {30},
			
 
				+  abstract     = {The state of the art in data compression is arithmetic coding, not the better-known Huffman method. Arithmetic coding gives greater compression, is faster for adaptive models, and clearly separates the model from the channel encoding.},
			
 
				+  address      = {New York, NY, USA},
			
 
				+  issue_date   = {June 1987},
			
 
				+  journal      = {Commun. ACM},
			
 
				+  month        = {jun},
			
 
				+  numpages     = {21},
			
 
				+  publisher    = {Association for Computing Machinery ({ACM})},
			
 
				+  year         = {1987},
			
 
				+}
			
 
				+
			
 
				+@InProceedings{geco,
			
 
				+  author    = {Diogo Pratas and Armando J. Pinho and Paulo J. S. G. Ferreira},
			
 
				+  booktitle = {2016 Data Compression Conference ({DCC})},
			
 
				+  date      = {2016-03},
			
 
				+  title     = {Efficient Compression of Genomic Sequences},
			
 
				+  doi       = {10.1109/DCC.2016.60},
			
 
				+  publisher = {{IEEE}},
			
 
				+}
			
 
				+
			
 
				+@Article{survey,
			
 
				+  author       = {Morteza Hosseini and Diogo Pratas and Armando Pinho},
			
 
				+  date         = {2016-10},
			
 
				+  journaltitle = {Information},
			
 
				+  title        = {A Survey on Data Compression Methods for Biological Sequences},
			
 
				+  doi          = {10.3390/info7040056},
			
 
				+  number       = {4},
			
 
				+  pages        = {56},
			
 
				+  volume       = {7},
			
 
				+  publisher    = {{MDPI} {AG}},
			
 
				+}
			
 
				+
			
 
				+@Article{vertical,
			
 
				+  author       = {Kelvin V. Kredens and Juliano V. Martins and Osmar B. Dordal and Mauri Ferrandin and Roberto H. Herai and Edson E. Scalabrin and Br{\'{a}}ulio C. {\'{A}}vila},
			
 
				+  date         = {2020-05},
			
 
				+  journaltitle = {{PLOS} {ONE}},
			
 
				+  title        = {Vertical lossless genomic data compression tools for assembled genomes: A systematic literature review},
			
 
				+  doi          = {10.1371/journal.pone.0232942},
			
 
				+  editor       = {Rashid Mehmood},
			
 
				+  number       = {5},
			
 
				+  pages        = {e0232942},
			
 
				+  volume       = {15},
			
 
				+  publisher    = {Public Library of Science ({PLoS})},
			
 
				+}
			
 
				+
			
 
				+@Online{twobit,
			
 
				+  date   = {2022-09-22},
			
 
				+  editor = {UCSC University of California Sata Cruz},
			
 
				+  title  = {TwoBit File Format},
			
 
				+  url    = {https://genome-source.gi.ucsc.edu/gitlist/kent.git/raw/master/src/inc/twoBit.h},
			
 
				+}
			
 
				+
			
 
				+@TechReport{isompeg,
			
 
				+  author      = {{ISO Central Secretary}},
			
 
				+  institution = {International Organization for Standardization},
			
 
				+  title       = {MPGE-G},
			
 
				+  language    = {en},
			
 
				+  number      = {ISO/IEC 23092:2019},
			
 
				+  type        = {Standard},
			
 
				+  url         = {https://www.iso.org/standard/23092.html},
			
 
				+  year        = {2019},
			
 
				+}
			
 
				+
			
 
				+@Article{mpeg,
			
 
				+  author    = {Claudio Albert and Tom Paridaens and Jan Voges and Daniel Naro and Junaid J. Ahmad and Massimo Ravasi and Daniele Renzi and Giorgio Zoia and Paolo Ribeca and Idoia Ochoa and Marco Mattavelli and Jaime Delgado and Mikel Hernaez},
			
 
				+  date      = {2018-09},
			
 
				+  title     = {An introduction to {MPEG}-G, the new {ISO} standard for genomic information representation},
			
 
				+  doi       = {10.1101/426353},
			
 
				+  publisher = {Cold Spring Harbor Laboratory},
			
 
				+}
			
 
				+
			
 
				+@Article{haplo,
			
 
				+  author       = {Wai Yee Low and Rick Tearle and Ruijie Liu and Sergey Koren and Arang Rhie and Derek M. Bickhart and Benjamin D. Rosen and Zev N. Kronenberg and Sarah B. Kingan and Elizabeth Tseng and Fran{\c{c}}oise Thibaud-Nissen and Fergal J. Martin and Konstantinos Billis and Jay Ghurye and Alex R. Hastie and Joyce Lee and Andy W. C. Pang and Michael P. Heaton and Adam M. Phillippy and Stefan Hiendleder and Timothy P. L. Smith and John L. Williams},
			
 
				+  date         = {2020-04},
			
 
				+  journaltitle = {Nature Communications},
			
 
				+  title        = {Haplotype-resolved genomes provide insights into structural variation and gene content in Angus and Brahman cattle},
			
 
				+  doi          = {10.1038/s41467-020-15848-y},
			
 
				+  number       = {1},
			
 
				+  volume       = {11},
			
 
				+  publisher    = {Springer Science and Business Media {LLC}},
			
 
				+}
			
 
				+
			
 
				+@Online{ftp-igsr,
			
 
				+  date  = {2022-11-10},
			
 
				+  title = {IGSR: The International Genome Sample Resource},
			
 
				+  url   = {https://ftp.1000genomes.ebi.ac.uk},
			
 
				+}
			
 
				+
			
 
				+@Online{ftp-ncbi,
			
 
				+  date  = {2022-11-01},
			
 
				+  title = {NCBI National Center for Biotechnology Information},
			
 
				+  url   = {https://ftp.ncbi.nlm.nih.gov/genomes/},
			
 
				+}
			
 
				+
			
 
				+@Online{ftp-ensembl,
			
 
				+  date  = {2022-10-15},
			
 
				+  title = {ENSEMBL Rapid Release},
			
 
				+  url   = {https://ftp.ensembl.org},
			
 
				+}
			
 
				+
			
 
				 @Comment{jabref-meta: databaseType:biblatex;}
			
--- a/latex/tex/preambel.tex
+++ b/latex/tex/preambel.tex
@@ -76,7 +76,7 @@
 
				                                   %      inline: Zitat in Klammern (\parancite)
			
 
				                                   %      footnote: Zitat in Fußnoten (\footcite)
			
 
				                                   %      plain: Zitat direkt ohne Klammern (\cite)
			
 
				-  style=authoryear,               % Legt den Stil für die Zitate fest
			
 
				+  style=ieee,		              % Legt den Stil für die Zitate fest
			
 
				                                   %      ieee: Zitate als Zahlen [1]
			
 
				                                   %      alphabetic: Zitate als Kürzel und Jahr [Ein05]
			
 
				                                   %      authoryear: Zitate Author und Jahr [Einstein (1905)]
Auteur	SHA1 Message	Date
u	d83d916189 improved several chapters. +30 pages. working on results.	il y a 3 ans
u	e8426e00d1 added static code analysis for geco	il y a 3 ans