3 ani în urmă · 248ef21774
--- a/latex/tex/kapitel/abkuerzungen.tex
+++ b/latex/tex/kapitel/abkuerzungen.tex
@@ -12,10 +12,10 @@
 
				   \acro{DNA}{Deoxyribonucleic Acid}
			
 
				   \acro{EOF}{End of File}
			
 
				   \acro{FASTA}{File Format for Storing Genomic Data}
			
 
				-  \acro{FASTQ}{File Format Based on FASTA}
			
 
				+  \acro{FASTq}{File Format Based on FASTA}
			
 
				   \acro{FTP}{File Transfere Protocol}
			
 
				   \acro{GA4GH}{Global Alliance for Genomics and Health}
			
 
				-  \acro{GECO}{Genome Compressor}
			
 
				+  \acro{GeCo}{Genome Compressor}
			
 
				   \acro{IUPAC}{International Union of Pure and Applied Chemistry}
			
 
				   \acro{LZ77}{Lempel Ziv 1977}
			
 
				   \acro{LZ78}{Lempel Ziv 1978}
			
--- a/latex/tex/kapitel/k4_algorithms.tex
+++ b/latex/tex/kapitel/k4_algorithms.tex
@@ -92,6 +92,10 @@ This means the intervall start of the character is noted, its intervall is split
 
				 To encode in binary, the binary floating point representation of a number inside the intervall, for the last character is calculated, by using a similar process, described above, called subdividing.
			
 
				 % its finite subdividing because processors bottleneck floatingpoints 
			
 
				 
			
 
				+ % (genomic squeeze <- official | inofficial -> GDC, GRS). Further \ac{ANS} or rANS ... TBD.
			
 
				+\subsection{\ac{LZ77}}
			
 
				+ \ac{LZ77} basically works, by removing all repetition of a string or substring and replacing them with information where to find the first occurence and how long it is. Typically it is stored in two bytes, whereby more than one one byte can be used to point to the first occurence because usually less than one byte is required to store the length.
			
 
				+
			
 
				 \subsection{Huffman encoding}
			
 
				 % list of algos and the tools that use them
			
 
				 The well known Huffman coding, is used in several Tools for genome compression. This subsection should give the reader a general impression how this algorithm works, without going into detail. To use Huffman coding one must first define an alphabet, in our case a four letter alphabet, containing \texttt{A, C, G and T} is sufficient. The basic structure is symbolized as a tree. With that, a few simple rules apply to the structure:
			
@@ -107,26 +111,16 @@ The well known Huffman coding, is used in several Tools for genome compression.
 
				 The process of compressing starts with the nodes with the lowest weight and buids up to the hightest. Each step adds nodes to a tree where the most left branch should be the shortest and the most right the longest. The most left branch ends with the symbol with the highest weight, therefore occours the most in the input data.
			
 
				 Following one path results in the binary representation for one symbol. For an alphabet like the one described above, the binary representation encoded in ASCI is shown here \texttt{A -> 01000001, C -> 01000011, G -> 01010100, T -> 00001010}. An imaginary sequence, that has this distribution of characters \texttt{A -> 10, C -> 8, G -> 4, T -> 2}. From this information a weighting would be calculated for each character by dividing one by the characters occurence. With a corresponding tree, created from with the weights, the binary data for each symbol would change to this \texttt{A -> 0, C -> 11, T -> 100, G -> 101}. Besides the compressed data, the information contained in the tree msut be saved for the decompression process.
			
 
				 
			
 
				+\section{DEFLATE}
			
 
				+% mix of huffman and lz77
			
 
				+The DEFLATE compression algorithm combines \ac{lz77} and huffman coding. It is used in well known tools like gzip.
			
 
				+
			
 
				+
			
 
				 \subsubsection{misc}
			
 
				 
			
 
				 %check if (small) text coding is done with this:
			
 
				-Arithmetic Asymmetric numeral systems 
			
 
				-Golomb 
			
 
				-Huffman 
			
 
				-Adaptive 
			
 
				-Canonical 
			
 
				-Modified 
			
 
				-Range 
			
 
				-Shannon 
			
 
				-Shannon–Fano 
			
 
				-Shannon–Fano–Elias 
			
 
				-Tunstall 
			
 
				-Unary 
			
 
				-Universal 
			
 
				-Exp-Golomb 
			
 
				-Fibonacci 
			
 
				-Gamma 
			
 
				-Levenshtein
			
 
				+Arithmetic Asymmetric numeral systems ?
			
 
				+Modified -> used in cram
			
 
				 
			
 
				 
			
 
				 \section{Implementations in Relevant Tools}
			
--- a/latex/tex/kapitel/k5_feasability.tex
+++ b/latex/tex/kapitel/k5_feasability.tex
@@ -25,6 +25,7 @@
 
				 %\chapter{Analysis for Possible Compression Improvements}
			
 
				 \chapter{Environment and Procedure to Determine the State of The Art Efficiency and Compressionratio of Relevant Tools}
			
 
				 % goal define
			
 
				+\label{k5:goals}
			
 
				 Since improvements must be meassured, defining a baseline which would need to be beaten bevorhand is neccesary. Others have dealt with this task several times with common algorithms and tools, and published their results. But since the test case, that need to be build for this work, is rather uncommon in its compilation, the available data are not very usefull. Therefore new testdata must be created.\\
			
 
				 The goal of this is, to determine a baseline for efficiendy and effectivity of state of the art tools, used to compress \ac{DNA}. This baseline is set by two important factors:
			
 
				 
			
@@ -168,6 +169,7 @@ Following criteria is requiered for testdata to be appropriate:
 
				   \item{The file is publicly available and free to use.}
			
 
				 \end{itemize}
			
 
				 Since there are multiple open \ac{FTP} servers which distribute a varety of files, finding a suiteable one is rather easy. The ensembl databse featured defined criteria, so the first suiteable were choosen: Homo\_sapiens.GRCh38.dna.chromosome. This sample includes over 20 chromosomes, whereby considering the filenames, one chromosome was contained in a single file. After retrieving and unpacking the files, write priviledges on them was withdrawn. So no tool could alter any file contents.\\
			
 
				+% todo make sure this needs to stay.
			
 
				 \noindent Following tools and parameters where used in this process:
			
 
				 \begin{lstlisting}[language=bash]
			
 
				   \$ wget http://ftp.ensembl.org/pub/release-107/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.chromosome.{2,3,4,5,6,7,8,9,10}.fa.gz
			
@@ -175,7 +177,14 @@ Since there are multiple open \ac{FTP} servers which distribute a varety of file
 
				   \$ chmod -w ./*
			
 
				 \end{lstlisting}
			
 
				 
			
 
				-The choosen tools are able to handle the \ac{FASTA} format. However some, like samtools, require to convert \ac{FASTA} into another format like \ac{SAM}.\\ Simply comparing the size is not sufficient, therefore both files are temporarly stripped from metadata and formating, so the raw data of both files can be compared.
			
 
				+The choosen tools are able to handle the \acs{FASTA} format. However Samtools must convert \acs{FASTA} files into their \acs{SAM} format bevor the file can be compressed. The compression will firstly lead to an output with \acs{BAM} format, from there it can be compressed further into a \acs{CRAM} file. For \acs{CRAM} compression, the time needed for each step, from converting to two compressions, is summed up and displayed as one. For the compression time into the \acs{BAM} format, just the conversion and the single compression time is summed up. The conversion from \acs{FASTA} to \acs{SAM} is not displayed in the results. This is due to the fact that this is no compression process, and therefor has no value to this work.\\
			
 
				+Even though \acs{SAM} files are not compressed, there was a small but noticeable difference in size between the files in each format. Since \acs{FASTA} should store less information, by leaving out quality scores, this observation was counterintuitive. Comparing the first few lines showed two things: the header line were altered and newlines were removed. The alteration of the header line would result in just a few more bytes. To verify, no information was lost while converting, both files were temporarly stripped from metadata and formating, so the raw data of both files can be compared. Using \texttt{diff} showed no differences between the stored characters in each file. 
			
 
				+% user@debian data$\ ls -l --block-size=M raw/Homo_sapiens.GRCh38.dna.chromosome.1.fa 
			
 
				+% -r--r--r-- 1 user user 242M Jun  4 10:49 raw/Homo_sapiens.GRCh38.dna.chromosome.1.fa
			
 
				+% user@debian data$\ ls -l --block-size=M samtools/files/Homo_sapiens.GRCh38.dna.chromosome.1.sam
			
 
				+% -rw-r--r-- 1 user user 238M Nov  2 14:32 samtools/files/Homo_sapiens.GRCh38.dna.chromosome.1.sam
			
 
				+
			
 
				+
			
 
				 
			
 
				 % remove metadata: grep -E 'A|C|G|N' <sourcefile> > <destfile>
			
 
				 % remove newlines: tr -d '\n' 
			
--- a/latex/tex/kapitel/k6_results.tex
+++ b/latex/tex/kapitel/k6_results.tex
@@ -1,82 +1,42 @@
 
				-% raw data and charts
			
 
				-% differences in used algos/ algos in tools
			
 
				-% optimization approach
			
 
				-% further research focus
			
 
				-% (how optimization would be recognizable in testdata)
			
 
				-
			
 
				 \chapter{Results and Discussion}
			
 
				 
			
 
				-\begin{table}[ht]
			
 
				-\caption{Multi-column table}
			
 
				-\begin{center}
			
 
				-\begin{tabular}{cc}
			
 
				-    \hline
			
 
				-    \multicolumn{2}{c}{Multi-column}\\
			
 
				-    X&X\\
			
 
				-    \hline
			
 
				-\end{tabular}
			
 
				-\end{center}
			
 
				-\label{tab:multicol}
			
 
				-\end{table}
			
 
				-
			
 
				-
			
 
				-\begin{table}[ht]
			
 
				-\caption{Multi-column table}
			
 
				-\begin{center}
			
 
				-\begin{tabular}{ |p{3cm}||p{3cm}|p{3cm}|p{3cm}| }
			
 
				- \hline
			
 
				- \multicolumn{4}{|c|}{ratio}\\
			
 
				- \hline
			
 
				- h: tool v: taks&GeCo&samtools to BAM& samtools to CRAM\\
			
 
				- \hline
			
 
				-  method/taks& geco &sam -> bam &sam -> cram\\
			
 
				-%  conversion& - &- & \\
			
 
				-%  compression in ms& & & \\
			
 
				-%  compression ratio& & & \\
			
 
				- \hline
			
 
				-\end{tabular}
			
 
				-\end{center}
			
 
				-\label{tab:multicol}
			
 
				-\end{table}
			
 
				-
			
 
				 
			
 
				-\begin{tabular}{ |p{3cm}||p{3cm}|p{3cm}|p{3cm}|  }
			
 
				+\begin{tabular}{ |p{2cm}||p{3cm}|p{3,5cm}|p{3,5cm}|  }
			
 
				  \hline
			
 
				- \multicolumn{4}{|c|}{Compression time} \\
			
 
				+ \multicolumn{4}{|c|}{Compression time in milliseconds} \\
			
 
				  \hline
			
 
				-   & \acs{GECO}& Samtools&\\
			
 
				+   & \acs{GeCo}& Samtools \acs{BAM}& Samtools \acs{CRAM}\\
			
 
				  \hline
			
 
				- % method/taks& geco &sam -> bam &sam -> cram\\
			
 
				- File 1 & 235005& 15178&  \\
			
 
				- File 2 & 246503& 15211&  \\
			
 
				- File 3 & 20169& 12526&  \\
			
 
				- File 4 & 194081& 11986&  \\
			
 
				- File 5 & 183878& 11436&  \\
			
 
				- File 6 & 173646& 10738&  \\
			
 
				- File 7 & 159999& 9995&  \\
			
 
				- File 8 & 148288& 9142&  \\
			
 
				- File 9 & 12304& 8276&  \\
			
 
				- File 10 & 134937& 8460&  \\
			
 
				- File 11 & 136299& 8508&  \\
			
 
				- File 12 & 134932& 8467&  \\
			
 
				- File 13 & 999022& 6770&  \\
			
 
				- File 14 & 924753& 6309&  \\
			
 
				- File 15 & 852555& 5959&  \\
			
 
				- File 16 & 827651& 5481&  \\
			
 
				- File 17 & 820814& 5151&  \\
			
 
				- File 18 & 798429& 5012&  \\
			
 
				- File 19 & 586058& 3662&  \\
			
 
				- File 20 & 645884& 4025&  \\
			
 
				- File 21 & 411984& 2783&  \\
			
 
				+   File 1 & 235005& 3786& 16926\\
			
 
				+   File 2 & 246503& 3784& 17043\\
			
 
				+   File 3 & 20169& 3123& 13999\\
			
 
				+   File 4 & 194081& 3011& 13445\\
			
 
				+   File 5 & 183878& 2862& 12802\\
			
 
				+   File 6 & 173646& 2685& 12015\\
			
 
				+   File 7 & 159999& 2503& 11198\\
			
 
				+   File 8 & 148288& 2286& 10244\\
			
 
				+   File 9 & 12304& 2078& 9210\\
			
 
				+   File 10 & 134937& 2127& 9461\\
			
 
				+   File 11 & 136299& 2132& 9508\\
			
 
				+   File 12 & 134932& 2115& 9456\\
			
 
				+   File 13 & 999022& 1695& 7533\\
			
 
				+   File 14 & 924753& 1592& 7011\\
			
 
				+   File 15 & 852555& 1507& 6598\\
			
 
				+   File 16 & 827651& 1390& 6089\\
			
 
				+   File 17 & 820814& 1306& 5791\\
			
 
				+   File 18 & 798429& 1277& 5603\\
			
 
				+   File 19 & 586058& 960& 4106\\
			
 
				+   File 20 & 645884& 1026& 4507\\
			
 
				+   File 21 & 411984& 721& 3096\\
			
 
				  \hline
			
 
				 \end{tabular}
			
 
				 
			
 
				 
			
 
				 \begin{tabular}{ |p{3cm}||p{3cm}|p{3cm}|p{3cm}|  }
			
 
				  \hline
			
 
				- \multicolumn{4}{|c|}{File sizes} \\
			
 
				+ \multicolumn{4}{|c|}{File sizes in bytes} \\
			
 
				  \hline
			
 
				-   & Source file& \acs{GECO}& Samtools CRAM\\
			
 
				+   & Source file& \acs{GeCo}& Samtools \acs{CRAM}\\
			
 
				  \hline
			
 
				   File 1& 253105752& 46364770& 55769827\\
			
 
				   File 2& 136027438& 27411806& 32238052\\
			
@@ -101,3 +61,14 @@
 
				   File 21& 147557670& 23932541& 29459829\\
			
 
				  \hline
			
 
				 \end{tabular}
			
 
				+
			
 
				+% raw data and charts
			
 
				+% differences in used algos/ algos in tools
			
 
				+% optimization approach
			
 
				+% further research focus
			
 
				+% (how optimization would be recognizable in testdata)
			
 
				+
			
 
				+% todo ms to minutes and bytes to mb. Those tables move to the appendix
			
 
				+The two tables above contain rather raw measurement values for the two goals, described in \ref{k5:goals}. The first table shows how long each compression procedure took. Each row contains information about one of the \texttt{Homo\_sapiens.GRCh38.dna.chromosome.}x\texttt{.fa} files. To improve readability, the filename were replaced by \texttt{File}. To determine which file was compressed, simply replace the placeholder with the number following \texttt{File}.\\
			
 
				+
			
 
				+While \acs{GeCo} takes more time to compress, an increase in effectivity, meaning in the reduction of file size, can be recognized.\\