\chapter{Results and Discussion} The tables \ref{a6:compr-size} and \ref{a6:compr-time} contain raw measurement values for the two goals, described in \ref{k5:goals}. The table \ref{a6:compr-time} lists how long each compression procedure took, in milliseconds. \ref{a6:compr-size} contains file sizes in bytes. In these tables, as well as in the other ones associated with tests in the scope of this work, the a name scheme is used, to improve readability. The filenames were replaced by \texttt{File} followed by two numbers seperated by a point. For the first test set, the number prefix \texttt{1.} was used, the second set is marked with a \texttt{2.}. For example, the fourth file of each test, in tables are named like this \texttt{File 1.4} and \texttt{File 2.4}. The name of the associated source file for the first set is: \texttt{Homo\_sapiens.GRCh38.dna.chromosome.\textbf{4}.fa} Since the source files of the second set are not named as consistent as in the first one, a third column in \ref{k6:set2size} was added, which is mapping table ID. and source file name.\\ The files contained in each test set, as well as their size can be found in the tables \ref{k6:set1size} and \ref{k6:set2size}. The first test set contained a total of 2.8 \acs{GB} unevenly spread over 21 files, while the second test set contained 7 \acs{GB} in total, with a quantity of seven files.\\ \label{k6:set1size} \sffamily \begin{footnotesize} \begin{longtable}[h]{ p{.4\textwidth} p{.4\textwidth}} \caption[First Test Set Files and their Sizes in MB] % Caption für das Tabellenverzeichnis {Files contained in the First Test Set and their Sizes in \acs{MB}} % Caption für die Tabelle selbst \\ \toprule \textbf{ID.} & \textbf{Size in \acs{MB}} \\ \midrule File 1.1& 241.38\\ File 1.2& 234.823\\ File 1.3& 192.261\\ File 1.4& 184.426\\ File 1.5& 176.014\\ File 1.6& 165.608\\ File 1.7& 154.497\\ File 1.8& 140.722\\ File 1.9& 134.183\\ File 1.10& 129.726\\ File 1.11& 130.976\\ File 1.12& 129.22\\ File 1.13& 110.884\\ File 1.14& 103.786\\ File 1.15& 98.888\\ File 1.16& 87.589\\ File 1.17& 80.724\\ File 1.18& 77.927\\ File 1.19& 56.834\\ File 1.20& 62.483\\ File 1.21& 45.289\\ \bottomrule \end{longtable} \end{footnotesize} \rmfamily \label{k6:set2size} \sffamily \begin{footnotesize} \begin{longtable}[h]{ p{.2\textwidth} p{.2\textwidth} p{.4\textwidth}} \caption[Second Test Set Files and their Sizes in MB] % Caption für das Tabellenverzeichnis {Files contained in the Second Test Set, their Sizes in \acs{MB} and Source File Names} % Caption für die Tabelle selbst \\ \toprule \textbf{ID.} & \textbf{Size in \acs{MB}} & \textbf{Source File Name}\\ \midrule File 2.1& 1188.976& SRR002905.recal.fastq\\ File 2.2& 1203.314& SRR002906.recal.fastq\\ File 2.3& 627.467& SRR002815.recal.fastq\\ File 2.4& 676.0& SRR002816.recal.fastq\\ File 2.5& 1066.431& SRR002817.recal.fastq\\ File 2.6& 1071.095& SRR002818.recal.fastq\\ File 2.7& 1240.564& SRR002819.recal.fastq\\ \bottomrule \end{longtable} \end{footnotesize} \rmfamily \section{Interpretation of Results} The units milliseconds and bytes store a high precision. Unfortunately they are harder to read and compare, solely by the readers eyes. Therefore the data was altered. Sizes in \ref{k6:sizepercent} are displayed in percentage, in relation to the respective source file. Meaning the compression with \acs{GeCo} on: \texttt{Homo\_sapiens.GRCh38.dna.chromosome.11.fa} resulted in a compressed file which were only 17.6\% as big. Runtimes in \ref{k6:time} were converted into seconds and have been rounded to two decimal places. Also a line was added to the bottom of each table, showing the average percentage or runtime for each process.\\ \label{k6:sizepercent} \sffamily \begin{footnotesize} \begin{longtable}[h]{ p{.2\textwidth} p{.2\textwidth} p{.2\textwidth} p{.2\textwidth}} \caption[Compression Effectivity for first test set] % Caption für das Tabellenverzeichnis {File sizes in different compression formats in \textbf{percent}} % Caption für die Tabelle selbst \\ \toprule \textbf{ID.} & \textbf{\acs{GeCo} \%} & \textbf{Samtools \acs{BAM}\%}& \textbf{Samtools \acs{CRAM} \%} \\ \midrule %geco bam and cram in percent File 1.1& 18.32& 24.51& 22.03\\ File 1.2& 20.28& 26.56& 23.57\\ File 1.3& 20.4& 26.58& 23.66\\ File 1.4& 20.3& 26.61& 23.56\\ File 1.5& 20.12& 26.46& 23.65\\ File 1.6& 20.36& 26.61& 23.6\\ File 1.7& 19.64& 26.15& 23.71\\ File 1.8& 20.4& 26.5& 23.67\\ File 1.9& 17.01& 23.25& 20.94\\ File 1.10& 20.15& 26.36& 23.7\\ File 1.11& 19.96& 26.14& 23.69\\ File 1.12& 20.1& 26.26& 23.74\\ File 1.13& 17.8& 22.76& 20.27\\ File 1.14& 17.16& 22.31& 20.11\\ File 1.15& 16.21& 21.69& 19.76\\ File 1.16& 17.43& 23.48& 21.66\\ File 1.17& 18.76& 25.16& 23.84\\ File 1.18& 20.0& 25.31& 23.63\\ File 1.19& 17.6& 24.53& 23.91\\ File 1.20& 19.96& 25.6& 23.67\\ File 1.21& 16.64& 22.06& 20.44\\ &&&\\ \textbf{Total}& 18.98& 24.99& 22.71\\ \bottomrule \end{longtable} \end{footnotesize} \rmfamily Overall, Samtools \acs{BAM} resulted in 71.76\% size reduction, the \acs{CRAM} methode improved this by rughly 2.5\%. \acs{GeCo} provided the greatest reduction with 78.53\%. This gap of about 4\% comes with a comparatively great sacrifice in time.\\ \label{k6:time} \sffamily \begin{footnotesize} \begin{longtable}[ht]{ p{.2\textwidth} p{.2\textwidth} p{.2\textwidth} p{.2\textwidth}} \caption[Compression Efficiency for first test set] % Caption für das Tabellenverzeichnis {Compression duration in seconds} % Caption für die Tabelle selbst \\ \toprule \textbf{ID.} & \textbf{\acs{GeCo}} & \textbf{Samtools \acs{BAM}}& \textbf{Samtools \acs{CRAM}} \\ \midrule File 1.1 & 23.5& 3.786& 16.926\\ File 1.2 & 24.65& 3.784& 17.043\\ File 1.3 & 2.016& 3.123& 13.999\\ File 1.4 & 19.408& 3.011& 13.445\\ File 1.5 & 18.387& 2.862& 12.802\\ File 1.6 & 17.364& 2.685& 12.015\\ File 1.7 & 15.999& 2.503& 11.198\\ File 1.8 & 14.828& 2.286& 10.244\\ File 1.9 & 12.304& 2.078& 9.21\\ File 1.10 & 13.493& 2.127& 9.461\\ File 1.11 & 13.629& 2.132& 9.508\\ File 1.12 & 13.493& 2.115& 9.456\\ File 1.13 & 99.902& 1.695& 7.533\\ File 1.14 & 92.475& 1.592& 7.011\\ File 1.15 & 85.255& 1.507& 6.598\\ File 1.16 & 82.765& 1.39& 6.089\\ File 1.17 & 82.081& 1.306& 5.791\\ File 1.18 & 79.842& 1.277& 5.603\\ File 1.19 & 58.605& 0.96& 4.106\\ File 1.20 & 64.588& 1.026& 4.507\\ File 1.21 & 41.198& 0.721& 3.096\\ &&&\\ \textbf{Total}&42.57&2.09&9.32\\ \bottomrule \end{longtable} \end{footnotesize} \rmfamily As \ref{k6:time} is showing, the average compression duration for \acs{GeCo} is at 42.57s. That is a little over 33s, or 78\% longer than the average runtime of samtools for compressing into the \acs{CRAM} format.\\ Since \acs{CRAM} requires a file in \acs{BAM} format, the third row is calculated by adding the time needed to compress into \acs{BAM} with the time needed to compress into \acs{CRAM}. While \acs{SAM} format is required for compressing a \acs{FASTA} into \acs{BAM} and further into \acs{CRAM}, in itself it does not features no compression. However, the conversion from \acs{SAM} to \acs{FASTA} can result in a decrease in size. At first this might be contra intuitive since, as described in \ref{k2:sam} \acs{SAM} stores more information than \acs{FASTA}. This can be explained by comparing the sequence storing mechanism. A \acs{FASTA} sequence section can be spread over multiple lines whereas \acs{SAM} files store a sequence in just one line, converting can result in a \acs{SAM} file that is smaller than the original \acs{FASTA} file. % (hi)storytime Before interpreting this data further, a quick view into development processes: \acs{GeCo} stopped development in the year 2016 while Samtools is being developed since 2015, to this day, with over 70 people contributing.\\ % big tables %For the second set of test data, the file identifier was set to follow the scheme \texttt{File 2.x} where x is a number between zero and seven. While the first set of test data had names that matched the file identifiers, considering its numbering, the second set had more variating names. The mapping between identifier and file can be found in \ref{}. % todo add test set tables Reviewing \ref{k6:recal-time} one will notice, that \acs{GeCo} reached a runtime over 60 seconds on every run. Instead of displaying the runtime solely in seconds, a leading number followed by an m indicates how many minutes each run took. \label{k6:recal-size} \sffamily \begin{footnotesize} \begin{longtable}[c]{ p{.2\textwidth} p{.2\textwidth} p{.2\textwidth} p{.2\textwidth}} \caption[Compression Effectivity for second test set] % Caption für das Tabellenverzeichnis {File sizes in different compression formats in \textbf{percent}} % Caption für die Tabelle selbst \\ \toprule \textbf{ID.} & \textbf{\acs{GeCo}\% }& \textbf{Samtools \acs{BAM}\% }& \textbf{Samtools \acs{CRAM}\% } \\ \midrule %geco bam and cram in percent File 2.1& 1.00& 6.28& 5.38\\ File 2.2& 0.98& 6.41& 5.52\\ File 2.3& 1.21& 8.09& 7.17\\ File 2.4& 1.20& 7.70& 6.85\\ File 2.5& 1.08& 7.58& 6.72\\ File 2.6& 1.09& 7.85& 6.93\\ File 2.7& 0.96& 5.83& 4.63\\ &&&\\ \textbf{Total}& 1.07& 7.11& 6.17\\ \bottomrule \end{longtable} \end{footnotesize} \rmfamily \label{k6:recal-time} \sffamily \begin{footnotesize} \begin{longtable}[ht]{ p{.2\textwidth} p{.2\textwidth} p{.2\textwidth} p{.2\textwidth}} \caption[Compression Effectivity for greater files] % Caption für das Tabellenverzeichnis {Compression duration in seconds} % Caption für die Tabelle selbst \\ \toprule \textbf{ID.} & \textbf{\acs{GeCo}} & \textbf{Samtools \acs{BAM}}& \textbf{Samtools \acs{CRAM}} \\ \midrule %compress time for geco, bam and cram in seconds File 2.1 & 1m58.427& 16.248& 23.016\\ File 2.2 & 1m57.905& 15.770& 22.892\\ File 2.3 & 1m09.725& 07.732& 12.858\\ File 2.4 & 1m13.694& 08.291& 13.649\\ File 2.5 & 1m51.001& 14.754& 23.713\\ File 2.6 & 1m51.315& 15.142& 24.358\\ File 2.7 & 2m02.065& 16.379& 23.484\\ &&&\\ \textbf{Total}& 1m43.447& 13.474& 20.567\\ \bottomrule \end{longtable} \end{footnotesize} \rmfamily In both tables \ref{k6:recal-time} and \ref{k6:recal-size} the already identified pattern can be observed. Looking at the compression ratio in \ref{k6:recal-size} a maximum compression of 99.04\% was reached with \acs{GeCo}. In this set of test files, file seven were the one with the greatest size (\~1.3 Gigabyte). Closely folled by file one and two (\~1.2 Gigabyte). \section{View on Possible Improvements} So far, this work went over formats for storing genomes, methods to compress files (in mentioned formats) and through tests where implementations of named algorithms compress several files and analyzed the results. The test results show that \acs{GeCo} provides a better compression ratio than Samtools and takes more time to run through. So in this testrun, implementations of arithmetic coding resulted in a better compression ratio than Samtools \acs{BAM} with the mix of Huffman coding and \acs{LZ77}, or Samtools custom compression format \acs{CRAM}. Comparing results in \autocite{survey}, supports this statement. This study used \acs{FASTA}/Multi-FASTA files from 71MB to 166MB and found that \acs{GeCo} had a variating compression ratio from 12.34 to 91.68 times smaller than the input reference and also resulted in long runtimes up to over 600 minutes \cite{survey}. Since this study focused on another goal than this work and therefore used different test variables and environments, the results can not be compared. But what can be taken from this, is that arithmetic coding, at least in \acs{GeCo} is in need of a runtime improvement.\\ The actual mathematical prove of such an improvement, the planing of a implementation and the development of a proof of concept, will be a rewarding but time and ressource comsuming project. Dealing with those tasks would go beyond the scope of this work. But in order to widen the foundation for this tasks, the rest of this work will consist of considerations and problem analysis, which should be thought about and dealt with to develop a improvement. S.V. Petoukhov described his findings, which are under ongoing research, about the distribution of nucleotides \cite{pet21}. With the probability of one nucleotide, in a sequence of sufficient length, information about the direct neighbours might be revealed. For example, with the probability of \texttt{C}, the probabilities for sets (n-plets) of any nucleotide \texttt{N}, including \texttt{C} might be determinable without counting them \cite{pet21}.\\ %\%C ≈ Σ\%CN ≈ Σ\%NС ≈ Σ\%CNN ≈ Σ\%NCN ≈ Σ\%NNC ≈ Σ\%CNNN ≈ Σ\%NCNN ≈ Σ\%NNCN ≈ Σ\%NNNC\\ % begin optimization Considering this and the measured results, an improvement in the arithmetic coding process and therefore in \acs{GeCo}s efficiency, would be a good start to equalize the great gap in the compression duration. Combined with a tool that is developed with todays standards, there is a possibility that even greater improvements could be archived.\\ % simple theoretical approach How would a theoretical improvement approach look like? As described in \ref{k4:arith}, entropy coding requires to determine the probabilies of each symbol in the alphabet. The simplest way to do that, is done by parsing the whole sequence from start to end and increasing a counter for each nucleotide that got parsed. With new findings discovered by Petoukhov in cosideration, the goal would be to create an entropy coding implementation that beats current implementation in the time needed to determine probabilities. A possible approach would be that the probability of one nucleotide can be used to determine the probability of other nucelotides, by a calculation rather than the process of counting each one. This approach throws a few questions that need to be answered in order to plan a implementation \cite{pet21}:\\ \begin{itemize} \item How many probabilities are needed to calculate the others? \item Is there space for improvement in the parsing/counting process? %\item Is there space for visible improvements, when only counting one nucleotide? \item How can the variation between probabilities be determined? \end{itemize} % first bulletpoint The question for how many probabilities are needed, needs to be answered, to start working on any kind of implementation. This question will only get answered by theoretical prove. It could happen in form of a mathematical equation, which proves that counting all occurences of one nucleotide reveals can be used to determin all probabilities. %Since this task is time and resource consuming and there is more to discuss, finding a answer will be postponed to another work. %One should keep in mind that this is only one of many approaches. Any prove of other approaches which reduces the probability determination, can be taken in instead. % second bullet point (mutlithreading aspect= The Second point must be asked, because the improvement in counting only one nucleotide in comparison to counting three, would be to little to be called relevant. Especially if multithreading is a option. % fim: nicht ganz klar -> Since in the static codeanalysis in \ref{k4:geco} revealed no multithreading, the analysis for improvements when splitting the workload onto several threads should be considered, before working on an improvement based on Petoukhovs findings. This is relevant, because some improvements, like the one described above, will loose efficiency if only subsections of a genomes are processed. A tool like OpenMC for multithreading C programs would possibly supply the required functionality to develop a prove of concept \cite{cthreading, pet21}. % theoretical improvement with pseudocode But how could a improvement look like, not considering possible difficulties multithreading would bring? To answer this, first a mechanism to determine a possible improvement must be determined. To compare parts of a programm and their complexity, the Big-O notation is used. Unfortunally this is only covering loops and coditions as a whole. Therefore a more detailed view on operations must be created: Considering a single threaded loop with the purpose to count every nucleotide in a sequence, the process of counting can be split into several operations, defined by this pseudocode.\\ %todo use GeCo arith function with bigO while (sequence not end)\\ do\\ \-\hspace{0.5cm} next\_nucleotide = read\_next\_nucleotide(sequence)\\ \-\hspace{0.5cm} for (element in alphabet\_probabilities)\\ \-\hspace{0.5cm} do\\ \-\hspace{1cm} if (element equals next\_nucleotide)\\ \-\hspace{1.5cm} element = element + 1\\ \-\hspace{1cm} fi\\ \-\hspace{0.5cm} done\\ done\\ This loop will itterate over a whole sequence, counting each nucleotide. In line three, a inner loop can be found which itterates over the alphabet, to determine which symbol should be increased. Considering the findings, described above, the inner loop can be left out, because there is no need to compare the read nucleotide against more than one symbol. The Big-O notation for this code, with any sequence with the length of n, would be decreseased from O($n^2$) to O($n\cdot 1)$ or simply O(n) \cite{big-o}. Which is clearly an improvement in complexety and therefor also in runtime.\\ The runtime for calculations of the other symbols probabilities must be considered as well and compared against the nested loop to be certain, that the overall runtime was improved.\\ % more realistic view on parsing todo need cites %In practice, obviously smarter ways are used, to determine probabilities. Like splitting the sequence in multiple parts and parse each subsequence asynchronous. Getting back to the question how multithreading would impact improvements: A implementation like the one described above, could also work with multithreading. Since the ratio of the difference between O($n^2$) and O(n) does not differ with the reduction of n. Multiple threads, processing parts of a sequence with the length of n, would also benefit, because any fraction of $n^2$ will always be greater than the corresponding fraction of n. This results can either sumed up for global probabilities or get used individually on each associated subsequence. Either way, the presented improvement approach should be appliable to both parsing methods.\\ This leaves a list of problems, which needs to be regarded in the approach of developing a improvement. If there space for improvement in the parsing/counting process, what problems needs to be addressed: \begin{itemize} \item reducing one process by adding aditional code must be estimated and set into relation. \item for a tool that does not feature multithreading, how would multithreading affect the improvement results? \end{itemize} % todo petoukhov just said T = AT+GT+CT+TT = %NT and %T = %TN % if %C = %T and %A = %G % C = ? % bulletpoint 3 A important question that needs answered would be: If Petoukhovs findings show that, through simliarities in the distribution of each nucleotide, one can lead to the aproximation of the other three. Entropy codings work with probabilities, how does that affect the coding mechanism? With a equal probability for each nucleotide, entropy coding can not be treated as a whole. This is due to the fact, that Huffman coding makes use of differing probabilities. A equal distribution means every character will be encoded in the same length which would make the encoding process unnecessary. Arithmetic coding on the other hand is able to handle equal probabilities. The fact that there are obviously chains of repeating nucleotides in genomes. For example \texttt{File 2.2}, which contains this subsequence is found at line 90: \texttt{AAAAAAAAAAAAAAAAAAAAAATAAATATTTTATTT} Without determining probabilities, one can see that the amount of \texttt{A}s outnumbers \texttt{T}s and neither \texttt{C} nor \texttt{G} are present. With the whole 1.2 gigabytes, the distribution will align more, but by cutting out a subsection, of relevant size, with unequal distributions will have an impact on the probabilities of the whole sequence. If a greater sequence would lead to a more equal distribution, this knowledge could be used to help determining distributions on subsequences of one with equaly distributed probabilities. % length cutting % todo erweitern um vergleiche zu survey work % how is data interpreted % why did the tools result in this, what can we learn % improvements % - goal: less time to compress % - approach: optimize probability determination % -> how?