3 ani în urmă · 4955cc970a
--- a/latex/tex/kapitel/k3_datatypes.tex
+++ b/latex/tex/kapitel/k3_datatypes.tex
@@ -50,7 +50,7 @@ Several people and groups have developed different file formats to store genomes
 
				 \begin{itemize}
			
 
				   \item{The format has reputation. This can be indicated through:}
			
 
				 	\begin{itemize}
			
 
				-		\item{A scientific paper, that prooved its superiority to other relevant tools.}
			
 
				+		\item{A scientific paper, that proved its superiority to other relevant tools.}
			
 
				 		\item{A broad ussage of the format determined by its use on ftp servers, which focus on supporting scientific research.}
			
 
				 	\end{itemize}
			
 
				   \item{The format should not specialize on only one type of \acs{DNA} or target a specific technology.}
			
--- a/latex/tex/kapitel/k4_algorithms.tex
+++ b/latex/tex/kapitel/k4_algorithms.tex
@@ -43,7 +43,7 @@ Dictionary coding, as the name suggest, uses a dictionary to eliminate redundand
 
				 % demo substrings
			
 
				 Looking at the string 'stationary' it might be smart to store 'station' and 'ary' as seperate dictionary enties. Which way is more efficient depents on the text that should get compressed. 
			
 
				 % end demo
			
 
				-The dictionary should only store strings that occour in the input data. Also storing a dictionary in addition to the (compressed) input data, would be a waste of resources. Therefore the dicitonary is part of the text. Each first occourence is left uncompressed. Each occurence of a string, after the first one, points either to to its first occurence or to the last replacement of its occurence.\\ 
			
 
				+The dictionary should only store strings that occur in the input data. Also storing a dictionary in addition to the (compressed) input data, would be a waste of resources. Therefore the dicitonary is part of the text. Each first occurence is left uncompressed. Each occurence of a string, after the first one, points either to to its first occurence or to the last replacement of its occurence.\\ 
			
 
				 \ref{k4:dict-fig} illustrates how this process is executed. The bar on top of the figure, which extends over the full widht, symbolizes any text. The squares inside the text are repeating occurences of text segments. 
			
 
				 In the dictonary coding process, the square annotated as \texttt{first occ.} is added to the dictionary. \texttt{second} and \texttt{third occ.} get replaced by a structure \texttt{<pos, len>} consisting of a pointer to the position of the first occurence \texttt{pos} and the length of that occurence \texttt{len}.
			
 
				 The bar at the bottom of the figure shows how the compressed text for this example would be structured. The dotted lines would only consist of two bytes, storing position and lenght, pointing to \texttt{first occ.}. Decompressing this text would only require to parse the text from left to right and replace every \texttt{<pos, len>} with the already parsed word from the dictionary. This means jumping back to the parsed position stored in the replacement, reading for as long as the length dictates, copying the read section, jumping back and pasting the section.\\
			
@@ -275,7 +275,7 @@ For a better understanding of this example, and to help further explanations, th
 
				 
			
 
				 \begin{figure}[H]
			
 
				   \centering
			
 
				-  \includegraphics[width=8cm]{k4/huffman-tree.png}
			
 
				+  \includegraphics[width=8cm]{k4/Huffman-tree.png}
			
 
				   \caption{Final version of the Huffman tree for described example.}
			
 
				   \label{k4:huff-tree}
			
 
				 \end{figure}
			
@@ -333,8 +333,8 @@ Following function calls int the \texttt{compressor} section of \texttt{geco.c},
 
				 Compression in this fromat is done by a implementation called BGZF, which is a block compression on top of a widely used algorithm called DEFLATE.
			
 
				 \label{k4:deflate}
			
 
				 \paragraph{DEFLATE}
			
 
				-% mix of huffman and LZ77
			
 
				-The DEFLATE compression algorithm combines \acs{LZ77} and huffman coding. It is used in well known tools like gzip. 
			
 
				+% mix of Huffman and LZ77
			
 
				+The DEFLATE compression algorithm combines \acs{LZ77} and Huffman coding. It is used in well known tools like gzip. 
			
 
				 Data is split into blocks. Each block stores a header consisting of three bit. A single block can be stored in one of three forms. Each of which is represented by a identifier that is stored with the last two bit in the header. 
			
 
				 \begin{itemize}
			
 
				 	\item \texttt{00}		No compression.
			
@@ -344,12 +344,12 @@ Data is split into blocks. Each block stores a header consisting of three bit. A
 
				 The last combination \texttt{11} is reserved to mark a faulty block. The third, leading bit is set to flag the last data block \cite{rfc1951}.
			
 
				 % lz77 part
			
 
				 As described in \ref{k4:lz} a compression with \acs{LZ77} results in literals, a length for each literal and pointers that are represented by the distance between pointer and the literal it points to.
			
 
				-The \acs{LZ77} algorithm is executed before the huffman algorithm. Further compression steps differ from the already described algorithm and will extend to the end of this section.\\
			
 
				+The \acs{LZ77} algorithm is executed before the Huffman algorithm. Further compression steps differ from the already described algorithm and will extend to the end of this section.\\
			
 
				 
			
 
				-% huffman part
			
 
				+% Huffman part
			
 
				 Besides header bit and a data block, two Huffman code trees are store. One encodes literals and lenghts and the other distances. They happen to be in a compact form. This is archived by a addition of two rules on top of the rules described in \ref{k4:huff}: Codes of identical lengths are orderd lexicographically, directed by the characters they represent. And the simple rule: shorter codes precede longer codes.
			
 
				 To illustrated this with an example:
			
 
				-For a text consisting out of \texttt{C} and \texttt{G}, following codes would be set, for a encoding of two bit per character: \texttt{C}: 00, \texttt{G}: 01. With another character \texttt{A} in the alphabet, which would occour more often than the other two characters, the codes would change to a representation like this:
			
 
				+For a text consisting out of \texttt{C} and \texttt{G}, following codes would be set, for a encoding of two bit per character: \texttt{C}: 00, \texttt{G}: 01. With another character \texttt{A} in the alphabet, which would occur more often than the other two characters, the codes would change to a representation like this:
			
 
				 
			
 
				 \sffamily
			
 
				 \begin{footnotesize}
			
@@ -397,10 +397,10 @@ A Data Container can be split into three sections. From this sections the one st
 
				 \end{itemize}
			
 
				 
			
 
				 The Container Header stores information on how to decompress the data stored in the following block sections. The Compression Header contains information about what kind of data is stored and some encoding information for \acs{SAM} specific flags \cite{bam}. 
			
 
				-The actual data is stored in the Data Blocks. Those consist of encoded bit streams. According to the Samtools specification, the encoding can be one of the following: External, Huffman and two other methods which happen to be either a form of huffman coding or a shortened binary representation of integers \cite{bam}. The External option allows to use gzip, bzip2 which is a form of multiple coding methods including run length encoding and huffman, a encoding from the LZ family called LZMA or a combination of arithmetic and huffman coding called rANS \cite{sam12}.
			
 
				+The actual data is stored in the Data Blocks. Those consist of encoded bit streams. According to the Samtools specification, the encoding can be one of the following: External, Huffman and two other methods which happen to be either a form of Huffman coding or a shortened binary representation of integers \cite{bam}. The External option allows to use gzip, bzip2 which is a form of multiple coding methods including run length encoding and Huffman, a encoding from the LZ family called LZMA or a combination of arithmetic and Huffman coding called rANS \cite{sam12}.
			
 
				 % possible encodings: 
			
 
				 % external: no encoding or gzip, bzip2, lzma
			
 
				-% huffman
			
 
				-% Byte array coding -> huffman or external...
			
 
				+% Huffman
			
 
				+% Byte array coding -> Huffman or external...
			
 
				 % Beta coding -> binary representation
			
 
				 
			
--- a/latex/tex/kapitel/k6_results.tex
+++ b/latex/tex/kapitel/k6_results.tex
@@ -214,10 +214,10 @@ Reviewing \ref{k6:recal-time} one will notice, that \acs{GeCo} reached a runtime
 
				 In both tables \ref{k6:recal-time} and \ref{k6:recal-size} the already identified pattern can be observed. Looking at the compression ratio in \ref{k6:recal-size} a maximum compression of 99.04\% was reached with \acs{GeCo}. In this set of test files, file seven were the one with the greatest size (\~1.3 Gigabyte). Closely folled by file one and two (\~1.2 Gigabyte). 
			
 
				 
			
 
				 \section{View on Possible Improvements}
			
 
				-So far, this work went over formats for storing genomes, methods to compress files (in mentioned formats) and through tests where implementations of named algorithms compress several files and analyzed the results. The test results show that \acs{GeCo} provides a better compression ratio than Samtools and takes more time to run through. So in this testrun, implementations of arithmetic coding resulted in a better compression ratio than Samtools \acs{BAM} with the mix of huffman coding and \acs{LZ77}, or Samtools custom compression format \acs{CRAM}. Comparing results in \autocite{survey}, supports this statement. This study used \acs{FASTA}/Multi-FASTA files from 71MB to 166MB and found that \acs{GeCo} had a variating compression ratio from 12.34 to 91.68 times smaller than the input reference and also resulted in long runtimes up to over 600 minutes \cite{survey}. Since this study focused on another goal than this work and therefore used different test variables and environments, the results can not be compared. But what can be taken from this, is that arithmetic coding, at least in \acs{GeCo} is in need of a runtime improvement.\\
			
 
				-The actual mathematical proove of such an improvement, the planing of a implementation and the development of a proof of concept, will be a rewarding but time and ressource comsuming project. Dealing with those tasks would go beyond the scope of this work. But in order to widen the foundation for this tasks, the rest of this work will consist of considerations and problem analysis, which should be thought about and dealt with to develop a improvement.
			
 
				+So far, this work went over formats for storing genomes, methods to compress files (in mentioned formats) and through tests where implementations of named algorithms compress several files and analyzed the results. The test results show that \acs{GeCo} provides a better compression ratio than Samtools and takes more time to run through. So in this testrun, implementations of arithmetic coding resulted in a better compression ratio than Samtools \acs{BAM} with the mix of Huffman coding and \acs{LZ77}, or Samtools custom compression format \acs{CRAM}. Comparing results in \autocite{survey}, supports this statement. This study used \acs{FASTA}/Multi-FASTA files from 71MB to 166MB and found that \acs{GeCo} had a variating compression ratio from 12.34 to 91.68 times smaller than the input reference and also resulted in long runtimes up to over 600 minutes \cite{survey}. Since this study focused on another goal than this work and therefore used different test variables and environments, the results can not be compared. But what can be taken from this, is that arithmetic coding, at least in \acs{GeCo} is in need of a runtime improvement.\\
			
 
				+The actual mathematical prove of such an improvement, the planing of a implementation and the development of a proof of concept, will be a rewarding but time and ressource comsuming project. Dealing with those tasks would go beyond the scope of this work. But in order to widen the foundation for this tasks, the rest of this work will consist of considerations and problem analysis, which should be thought about and dealt with to develop a improvement.
			
 
				 
			
 
				-S.V. Petoukhov described his findings about the distribution of nucleotides \cite{pet21}. With the probability of one nucleotide, in a sequence of sufficient length, information about the direct neighbours is revealed. For example, with the probability of \texttt{C}, the probabilities for sets (n-plets) of any nucleotide \texttt{N}, including \texttt{C} can be determined without counting them \cite{pet21}.\\
			
 
				+S.V. Petoukhov described his findings, which are under ongoing research, about the distribution of nucleotides \cite{pet21}. With the probability of one nucleotide, in a sequence of sufficient length, information about the direct neighbours might be revealed. For example, with the probability of \texttt{C}, the probabilities for sets (n-plets) of any nucleotide \texttt{N}, including \texttt{C} might be determinable without counting them \cite{pet21}.\\
			
 
				 %\%C ≈ Σ\%CN ≈ Σ\%NС ≈ Σ\%CNN ≈ Σ\%NCN ≈ Σ\%NNC ≈ Σ\%CNNN ≈ Σ\%NCNN ≈ Σ\%NNCN ≈ Σ\%NNNC\\
			
 
				 
			
 
				 % begin optimization 
			
@@ -235,11 +235,14 @@ This approach throws a few questions that need to be answered in order to plan a
 
				 \end{itemize}
			
 
				 
			
 
				 % first bulletpoint
			
 
				-The question for how many probabilities are needed, needs to be answered, to start working on any kind of implementation. This question will only get answered by theoretical proove. It could happen in form of a mathematical equtaion, which prooves that counting all ocurences of one nucleotide reveals can be used to determin all probabilities. Since this task is time and resource consuming and there is more to discuss, finding a answer will be postponed to another work. 
			
 
				-%One should keep in mind that this is only one of many approaches. Any proove of other approaches which reduces the probability determination, can be taken in instead. 
			
 
				+The question for how many probabilities are needed, needs to be answered, to start working on any kind of implementation. This question will only get answered by theoretical prove. It could happen in form of a mathematical equation, which proves that counting all occurences of one nucleotide reveals can be used to determin all probabilities. 
			
 
				+%Since this task is time and resource consuming and there is more to discuss, finding a answer will be postponed to another work. 
			
 
				+%One should keep in mind that this is only one of many approaches. Any prove of other approaches which reduces the probability determination, can be taken in instead. 
			
 
				 
			
 
				 % second bullet point (mutlithreading aspect=
			
 
				-The Second point must be asked, because the improvement in counting only one nucleotide in comparison to counting three, would be to little to be called relevant. Especially if multithreading is a option. Since in the static codeanalysis in \ref{k4:geco} revealed no multithreading, the analysis for improvements when splitting the workload onto several threads should be considered, before working on an improvement based on Petoukhovs findings. This is relevant, because some improvements, like the one described above, will loose efficiency if only subsections of a genomes are processed. A tool like OpenMC for multithreading C programs would possibly supply the required functionality to develop a prove of concept \cite{cthreading, pet21}.
			
 
				+The Second point must be asked, because the improvement in counting only one nucleotide in comparison to counting three, would be to little to be called relevant. Especially if multithreading is a option. 
			
 
				+% fim: nicht ganz klar ->
			
 
				+Since in the static codeanalysis in \ref{k4:geco} revealed no multithreading, the analysis for improvements when splitting the workload onto several threads should be considered, before working on an improvement based on Petoukhovs findings. This is relevant, because some improvements, like the one described above, will loose efficiency if only subsections of a genomes are processed. A tool like OpenMC for multithreading C programs would possibly supply the required functionality to develop a prove of concept \cite{cthreading, pet21}.
			
 
				 % theoretical improvement with pseudocode
			
 
				 But how could a improvement look like, not considering possible difficulties multithreading would bring?
			
 
				 To answer this, first a mechanism to determine a possible improvement must be determined. To compare parts of a programm and their complexity, the Big-O notation is used. Unfortunally this is only covering loops and coditions as a whole. Therefore a more detailed view on operations must be created: 
			
@@ -276,7 +279,7 @@ If there space for improvement in the parsing/counting process, what problems ne
 
				 
			
 
				 % bulletpoint 3
			
 
				 A important question that needs answered would be: If Petoukhovs findings show that, through simliarities in the distribution of each nucleotide, one can lead to the aproximation of the other three. Entropy codings work with probabilities, how does that affect the coding mechanism?
			
 
				-With a equal probability for each nucleotide, entropy coding can not be treated as a whole. This is due to the fact, that huffman coding makes use of differing probabilities. A equal distribution means every character will be encoded in the same length which would make the encoding process unnecessary. Arithmetic coding on the other hand is able to handle equal probabilities.
			
 
				+With a equal probability for each nucleotide, entropy coding can not be treated as a whole. This is due to the fact, that Huffman coding makes use of differing probabilities. A equal distribution means every character will be encoded in the same length which would make the encoding process unnecessary. Arithmetic coding on the other hand is able to handle equal probabilities.
			
 
				 The fact that there are obviously chains of repeating nucleotides in genomes. For example \texttt{File 2.2}, which contains this subsequence is found at line 90:
			
 
				 
			
 
				 \texttt{AAAAAAAAAAAAAAAAAAAAAATAAATATTTTATTT} 
			
--- a/latex/tex/literatur.bib
+++ b/latex/tex/literatur.bib
@@ -1,5 +1,8 @@
 
				+% todo sort alphabetically 
			
 
				+% seperate online sources
			
 
				+
			
 
				 @Article{alok17,
			
 
				-  author       = {Anas Al-Okaily and Badar Almarri and Sultan Al Yami and Chun-Hsi Huang},
			
 
				+  author       = {A. Al-Okaily and B. Almarri and S. Yami and C. Huang},
			
 
				   date         = {2017-04-01},
			
 
				   journaltitle = {Journal of Computational Biology},
			
 
				   title        = {Toward a Better Compression for {DNA} Sequences Using Huffman Encoding},
			
@@ -7,20 +10,11 @@
 
				   number       = {4},
			
 
				   pages        = {280--288},
			
 
				   volume       = {24},
			
 
				-  publisher    = {Mary Ann Liebert Inc},
			
 
				-}
			
 
				-
			
 
				-@Online{bam,
			
 
				-  author  = {The SAM/BAM Format Specification Working Group},
			
 
				-  date    = {2022-08-22},
			
 
				-  title   = {Sequence Alignment/Map Format Specification},
			
 
				-  url     = {https://github.com/samtools/hts-specs},
			
 
				-  urldate = {2022-09-12},
			
 
				-  version = {44b4167},
			
 
				+  publisher    = {Mary Ann Liebert Inc.},
			
 
				 }
			
 
				 
			
 
				 @Article{Cock_2009,
			
 
				-  author       = {Peter J. A. Cock and Christopher J. Fields and Naohisa Goto and Michael L. Heuer and Peter M. Rice},
			
 
				+  author       = {P. Cock and C. Fields and N. Goto and M. Heuer and P. Rice},
			
 
				   date         = {2009-12},
			
 
				   journaltitle = {Nucleic Acids Research},
			
 
				   title        = {The Sanger {FASTQ} file format for sequences with quality scores, and the Solexa/Illumina {FASTQ} variants},
			
@@ -32,7 +26,7 @@
 
				 }
			
 
				 
			
 
				 @Article{cells,
			
 
				-  author       = {Eva Bianconi and Allison Piovesan and Federica Facchin and Alina Beraudi and Raffaella Casadei and Flavia Frabetti and Lorenza Vitale and Maria Chiara Pelleri and Simone Tassani and Francesco Piva and Soledad Perez-Amodio and Pierluigi Strippoli and Silvia Canaider},
			
 
				+  author       = {E. Bianconi and A. Piovesan and F. Facchin and A. Beraudi and R. Casadei and F. Frabetti and L. Vitale and M. Pelleri and S. Tassani and F. Piva and S. Perez-Amodio and P. Strippoli and S. Canaider},
			
 
				   date         = {2013-07},
			
 
				   journaltitle = {Annals of Human Biology},
			
 
				   title        = {An estimation of the number of cells in the human body},
			
@@ -44,7 +38,7 @@
 
				 }
			
 
				 
			
 
				 @Article{dna_structure,
			
 
				-  author       = {J. D. WATSON and F. H. C. CRICK},
			
 
				+  author       = {J. Watson and F. Crick},
			
 
				   date         = {1953-04},
			
 
				   journaltitle = {Nature},
			
 
				   title        = {Molecular Structure of Nucleic Acids: A Structure for Deoxyribose Nucleic Acid},
			
@@ -56,7 +50,7 @@
 
				 }
			
 
				 
			
 
				 @Article{iupac,
			
 
				-  author       = {Andrew D. Johnson},
			
 
				+  author       = {A. Johnson},
			
 
				   date         = {2010-03},
			
 
				   journaltitle = {Bioinformatics},
			
 
				   title        = {An extended {IUPAC} nomenclature code for polymorphic nucleic acids},
			
@@ -68,7 +62,7 @@
 
				 }
			
 
				 
			
 
				 @TechReport{rfc1951,
			
 
				-  author    = {L Peter Deutsch},
			
 
				+  author    = {P. Deutsch},
			
 
				   date      = {1996-05},
			
 
				   title     = {{DEFLATE} Compressed Data Format Specification version 1.3},
			
 
				   doi       = {10.17487/rfc1951},
			
@@ -88,36 +82,8 @@
 
				   publisher    = {Institute of Electrical and Electronics Engineers ({IEEE})},
			
 
				 }
			
 
				 
			
 
				-@Online{ucsc,
			
 
				-  author  = {UCSC - University of California, Santa Cruz},
			
 
				-  date    = {2022-10-28},
			
 
				-  title   = {UCSC Genome Browser},
			
 
				-  url     = {https://genome.ucsc.edu/},
			
 
				-  urldate = {2022-10-28},
			
 
				-}
			
 
				-
			
 
				-@Online{ensembl,
			
 
				-  author = {Paul Flicek},
			
 
				-  date   = {2022-10-24},
			
 
				-  title  = {ENSEMBL Project},
			
 
				-  url    = {http://www.ensembl.org/},
			
 
				-}
			
 
				-
			
 
				-@Online{ga4gh,
			
 
				-  date  = {2022-10-10},
			
 
				-  title = {Global Alliance for Genomics and Health},
			
 
				-  url   = {https://github.com/samtools/hts-specs.},
			
 
				-}
			
 
				-
			
 
				-@Online{bed,
			
 
				-  author = {Sanger Institute, Genome Research Limited},
			
 
				-  date   = {2022-10-20},
			
 
				-  title  = {BED Browser Extensible Data},
			
 
				-  url    = {https://samtools.github.io/hts-specs/BEDv1.pdf},
			
 
				-}
			
 
				-
			
 
				 @InProceedings{compr-visual,
			
 
				-  author    = {Sami Khuri and Hsiu-Chin Hsu},
			
 
				+  author    = {S. Khuri and H. Hsu},
			
 
				   booktitle = {Proceedings of the 2000 {ACM} symposium on Applied computing - {SAC} {\textquotesingle}00},
			
 
				   date      = {2000},
			
 
				   title     = {Tools for visualizing text compression algorithms},
			
@@ -126,7 +92,7 @@
 
				 }
			
 
				 
			
 
				 @Article{lcqs,
			
 
				-  author       = {Jiabing Fu and Bixin Ke and Shoubin Dong},
			
 
				+  author       = {J. Fu and B. Ke and S. Dong},
			
 
				   date         = {2020-03},
			
 
				   journaltitle = {{BMC} Bioinformatics},
			
 
				   title        = {{LCQS}: an efficient lossless compression tool of quality scores with random access functionality},
			
@@ -137,7 +103,7 @@
 
				 }
			
 
				 
			
 
				 @Book{delfs_knebl,
			
 
				-  author    = {Delfs, Hans and Knebl, Helmut},
			
 
				+  author    = {H. Delfs and H. Knebl},
			
 
				   date      = {2007},
			
 
				   title     = {Introduction to Cryptography},
			
 
				   isbn      = {9783540492436},
			
@@ -147,7 +113,7 @@
 
				 }
			
 
				 
			
 
				 @Article{cc14,
			
 
				-  author       = {Kashfia Sailunaz and Mohammed Rokibul Alam Kotwal and Mohammad Nurul Huda},
			
 
				+  author       = {K. Sailunaz and M. Kotwal and M. Huda},
			
 
				   date         = {2014-03},
			
 
				   journaltitle = {International Journal of Computer Applications},
			
 
				   title        = {Data Compression Considering Text Files},
			
@@ -159,7 +125,7 @@
 
				 }
			
 
				 
			
 
				 @Article{cnet13,
			
 
				-  author       = {Manish RajShivare and Yogendra P. S. Maravi and Sanjeev Sharma},
			
 
				+  author       = {M. RajShivare and Y. Maravi and S. Sharma},
			
 
				   date         = {2013-10},
			
 
				   journaltitle = {International Journal of Computer Applications},
			
 
				   title        = {Analysis of Header Compression Techniques for Networks: A Review},
			
@@ -171,7 +137,7 @@
 
				 }
			
 
				 
			
 
				 @TechReport{rfcgzip,
			
 
				-  author       = {L. Peter Deutsch and Jean-Loup Gailly and Mark Adler and L. Peter Deutsch and Glenn Randers-Pehrson},
			
 
				+  author       = {P. Deutsch and J. Gailly and M. Adler and P. Deutsch and G. Randers-Pehrson},
			
 
				   date         = {1996-05},
			
 
				   title        = {GZIP file format specification version 4.3},
			
 
				   number       = {1952},
			
@@ -184,13 +150,12 @@
 
				 }
			
 
				 
			
 
				 @Article{huf52,
			
 
				-  author      = {Huffman, David A.},
			
 
				+  author      = {D. A. Huffman},
			
 
				   title       = {A Method for the Construction of Minimum-Redundancy Codes},
			
 
				   number      = {9},
			
 
				   pages       = {1098-1101},
			
 
				   volume      = {40},
			
 
				   added-at    = {2009-01-14T00:43:43.000+0100},
			
 
				-  biburl      = {https://www.bibsonomy.org/bibtex/2585b817b85d7278b868329672ddded96/dret},
			
 
				   description = {dret'd bibliography},
			
 
				   interhash   = {d00a180c1c2e7851560c2d51e0fd8f92},
			
 
				   intrahash   = {585b817b85d7278b868329672ddded96},
			
@@ -203,7 +168,7 @@
 
				 }
			
 
				 
			
 
				 @Article{moffat20,
			
 
				-  author       = {Alistair Moffat},
			
 
				+  author       = {A. Moffat},
			
 
				   date         = {2020-07},
			
 
				   journaltitle = {{ACM} Computing Surveys},
			
 
				   title        = {Huffman Coding},
			
@@ -215,7 +180,7 @@
 
				 }
			
 
				 
			
 
				 @Article{moffat_arith,
			
 
				-  author       = {Alistair Moffat and Radford M. Neal and Ian H. Witten},
			
 
				+  author       = {A. Moffat and R. Neal and I. Witten},
			
 
				   date         = {1998-07},
			
 
				   journaltitle = {{ACM} Transactions on Information Systems},
			
 
				   title        = {Arithmetic coding revisited},
			
@@ -227,7 +192,7 @@
 
				 }
			
 
				 
			
 
				 @Article{ris76,
			
 
				-  author       = {J. J. Rissanen},
			
 
				+  author       = {J. Rissanen},
			
 
				   date         = {1976-05},
			
 
				   journaltitle = {{IBM} Journal of Research and Development},
			
 
				   title        = {Generalized Kraft Inequality and Arithmetic Coding},
			
@@ -247,7 +212,7 @@
 
				 }
			
 
				 
			
 
				 @Article{big-o,
			
 
				-  author = {Mala, Firdous and Ali, Rouf},
			
 
				+  author = {M. Firdous and A. Rouf},
			
 
				   title  = {The Big-O of Mathematics and Computer Science},
			
 
				   doi    = {10.26855/jamc.2022.03.001},
			
 
				   pages  = {1-3},
			
@@ -257,7 +222,7 @@
 
				 }
			
 
				 
			
 
				 @Article{sam12,
			
 
				-  author       = {Petr Danecek and James K Bonfield and Jennifer Liddle and John Marshall and Valeriu Ohan and Martin O Pollard and Andrew Whitwham and Thomas Keane and Shane A McCarthy and Robert M Davies and Heng Li},
			
 
				+  author       = {P. Danecek and J. Bonfield and J. Liddle and J. Marshall and V. Ohan and M. Pollard and A. Whitwham and T. Keane and S. McCarthy and R. Davies and H. Li},
			
 
				   date         = {2021-01},
			
 
				   journaltitle = {{GigaScience}},
			
 
				   title        = {Twelve years of {SAMtools} and {BCFtools}},
			
@@ -268,7 +233,7 @@
 
				 }
			
 
				 
			
 
				 @Article{cram-origin,
			
 
				-  author       = {Markus Hsi-Yang Fritz and Rasko Leinonen and Guy Cochrane and Ewan Birney},
			
 
				+  author       = {M. Fritz and R. Leinonen and G. Cochrane and E. Birney},
			
 
				   date         = {2011-01},
			
 
				   journaltitle = {Genome Research},
			
 
				   title        = {Efficient storage of high throughput {DNA} sequencing data using reference-based compression},
			
@@ -279,26 +244,22 @@
 
				   publisher    = {Cold Spring Harbor Laboratory},
			
 
				 }
			
 
				 
			
 
				-@Online{illufastq,
			
 
				-  author = {Illumina},
			
 
				-  date   = {2022-11-17},
			
 
				-  title  = {Illumina FASTq file structure explained},
			
 
				-  url    = {https://support.illumina.com/bulletins/2016/04/fastq-files-explained.html},
			
 
				-}
			
 
				-
			
 
				 @TechReport{rfcansi,
			
 
				-  author       = {K. Simonsen and},
			
 
				-  title        = {Character Mnemonics and Character Sets},
			
 
				-  number       = {1345},
			
 
				-  type         = {RFC},
			
 
				-  howpublished = {Internet Requests for Comments},
			
 
				-  issn         = {2070-1721},
			
 
				-  month        = {June},
			
 
				-  year         = {1992},
			
 
				+    author =    {K. Simonsen},
			
 
				+    series =    {Request for Comments},
			
 
				+    number =    {1345}, 
			
 
				+    howpublished =  {RFC 1345},
			
 
				+    publisher = {RFC Editor},
			
 
				+    doi =       {10.17487/RFC1345},
			
 
				+    url =       {https://www.rfc-editor.org/info/rfc1345},
			
 
				+    title =     {{Character Mnemonics and Character Sets}},
			
 
				+    pagetotal = {103},
			
 
				+    year =      {1992},
			
 
				+    month =     {jun},
			
 
				 }
			
 
				 
			
 
				 @Article{witten87,
			
 
				-  author       = {Ian H. Witten and Radford M. Neal and John G. Cleary},
			
 
				+  author       = {I. Witten and R. Neal and J. Cleary},
			
 
				   date         = {1987-06},
			
 
				   journaltitle = {Communications of the {ACM}},
			
 
				   title        = {Arithmetic coding for data compression},
			
@@ -319,7 +280,7 @@
 
				 }
			
 
				 
			
 
				 @InProceedings{geco,
			
 
				-  author    = {Diogo Pratas and Armando J. Pinho and Paulo J. S. G. Ferreira},
			
 
				+  author    = {D. Pratas and A. Pinho and P. Ferreira},
			
 
				   booktitle = {2016 Data Compression Conference ({DCC})},
			
 
				   date      = {2016-03},
			
 
				   title     = {Efficient Compression of Genomic Sequences},
			
@@ -328,7 +289,7 @@
 
				 }
			
 
				 
			
 
				 @Article{survey,
			
 
				-  author       = {Morteza Hosseini and Diogo Pratas and Armando Pinho},
			
 
				+  author       = {M. Hosseini and D. Pratas and A. Pinho},
			
 
				   date         = {2016-10},
			
 
				   journaltitle = {Information},
			
 
				   title        = {A Survey on Data Compression Methods for Biological Sequences},
			
@@ -340,7 +301,7 @@
 
				 }
			
 
				 
			
 
				 @Article{vertical,
			
 
				-  author       = {Kelvin V. Kredens and Juliano V. Martins and Osmar B. Dordal and Mauri Ferrandin and Roberto H. Herai and Edson E. Scalabrin and Br{\'{a}}ulio C. {\'{A}}vila},
			
 
				+  author       = {K. Kredens and J. Martins and O. Dordal and M. Ferrandin and R. Herai and E. Scalabrin and B. {\'{A}}vila},
			
 
				   date         = {2020-05},
			
 
				   journaltitle = {{PLOS} {ONE}},
			
 
				   title        = {Vertical lossless genomic data compression tools for assembled genomes: A systematic literature review},
			
@@ -352,12 +313,6 @@
 
				   publisher    = {Public Library of Science ({PLoS})},
			
 
				 }
			
 
				 
			
 
				-@Online{twobit,
			
 
				-  date   = {2022-09-22},
			
 
				-  editor = {UCSC University of California Sata Cruz},
			
 
				-  title  = {TwoBit File Format},
			
 
				-  url    = {https://genome-source.gi.ucsc.edu/gitlist/kent.git/raw/master/src/inc/twoBit.h},
			
 
				-}
			
 
				 
			
 
				 @TechReport{isompeg,
			
 
				   author      = {{ISO Central Secretary}},
			
@@ -371,16 +326,19 @@
 
				   year        = {2019},
			
 
				 }
			
 
				 
			
 
				-@Article{mpeg,
			
 
				-  author    = {Claudio Albert and Tom Paridaens and Jan Voges and Daniel Naro and Junaid J. Ahmad and Massimo Ravasi and Daniele Renzi and Giorgio Zoia and Paolo Ribeca and Idoia Ochoa and Marco Mattavelli and Jaime Delgado and Mikel Hernaez},
			
 
				-  date      = {2018-09},
			
 
				-  title     = {An introduction to {MPEG}-G, the new {ISO} standard for genomic information representation},
			
 
				-  doi       = {10.1101/426353},
			
 
				-  publisher = {Cold Spring Harbor Laboratory},
			
 
				+@ARTICLE{9455132,
			
 
				+  author={J. Voges and M. Hernaez and M. Mattavelli and J. Ostermann},
			
 
				+  journal={Proceedings of the IEEE}, 
			
 
				+  title={An Introduction to MPEG-G: The First Open ISO/IEC Standard for the Compression and Exchange of Genomic Sequencing Data}, 
			
 
				+  year={2021},
			
 
				+  volume={109},
			
 
				+  number={9},
			
 
				+  pages={1607-1622},
			
 
				+  doi={10.1109/JPROC.2021.3082027}
			
 
				 }
			
 
				 
			
 
				 @Article{haplo,
			
 
				-  author       = {Wai Yee Low and Rick Tearle and Ruijie Liu and Sergey Koren and Arang Rhie and Derek M. Bickhart and Benjamin D. Rosen and Zev N. Kronenberg and Sarah B. Kingan and Elizabeth Tseng and Fran{\c{c}}oise Thibaud-Nissen and Fergal J. Martin and Konstantinos Billis and Jay Ghurye and Alex R. Hastie and Joyce Lee and Andy W. C. Pang and Michael P. Heaton and Adam M. Phillippy and Stefan Hiendleder and Timothy P. L. Smith and John L. Williams},
			
 
				+  author       = {W. Low and R. Tearle and R. Liu and S. Koren and A. Rhie and D. Bickhart and B. Rosen and Z. Kronenberg and S. Kingan and E. Tseng and F. Thibaud-Nissen and F. Martin and K. Billis and J. Ghurye and A. Hastie and J. Lee and A. Pang and M. Heaton and A. Phillippy and S. Hiendleder and T. Smith and J. Williams},
			
 
				   date         = {2020-04},
			
 
				   journaltitle = {Nature Communications},
			
 
				   title        = {Haplotype-resolved genomes provide insights into structural variation and gene content in Angus and Brahman cattle},
			
@@ -390,41 +348,8 @@
 
				   publisher    = {Springer Science and Business Media {LLC}},
			
 
				 }
			
 
				 
			
 
				-@Online{ftp-igsr,
			
 
				-  date  = {2022-11-10},
			
 
				-  title = {IGSR: The International Genome Sample Resource},
			
 
				-  url   = {https://ftp.1000genomes.ebi.ac.uk},
			
 
				-}
			
 
				-
			
 
				-@Online{ftp-ncbi,
			
 
				-  date  = {2022-11-01},
			
 
				-  title = {NCBI National Center for Biotechnology Information},
			
 
				-  url   = {https://ftp.ncbi.nlm.nih.gov/genomes/},
			
 
				-}
			
 
				-
			
 
				-@Online{ftp-ensembl,
			
 
				-  date  = {2022-10-15},
			
 
				-  title = {ENSEMBL Rapid Release},
			
 
				-  url   = {https://ftp.ensembl.org},
			
 
				-}
			
 
				-
			
 
				-@Book{cthreading,
			
 
				-  author    = {Quinn, Michael J.},
			
 
				-  title     = {Parallel Programming in C with MPI and OpenMP},
			
 
				-  isbn      = {0071232656},
			
 
				-  publisher = {McGraw-Hill Education Group},
			
 
				-  year      = {2003},
			
 
				-}
			
 
				-
			
 
				-@Online{geco-repo,
			
 
				-  author = {Cobilab},
			
 
				-  date   = {2022-11-19},
			
 
				-  title  = {Repositories for the three versions of GeCo},
			
 
				-  url    = {https://github.com/cobilab},
			
 
				-}
			
 
				-
			
 
				 @Article{pet21,
			
 
				-  author    = {Sergey V. Petoukhov},
			
 
				+  author    = {S. Petoukhov},
			
 
				   date      = {2021-10},
			
 
				   title     = {Tensor Rules in the Stochastic Organization of Genomes and Genetic Stochastic Resonance in Algebraic Biology},
			
 
				   doi       = {10.20944/preprints202110.0093.v1},
			
@@ -444,7 +369,7 @@
 
				 }
			
 
				 
			
 
				 @Book{dict,
			
 
				-  author    = {McIntosh, Colin},
			
 
				+  author    = {C. McIntosh},
			
 
				   date      = {2013},
			
 
				   title     = {Cambridge International Dictionary of English},
			
 
				   isbn      = {9781107035157},
			
@@ -462,15 +387,21 @@
 
				   pagetotal    = {3},
			
 
				   url          = {https://www.rfc-editor.org/info/rfc768},
			
 
				   howpublished = {RFC 768},
			
 
				-  month        = aug,
			
 
				+  month        = {aug},
			
 
				   publisher    = {RFC Editor},
			
 
				   series       = {Request for Comments},
			
 
				   year         = {1980},
			
 
				 }
			
 
				 
			
 
				 @TechReport{isoutf,
			
 
				-  author = {ISO},
			
 
				+  author      = {ISO/IEC JTC 1/SC 2 Coded character sets},
			
 
				   title  = {ISO/IEC 10646:2020 UTF},
			
 
				+  date        = {2020-12},
			
 
				+  institution = {International Organization for Standardization},
			
 
				+  title       = {Information technology — Universal coded character set (UCS)},
			
 
				+  type        = {Standard},
			
 
				+  address     = {Geneva, CH},
			
 
				+  key         = {ISO10646:2020},
			
 
				 }
			
 
				 
			
 
				 @Article{lz77,
			
@@ -484,26 +415,8 @@
 
				   year    = {1977},
			
 
				 }
			
 
				 
			
 
				-@Online{code-analysis,
			
 
				-  author = {Ryan Dewhurst},
			
 
				-  date   = {2022-11-20},
			
 
				-  editor = {Kirsten S and Nick Bloor and Sarah Baso and James Bowie and Evgeniy Ryzhkov and Iberiam and Ann Campbell and Jonathan Marcil and Christina Schelin and Jie Wang and Fabian and Achim and Dirk Wetter},
			
 
				-  title  = {Static Code Analysis},
			
 
				-  url    = {https://owasp.org/www-community/controls/Static_Code_Analysis},
			
 
				-}
			
 
				-
			
 
				-@Online{gpl,
			
 
				-  title = {GNU Public License},
			
 
				-  url   = {http://www.gnu.org/licenses/gpl-3.0.html},
			
 
				-}
			
 
				-
			
 
				-@Online{mitlic,
			
 
				-  title = {MIT License},
			
 
				-  url   = {https://spdx.org/licenses/MIT.html},
			
 
				-}
			
 
				-
			
 
				 @Article{wang_22,
			
 
				-  author       = {Si-Wei Wang and Chao Gao and Yi-Min Zheng and Li Yi and Jia-Cheng Lu and Xiao-Yong Huang and Jia-Bin Cai and Peng-Fei Zhang and Yue-Hong Cui and Ai-Wu Ke},
			
 
				+  author       = {S. Wang and C. Gao and Y. Zheng and L. Yi and J. Lu and X. Huang and J. Cai and P. Zhang and Y. Cui and A. Ke},
			
 
				   date         = {2022-02},
			
 
				   journaltitle = {Molecular Cancer},
			
 
				   title        = {Current applications and future perspective of {CRISPR}/Cas9 gene editing in cancer},
			
@@ -514,7 +427,7 @@
 
				 }
			
 
				 
			
 
				 @Article{ju_21,
			
 
				-  author       = {Philomin Juliana and Ravi Prakash Singh and Jesse Poland and Sandesh Shrestha and Julio Huerta-Espino and Velu Govindan and Suchismita Mondal and Leonardo Abdiel Crespo-Herrera and Uttam Kumar and Arun Kumar Joshi and Thomas Payne and Pradeep Kumar Bhati and Vipin Tomar and Franjel Consolacion and Jaime Amador Campos Serna},
			
 
				+  author       = {Philomin, J. and R. Singh and J. Poland and S. Shrestha and J. Huerta-Espino and V. Govindan and S. Mondal and L. Crespo-Herrera and U. Kumar and A. Joshi and T. Payne and P. Bhati and V. Tomar and F. Consolacion and J. Serna},
			
 
				   date         = {2021-03},
			
 
				   journaltitle = {Scientific Reports},
			
 
				   title        = {Elucidating the genetics of grain yield and stress-resilience in bread wheat using a large-scale genome-wide association mapping study with 55,568 lines},
			
@@ -525,7 +438,7 @@
 
				 }
			
 
				 
			
 
				 @Article{mo_83,
			
 
				-  author       = {Arno G. Motulsky},
			
 
				+  author       = {A. Motulsky},
			
 
				   date         = {1983-01},
			
 
				   journaltitle = {Science},
			
 
				   title        = {Impact of Genetic Manipulation on Society and Medicine},
			
@@ -536,4 +449,107 @@
 
				   publisher    = {American Association for the Advancement of Science ({AAAS})},
			
 
				 }
			
 
				 
			
 
				+
			
 
				+@Online{ftp-igsr,
			
 
				+  date  = {2022-11-10},
			
 
				+  title = {IGSR: The International Genome Sample Resource},
			
 
				+  url   = {https://ftp.1000genomes.ebi.ac.uk},
			
 
				+}
			
 
				+
			
 
				+@Online{ftp-ncbi,
			
 
				+  date  = {2022-11-01},
			
 
				+  title = {NCBI National Center for Biotechnology Information},
			
 
				+  url   = {https://ftp.ncbi.nlm.nih.gov/genomes/},
			
 
				+}
			
 
				+
			
 
				+@Online{ftp-ensembl,
			
 
				+  date  = {2022-10-15},
			
 
				+  title = {ENSEMBL Rapid Release},
			
 
				+  url   = {https://ftp.ensembl.org},
			
 
				+}
			
 
				+
			
 
				+@Book{cthreading,
			
 
				+  author    = {Quinn, Michael J.},
			
 
				+  title     = {Parallel Programming in C with MPI and OpenMP},
			
 
				+  isbn      = {0071232656},
			
 
				+  publisher = {McGraw-Hill Education Group},
			
 
				+  year      = {2003},
			
 
				+}
			
 
				+
			
 
				+@Online{geco-repo,
			
 
				+  author = {Cobilab},
			
 
				+  date   = {2022-11-19},
			
 
				+  title  = {Repositories for the three versions of GeCo},
			
 
				+  url    = {https://github.com/cobilab},
			
 
				+}
			
 
				+
			
 
				+@Online{code-analysis,
			
 
				+  author = {Ryan Dewhurst},
			
 
				+  date   = {2022-11-20},
			
 
				+  editor = {Kirsten S and Nick Bloor and Sarah Baso and James Bowie and Evgeniy Ryzhkov and Iberiam and Ann Campbell and Jonathan Marcil and Christina Schelin and Jie Wang and Fabian and Achim and Dirk Wetter},
			
 
				+  title  = {Static Code Analysis},
			
 
				+  url    = {https://owasp.org/www-community/controls/Static_Code_Analysis},
			
 
				+}
			
 
				+
			
 
				+@Online{gpl,
			
 
				+  title = {GNU Public License},
			
 
				+  url   = {http://www.gnu.org/licenses/gpl-3.0.html},
			
 
				+}
			
 
				+
			
 
				+@Online{mitlic,
			
 
				+  title = {MIT License},
			
 
				+  url   = {https://spdx.org/licenses/MIT.html},
			
 
				+}
			
 
				+
			
 
				+@Online{bam,
			
 
				+  author  = {The SAM/BAM Format Specification Working Group},
			
 
				+  date    = {2022-08-22},
			
 
				+  title   = {Sequence Alignment/Map Format Specification},
			
 
				+  url     = {https://github.com/samtools/hts-specs},
			
 
				+  urldate = {2022-09-12},
			
 
				+  version = {44b4167},
			
 
				+}
			
 
				+
			
 
				+@Online{ucsc,
			
 
				+  author  = {UCSC - University of California, Santa Cruz},
			
 
				+  date    = {2022-10-28},
			
 
				+  title   = {UCSC Genome Browser},
			
 
				+  url     = {https://genome.ucsc.edu/},
			
 
				+  urldate = {2022-10-28},
			
 
				+}
			
 
				+
			
 
				+@Online{ensembl,
			
 
				+  author = {P. Flicek},
			
 
				+  date   = {2022-10-24},
			
 
				+  title  = {ENSEMBL Project},
			
 
				+  url    = {http://www.ensembl.org/},
			
 
				+}
			
 
				+
			
 
				+@Online{ga4gh,
			
 
				+  date  = {2022-10-10},
			
 
				+  title = {Global Alliance for Genomics and Health},
			
 
				+  url   = {https://github.com/samtools/hts-specs.},
			
 
				+}
			
 
				+
			
 
				+@Online{bed,
			
 
				+  author = {Sanger Institute, Genome Research Limited},
			
 
				+  date   = {2022-10-20},
			
 
				+  title  = {BED Browser Extensible Data},
			
 
				+  url    = {https://samtools.github.io/hts-specs/BEDv1.pdf},
			
 
				+}
			
 
				+
			
 
				+@Online{illufastq,
			
 
				+  author = {Illumina},
			
 
				+  date   = {2022-11-17},
			
 
				+  title  = {Illumina FASTq file structure explained},
			
 
				+  url    = {https://support.illumina.com/bulletins/2016/04/fastq-files-explained.html},
			
 
				+}
			
 
				+
			
 
				+@Online{twobit,
			
 
				+  date   = {2022-09-22},
			
 
				+  editor = {UCSC University of California Sata Cruz},
			
 
				+  title  = {TwoBit File Format},
			
 
				+  url    = {https://genome-source.gi.ucsc.edu/gitlist/kent.git/raw/master/src/inc/twoBit.h},
			
 
				+}
			
 
				 @Comment{jabref-meta: databaseType:biblatex;}
			
 
				+