%SUMMARY %- ABSTRACT %- INTRODUCTION %# BASICS %- \acs{DNA} STRUCTURE %- DATA TYPES % - BAM/FASTQ % - NON STANDARD %- COMPRESSION APPROACHES % - SAVING DIFFERENCES WITH GIVEN BASE \acs{DNA} % - HUFFMAN ENCODING % - PROBABILITY APPROACHES (WITH BASE?) % %# COMPARING TOOLS %- %# POSSIBLE IMPROVEMENT %- \acs{DNA}S STOCHASTICAL ATTRIBUTES %- IMPACT ON COMPRESSION \chapter{The Structure of the Human Genome and How its Digital Form is Compressed} \section{Structure of Human \acs{DNA}} To strengthen the understanding of how and where biological information is stored, this section starts with a quick and general rundown of the structure of any living organism.\\ \begin{figure}[ht] \centering \includegraphics[width=6cm]{k2/cell.png} \caption{A superficial representation of the physical positioning of genomes. Showing a double helix (bottom), a chromosome (upper rihgt) and a chell (upper center).} \label{k2:gene-overview} \end{figure} All living organisms, like plants and animals, are made of cells. To get a rough impression, a human body can consist of several trillion cells. A cell in itself is the smallest living organism. Most cells consist of an outer section and a core which is a called nucleus. In \ref{k2:gene-overview} the nucleus is illustrated as a purple, circlelike scheme inside a lighter circle. The nucleus contains chromosomes. Those chromosomes contain genetic information, about their organism in form of \ac{DNA} \cite{cells}.\\ \acs{DNA} is often seen in the form of a double helix, as shown in \ref{k2:dna-struct}. A double helix consists, as the name suggests, of two single helixes \cite{dna_structure}. \begin{figure}[ht] \centering \includegraphics[width=15cm]{k2/dna.png} \caption{A purely diagrammatic figure of the components \acs{DNA} is made of. The smaller, inner rods symbolize nucleotide links and the outer ribbons the phosphate-sugar chains \cite{dna_structure}.} \label{k2:dna-struct} \end{figure} Each of them consists of two main components: the sugar phosphate backbone, which is not relevant for this work and the bases. The suggar phosphate backbones are illustrated as flat stripes, circulating aroung the horizontal line in \ref{k2:dna-struct}. Pairs of bases are symbolized as vertical bars between the suggar phosphates. The arrangement of Bases represents the information, stored in the \acs{DNA}. Whar is here described as base is a organic molecule, which is also called nucleotide \cite{dna_structure}.\\ For this work, nucleotides are the most important parts of the \acs{DNA}. A nucleotide can occur in one of four forms: it can be either adenine, thymine, guanine or cytosine. Each of them got a counterpart with which a bond can be established: adenine can bond with thymine; guanine can bond with cytosine.\\ From the perspective of a computer scientist: The content of one helix must be stored to persist the full information. In more practical terms: The nucleotides of only one (entire) helix need to be stored physically to save the information of the whole \acs{DNA}. The other half can be determined by ``inverting'' the stored one. % todo OPT -> figure? An example would show the counterpart for e.g.: \texttt{adenine, guanine, adenine} chain which would be a chain of \texttt{thymine, cytosine, thymine}. For the sake of simplicity, one does not write out the full name of each nucleotide, but only its initial. So, the example would change to \texttt{AGA} in one helix, \texttt{TCT} in the other.\\ This representation is commonly used to store \acs{DNA} digitally. Depending on the sequencing procedure and other factors, more information is stored and therefore more characters are required but for now 'A', 'C', 'G' and 'T' should be the only concern.