| 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748 |
- %SUMMARY
- %- ABSTRACT
- %- INTRODUCTION
- %# BASICS
- %- \acs{DNA} STRUCTURE
- %- DATA TYPES
- % - BAM/FASTQ
- % - NON STANDARD
- %- COMPRESSION APPROACHES
- % - SAVING DIFFERENCES WITH GIVEN BASE \acs{DNA}
- % - HUFFMAN ENCODING
- % - PROBABILITY APPROACHES (WITH BASE?)
- %
- %# COMPARING TOOLS
- %-
- %# POSSIBLE IMPROVEMENT
- %- \acs{DNA}S STOCHASTICAL ATTRIBUTES
- %- IMPACT ON COMPRESSION
- \chapter{The Structure of the Human Genome and how its Digital Form is Compressed}
- \section{Structure of Human \acs{DNA}}
- To strengthen the understanding of how and where biological information is stored, this section starts with a quick and general rundown on the structure of any living organism.\\
- \begin{figure}[ht]
- \centering
- \includegraphics[width=6cm]{k2/cell.png}
- \caption{A superficial representation of the physical positioning of genomes. Showing a double helix (bottom), a chromosome (upper rihgt) and a chell (upper center).}
- \label{k2:gene-overview}
- \end{figure}
- All living organisms, like plants and animals, are made of cells (a human body can consist out of several trillion cells) \cite{cells}.\\
- A cell in itself is a living organism; The smallest one possible. It consists out of two layers from which the inner one is called nucleus. The nucleus contains chromosomes and those chromosomes hold the genetic information in form of \ac{DNA}.
-
- \acs{DNA} is often seen in the form of a double helix. A double helix consists, as the name suggests, of two single helix.
- \begin{figure}[ht]
- \centering
- \includegraphics[width=15cm]{k2/dna.png}
- \caption{A purely diagrammatic figure of the components \acs{DNA} is made of. The smaller, inner rods symbolize nucleotide links and the outer ribbons the phosphate-sugar chains \cite{dna_structure}.}
- \label{k2:dna-struct}
- \end{figure}
- Each of them consists of two main components: the Sugar Phosphate backbone, which is not relevant for this work and the Bases. The arrangement of Bases represents the Information stored in the \acs{DNA}. A base is an organic molecule, they are called Nucleotides \cite{dna_structure}. \\
- % describe Genomes?
- For this work, nucleotides are the most important parts of the \acs{DNA}. A Nucleotide can occur in one of four forms: it can be either adenine, thymine, guanine or cytosine. Each of them got a Counterpart with which a bond can be established: adenine can bond with thymine, guanine can bond with cytosine.\\
- From the perspective of an computer scientist: The content of one helix must be stored, to persist the full information. In more practical terms: The nucleotides of only one (entire) helix needs to be stored physically, to save the information of the whole \acs{DNA} because the other half can be determined by ``inverting'' the stored one. An example would show the counterpart for e.g.: \texttt{adenine, guanine, adenine} chain which would be a chain of \texttt{thymine, cytosine, thymine}. For the sake of simplicity, one does not write out the full name of each nucleotide, but only its initiat. So the example would change to \texttt{AGA} in one Helix, \texttt{TCT} in the other.\\
- This representation ist commonly used to store \acs{DNA} digitally. Depending on the sequencing procedure and other factors, more information is stored and therefore more characters are required but for now 'A', 'C', 'G' and 'T' should be the only concern.
|