Page 136«..1020..135136137138..150160..»

Category Archives: Genome

Xenopus tropicalis v4.1 – JGI Genome Portal – Home

Posted: January 30, 2017 at 7:43 pm

Xenopus tropicalis is a unique resource for two critical areas in vertebrate biology: early embryonic development and cell biology. In the former, Xenopus laevis has led the way in identifying the mechanisms of early fate decisions, patterning of the basic vertebrate body plan, and early organogenesis. Contributions in cell biology and biochemistry include seminal work on chromosome replication, chromatin and nuclear assembly, control of the cell cycle components, in vitro reconstruction of cytoskeletal element dynamics, and signaling pathways. In fact, Xenopus has become a major vertebrate model for the cellular and developmental biology research that is supported by most of the Institutes of the NIH.

In March of 2002 the Joint Genome Institute convened a meeting at the Production Genomics Facility in Walnut Creek, California. Leading researchers in the Xenopus research community were invited to discuss the overall planning and strategies for sequencing the Xenopus tropicalis genome.

BAC Sequencing for the X. tropicalis Community

The JGI has now closed the nominations for additional BAC sequencing. Please check the BAC sequencing status link for current sequence information.

Individual BACs may be obtained from the BAC/PAC Resource Center at Children's Hospital Oakland Research Institute

A Xenopus tropicalis genome project advisory board was set up to insure an accurate and timely exchange of information between the JGI and the research community.

Genome Project Notes

X. tropicalis genome assembly 4.1 is the fourth in a series of preliminary assembly releases that are planned as part of the ongoing X. tropicalis genome project. To date, approximately 22.5 million paired end sequencing reads have been produced from libraries containing a range of insert sizes.

Our goal is to make the genome sequence of Xenopus widely and rapidly available to the scientific community. We endorse the principles for the distribution and use of large scale sequencing data adopted by the larger genome sequencing community and urge users of this data to follow them. It is our intention to publish the work of this project in a timely fashion and we welcome collaborative interaction on the project and analyses.

Continue reading here:
Xenopus tropicalis v4.1 - JGI Genome Portal - Home

Posted in Genome | Comments Off on Xenopus tropicalis v4.1 – JGI Genome Portal – Home

Dog Genome Project | Broad Institute

Posted: December 14, 2016 at 11:45 pm

The genome of the domesticated dog, a close evolutionary relation to human, is a powerful new tool for understanding the human genome. Comparison of the dog with human and other mammals reveals key information about the structure and evolution of genes and genomes. The unique breeding history of dogs, with their extraordinary behavioral and physical diversity, offers the opportunity to find important genes underlying diseases shared between dogs and humans, such as cancer, diabetes, and epilepsy.

The Canine Genome Sequencing Project produced a high-quality draft sequence of a female boxer named Tasha. By comparing Tasha with many other breeds, the project also compiled a comprehensive set of SNPs (single nucleotide polymorphisms) useful in all dog breeds. These closely spaced genomic landmarks are critical for disease mapping. By comparing the dog, rodent, and human lineages, researchers at the Broad Institute uncovered exciting new information about human genes, their evolution, and the regulatory mechanisms governing their expression. Using SNPs, researchers describe the strikingly different haplotype structure in dog breeds compared with the entire dog population. In addition, they show that by understanding the patterns of variation in dog breeds, scientists can design powerful gene mapping experiments for complex diseases that are difficult to map in human populations.

Questions or comments, please email dog-info@broadinstitute.org .

Read more here:
Dog Genome Project | Broad Institute

Posted in Genome | Comments Off on Dog Genome Project | Broad Institute

Whole genome sequencing – Wikipedia

Posted: November 12, 2016 at 5:20 pm

"Genome sequencing" redirects here. For the sequencing only of DNA, see DNA sequencing.

Whole genome sequencing (also known as WGS, full genome sequencing, complete genome sequencing, or entire genome sequencing) is a laboratory process that determines the complete DNA sequence of an organism's genome at a single time. This entails sequencing all of an organism's chromosomal DNA as well as DNA contained in the mitochondria and, for plants, in the chloroplast.

Whole genome sequencing should not be confused with DNA profiling, which only determines the likelihood that genetic material came from a particular individual or group, and does not contain additional information on genetic relationships, origin or susceptibility to specific diseases.[2] Also unlike full genome sequencing, SNP genotyping covers less than 0.1% of the genome. Almost all truly complete genomes are of microbes; the term "full genome" is thus sometimes used loosely to mean "greater than 95%". The remainder of this article focuses on nearly complete human genomes.

High-throughput genome sequencing technologies have largely been used as a research tool and are currently being introduced in the clinics.[3][4][5] In the future of personalized medicine, whole genome sequence data will be an important tool to guide therapeutic intervention.[6] The tool of gene sequencing at SNP level is also used to pinpoint functional variants from association studies and improve the knowledge available to researchers interested in evolutionary biology, and hence may lay the foundation for predicting disease susceptibility and drug response.

The shift from manual DNA sequencing methods such as Maxam-Gilbert sequencing and Sanger sequencing in the 1970s and 1980s to more rapid, automated sequencing methods in the 1990s played a crucial role in giving scientists the ability to sequence whole genomes.[8]Haemophilus influenzae, a commensal bacterium which resides in the human respiratory tract was the first organism to have its entire genome sequenced (Figure 2.1). The entire genome of this bacterium was published in 1995.[9] The genomes of H. influenzae, other Bacteria, and some Archaea were the first to be sequenced - largely due to their small genome size. H. influenzae has a genome of 1,830,140 base pairs of DNA.[9] In contrast, eukaryotes, both unicellular and multicellular such as Amoeba dubia and humans (Homo sapiens) respectively, have much larger genomes (see C-value paradox).[10]Amoeba dubia has a genome of 700 billion nucleotide pairs spread across thousands of chromosomes.[11] Humans contain fewer nucleotide pairs (about 3.2 billion in each germ cell - note the exact size of the human genome is still being revised) than A. dubia however their genome size far outweighs the genome size of individual bacteria.[12]

The first bacterial and archaeal genomes, including that of H. influenzae, were sequenced by Shotgun sequencing.[9] In 1996, the first eukaryotic genome ( the yeast Saccharomyces cerevisiae) was sequenced. S. cerevisiae, a model organism in biology has a genome of only around 12 million nucleotide pairs,[13] and was the first unicellular eukaryote to have its whole genome sequenced. The first multicellular eukaryote, and animal, to have its whole genome sequenced was the nematode worm: Caenorhabditis elegans in 1998.[14] Eukaryotic genomes are sequenced by several methods including Shotgun sequencing of short DNA fragments and sequencing of larger DNA clones from DNA libraries (see library (biology)) such as Bacterial artificial chromosomes (BACs) and Yeast artificial chromosomes (YACs).[15]

In 1999, the entire DNA sequence of human chromosome 22, the shortest human autosome, was published.[16] By the year 2000, the second animal and second invertebrate (yet first insect) genome was sequenced - that of the fruit fly Drosophila melanogaster - a popular choice of model organism in experimental research.[17] The first plant genome - that of the model organism Arabidopsis thaliana - was also fully sequenced by 2000.[18] By 2001, a draft of the entire human genome sequence was published.[19] The genome of the laboratory mouse Mus musculus was completed in 2002.[20]

In 2004, the Human Genome Project published the human genome.[21]

Currently, thousands of genomes have been sequenced.

Almost any biological sample containing a full copy of the DNAeven a very small amount of DNA or ancient DNAcan provide the genetic material necessary for full genome sequencing. Such samples may include saliva, epithelial cells, bone marrow, hair (as long as the hair contains a hair follicle), seeds, plant leaves, or anything else that has DNA-containing cells.

The genome sequence of a single cell selected from a mixed population of cells can be determined using techniques of single cell genome sequencing. This has important advantages in environmental microbiology in cases where a single cell of a particular microorganism species can be isolated from a mixed population by microscopy on the basis of its morphological or other distinguishing characteristics. In such cases the normally necessary steps of isolation and growth of the organism in culture may be omitted, thus allowing the sequencing of a much greater spectrum of organism genomes.[22]

Single cell genome sequencing is being tested as a method of preimplantation genetic diagnosis, wherein a cell from the embryo created by in vitro fertilization is taken and analyzed before embryo transfer into the uterus.[23] After implantation, cell-free fetal DNA can be taken by simple venipuncture from the mother and used for whole genome sequencing of the fetus.[24]

Sequencing of nearly an entire human genome was first accomplished in 2000 partly through the use of shotgun sequencing technology. While full genome shotgun sequencing for small (40007000 base pair) genomes was already in use in 1979,[25] broader application benefited from pairwise end sequencing, known colloquially as double-barrel shotgun sequencing. As sequencing projects began to take on longer and more complicated genomes, multiple groups began to realize that useful information could be obtained by sequencing both ends of a fragment of DNA. Although sequencing both ends of the same fragment and keeping track of the paired data was more cumbersome than sequencing a single end of two distinct fragments, the knowledge that the two sequences were oriented in opposite directions and were about the length of a fragment apart from each other was valuable in reconstructing the sequence of the original target fragment.

The first published description of the use of paired ends was in 1990 as part of the sequencing of the human HPRT locus,[26] although the use of paired ends was limited to closing gaps after the application of a traditional shotgun sequencing approach. The first theoretical description of a pure pairwise end sequencing strategy, assuming fragments of constant length, was in 1991.[27] In 1995 the innovation of using fragments of varying sizes was introduced,[28] and demonstrated that a pure pairwise end-sequencing strategy would be possible on large targets. The strategy was subsequently adopted by The Institute for Genomic Research (TIGR) to sequence the entire genome of the bacterium Haemophilus influenzae in 1995,[29] and then by Celera Genomics to sequence the entire fruit fly genome in 2000,[30] and subsequently the entire human genome. Applied Biosystems, now called Life Technologies, manufactured the automated capillary sequencers utilized by both Celera Genomics and The Human Genome Project.

While capillary sequencing was the first approach to successfully sequence a nearly full human genome, it is still too expensive and takes too long for commercial purposes. Since 2005 capillary sequencing has been progressively displaced by next-generation sequencing technologies such as Illumina dye sequencing, pyrosequencing, and SMRT sequencing.[31] All of these technologies continue to employ the basic shotgun strategy, namely, parallelization and template generation via genome fragmentation.

Other technologies are emerging, including nanopore technology. Though nanopore sequencing technology is still being refined, its portability and potential capability of generating long reads are of relevance to whole-genome sequencing applications.[32]

In principle, full genome sequencing can provide raw data on all six billion nucleotides in an individual's DNA. However, it does not provide an analysis of what that information means or how it might be utilized in various clinical applications, such as in medicine to help prevent disease. Work toward that goal is continuously moving forward.

Because sequencing generates a lot of data (for example, there are approximately six billion base pairs in each human diploid genome), its output is stored electronically and requires a large amount of computing power and storage capacity. Full genome sequencing would have been nearly impossible before the advent of the microprocessor, computers, and the Information Age.

A 2015 study[33] done at Children's Mercy Hospital in Kansas City detailed the use of full genome sequencing including full analysis. The process took a record breaking 26 hours[34] and was done using Illumina HiSeq machines, the Edico Genome Dragen Processor, and several custom designed software packages. Most of this acceleration was achieved using the newly developed Dragen Processor which brought the analysis time down from 15 hours to 40 minutes.

A number of public and private companies are competing to develop a full genome sequencing platform that is commercially robust for both research and clinical use,[35] including Illumina,[36]Knome,[37]Sequenom,[38]454 Life Sciences,[39] Pacific Biosciences,[40]Complete Genomics,[41]Helicos Biosciences,[42]GE Global Research (General Electric), Affymetrix, IBM, Intelligent Bio-Systems,[43] Life Technologies and Oxford Nanopore Technologies.[44] These companies are heavily financed and backed by venture capitalists, hedge funds, and investment banks.[45][46]

In October 2006, the X Prize Foundation, working in collaboration with the J. Craig Venter Science Foundation, established the Archon X Prize for Genomics,[47] intending to award US$10million to "the first Team that can build a device and use it to sequence 100 human genomes within 10 days or less, with an accuracy of no more than one error in every 1,000,000 bases sequenced, with sequences accurately covering at least 98% of the genome, and at a recurring cost of no more than $1,000per genome".[48] An error rate of 1in 1,000,000bases, out of a total of approximately six billion bases in the human diploid genome, would mean about 6,000errors per genome. The error rates required for widespread clinical use, such as predictive medicine[49] is currently set by over 1,400 clinical single gene sequencing tests[50] (for example, errors in BRCA1 gene for breast cancer risk analysis).

The Archon X Prize for Genomics was cancelled in 2013, before its official start date.[51][52]

In 2007, Applied Biosystems started selling a new type of sequencer called SOLiD System.[53] The technology allowed users to sequence 60 gigabases per run.[54]

In June 2009, Illumina announced that they were launching their own Personal Full Genome Sequencing Service at a depth of 30 for $48,000 per genome.[55][56]

In August 2009, the founder of Helicos Biosciences, Stephen Quake, stated that using the company's Single Molecule Sequencer he sequenced his own full genome for less than $50,000.[57]

In November 2009, Complete Genomics published a peer-reviewed paper in Science demonstrating its ability to sequence a complete human genome for $1,700.[58][59]

In May 2011, Illumina lowered its Full Genome Sequencing service to $5,000 per human genome, or $4,000 if ordering 50 or more.[60] Helicos Biosciences, Pacific Biosciences, Complete Genomics, Illumina, Sequenom, ION Torrent Systems, Halcyon Molecular, NABsys, IBM, and GE Global appear to all be going head to head in the race to commercialize full genome sequencing.[31][61]

A series of publications in 2012 showed the utility of SMRT sequencing from Pacific Biosciences in generating full genome sequences with de novo assembly.[62]

With sequencing costs declining, a number of companies began claiming that their equipment would soon achieve the $1,000 genome: these companies included Life Technologies in January 2012,[63]Oxford Nanopore Technologies in February 2012[64] and Illumina in February 2014.[65][66]

However, as of 2015, the NHGRI estimates the cost of obtaining a whole-genome sequence at around $1,500.[67]

Full genome sequencing provides information on a genome that is orders of magnitude larger than that provided by the previous leader in genotyping technology, DNA arrays. For humans, DNA arrays currently provide genotypic information on up to one million genetic variants,[68][69][70] while full genome sequencing will provide information on all six billion bases in the human genome, or 3,000times more data. Because of this, full genome sequencing is considered a disruptive innovation to the DNA array markets as the accuracy of both range from 99.98% to 99.999% (in non-repetitive DNA regions) and their consumables cost of $5000 per 6 billion base pairs is competitive (for some applications) with DNA arrays ($500per 1 million basepairs).[39]Agilent, another established DNA array manufacturer, is working on targeted (selective region) genome sequencing technologies.[71] It is thought that Affymetrix, the pioneer of array technology in the 1990s, has fallen behind due to significant corporate and stock turbulence and is currently not working on any known full genome sequencing approach.[72][73][74] It is unknown what will happen to the DNA array market once full genome sequencing becomes commercially widespread, especially as companies and laboratories providing this disruptive technology start to realize economies of scale. It is postulated, however, that this new technology may significantly diminish the total market size for arrays and any other sequencing technology once it becomes commonplace for individuals and newborns to have their full genomes sequenced.[75]

Whole genome sequencing has established the mutation frequency for whole human genomes. The mutation frequency in the whole genome between generations for humans (parent to child) is about 70 new mutations per generation.[76][77] An even lower level of variation was found comparing whole genome sequencing in blood cells for a pair of monozygotic (identical twins) 100-year-old centenarians.[78] Only 8 somatic differences were found, though somatic variation occurring in less than 20% of blood cells would be undetected.

In the specifically protein coding regions of the human genome, it is estimated that there are about 0.35 mutations that would change the protein sequence between parent/child generations (less than one mutated protein per generation).[79]

In cancer, mutation frequencies are much higher, due to genome instability. This frequency can further depend on patient age, exposure to DNA damaging agents (such as UV-irradiation or components of tobacco smoke) and the activity/inactivity of DNA repair mechanisms.[80] Furthermore, mutation frequency can vary between cancer types: in germline cells, mutation rates occur at approximately 0.023 mutations per megabase, but this number is much higher in breast cancer (1.18-1.66 mutations per Mb), in lung cancer (17.7) or in melanomas (~33).[81]

Inexpensive, time-efficient full genome sequencing will be a major accomplishment not only for the field of genomics, but for the entire human civilization because, for the first time, individuals will be able to have their entire genome sequenced. Utilizing this information, it is speculated that health care professionals, such as physicians and genetic counselors, will eventually be able to use genomic information to predict what diseases a person may get in the future and attempt to either minimize the impact of that disease or avoid it altogether through the implementation of personalized, preventive medicine. Full genome sequencing will allow health care professionals to analyze the entire human genome of an individual and therefore detect all disease-related genetic variants, regardless of the genetic variant's prevalence or frequency. This will enable the rapidly emerging medical fields of predictive medicine and personalized medicine and will mark a significant leap forward for the clinical genetic revolution. Full genome sequencing is clearly of great importance for research into the basis of genetic disease and has shown significant benefit to a subset of individuals with rare disease in the clinical setting.[82][83][84][85] Illumina's CEO, Jay Flatley, stated in February 2009 that "A complete DNA read-out for every newborn will be technically feasible and affordable in less than five years, promising a revolution in healthcare" and that "by 2019 it will have become routine to map infants' genes when they are born".[86] This potential use of genome sequencing is highly controversial, as it runs counter to established ethical norms for predictive genetic testing of asymptomatic minors that have been well established in the fields of medical genetics and genetic counseling.[87][88][89][90] The traditional guidelines for genetic testing have been developed over the course of several decades since it first became possible to test for genetic markers associated with disease, prior to the advent of cost-effective, comprehensive genetic screening. It is established that norms, such as in the sciences and the field of genetics, are subject to change and evolve over time.[91][92] It is unknown whether traditional norms practiced in medical genetics today will be altered by new technological advancements such as full genome sequencing.

In March 2010, researchers from the Medical College of Wisconsin announced the first successful use of whole-genome sequencing to inform the treatment of a patient.[93][94]

Currently available newborn screening for childhood diseases allows detection of rare disorders that can be prevented or better treated by early detection and intervention. Specific genetic tests are also available to determine an etiology when a child's symptoms appear to have a genetic basis. Full genome sequencing, in addition has the potential to reveal a large amount of information (such as carrier status for autosomal recessive disorders, genetic risk factors for complex adult-onset diseases, and other predictive medical and non-medical information) that is currently not completely understood, may not be clinically useful to the child during childhood, and may not necessarily be wanted by the individual upon reaching adulthood.[95] In addition to predicting disease risk in childhood, genetic testing may have other benefits (such as discovery of non-paternity) but may also have potential downsides (genetic discrimination, loss of anonymity, and psychological impacts).[96] Many publications regarding ethical guidelines for predictive genetic testing of asymptomatic minors may therefore have more to do with protecting minors and preserving the individual's privacy and autonomy to know or not to know their genetic information, than with the technology that makes the tests themselves possible.[97]

Due to recent cost reductions (see above) whole genome sequencing has become a realistic application in DNA diagnostics. In 2013, the 3Gb-TEST consortium obtained funding from the European Union to prepare the health care system for these innovations in DNA diagnostics.[98][99]Quality assessment schemes, Health technology assessment and guidelines have to be in place. The 3Gb-TEST consortium has identified the analysis and interpretation of sequence data as the most complicated step in the diagnostic process.[100] At the Consortium meeting in Athens in September 2014, the Consortium coined the word genotranslation for this crucial step. This step leads to a so-called genoreport. Guidelines are needed to determine the required content of these reports.

The majority of ethicists insist that the privacy of individuals undergoing genetic testing must be protected under all circumstances.[101] Data obtained from whole genome sequencing can not only reveal much information about the individual who is the source of DNA, but it can also reveal much probabilistic information about the DNA sequence of close genetic relatives.[102] Furthermore, the data obtained from whole genome sequencing can also reveal much useful predictive information about the relatives present and future health risks.[103] This raises important questions about what obligations, if any, are owed to the family members of the individuals who are undergoing genetic testing. In the Western/European society, tested individuals are usually encouraged to share important information on the genetic diagnosis with their close relatives since the importance of the genetic diagnosis for offspring and other close relatives is usually one of the reasons for seeking a genetic testing in the first place.[101] Nevertheless, a major ethical dilemma can develop when the patients refuse to share information on a diagnosis that is made for serious genetic disorder that is highly preventable and where there is a high risk to relatives carrying the same disease mutation.[102] Under such circumstances, the clinician may suspect that the relatives would rather know of the diagnosis and hence the clinician can face a conflict of interest with respect to patient-doctor confidentiality.[102]

Another major privacy concern is the scientific need to put information on patient's genotypes and phenotypes into the public scientific databases such as the locus specific databases.[102] Although only anonymous patient data are submitted to the locus specific databases, patients might still be identifiable by their relatives in the case of finding a rare disease or a rare missense mutation.[102]

The first nearly complete human genomes sequenced were two caucasians in 2007 (J. Craig Venter at 7.5-fold coverage,[104][105][106] and James Watson at 7.4-fold).[107][108][109] This was followed in 2008 by sequencing of an anonymous Han Chinese man (at 36-fold),[110] a Yoruban man from Nigeria (at 30-fold),[111] and a female caucasian Leukemia patient (at 33 and 14-fold coverage for tumor and normal tissues).[112]Steve Jobs was among the first 20 people to have their whole genome sequenced, reportedly for the cost of $100,000.[113] As of June 2012[update], there are 69 nearly complete human genomes publicly available.[114]Commercialization of full genome sequencing is in an early stage and growing rapidly.

Continue reading here:
Whole genome sequencing - Wikipedia

Posted in Genome | Comments Off on Whole genome sequencing – Wikipedia

Pandora Internet Radio – Listen to Free Music You’ll Love

Posted: November 2, 2016 at 6:56 am

It is taking longer than expected to fetch the next song to play. The music should be playing soon. If you get tired of waiting, you can try reloading your browser.

Please ensure you are using the latest Flash Player.

If you are unable or do not wish to upgrade your Flash Player, please try a different browser.

Your Pandora Plus subscription will expire shortly.

Your Pandora Plus trial will expire shortly.

Your Pandora Plus trial subscription will expire shortly. Upgrade to continue unlimited, ad-free listening.

You've listened to hours of Pandora this month. Consider upgrading to Pandora Plus.

Your music will be right back

Don't have a Pandora account? Sign up

Already have a Pandora account? Log In

We're sorry, but a browser plugin or firewall may be preventing Pandora from loading.

In order to use Pandora internet radio, please upgrade to a more current browser.

Please check our Help page for more information.

It looks like your browser does not support modern SSL/TLS. Please upgrade your browser.

If you need help, please email: pandora-support@pandora.com.

In order to use Pandora internet radio, please upgrade to a more current browser or install a newer version of Flash (v.10 or later).

In order to use Pandora internet radio, please install Adobe Flash (v.10 or later).

[75, 77, 114, 70, 103, 66, 90, 105, 89, 127, 64, 112, 81, 113, 103, 93, 80, 92, 119, 84, 65, 122, 65, 80, 108, 94, 100, 82, 122, 96, 92, 124, 88, 122, 104, 97, 81, 70, 115, 90, 119, 121, 97, 65, 109, 96, 122, 118, 80, 116, 65, 108, 104, 102, 96, 113, 109, 122, 103, 80, 102, 78, 103, 120, 92, 108, 123, 89, 126, 72, 77, 101, 100, 126, 74, 101, 69, 84, 77, 88, 87, 116, 78, 119, 127, 93, 64, 77, 73, 77, 88, 67, 83, 127, 93, 88, 121, 65, 126, 108, 98, 104, 86, 87, 69, 119, 101, 84, 100, 88, 110, 99, 75, 80, 115, 113, 80, 97, 106, 123, 82, 90, 83, 119, 106, 111, 109, 121, 73, 121, 78, 91, 90, 101, 64, 85, 99, 108, 100, 76, 114, 86, 103, 88, 68, 92, 100, 70, 92, 125, 78, 116, 69, 126, 116, 87, 121, 90, 112, 114, 87, 99, 120, 108, 109, 82, 98, 117, 123, 100, 126, 95, 127, 87, 92, 99, 72, 69, 107, 98, 90, 110, 111, 104, 87, 83, 123, 127, 98, 86, 100, 86, 86, 117, 110, 123, 108, 73, 86, 124, 120, 81, 87, 96, 67, 120, 81, 101, 107, 78, 126, 76, 93, 95, 92, 122, 73, 90, 106, 68, 95, 113, 103, 73, 75, 117, 99, 122, 77, 111, 123, 119, 74, 65, 74, 120, 74, 66, 64, 75, 102, 113, 78, 68, 76, 64, 95, 108, 98, 74, 84, 105, 64, 72, 70, 102, 121, 82, 118, 111, 98, 106, 109, 117, 100, 91, 89, 88, 107, 125, 72, 96, 69, 122, 115, 122, 70, 101, 112, 65, 97, 114, 100, 73, 118, 93, 67, 127, 119, 79, 73, 101, 83, 112, 108, 91, 82, 94, 118, 82, 89, 94, 116, 116, 79, 68, 83, 124, 81, 71, 101, 124, 87, 117, 73, 116, 95, 100, 116, 123, 85, 93, 90, 102, 112, 68, 109, 93, 120, 113, 68, 78, 64, 111, 87, 122, 71, 125, 97, 65, 122, 76, 100, 100, 126, 90, 109, 121, 90, 83, 87, 117, 100, 70, 86, 97, 125, 76, 97, 73, 111, 116, 92, 108, 74, 87, 65, 122, 106, 124, 92, 111, 127, 115, 81, 94, 117, 125, 69, 109, 90, 104, 126, 75, 73, 69, 106, 81, 106, 93, 104, 67, 69, 89, 89, 120, 93, 96, 78, 124, 64, 103, 102, 119, 105, 127, 127, 78, 123, 77, 107, 84, 74, 114, 115, 111, 125, 116, 73, 92, 116, 84, 110, 75, 91, 80, 106, 83, 80, 82, 66, 73, 107, 111, 90, 115, 111, 78, 88, 76, 77, 110, 88, 82, 102, 78, 116, 93, 80, 104, 115, 114, 116, 104, 72, 115, 105, 104, 88, 107, 85, 82, 77, 93, 78, 80, 116, 83, 78, 111, 82, 98, 65, 102, 99, 115, 110, 104, 97, 88, 86, 87, 117, 88, 98, 82, 101, 97, 93, 79, 101, 105, 82, 85, 83, 123, 106, 111, 68, 79, 100, 95, 87, 98, 71, 116, 99, 68, 71, 115, 75, 78]

In order to use Pandora internet radio, please enable your browser's JavaScript support.

The rest is here:
Pandora Internet Radio - Listen to Free Music You'll Love

Posted in Genome | Comments Off on Pandora Internet Radio – Listen to Free Music You’ll Love

Genome – Wikipedia

Posted: October 23, 2016 at 4:19 am

In modern molecular biology and genetics, a genome is the genetic material of an organism. It consists of DNA (or RNA in RNA viruses). The genome includes both the genes, (the coding regions), the noncoding DNA[1] and the genomes of the mitochondria[2] and chloroplasts.

The term genome was created in 1920 by Hans Winkler,[3] professor of botany at the University of Hamburg, Germany. The Oxford Dictionary suggests the name is a blend of the words gene and chromosome.[4] However, see omics for a more thorough discussion. A few related -ome words already existedsuch as biome, rhizome, forming a vocabulary into which genome fits systematically.[5]

Some organisms have multiple copies of chromosomes: diploid, triploid, tetraploid and so on. In classical genetics, in a sexually reproducing organism (typically eukarya) the gamete has half the number of chromosomes of the somatic cell and the genome is a full set of chromosomes in a diploid cell. The halving of the genetic material in gametes is accomplished by the segregation of homologous chromosomes during meiosis.[6] In haploid organisms, including cells of bacteria, archaea, and in organelles including mitochondria and chloroplasts, or viruses, that similarly contain genes, the single or set of circular or linear chains of DNA (or RNA for some viruses), likewise constitute the genome. The term genome can be applied specifically to mean what is stored on a complete set of nuclearDNA (i.e.,the "nuclear genome") but can also be applied to what is stored within organelles that contain their own DNA, as with the "mitochondrial genome" or the "chloroplast genome". Additionally, the genome can comprise non-chromosomal genetic elements such as viruses, plasmids, and transposable elements.[7]

Typically, when it is said that the genome of a sexually reproducing species has been "sequenced", it refers to a determination of the sequences of one set of autosomes and one of each type of sex chromosome, which together represent both of the possible sexes. Even in species that exist in only one sex, what is described as a "genome sequence" may be a composite read from the chromosomes of various individuals. Colloquially, the phrase "genetic makeup" is sometimes used to signify the genome of a particular individual or organism.[citation needed] The study of the global properties of genomes of related organisms is usually referred to as genomics, which distinguishes it from genetics which generally studies the properties of single genes or groups of genes.

Both the number of base pairs and the number of genes vary widely from one species to another, and there is only a rough correlation between the two (an observation is known as the C-value paradox). At present, the highest known number of genes is around 60,000, for the protozoan causing trichomoniasis (see List of sequenced eukaryotic genomes), almost three times as many as in the human genome.

An analogy to the human genome stored on DNA is that of instructions stored in a book:

In 1976, Walter Fiers at the University of Ghent (Belgium) was the first to establish the complete nucleotide sequence of a viral RNA-genome (Bacteriophage MS2). The next year Fred Sanger completed the first DNA-genome sequence: Phage -X174, of 5386 base pairs.[8] The first complete genome sequences among all three domains of life were released within a short period during the mid-1990s: The first bacterial genome to be sequenced was that of Haemophilus influenzae, completed by a team at The Institute for Genomic Research in 1995. A few months later, the first eukaryotic genome was completed, with sequences of the 16 chromosomes of budding yeast Saccharomyces cerevisiae published as the result of a European-led effort begun in the mid-1980s. The first genome sequence for an archaeon, Methanococcus jannaschii, was completed in 1996, again by The Institute for Genomic Research.

The development of new technologies has made it dramatically easier and cheaper to do sequencing, and the number of complete genome sequences is growing rapidly. The US National Institutes of Health maintains one of several comprehensive databases of genomic information.[9] Among the thousands of completed genome sequencing projects include those for rice, a mouse, the plant Arabidopsis thaliana, the puffer fish, and the bacteria E. coli. In December 2013, scientists first sequenced the entire genome of a Neanderthal, an extinct species of humans. The genome was extracted from the toe bone of a 130,000-year-old Neanderthal found in a Siberian cave.[10][11]

New sequencing technologies, such as massive parallel sequencing have also opened up the prospect of personal genome sequencing as a diagnostic tool, as pioneered by Manteia Predictive Medicine. A major step toward that goal was the completion in 2007 of the full genome of James D. Watson, one of the co-discoverers of the structure of DNA.[12]

Whereas a genome sequence lists the order of every DNA base in a genome, a genome map identifies the landmarks. A genome map is less detailed than a genome sequence and aids in navigating around the genome. The Human Genome Project was organized to map and to sequence the human genome. A fundamental step in the project was the release of a detailed genomic map by Jean Weissenbach and his team at the Genoscope in Paris.[13][14]

Reference genome sequences and maps continue to be updated, removing errors and clarifying regions of high allelic complexity.[15] The decreasing cost of genomic mapping has permitted genealogical sites to offer it as a service,[16] to the extent that one may submit one's genome to crowd sourced scientific endeavours such as DNA.land at the New York Genome Center, an example both of the economies of scale and of citizen science.[17]

Genome composition is used to describe the make up of contents of a haploid genome, which should include genome size, proportions of non-repetitive DNA and repetitive DNA in details. By comparing the genome compositions between genomes, scientists can better understand the evolutionary history of a given genome.

When talking about genome composition, one should distinguish between prokaryotes and eukaryotes as there are significant differences with contents structure. In prokaryotes, most of the genome (8590%) is non-repetitive DNA, which means coding DNA mainly forms it, while non-coding regions only take a small part.[18] On the contrary, eukaryotes have the feature of exon-intron organization of protein coding genes; the variation of repetitive DNA content in eukaryotes is also extremely high. In mammals and plants, the major part of the genome is composed of repetitive DNA.[19]

Most biological entities that are more complex than a virus sometimes or always carry additional genetic material besides that which resides in their chromosomes. In some contexts, such as sequencing the genome of a pathogenic microbe, "genome" is meant to include information stored on this auxiliary material, which is carried in plasmids. In such circumstances then, "genome" describes all of the genes and information on non-coding DNA that have the potential to be present.

In eukaryotes such as plants, protozoa and animals, however, "genome" carries the typical connotation of only information on chromosomal DNA. So although these organisms contain chloroplasts or mitochondria that have their own DNA, the genetic information contained in DNA within these organelles is not considered part of the genome. In fact, mitochondria are sometimes said to have their own genome often referred to as the "mitochondrial genome". The DNA found within the chloroplast may be referred to as the "plastome".

Genome size is the total number of DNA base pairs in one copy of a haploid genome. The nuclear genome comprises approximately 3.2 billion nucleotides of DNA, divided into 24 linear molecules, the shortest 50 000 000 nucleotides in length and the longest 260 000 000 nucleotides, each contained in a different chromosome.[21] The genome size is positively correlated with the morphological complexity among prokaryotes and lower eukaryotes; however, after mollusks and all the other higher eukaryotes above, this correlation is no longer effective.[19][22] This phenomenon also indicates the mighty influence coming from repetitive DNA act on the genomes.

Since genomes are very complex, one research strategy is to reduce the number of genes in a genome to the bare minimum and still have the organism in question survive. There is experimental work being done on minimal genomes for single cell organisms as well as minimal genomes for multi-cellular organisms (see Developmental biology). The work is both in vivo and in silico.[23][24]

Here is a table of some significant or representative genomes. See #See also for lists of sequenced genomes.

[30][31][32]

Initial sequencing and analysis of the human genome[61]

The proportion of non-repetitive DNA is calculated by using the length of non-repetitive DNA divided by genome size. Protein-coding genes and RNA-coding genes are generally non-repetitive DNA.[66] A bigger genome does not mean more genes, and the proportion of non-repetitive DNA decreases along with increasing genome size in higher eukaryotes.[19]

It had been found that the proportion of non-repetitive DNA can vary a lot between species. Some E. coli as prokaryotes only have non-repetitive DNA, lower eukaryotes such as C. elegans and fruit fly, still possess more non-repetitive DNA than repetitive DNA.[19][67] Higher eukaryotes tend to have more repetitive DNA than non-repetitive ones. In some plants and amphibians, the proportion of non-repetitive DNA is no more than 20%, becoming a minority component.[19]

The proportion of repetitive DNA is calculated by using length of repetitive DNA divide by genome size. There are two categories of repetitive DNA in genome: tandem repeats and interspersed repeats.[68]

Tandem repeats are usually caused by slippage during replication, unequal crossing-over and gene conversion,[69]satellite DNA and microsatellites are forms of tandem repeats in the genome.[70] Although tandem repeats count for a significant proportion in genome, the largest proportion in mammalian is the other type, interspersed repeats.

Interspersed repeats mainly come from transposable elements (TEs), but they also include some protein coding gene families and pseudogenes. Transposable elements are able to integrate into the genome at another site within the cell.[18][71] It is believed that TEs are an important driving force on genome evolution of higher eukaryotes.[72] TEs can be classified into two categories, Class 1 (retrotransposons) and Class 2 (DNA transposons).[71]

Retrotransposons can be transcribed into RNA, which are then duplicated at another site into the genome.[73] Retrotransposons can be divided into Long terminal repeats (LTRs) and Non-Long Terminal Repeats (Non-LTR).[72]

DNA transposons generally move by "cut and paste" in the genome, but duplication has also been observed. Class 2 TEs do not use RNA as intermediate and are popular in bacteria, in metazoan it has also been found.[72]

Genomes are more than the sum of an organism's genes and have traits that may be measured and studied without reference to the details of any particular genes and their products. Researchers compare traits such as chromosome number (karyotype), genome size, gene order, codon usage bias, and GC-content to determine what mechanisms could have produced the great variety of genomes that exist today (for recent overviews, see Brown 2002; Saccone and Pesole 2003; Benfey and Protopapas 2004; Gibson and Muse 2004; Reese 2004; Gregory 2005).

Duplications play a major role in shaping the genome. Duplication may range from extension of short tandem repeats, to duplication of a cluster of genes, and all the way to duplication of entire chromosomes or even entire genomes. Such duplications are probably fundamental to the creation of genetic novelty.

Horizontal gene transfer is invoked to explain how there is often an extreme similarity between small portions of the genomes of two organisms that are otherwise very distantly related. Horizontal gene transfer seems to be common among many microbes. Also, eukaryotic cells seem to have experienced a transfer of some genetic material from their chloroplast and mitochondrial genomes to their nuclear chromosomes.

See more here:
Genome - Wikipedia

Posted in Genome | Comments Off on Genome – Wikipedia

Human genome – Wikipedia

Posted: October 20, 2016 at 11:32 pm

Genomic information Graphical representation of the idealized human diploid karyotype, showing the organization of the genome into chromosomes. This drawing shows both the female (XX) and male (XY) versions of the 23rd chromosome pair. Chromosomes are shown aligned at their centromeres. The mitochondrial DNA is not shown. NCBI genome ID 51 Ploidy diploid Genome size

3,234.83 Mb (Mega-basepairs) per haploid genome

The human genome is the complete set of nucleic acid sequence for humans (Homo sapiens), encoded as DNA within the 23 chromosome pairs in cell nuclei and in a small DNA molecule found within individual mitochondria. Human genomes include both protein-coding DNA genes and noncoding DNA. Haploid human genomes, which are contained in germ cells (the egg and sperm gamete cells created in the meiosis phase of sexual reproduction before fertilization creates a zygote) consist of three billion DNA base pairs, while diploid genomes (found in somatic cells) have twice the DNA content. While there are significant differences among the genomes of human individuals (on the order of 0.1%),[1] these are considerably smaller than the differences between humans and their closest living relatives, the chimpanzees (approximately 4%[2]) and bonobos.

The Human Genome Project produced the first complete sequences of individual human genomes, with the first draft sequence and initial analysis being published on February 12, 2001.[3] The human genome was the first of all vertebrates to be completely sequenced. As of 2012, thousands of human genomes have been completely sequenced, and many more have been mapped at lower levels of resolution. The resulting data are used worldwide in biomedical science, anthropology, forensics and other branches of science. There is a widely held expectation that genomic studies will lead to advances in the diagnosis and treatment of diseases, and to new insights in many fields of biology, including human evolution.

Although the sequence of the human genome has been (almost) completely determined by DNA sequencing, it is not yet fully understood. Most (though probably not all) genes have been identified by a combination of high throughput experimental and bioinformatics approaches, yet much work still needs to be done to further elucidate the biological functions of their protein and RNA products. Recent results suggest that most of the vast quantities of noncoding DNA within the genome have associated biochemical activities, including regulation of gene expression, organization of chromosome architecture, and signals controlling epigenetic inheritance.

There are an estimated 19,000-20,000 human protein-coding genes.[4] The estimate of the number of human genes has been repeatedly revised down from initial predictions of 100,000 or more as genome sequence quality and gene finding methods have improved, and could continue to drop further.[5][6]Protein-coding sequences account for only a very small fraction of the genome (approximately 1.5%), and the rest is associated with non-coding RNA molecules, regulatory DNA sequences, LINEs, SINEs, introns, and sequences for which as yet no function has been determined.[7]

In June 2016, scientists formally announced HGP-Write, a plan to synthesize the human genome.[8][9]

The total length of the human genome is over 3 billion base pairs. The genome is organized into 22 paired chromosomes, plus the X chromosome (one in males, two in females) and, in males only, one Y chromosome. These are all large linear DNA molecules contained within the cell nucleus. The genome also includes the mitochondrial DNA, a comparatively small circular molecule present in each mitochondrion. Basic information about these molecules and their gene content, based on a reference genome that does not represent the sequence of any specific individual, are provided in the following table. (Data source: Ensembl genome browser release 68, July 2012)

Table 1 (above) summarizes the physical organization and gene content of the human reference genome, with links to the original analysis, as published in the Ensembl database at the European Bioinformatics Institute (EBI) and Wellcome Trust Sanger Institute. Chromosome lengths were estimated by multiplying the number of base pairs by 0.34 nanometers, the distance between base pairs in the DNA double helix. The number of proteins is based on the number of initial precursor mRNA transcripts, and does not include products of alternative pre-mRNA splicing, or modifications to protein structure that occur after translation.

The number of variations is a summary of unique DNA sequence changes that have been identified within the sequences analyzed by Ensembl as of July, 2012; that number is expected to increase as further personal genomes are sequenced and examined. In addition to the gene content shown in this table, a large number of non-expressed functional sequences have been identified throughout the human genome (see below). Links open windows to the reference chromosome sequence in the EBI genome browser. The table also describes prevalence of genes encoding structural RNAs in the genome.

MicroRNA, or miRNA, functions as a post-transcriptional regulator of gene expression. Ribosomal RNA, or rRNA, makes up the RNA portion of the ribosome and is critical in the synthesis of proteins. Small nuclear RNA, or snRNA, is found in the nucleus of the cell. Its primary function is in the processing of pre-mRNA molecules and also in the regulation of transcription factors. Small nucleolar RNA, or SnoRNA, primarily functions in guiding chemical modifications to other RNA molecules.

Although the human genome has been completely sequenced for all practical purposes, there are still hundreds of gaps in the sequence. A recent study noted more than 160 euchromatic gaps of which 50 gaps were closed.[10] However, there are still numerous gaps in the heterochromatic parts of the genome which is much harder to sequence due to numerous repeats and other intractable sequence features.

The content of the human genome is commonly divided into coding and noncoding DNA sequences. Coding DNA is defined as those sequences that can be transcribed into mRNA and translated into proteins during the human life cycle; these sequences occupy only a small fraction of the genome (<2%). Noncoding DNA is made up of all of those sequences (ca. 98% of the genome) that are not used to encode proteins.

Some noncoding DNA contains genes for RNA molecules with important biological functions (noncoding RNA, for example ribosomal RNA and transfer RNA). The exploration of the function and evolutionary origin of noncoding DNA is an important goal of contemporary genome research, including the ENCODE (Encyclopedia of DNA Elements) project, which aims to survey the entire human genome, using a variety of experimental tools whose results are indicative of molecular activity.

Because non-coding DNA greatly outnumbers coding DNA, the concept of the sequenced genome has become a more focused analytical concept than the classical concept of the DNA-coding gene.[11][12]

Mutation rate of human genome is a very important factor in calculating evolutionary time points. Researchers calculated the number of genetic variations between human and apes. Dividing that number by age of fossil of most recent common ancestor of humans and ape, researchers calculated the mutation rate. Recent studies using next generation sequencing technologies concluded a slow mutation rate which doesn't add up with human migration pattern time points and suggesting a new evolutionary time scale.[13] 100,000 year old human fossils found in Israel have served to compound this new found uncertainty of the human migration timeline.[13]

Protein-coding sequences represent the most widely studied and best understood component of the human genome. These sequences ultimately lead to the production of all human proteins, although several biological processes (e.g. DNA rearrangements and alternative pre-mRNA splicing) can lead to the production of many more unique proteins than the number of protein-coding genes.

The complete modular protein-coding capacity of the genome is contained within the exome, and consists of DNA sequences encoded by exons that can be translated into proteins. Because of its biological importance, and the fact that it constitutes less than 2% of the genome, sequencing of the exome was the first major milepost of the Human Genome Project.

Number of protein-coding genes. About 20,000 human proteins have been annotated in databases such as Uniprot.[15] Historically, estimates for the number of protein genes have varied widely, ranging up to 2,000,000 in the late 1960s,[16] but several researchers pointed out in the early 1970s that the estimated mutational load from deleterious mutations placed an upper limit of approximately 40,000 for the total number of functional loci (this includes protein-coding and functional non-coding genes).[17]

The number of human protein-coding genes is not significantly larger than that of many less complex organisms, such as the roundworm and the fruit fly. This difference may result from the extensive use of alternative pre-mRNA splicing in humans, which provides the ability to build a very large number of modular proteins through the selective incorporation of exons.

Protein-coding capacity per chromosome. Protein-coding genes are distributed unevenly across the chromosomes, ranging from a few dozen to more than 2000, with an especially high gene density within chromosomes 19, 11, and 1 (Table 1). Each chromosome contains various gene-rich and gene-poor regions, which may be correlated with chromosome bands and GC-content[citation needed]. The significance of these nonrandom patterns of gene density is not well understood.[18]

Size of protein-coding genes. The size of protein-coding genes within the human genome shows enormous variability (Table 2). For example, the gene for histone H1a (HIST1HIA) is relatively small and simple, lacking introns and encoding mRNA sequences of 781 nt and a 215 amino acid protein (648 nt open reading frame). Dystrophin (DMD) is the largest protein-coding gene in the human reference genome, spanning a total of 2.2 MB, while Titin (TTN) has the longest coding sequence (114,414 bp), the largest number of exons (363),[19] and the longest single exon (17,106 bp). Over the whole genome, the median size of an exon is 122 bp (mean = 145 bp), the median number of exons is 7 (mean = 8.8), and the median coding sequence encodes 367 amino acids (mean = 447 amino acids; Table 21 in[7] ).

Table 2. Examples of human protein-coding genes. Chrom, chromosome. Alt splicing, alternative pre-mRNA splicing. (Data source: Ensembl genome browser release 68, July 2012)

Noncoding DNA is defined as all of the DNA sequences within a genome that are not found within protein-coding exons, and so are never represented within the amino acid sequence of expressed proteins. By this definition, more than 98% of the human genomes is composed of ncDNA.

Numerous classes of noncoding DNA have been identified, including genes for noncoding RNA (e.g. tRNA and rRNA), pseudogenes, introns, untranslated regions of mRNA, regulatory DNA sequences, repetitive DNA sequences, and sequences related to mobile genetic elements.

Numerous sequences that are included within genes are also defined as noncoding DNA. These include genes for noncoding RNA (e.g. tRNA, rRNA), and untranslated components of protein-coding genes (e.g. introns, and 5' and 3' untranslated regions of mRNA).

Protein-coding sequences (specifically, coding exons) constitute less than 1.5% of the human genome.[7] In addition, about 26% of the human genome is introns.[20] Aside from genes (exons and introns) and known regulatory sequences (820%), the human genome contains regions of noncoding DNA. The exact amount of noncoding DNA that plays a role in cell physiology has been hotly debated. Recent analysis by the ENCODE project indicates that 80% of the entire human genome is either transcribed, binds to regulatory proteins, or is associated with some other biochemical activity.[6]

It however remains controversial whether all of this biochemical activity contributes to cell physiology, or whether a substantial portion of this is the result transcriptional and biochemical noise, which must be actively filtered out by the organism.[21] Excluding protein-coding sequences, introns, and regulatory regions, much of the non-coding DNA is composed of: Many DNA sequences that do not play a role in gene expression have important biological functions. Comparative genomics studies indicate that about 5% of the genome contains sequences of noncoding DNA that are highly conserved, sometimes on time-scales representing hundreds of millions of years, implying that these noncoding regions are under strong evolutionary pressure and positive selection.[22]

Many of these sequences regulate the structure of chromosomes by limiting the regions of heterochromatin formation and regulating structural features of the chromosomes, such as the telomeres and centromeres. Other noncoding regions serve as origins of DNA replication. Finally several regions are transcribed into functional noncoding RNA that regulate the expression of protein-coding genes (for example[23] ), mRNA translation and stability (see miRNA), chromatin structure (including histone modifications, for example[24] ), DNA methylation (for example[25] ), DNA recombination (for example[26] ), and cross-regulate other noncoding RNAs (for example[27] ). It is also likely that many transcribed noncoding regions do not serve any role and that this transcription is the product of non-specific RNA Polymerase activity.[21]

Pseudogenes are inactive copies of protein-coding genes, often generated by gene duplication, that have become nonfunctional through the accumulation of inactivating mutations. Table 1 shows that the number of pseudogenes in the human genome is on the order of 13,000,[28] and in some chromosomes is nearly the same as the number of functional protein-coding genes. Gene duplication is a major mechanism through which new genetic material is generated during molecular evolution.

For example, the olfactory receptor gene family is one of the best-documented examples of pseudogenes in the human genome. More than 60 percent of the genes in this family are non-functional pseudogenes in humans. By comparison, only 20 percent of genes in the mouse olfactory receptor gene family are pseudogenes. Research suggests that this is a species-specific characteristic, as the most closely related primates all have proportionally fewer pseudogenes. This genetic discovery helps to explain the less acute sense of smell in humans relative to other mammals.[29]

Noncoding RNA molecules play many essential roles in cells, especially in the many reactions of protein synthesis and RNA processing. Noncoding RNA include tRNA, ribosomal RNA, microRNA, snRNA and other non-coding RNA genes including about 60,000 long non coding RNAs (lncRNAs).[6][30][31][32] It should be noted that while the number of reported lncRNA genes continues to rise and the exact number in the human genome is yet to be defined, many of them are argued to be non-functional.[33]

Many ncRNAs are critical elements in gene regulation and expression. Noncoding RNA also contributes to epigenetics, transcription, RNA splicing, and the translational machinery. The role of RNA in genetic regulation and disease offers a new potential level of unexplored genomic complexity.[34]

In addition to the ncRNA molecules that are encoded by discrete genes, the initial transcripts of protein coding genes usually contain extensive noncoding sequences, in the form of introns, 5'-untranslated regions (5'-UTR), and 3'-untranslated regions (3'-UTR). Within most protein-coding genes of the human genome, the length of intron sequences is 10- to 100-times the length of exon sequences (Table 2).

The human genome has many different regulatory sequences which are crucial to controlling gene expression. Conservative estimates indicate that these sequences make up 8% of the genome,[35] however extrapolations from the ENCODE project give that 20[36]-40%[37] of the genome is gene regulatory sequence. Some types of non-coding DNA are genetic "switches" that do not encode proteins, but do regulate when and where genes are expressed (called enhancers).[38]

Regulatory sequences have been known since the late 1960s.[39] The first identification of regulatory sequences in the human genome relied on recombinant DNA technology.[40] Later with the advent of genomic sequencing, the identification of these sequences could be inferred by evolutionary conservation. The evolutionary branch between the primates and mouse, for example, occurred 7090 million years ago.[41] So computer comparisons of gene sequences that identify conserved non-coding sequences will be an indication of their importance in duties such as gene regulation.[42]

Other genomes have been sequenced with the same intention of aiding conservation-guided methods, for exampled the pufferfish genome.[43] However, regulatory sequences disappear and re-evolve during evolution at a high rate.[44][45][46]

As of 2012, the efforts have shifted toward finding interactions between DNA and regulatory proteins by the technique ChIP-Seq, or gaps where the DNA is not packaged by histones (DNase hypersensitive sites), both of which tell where there are active regulatory sequences in the investigated cell type.[35]

Repetitive DNA sequences comprise approximately 50% of the human genome.[47]

About 8% of the human genome consists of tandem DNA arrays or tandem repeats, low complexity repeat sequences that have multiple adjacent copies (e.g. "CAGCAGCAG...").[citation needed] The tandem sequences may be of variable lengths, from two nucleotides to tens of nucleotides. These sequences are highly variable, even among closely related individuals, and so are used for genealogical DNA testing and forensic DNA analysis.[48]

Repeated sequences of fewer than ten nucleotides (e.g. the dinucleotide repeat (AC)n) are termed microsatellite sequences. Among the microsatellite sequences, trinucleotide repeats are of particular importance, as sometimes occur within coding regions of genes for proteins and may lead to genetic disorders. For example, Huntington's disease results from an expansion of the trinucleotide repeat (CAG)n within the Huntingtin gene on human chromosome 4. Telomeres (the ends of linear chromosomes) end with a microsatellite hexanucleotide repeat of the sequence (TTAGGG)n.

Tandem repeats of longer sequences (arrays of repeated sequences 1060 nucleotides long) are termed minisatellites.

Transposable genetic elements, DNA sequences that can replicate and insert copies of themselves at other locations within a host genome, are an abundant component in the human genome. The most abundant transposon lineage, Alu, has about 50,000 active copies,[49] and can be inserted into intragenic and intergenic regions.[50] One other lineage, LINE-1, has about 100 active copies per genome (the number varies between people).[51] Together with non-functional relics of old transposons, they account for over half of total human DNA.[52] Sometimes called "jumping genes", transposons have played a major role in sculpting the human genome. Some of these sequences represent endogenous retroviruses, DNA copies of viral sequences that have become permanently integrated into the genome and are now passed on to succeeding generations.

Mobile elements within the human genome can be classified into LTR retrotransposons (8.3% of total genome), SINEs (13.1% of total genome) including Alu elements, LINEs (20.4% of total genome), SVAs and Class II DNA transposons (2.9% of total genome).

With the exception of identical twins, all humans show significant variation in genomic DNA sequences. The human reference genome (HRG) is used as a standard sequence reference.

There are several important points concerning the human reference genome:

Most studies of human genetic variation have focused on single-nucleotide polymorphisms (SNPs), which are substitutions in individual bases along a chromosome. Most analyses estimate that SNPs occur 1 in 1000 base pairs, on average, in the euchromatic human genome, although they do not occur at a uniform density. Thus follows the popular statement that "we are all, regardless of race, genetically 99.9% the same",[53] although this would be somewhat qualified by most geneticists. For example, a much larger fraction of the genome is now thought to be involved in copy number variation.[54] A large-scale collaborative effort to catalog SNP variations in the human genome is being undertaken by the International HapMap Project.

The genomic loci and length of certain types of small repetitive sequences are highly variable from person to person, which is the basis of DNA fingerprinting and DNA paternity testing technologies. The heterochromatic portions of the human genome, which total several hundred million base pairs, are also thought to be quite variable within the human population (they are so repetitive and so long that they cannot be accurately sequenced with current technology). These regions contain few genes, and it is unclear whether any significant phenotypic effect results from typical variation in repeats or heterochromatin.

Most gross genomic mutations in gamete germ cells probably result in inviable embryos; however, a number of human diseases are related to large-scale genomic abnormalities. Down syndrome, Turner Syndrome, and a number of other diseases result from nondisjunction of entire chromosomes. Cancer cells frequently have aneuploidy of chromosomes and chromosome arms, although a cause and effect relationship between aneuploidy and cancer has not been established.

Whereas a genome sequence lists the order of every DNA base in a genome, a genome map identifies the landmarks. A genome map is less detailed than a genome sequence and aids in navigating around the genome.[55][56]

An example of a variation map is the HapMap being developed by the International HapMap Project. The HapMap is a haplotype map of the human genome, "which will describe the common patterns of human DNA sequence variation."[57] It catalogs the patterns of small-scale variations in the genome that involve single DNA letters, or bases.

Researchers published the first sequence-based map of large-scale structural variation across the human genome in the journal Nature in May 2008.[58][59] Large-scale structural variations are differences in the genome among people that range from a few thousand to a few million DNA bases; some are gains or losses of stretches of genome sequence and others appear as re-arrangements of stretches of sequence. These variations include differences in the number of copies individuals have of a particular gene, deletions, translocations and inversions.

Single-nucleotide polymorphisms (SNPs) do not occur homogeneously across the human genome. In fact, there is enormous diversity in SNP frequency between genes, reflecting different selective pressures on each gene as well as different mutation and recombination rates across the genome. However, studies on SNPs are biased towards coding regions, the data generated from them are unlikely to reflect the overall distribution of SNPs throughout the genome. Therefore, the SNP Consortium protocol was designed to identify SNPs with no bias towards coding regions and the Consortium's 100,000 SNPs generally reflect sequence diversity across the human chromosomes.The SNP Consortium aims to expand the number of SNPs identified across the genome to 300 000 by the end of the first quarter of 2001.[60]

Changes in non-coding sequence and synonymous changes in coding sequence are generally more common than non-synonymous changes, reflecting greater selective pressure reducing diversity at positions dictating amino acid identity. Transitional changes are more common than transversions, with CpG dinucleotides showing the highest mutation rate, presumably due to deamination.

A personal genome sequence is a (nearly) complete sequence of the chemical base pairs that make up the DNA of a single person. Because medical treatments have different effects on different people due to genetic variations such as single-nucleotide polymorphisms (SNPs), the analysis of personal genomes may lead to personalized medical treatment based on individual genotypes.[citation needed]

The first personal genome sequence to be determined was that of Craig Venter in 2007. Personal genomes had not been sequenced in the public Human Genome Project to protect the identity of volunteers who provided DNA samples. That sequence was derived from the DNA of several volunteers from a diverse population.[61] However, early in the Venter-led Celera Genomics genome sequencing effort the decision was made to switch from sequencing a composite sample to using DNA from a single individual, later revealed to have been Venter himself. Thus the Celera human genome sequence released in 2000 was largely that of one man. Subsequent replacement of the early composite-derived data and determination of the diploid sequence, representing both sets of chromosomes, rather than a haploid sequence originally reported, allowed the release of the first personal genome.[62] In April 2008, that of James Watson was also completed. Since then hundreds of personal genome sequences have been released,[63] including those of Desmond Tutu,[64][65] and of a Paleo-Eskimo.[66] In November 2013, a Spanish family made their personal genomics data obtained by direct-to-consumer genetic testing with 23andMe publicly available under a Creative Commons public domain license. This is believed to be the first such public genomics dataset for a whole family.[67]

The sequencing of individual genomes further unveiled levels of genetic complexity that had not been appreciated before. Personal genomics helped reveal the significant level of diversity in the human genome attributed not only to SNPs but structural variations as well. However, the application of such knowledge to the treatment of disease and in the medical field is only in its very beginnings.[68]Exome sequencing has become increasingly popular as a tool to aid in diagnosis of genetic disease because the exome contributes only 1% of the genomic sequence but accounts for roughly 85% of mutations that contribute significantly to disease.[69]

Most aspects of human biology involve both genetic (inherited) and non-genetic (environmental) factors. Some inherited variation influences aspects of our biology that are not medical in nature (height, eye color, ability to taste or smell certain compounds, etc.). Moreover, some genetic disorders only cause disease in combination with the appropriate environmental factors (such as diet). With these caveats, genetic disorders may be described as clinically defined diseases caused by genomic DNA sequence variation. In the most straightforward cases, the disorder can be associated with variation in a single gene. For example, cystic fibrosis is caused by mutations in the CFTR gene, and is the most common recessive disorder in caucasian populations with over 1,300 different mutations known.[70]

Disease-causing mutations in specific genes are usually severe in terms of gene function, and are fortunately rare, thus genetic disorders are similarly individually rare. However, since there are many genes that can vary to cause genetic disorders, in aggregate they constitute a significant component of known medical conditions, especially in pediatric medicine. Molecularly characterized genetic disorders are those for which the underlying causal gene has been identified, currently there are approximately 2,200 such disorders annotated in the OMIM database.[70]

Studies of genetic disorders are often performed by means of family-based studies. In some instances population based approaches are employed, particularly in the case of so-called founder populations such as those in Finland, French-Canada, Utah, Sardinia, etc. Diagnosis and treatment of genetic disorders are usually performed by a geneticist-physician trained in clinical/medical genetics. The results of the Human Genome Project are likely to provide increased availability of genetic testing for gene-related disorders, and eventually improved treatment. Parents can be screened for hereditary conditions and counselled on the consequences, the probability it will be inherited, and how to avoid or ameliorate it in their offspring.

As noted above, there are many different kinds of DNA sequence variation, ranging from complete extra or missing chromosomes down to single nucleotide changes. It is generally presumed that much naturally occurring genetic variation in human populations is phenotypically neutral, i.e. has little or no detectable effect on the physiology of the individual (although there may be fractional differences in fitness defined over evolutionary time frames). Genetic disorders can be caused by any or all known types of sequence variation. To molecularly characterize a new genetic disorder, it is necessary to establish a causal link between a particular genomic sequence variant and the clinical disease under investigation. Such studies constitute the realm of human molecular genetics.

With the advent of the Human Genome and International HapMap Project, it has become feasible to explore subtle genetic influences on many common disease conditions such as diabetes, asthma, migraine, schizophrenia, etc. Although some causal links have been made between genomic sequence variants in particular genes and some of these diseases, often with much publicity in the general media, these are usually not considered to be genetic disorders per se as their causes are complex, involving many different genetic and environmental factors. Thus there may be disagreement in particular cases whether a specific medical condition should be termed a genetic disorder. The categorized table below provides the prevalence as well as the genes or chromosomes associated with some human genetic disorders.

-10

-9

-8

-7

-6

-5

-4

-3

-2

-1

0

Comparative genomics studies of mammalian genomes suggest that approximately 5% of the human genome has been conserved by evolution since the divergence of extant lineages approximately 200 million years ago, containing the vast majority of genes.[72][73] The published chimpanzee genome differs from that of the human genome by 1.23% in direct sequence comparisons.[74] Around 20% of this figure is accounted for by variation within each species, leaving only ~1.06% consistent sequence divergence between humans and chimps at shared genes.[75] This nucleotide by nucleotide difference is dwarfed, however, by the portion of each genome that is not shared, including around 6% of functional genes that are unique to either humans or chimps.[76]

In other words, the considerable observable differences between humans and chimps may be due as much or more to genome level variation in the number, function and expression of genes rather than DNA sequence changes in shared genes. Indeed, even within humans, there has been found to be a previously unappreciated amount of copy number variation (CNV) which can make up as much as 5 15% of the human genome. In other words, between humans, there could be +/- 500,000,000 base pairs of DNA, some being active genes, others inactivated, or active at different levels. The full significance of this finding remains to be seen. On average, a typical human protein-coding gene differs from its chimpanzee ortholog by only two amino acid substitutions; nearly one third of human genes have exactly the same protein translation as their chimpanzee orthologs. A major difference between the two genomes is human chromosome 2, which is equivalent to a fusion product of chimpanzee chromosomes 12 and 13.[77] (later renamed to chromosomes 2A and 2B, respectively).

Humans have undergone an extraordinary loss of olfactory receptor genes during our recent evolution, which explains our relatively crude sense of smell compared to most other mammals. Evolutionary evidence suggests that the emergence of color vision in humans and several other primate species has diminished the need for the sense of smell.[78]

In September 2016, scientists reported that, based on human DNA genetic studies, all non-Africans in the world today can be traced to a single population that exited Africa between 50,000 and 80,000 years ago.[79]

The human mitochondrial DNA is of tremendous interest to geneticists, since it undoubtedly plays a role in mitochondrial disease. It also sheds light on human evolution; for example, analysis of variation in the human mitochondrial genome has led to the postulation of a recent common ancestor for all humans on the maternal line of descent (see Mitochondrial Eve).

Due to the lack of a system for checking for copying errors, mitochondrial DNA (mtDNA) has a more rapid rate of variation than nuclear DNA. This 20-fold higher mutation rate allows mtDNA to be used for more accurate tracing of maternal ancestry. Studies of mtDNA in populations have allowed ancient migration paths to be traced, such as the migration of Native Americans from Siberia or Polynesians from southeastern Asia. It has also been used to show that there is no trace of Neanderthal DNA in the European gene mixture inherited through purely maternal lineage.[80] Due to the restrictive all or none manner of mtDNA inheritance, this result (no trace of Neanderthal mtDNA) would be likely unless there were a large percentage of Neanderthal ancestry, or there was strong positive selection for that mtDNA (for example, going back 5 generations, only 1 of your 32 ancestors contributed to your mtDNA, so if one of these 32 was pure Neanderthal you would expect that ~3% of your autosomal DNA would be of Neanderthal origin, yet you would have a ~97% chance to have no trace of Neanderthal mtDNA).

Epigenetics describes a variety of features of the human genome that transcend its primary DNA sequence, such as chromatin packaging, histone modifications and DNA methylation, and which are important in regulating gene expression, genome replication and other cellular processes. Epigenetic markers strengthen and weaken transcription of certain genes but do not affect the actual sequence of DNA nucleotides. DNA methylation is a major form of epigenetic control over gene expression and one of the most highly studied topics in epigenetics. During development, the human DNA methylation profile experiences dramatic changes. In early germ line cells, the genome has very low methylation levels. These low levels generally describe active genes. As development progresses, parental imprinting tags lead to increased methylation activity.[81][82]

Epigenetic patterns can be identified between tissues within an individual as well as between individuals themselves. Identical genes that have differences only in their epigenetic state are called epialleles. Epialleles can be placed into three categories: those directly determined by an individuals genotype, those influenced by genotype, and those entirely independent of genotype. The epigenome is also influenced significantly by environmental factors. Diet, toxins, and hormones impact the epigenetic state. Studies in dietary manipulation have demonstrated that methyl-deficient diets are associated with hypomethylation of the epigenome. Such studies establish epigenetics as an important interface between the environment and the genome.[83]

Continue reading here:
Human genome - Wikipedia

Posted in Genome | Comments Off on Human genome – Wikipedia

Human Genome Project – Wikipedia

Posted: October 19, 2016 at 4:07 am

The Human Genome Project (HGP) is an international scientific research project with the goal of determining the sequence of chemical base pairs which make up human DNA, and of identifying and mapping all of the genes of the human genome from both a physical and a functional standpoint.[1] It remains the world's largest collaborative biological project.[2] After the idea was picked up in 1984 by the US government when the planning started, with the project formally launched in 1990, and finally declared complete in 2003. Funding came from the US government through the National Institutes of Health (NIH) as well as numerous other groups from around the world. A parallel project was conducted outside of government by the Celera Corporation, or Celera Genomics, which was formally launched in 1998. Most of the government-sponsored sequencing was performed in twenty universities and research centers in the United States, the United Kingdom, Japan, France, Germany, Canada, and China.[3]

The Human Genome Project originally aimed to map the nucleotides contained in a human haploid reference genome (more than three billion). The "genome" of any given individual is unique; mapping the "human genome" involves sequencing multiple variations of each gene.[4] In May 2016, scientists considered extending the HGP to include creating a synthetic human genome.[5] In June 2016, scientists formally announced HGP-Write, a plan to synthesize the human genome.[6][7]

The Human Genome Project was a 13-year-long, publicly funded project initiated in 1990 with the objective of determining the DNA sequence of the entire euchromatic human genome within 15 years.[8] In May 1985, Robert Sinsheimer organized a workshop to discuss sequencing the human genome,[9] but for a number of reasons the NIH was uninterested in pursuing the proposal. The following March, the Santa Fe Workshop was organized by Charles DeLisi and David Smith of the Department of Energy's Office of Health and Environmental Research (OHER).[10] At the same time Renato Dulbecco proposed whole genome sequencing in an essay in Science.[11] James Watson followed two months later with a workshop held at the Cold Spring Harbor Laboratory.

The fact that the Santa Fe workshop was motivated and supported by a Federal Agency opened a path, albeit a difficult and tortuous one,[12] for converting the idea into public policy. In a memo to the Assistant Secretary for Energy Research (Alvin Trivelpiece), Charles DeLisi, who was then Director of OHER, outlined a broad plan for the project.[13] This started a long and complex chain of events which led to approved reprogramming of funds that enabled OHER to launch the Project in 1986, and to recommend the first line item for the HGP, which was in President Regan's 1988 budget submission,[12] and ultimately approved by the Congress. Of particular importance in Congressional approval was the advocacy of Senator Peter Domenici, whom DeLisi had befriended.[14] Domenici chaired the Senate Committee on Energy and Natural Resources, as well as the Budget Committee, both of which were key in the DOE budget process. Congress added a comparable amount to the NIH budget, thereby beginning official funding by both agencies.

Alvin Trivelpiece sought and obtained the approval of DeLisi's proposal by Deputy Secretary William Flynn Martin. This chart[15] was used in the spring of 1986 by Trivelpiece, then Director of the Office of Energy Research in the Department of Energy, to brief Martin and Under Secretary Joseph Salgado regarding his intention to reprogram $4 million to initiate the project with the approval of Secretary Herrington. This reprogramming was followed by a line item budget of $16 million in the Reagan Administrations 1987 budget submission to Congress.[16] It subsequently passed both Houses. The Project was planned for 15 years.[17]

Candidate technologies were already being considered for the proposed undertaking at least as early as 1985.[18]

In 1990, the two major funding agencies, DOE and NIH, developed a memorandum of understanding in order to coordinate plans and set the clock for the initiation of the Project to 1990.[19] At that time, David Galas was Director of the renamed Office of Biological and Environmental Research in the U.S. Department of Energys Office of Science and James Watson headed the NIH Genome Program. In 1993, Aristides Patrinos succeeded Galas and Francis Collins succeeded James Watson, assuming the role of overall Project Head as Director of the U.S. National Institutes of Health (NIH) National Center for Human Genome Research (which would later become the National Human Genome Research Institute). A working draft of the genome was announced in 2000 and the papers describing it were published in February 2001. A more complete draft was published in 2003, and genome "finishing" work continued for more than a decade.

The $3-billion project was formally founded in 1990 by the US Department of Energy and the National Institutes of Health, and was expected to take 15 years.[20] In addition to the United States, the international consortium comprised geneticists in the United Kingdom, France, Australia, China and myriad other spontaneous relationships.[21]

Due to widespread international cooperation and advances in the field of genomics (especially in sequence analysis), as well as major advances in computing technology, a 'rough draft' of the genome was finished in 2000 (announced jointly by U.S. President Bill Clinton and the British Prime Minister Tony Blair on June 26, 2000).[22] This first available rough draft assembly of the genome was completed by the Genome Bioinformatics Group at the University of California, Santa Cruz, primarily led by then graduate student Jim Kent. Ongoing sequencing led to the announcement of the essentially complete genome on April 14, 2003, two years earlier than planned.[23][24] In May 2006, another milestone was passed on the way to completion of the project, when the sequence of the last chromosome was published in Nature.[25]

The project did not aim to sequence all the DNA found in human cells. It sequenced only "euchromatic" regions of the genome, which make up about 90% of the genome. The other regions, called "heterochromatic" are found in centromeres and telomeres, and were not sequenced under the project.[26]

The Human Genome Project was declared complete in April 2003. An initial rough draft of the human genome was available in June 2000 and by February 2001 a working draft had been completed and published followed by the final sequencing mapping of the human genome on April 14, 2003. Although this was reported to be 99% of the euchromatic human genome with 99.99% accuracy a major quality assessment of the human genome sequence was published on May 27, 2004 indicating over 92% of sampling exceeded 99.99% accuracy which was within the intended goal.[27] Further analyses and papers on the HGP continue to occur.[28]

The sequencing of the human genome holds benefits for many fields, from molecular medicine to human evolution. The Human Genome Project, through its sequencing of the DNA, can help us understand diseases including: genotyping of specific viruses to direct appropriate treatment; identification of mutations linked to different forms of cancer; the design of medication and more accurate prediction of their effects; advancement in forensic applied sciences; biofuels and other energy applications; agriculture, animal husbandry, bioprocessing; risk assessment; bioarcheology, anthropology and evolution. Another proposed benefit is the commercial development of genomics research related to DNA based products, a multibillion-dollar industry.

The sequence of the DNA is stored in databases available to anyone on the Internet. The U.S. National Center for Biotechnology Information (and sister organizations in Europe and Japan) house the gene sequence in a database known as GenBank, along with sequences of known and hypothetical genes and proteins. Other organizations, such as the UCSC Genome Browser at the University of California, Santa Cruz,[29] and Ensembl[30] present additional data and annotation and powerful tools for visualizing and searching it. Computer programs have been developed to analyze the data, because the data itself is difficult to interpret without such programs. Generally speaking, advances in genome sequencing technology have followed Moores Law, a concept from computer science which states that integrated circuits can increase in complexity at an exponential rate.[31] This means that the speeds at which whole genomes can be sequenced can increase at a similar rate, as was seen during the development of the above-mentioned Human Genome Project.

The process of identifying the boundaries between genes and other features in a raw DNA sequence is called genome annotation and is in the domain of bioinformatics. While expert biologists make the best annotators, their work proceeds slowly, and computer programs are increasingly used to meet the high-throughput demands of genome sequencing projects. Beginning in 2008, a new technology known as RNA-seq was introduced that allowed scientists to directly sequence the messenger RNA in cells. This replaced previous methods of annotation, which relied on inherent properties of the DNA sequence, with direct measurement, which was much more accurate. Today, annotation of the human genome and other genomes relies primarily on deep sequencing of the transcripts in every human tissue using RNA-seq. These experiments have revealed that over 90% of genes contain at least one and usually several alternative splice variants, in which the exons are combined in different ways to produce 2 or more gene products from the same locus.[citation needed]

The genome published by the HGP does not represent the sequence of every individual's genome. It is the combined mosaic of a small number of anonymous donors, all of European origin. The HGP genome is a scaffold for future work in identifying differences among individuals. Subsequent projects sequenced the genomes of multiple distinct ethnic groups, though as of today there is still only one "reference genome."[citation needed]

Key findings of the draft (2001) and complete (2004) genome sequences include:

The Human Genome Project was started in 1990 with the goal of sequencing and identifying all three billion chemical units in the human genetic instruction set, finding the genetic roots of disease and then developing treatments. It is considered a Mega Project because the human genome has approximately 3.3 billion base-pairs. With the sequence in hand, the next step was to identify the genetic variants that increase the risk for common diseases like cancer and diabetes.[19][36]

It was far too expensive at that time to think of sequencing patients whole genomes. So the National Institutes of Health embraced the idea for a "shortcut", which was to look just at sites on the genome where many people have a variant DNA unit. The theory behind the shortcut was that, since the major diseases are common, so too would be the genetic variants that caused them. Natural selection keeps the human genome free of variants that damage health before children are grown, the theory held, but fails against variants that strike later in life, allowing them to become quite common. (In 2002 the National Institutes of Health started a $138 million dollar project called the HapMap to catalog the common variants in European, East Asian and African genomes.)[37]

The genome was broken into smaller pieces; approximately 150,000 base pairs in length.[36] These pieces were then ligated into a type of vector known as "bacterial artificial chromosomes", or BACs, which are derived from bacterial chromosomes which have been genetically engineered. The vectors containing the genes can be inserted into bacteria where they are copied by the bacterial DNA replication machinery. Each of these pieces was then sequenced separately as a small "shotgun" project and then assembled. The larger, 150,000 base pairs go together to create chromosomes. This is known as the "hierarchical shotgun" approach, because the genome is first broken into relatively large chunks, which are then mapped to chromosomes before being selected for sequencing.[38][39]

Funding came from the US government through the National Institutes of Health in the United States, and a UK charity organization, the Wellcome Trust, as well as numerous other groups from around the world. The funding supported a number of large sequencing centers including those at Whitehead Institute, the Sanger Centre, Washington University in St. Louis, and Baylor College of Medicine.[20][40]

The United Nations Educational, Scientific and Cultural Organization (UNESCO) served as an important channel for the involvement of developing countries in the Human Genome Project.[41]

In 1998, a similar, privately funded quest was launched by the American researcher Craig Venter, and his firm Celera Genomics. Venter was a scientist at the NIH during the early 1990s when the project was initiated. The $300,000,000 Celera effort was intended to proceed at a faster pace and at a fraction of the cost of the roughly $3 billion publicly funded project. The Celera approach was able to proceed at a much more rapid rate, and at a lower cost than the public project because it relied upon data made available by the publicly funded project.[42]

Celera used a technique called whole genome shotgun sequencing, employing pairwise end sequencing,[43] which had been used to sequence bacterial genomes of up to six million base pairs in length, but not for anything nearly as large as the three billion base pair human genome.

Celera initially announced that it would seek patent protection on "only 200300" genes, but later amended this to seeking "intellectual property protection" on "fully-characterized important structures" amounting to 100300 targets. The firm eventually filed preliminary ("place-holder") patent applications on 6,500 whole or partial genes. Celera also promised to publish their findings in accordance with the terms of the 1996 "Bermuda Statement", by releasing new data annually (the HGP released its new data daily), although, unlike the publicly funded project, they would not permit free redistribution or scientific use of the data. The publicly funded competitors were compelled to release the first draft of the human genome before Celera for this reason. On July 7, 2000, the UCSC Genome Bioinformatics Group released a first working draft on the web. The scientific community downloaded about 500 GB of information from the UCSC genome server in the first 24 hours of free and unrestricted access.[44]

In March 2000, President Clinton announced that the genome sequence could not be patented, and should be made freely available to all researchers. The statement sent Celera's stock plummeting and dragged down the biotechnology-heavy Nasdaq. The biotechnology sector lost about $50 billion in market capitalization in two days.

Although the working draft was announced in June 2000, it was not until February 2001 that Celera and the HGP scientists published details of their drafts. Special issues of Nature (which published the publicly funded project's scientific paper)[45] and Science (which published Celera's paper[46]) described the methods used to produce the draft sequence and offered analysis of the sequence. These drafts covered about 83% of the genome (90% of the euchromatic regions with 150,000 gaps and the order and orientation of many segments not yet established). In February 2001, at the time of the joint publications, press releases announced that the project had been completed by both groups. Improved drafts were announced in 2003 and 2005, filling in to approximately 92% of the sequence currently.

In the IHGSC international public-sector Human Genome Project (HGP), researchers collected blood (female) or sperm (male) samples from a large number of donors. Only a few of many collected samples were processed as DNA resources. Thus the donor identities were protected so neither donors nor scientists could know whose DNA was sequenced. DNA clones from many different libraries were used in the overall project, with most of those libraries being created by Pieter J. de Jong's lab. Much of the sequence (>70%) of the reference genome produced by the public HGP came from a single anonymous male donor from Buffalo, New York (code name RP11).[47][48]

HGP scientists used white blood cells from the blood of two male and two female donors (randomly selected from 20 of each) each donor yielding a separate DNA library. One of these libraries (RP11) was used considerably more than others, due to quality considerations. One minor technical issue is that male samples contain just over half as much DNA from the sex chromosomes (one X chromosome and one Y chromosome) compared to female samples (which contain two X chromosomes). The other 22 chromosomes (the autosomes) are the same for both sexes.

Although the main sequencing phase of the HGP has been completed, studies of DNA variation continue in the International HapMap Project, whose goal is to identify patterns of single-nucleotide polymorphism (SNP) groups (called haplotypes, or haps). The DNA samples for the HapMap came from a total of 270 individuals: Yoruba people in Ibadan, Nigeria; Japanese people in Tokyo; Han Chinese in Beijing; and the French Centre dEtude du Polymorphisme Humain (CEPH) resource, which consisted of residents of the United States having ancestry from Western and Northern Europe.

In the Celera Genomics private-sector project, DNA from five different individuals were used for sequencing. The lead scientist of Celera Genomics at that time, Craig Venter, later acknowledged (in a public letter to the journal Science) that his DNA was one of 21 samples in the pool, five of which were selected for use.[49][50]

In 2007, a team led by Jonathan Rothberg published James Watson's entire genome, unveiling the six-billion-nucleotide genome of a single individual for the first time.[51]

The work on interpretation and analysis of genome data is still in its initial stages. It is anticipated that detailed knowledge of the human genome will provide new avenues for advances in medicine and biotechnology. Clear practical results of the project emerged even before the work was finished. For example, a number of companies, such as Myriad Genetics, started offering easy ways to administer genetic tests that can show predisposition to a variety of illnesses, including breast cancer, hemostasis disorders, cystic fibrosis, liver diseases and many others. Also, the etiologies for cancers, Alzheimer's disease and other areas of clinical interest are considered likely to benefit from genome information and possibly may lead in the long term to significant advances in their management.[37][52]

There are also many tangible benefits for biologists. For example, a researcher investigating a certain form of cancer may have narrowed down his/her search to a particular gene. By visiting the human genome database on the World Wide Web, this researcher can examine what other scientists have written about this gene, including (potentially) the three-dimensional structure of its product, its function(s), its evolutionary relationships to other human genes, or to genes in mice or yeast or fruit flies, possible detrimental mutations, interactions with other genes, body tissues in which this gene is activated, and diseases associated with this gene or other datatypes. Further, deeper understanding of the disease processes at the level of molecular biology may determine new therapeutic procedures. Given the established importance of DNA in molecular biology and its central role in determining the fundamental operation of cellular processes, it is likely that expanded knowledge in this area will facilitate medical advances in numerous areas of clinical interest that may not have been possible without them.[53]

The analysis of similarities between DNA sequences from different organisms is also opening new avenues in the study of evolution. In many cases, evolutionary questions can now be framed in terms of molecular biology; indeed, many major evolutionary milestones (the emergence of the ribosome and organelles, the development of embryos with body plans, the vertebrate immune system) can be related to the molecular level. Many questions about the similarities and differences between humans and our closest relatives (the primates, and indeed the other mammals) are expected to be illuminated by the data in this project.[37][54]

The project inspired and paved the way for genomic work in other fields, such as agriculture. For example, by studying the genetic composition of Tritium aestivum, the worlds most commonly used bread wheat, great insight has been gained into the ways that domestication has impacted the evolution of the plant.[55] Which loci are most susceptible to manipulation, and how does this play out in evolutionary terms? Genetic sequencing has allowed these questions to be addressed for the first time, as specific loci can be compared in wild and domesticated strains of the plant. This will allow for advances in genetic modification in the future which could yield healthier, more disease-resistant wheat crops.

At the onset of the Human Genome Project several ethical, legal, and social concerns were raised in regards to how increased knowledge of the human genome could be used to discriminate against people. One of the main concerns of most individuals was the fear that both employers and health insurance companies would refuse to hire individuals or refuse to provide insurance to people because of a health concern indicated by someone's genes.[56] In 1996 the United States passed the Health Insurance Portability and Accountability Act (HIPAA) which protects against the unauthorized and non-consensual release of individually identifiable health information to any entity not actively engaged in the provision of healthcare services to a patient.[57]

Along with identifying all of the approximately 20,00025,000 genes in the human genome, the Human Genome Project also sought to address the ethical, legal, and social issues that were created by the onset of the project. For that the Ethical, Legal, and Social Implications (ELSI) program was founded in 1990. Five percent of the annual budget was allocated to address the ELSI arising from the project.[20][58] This budget started at approximately $1.57 million in the year 1990, but increased to approximately $18 million in the year 2014. [59]

Whilst the project may offer significant benefits to medicine and scientific research, some authors have emphasised the need to address the potential social consequences of mapping the human genome. "Molecularising disease and their possible cure will have a profound impact on what patients expect from medical help and the new generation of doctors' perception of illness."[60]

Read more from the original source:
Human Genome Project - Wikipedia

Posted in Genome | Comments Off on Human Genome Project – Wikipedia

About NHGRI – Genome.gov | National Human Genome Research …

Posted: July 10, 2016 at 5:53 pm

Feature Video now available The Genomic Landscape of Breast Cancer in Women of African Ancestry

On Tuesday, June 7, Olufunmilayo I. Olopade, M.D., F.A.C.P., presented The Genomic Landscape of Breast Cancer in Women of African Ancestry, the final lecture in the 2016 Genomics and Health Disparities Lecture Series. Dr. Olufunmilayo is director of the Center for Clinical Cancer Genetics at the University of Chicago School of Medicine. Read more | Watch the video

The NHGRI History of Genomics Program closed its six-part seminar series featuring Human Genome Project (HGP) participants who helped launch the HGP with the talk: The Genome is for Life, by David Bentley, D.Phil., on Thursday, May 26th. Dr. Bentley is vice president and chief scientist at Illumina Inc. His long-term research interest is the study of human sequence variation and its impact on health and disease. Read more about the series

In this issue of The Genomics Landscape, we feature the use of model organisms to explore the function of genes implicated in human disease. This month's issue also highlights a recently completed webinar series to help professionals in the health insurance industry understand genetic testing, new funding for training in genomic medicine research, and NHGRI's Genome Statute and Legislation Database. Read more

Cristina Kapusti, M.S., has been named chief of the Policy and Program Analysis Branch (PPAB) at the National Human Genome Research Institute (NHGRI). In her new role, she will oversee policy activities and evaluation as well as program reporting and assessment to support institute priorities. PPAB is a part of the Division of Policy, Communications and Education (DPCE), whose mission is to promote the understanding and application of genomic knowledge to advance human health and society. Read more

Last Updated: July 7, 2016

See the original post:
About NHGRI - Genome.gov | National Human Genome Research ...

Posted in Genome | Comments Off on About NHGRI – Genome.gov | National Human Genome Research …

Genomics – Wikipedia, the free encyclopedia

Posted: at 5:53 pm

Genomics is a discipline in genetics that applies recombinant DNA, DNA sequencing methods, and bioinformatics to sequence, assemble, and analyze the function and structure of genomes (the complete set of DNA within a single cell of an organism).[1][2] Advances in genomics have triggered a revolution in discovery-based research to understand even the most complex biological systems such as the brain.[3] The field includes efforts to determine the entire DNA sequence of organisms and fine-scale genetic mapping. The field also includes studies of intragenomic phenomena such as heterosis, epistasis, pleiotropy and other interactions between loci and alleles within the genome.[4] In contrast, the investigation of the roles and functions of single genes is a primary focus of molecular biology or genetics and is a common topic of modern medical and biological research. Research of single genes does not fall into the definition of genomics unless the aim of this genetic, pathway, and functional information analysis is to elucidate its effect on, place in, and response to the entire genomes networks.[5][6]

From the Greek [7]gen, "gene" (gamma, epsilon, nu, epsilon) meaning "become, create, creation, birth", and subsequent variants: genealogy, genesis, genetics, genic, genomere, genotype, genus etc. While the word genome (from the German Genom, attributed to Hans Winkler) was in use in English as early as 1926,[8] the term genomics was coined by Tom Roderick, a geneticist at the Jackson Laboratory (Bar Harbor, Maine), over beer at a meeting held in Maryland on the mapping of the human genome in 1986.[9]

Following Rosalind Franklin's confirmation of the helical structure of DNA, James D. Watson and Francis Crick's publication of the structure of DNA in 1953 and Fred Sanger's publication of the Amino acid sequence of insulin in 1955, nucleic acid sequencing became a major target of early molecular biologists.[10] In 1964, Robert W. Holley and colleagues published the first nucleic acid sequence ever determined, the ribonucleotide sequence of alanine transfer RNA.[11][12] Extending this work, Marshall Nirenberg and Philip Leder revealed the triplet nature of the genetic code and were able to determine the sequences of 54 out of 64 codons in their experiments.[13] In 1972, Walter Fiers and his team at the Laboratory of Molecular Biology of the University of Ghent (Ghent, Belgium) were the first to determine the sequence of a gene: the gene for Bacteriophage MS2 coat protein.[14] Fiers' group expanded on their MS2 coat protein work, determining the complete nucleotide-sequence of bacteriophage MS2-RNA (whose genome encodes just four genes in 3569 base pairs [bp]) and Simian virus 40 in 1976 and 1978, respectively.[15][16]

In addition to his seminal work on the amino acid sequence of insulin, Frederick Sanger and his colleagues played a key role in the development of DNA sequencing techniques that enabled the establishment of comprehensive genome sequencing projects.[4] In 1975, he and Alan Coulson published a sequencing procedure using DNA polymerase with radiolabelled nucleotides that he called the Plus and Minus technique.[17][18] This involved two closely related methods that generated short oligonucleotides with defined 3' termini. These could be fractionated by electrophoresis on a polyacrylamide gel and visualised using autoradiography. The procedure could sequence up to 80 nucleotides in one go and was a big improvement, but was still very laborious. Nevertheless, in 1977 his group was able to sequence most of the 5,386 nucleotides of the single-stranded bacteriophage X174, completing the first fully sequenced DNA-based genome.[19] The refinement of the Plus and Minus method resulted in the chain-termination, or Sanger method (see below), which formed the basis of the techniques of DNA sequencing, genome mapping, data storage, and bioinformatic analysis most widely used in the following quarter-century of research.[20][21] In the same year Walter Gilbert and Allan Maxam of Harvard University independently developed the Maxam-Gilbert method (also known as the chemical method) of DNA sequencing, involving the preferential cleavage of DNA at known bases, a less efficient method.[22][23] For their groundbreaking work in the sequencing of nucleic acids, Gilbert and Sanger shared half the 1980 Nobel Prize in chemistry with Paul Berg (recombinant DNA).

The advent of these technologies resulted in a rapid intensification in the scope and speed of completion of genome sequencing projects. The first complete genome sequence of an eukaryotic organelle, the human mitochondrion (16,568 bp, about 16.6 kb [kilobase]), was reported in 1981,[24] and the first chloroplast genomes followed in 1986.[25][26] In 1992, the first eukaryotic chromosome, chromosome III of brewer's yeast Saccharomyces cerevisiae (315 kb) was sequenced.[27] The first free-living organism to be sequenced was that of Haemophilus influenzae (1.8 Mb [megabase]) in 1995.[28] The following year a consortium of researchers from laboratories across North America, Europe, and Japan announced the completion of the first complete genome sequence of a eukaryote, S. cerevisiae (12.1 Mb), and since then genomes have continued being sequenced at an exponentially growing pace.[29] As of October 2011[update], the complete sequences are available for: 2,719 viruses, 1,115 archaea and bacteria, and 36 eukaryotes, of which about half are fungi.[30][31]

Most of the microorganisms whose genomes have been completely sequenced are problematic pathogens, such as Haemophilus influenzae, which has resulted in a pronounced bias in their phylogenetic distribution compared to the breadth of microbial diversity.[32][33] Of the other sequenced species, most were chosen because they were well-studied model organisms or promised to become good models. Yeast (Saccharomyces cerevisiae) has long been an important model organism for the eukaryotic cell, while the fruit fly Drosophila melanogaster has been a very important tool (notably in early pre-molecular genetics). The worm Caenorhabditis elegans is an often used simple model for multicellular organisms. The zebrafish Brachydanio rerio is used for many developmental studies on the molecular level, and the flower Arabidopsis thaliana is a model organism for flowering plants. The Japanese pufferfish (Takifugu rubripes) and the spotted green pufferfish (Tetraodon nigroviridis) are interesting because of their small and compact genomes, which contain very little noncoding DNA compared to most species.[34][35] The mammals dog (Canis familiaris),[36] brown rat (Rattus norvegicus), mouse (Mus musculus), and chimpanzee (Pan troglodytes) are all important model animals in medical research.[23]

A rough draft of the human genome was completed by the Human Genome Project in early 2001, creating much fanfare.[37] This project, completed in 2003, sequenced the entire genome for one specific person, and by 2007 this sequence was declared "finished" (less than one error in 20,000 bases and all chromosomes assembled).[37] In the years since then, the genomes of many other individuals have been sequenced, partly under the auspices of the 1000 Genomes Project, which announced the sequencing of 1,092 genomes in October 2012.[38] Completion of this project was made possible by the development of dramatically more efficient sequencing technologies and required the commitment of significant bioinformatics resources from a large international collaboration.[39] The continued analysis of human genomic data has profound political and social repercussions for human societies.[40]

The English-language neologism omics informally refers to a field of study in biology ending in -omics, such as genomics, proteomics or metabolomics. The related suffix -ome is used to address the objects of study of such fields, such as the genome, proteome or metabolome respectively. The suffix -ome as used in molecular biology refers to a totality of some sort; similarly omics has come to refer generally to the study of large, comprehensive biological data sets. While the growth in the use of the term has led some scientists (Jonathan Eisen, among others[41]) to claim that it has been oversold,[42] it reflects the change in orientation towards the quantitative analysis of complete or near-complete assortment of all the constituents of a system.[43] In the study of symbioses, for example, researchers which were once limited to the study of a single gene product can now simultaneously compare the total complement of several types of biological molecules.[44][45]

After an organism has been selected, genome projects involve three components: the sequencing of DNA, the assembly of that sequence to create a representation of the original chromosome, and the annotation and analysis of that representation.[4]

Historically, sequencing was done in sequencing centers, centralized facilities (ranging from large independent institutions such as Joint Genome Institute which sequence dozens of terabases a year, to local molecular biology core facilities) which contain research laboratories with the costly instrumentation and technical support necessary. As sequencing technology continues to improve, however, a new generation of effective fast turnaround benchtop sequencers has come within reach of the average academic laboratory.[46][47] On the whole, genome sequencing approaches fall into two broad categories, shotgun and high-throughput (aka next-generation) sequencing.[4]

Shotgun sequencing (Sanger sequencing is used interchangeably) is a sequencing method designed for analysis of DNA sequences longer than 1000 base pairs, up to and including entire chromosomes.[48] It is named by analogy with the rapidly expanding, quasi-random firing pattern of a shotgun. Since the chain termination method of DNA sequencing can only be used for fairly short strands (100 to 1000 base pairs), longer DNA sequences must be broken into random small segments which are then sequenced to obtain reads. Multiple overlapping reads for the target DNA are obtained by performing several rounds of this fragmentation and sequencing. Computer programs then use the overlapping ends of different reads to assemble them into a continuous sequence.[48][49] Shotgun sequencing is a random sampling process, requiring over-sampling to ensure a given nucleotide is represented in the reconstructed sequence; the average number of reads by which a genome is over-sampled is referred to as coverage.[50]

For much of its history, the technology underlying shotgun sequencing was the classical chain-termination method, which is based on the selective incorporation of chain-terminating dideoxynucleotides by DNA polymerase during in vitro DNA replication.[19][51] Developed by Frederick Sanger and colleagues in 1977, it was the most widely used sequencing method for approximately 25 years. More recently, Sanger sequencing has been supplanted by "Next-Gen" sequencing methods, especially for large-scale, automated genome analyses. However, the Sanger method remains in wide use in 2013, primarily for smaller-scale projects and for obtaining especially long contiguous DNA sequence reads (>500 nucleotides).[52] Chain-termination methods require a single-stranded DNA template, a DNA primer, a DNA polymerase, normal deoxynucleosidetriphosphates (dNTPs), and modified nucleotides (dideoxyNTPs) that terminate DNA strand elongation. These chain-terminating nucleotides lack a 3'-OH group required for the formation of a phosphodiester bond between two nucleotides, causing DNA polymerase to cease extension of DNA when a ddNTP is incorporated. The ddNTPs may be radioactively or fluorescently labelled for detection in automated sequencing machines.[4] Typically, these automated DNA-sequencing instruments (DNA sequencers) can sequence up to 96 DNA samples in a single batch (run) in up to 48 runs a day.[53]

The high demand for low-cost sequencing has driven the development of high-throughput sequencing (or next-generation sequencing [NGS]) technologies that parallelize the sequencing process, producing thousands or millions of sequences at once.[54][55] High-throughput sequencing technologies are intended to lower the cost of DNA sequencing beyond what is possible with standard dye-terminator methods. In ultra-high-throughput sequencing as many as 500,000 sequencing-by-synthesis operations may be run in parallel.[56][57]

Solexa, now part of Illumina, developed a sequencing method based on reversible dye-terminators technology acquired from Manteia Predictive Medicine in 2004. This technology had been invented and developed in late 1996 at Glaxo-Welcome's Geneva Biomedical Research Institute (GBRI), by Dr. Pascal Mayer and Dr Laurent Farinelli.[58] In this method, DNA molecules and primers are first attached on a slide and amplified with polymerase so that local clonal colonies, initially coined "DNA colonies", are formed. To determine the sequence, four types of reversible terminator bases (RT-bases) are added and non-incorporated nucleotides are washed away. Unlike pyrosequencing, the DNA chains are extended one nucleotide at a time and image acquisition can be performed at a delayed moment, allowing for very large arrays of DNA colonies to be captured by sequential images taken from a single camera.

Decoupling the enzymatic reaction and the image capture allows for optimal throughput and theoretically unlimited sequencing capacity. With an optimal configuration, the ultimately reachable instrument throughput is thus dictated solely by the analogic-to-digital conversion rate of the camera, multiplied by the number of cameras and divided by the number of pixels per DNA colony required for visualizing them optimally (approximately 10 pixels/colony). In 2012, with cameras operating at more than 10MHz A/D conversion rates and available optics, fluidics and enzymatics, throughput can be multiples of 1 million nucleotides/second, corresponding roughly to 1 human genome equivalent at 1x coverage per hour per instrument, and 1 human genome re-sequenced (at approx. 30x) per day per instrument (equipped with a single camera). The camera takes images of the fluorescently labeled nucleotides, then the dye along with the terminal 3' blocker is chemically removed from the DNA, allowing the next cycle.[59]

Ion Torrent Systems Inc. developed a sequencing approach based on standard DNA replication chemistry. This technology measures the release of a hydrogen ion each time a base is incorporated. A microwell containing template DNA is flooded with a single nucleotide, if the nucleotide is complementary to the template strand it will be incorporated and a hydrogen ion will be released. This release triggers an ISFET ion sensor. If a homopolymer is present in the template sequence multiple nucleotides will be incorporated in a single flood cycle, and the detected electrical signal will be proportionally higher.[60]

Overlapping reads form contigs; contigs and gaps of known length form scaffolds.

Paired end reads of next generation sequencing data mapped to a reference genome.

Multiple, fragmented sequence reads must be assembled together on the basis of their overlapping areas.

Sequence assembly refers to aligning and merging fragments of a much longer DNA sequence in order to reconstruct the original sequence.[4] This is needed as current DNA sequencing technology cannot read whole genomes as a continuous sequence, but rather reads small pieces of between 20 and 1000 bases, depending on the technology used. Typically the short fragments, called reads, result from shotgun sequencing genomic DNA, or gene transcripts (ESTs).[4]

Assembly can be broadly categorized into two approaches: de novo assembly, for genomes which are not similar to any sequenced in the past, and comparative assembly, which uses the existing sequence of a closely related organism as a reference during assembly.[50] Relative to comparative assembly, de novo assembly is computationally difficult (NP-hard), making it less favorable for short-read NGS technologies.

Finished genomes are defined as having a single contiguous sequence with no ambiguities representing each replicon.[61]

The DNA sequence assembly alone is of little value without additional analysis.[4]Genome annotation is the process of attaching biological information to sequences, and consists of three main steps:[62]

Automatic annotation tools try to perform these steps in silico, as opposed to manual annotation (a.k.a. curation) which involves human expertise and potential experimental verification.[63] Ideally, these approaches co-exist and complement each other in the same annotation pipeline (also see below).

Traditionally, the basic level of annotation is using BLAST for finding similarities, and then annotating genomes based on homologues.[4] More recently, additional information is added to the annotation platform. The additional information allows manual annotators to deconvolute discrepancies between genes that are given the same annotation. Some databases use genome context information, similarity scores, experimental data, and integrations of other resources to provide genome annotations through their Subsystems approach. Other databases (e.g. Ensembl) rely on both curated data sources as well as a range of software tools in their automated genome annotation pipeline.[64]Structural annotation consists of the identification of genomic elements, primarily ORFs and their localisation, or gene structure. Functional annotation consists of attaching biological information to genomic elements.

The need for reproducibility and efficient management of the large amount of data associated with genome projects mean that computational pipelines have important applications in genomics.[65]

Functional genomics is a field of molecular biology that attempts to make use of the vast wealth of data produced by genomic projects (such as genome sequencing projects) to describe gene (and protein) functions and interactions. Functional genomics focuses on the dynamic aspects such as gene transcription, translation, and proteinprotein interactions, as opposed to the static aspects of the genomic information such as DNA sequence or structures. Functional genomics attempts to answer questions about the function of DNA at the levels of genes, RNA transcripts, and protein products. A key characteristic of functional genomics studies is their genome-wide approach to these questions, generally involving high-throughput methods rather than a more traditional gene-by-gene approach.

A major branch of genomics is still concerned with sequencing the genomes of various organisms, but the knowledge of full genomes has created the possibility for the field of functional genomics, mainly concerned with patterns of gene expression during various conditions. The most important tools here are microarrays and bioinformatics.

Structural genomics seeks to describe the 3-dimensional structure of every protein encoded by a given genome.[66][67] This genome-based approach allows for a high-throughput method of structure determination by a combination of experimental and modeling approaches. The principal difference between structural genomics and traditional structural prediction is that structural genomics attempts to determine the structure of every protein encoded by the genome, rather than focusing on one particular protein. With full-genome sequences available, structure prediction can be done more quickly through a combination of experimental and modeling approaches, especially because the availability of large numbers of sequenced genomes and previously solved protein structures allow scientists to model protein structure on the structures of previously solved homologs. Structural genomics involves taking a large number of approaches to structure determination, including experimental methods using genomic sequences or modeling-based approaches based on sequence or structural homology to a protein of known structure or based on chemical and physical principles for a protein with no homology to any known structure. As opposed to traditional structural biology, the determination of a protein structure through a structural genomics effort often (but not always) comes before anything is known regarding the protein function. This raises new challenges in structural bioinformatics, i.e. determining protein function from its 3D structure.[68]

Epigenomics is the study of the complete set of epigenetic modifications on the genetic material of a cell, known as the epigenome.[69] Epigenetic modifications are reversible modifications on a cells DNA or histones that affect gene expression without altering the DNA sequence (Russell 2010 p.475). Two of the most characterized epigenetic modifications are DNA methylation and histone modification. Epigenetic modifications play an important role in gene expression and regulation, and are involved in numerous cellular processes such as in differentiation/development and tumorigenesis.[69] The study of epigenetics on a global level has been made possible only recently through the adaptation of genomic high-throughput assays.[70]

Metagenomics is the study of metagenomes, genetic material recovered directly from environmental samples. The broad field may also be referred to as environmental genomics, ecogenomics or community genomics. While traditional microbiology and microbial genome sequencing rely upon cultivated clonal cultures, early environmental gene sequencing cloned specific genes (often the 16S rRNA gene) to produce a profile of diversity in a natural sample. Such work revealed that the vast majority of microbial biodiversity had been missed by cultivation-based methods.[71] Recent studies use "shotgun" Sanger sequencing or massively parallel pyrosequencing to get largely unbiased samples of all genes from all the members of the sampled communities.[72] Because of its power to reveal the previously hidden diversity of microscopic life, metagenomics offers a powerful lens for viewing the microbial world that has the potential to revolutionize understanding of the entire living world.[73][74]

Bacteriophages have played and continue to play a key role in bacterial genetics and molecular biology. Historically, they were used to define gene structure and gene regulation. Also the first genome to be sequenced was a bacteriophage. However, bacteriophage research did not lead the genomics revolution, which is clearly dominated by bacterial genomics. Only very recently has the study of bacteriophage genomes become prominent, thereby enabling researchers to understand the mechanisms underlying phage evolution. Bacteriophage genome sequences can be obtained through direct sequencing of isolated bacteriophages, but can also be derived as part of microbial genomes. Analysis of bacterial genomes has shown that a substantial amount of microbial DNA consists of prophage sequences and prophage-like elements.[75] A detailed database mining of these sequences offers insights into the role of prophages in shaping the bacterial genome.[76][77]

At present there are 24 cyanobacteria for which a total genome sequence is available. 15 of these cyanobacteria come from the marine environment. These are six Prochlorococcus strains, seven marine Synechococcus strains, Trichodesmium erythraeum IMS101 and Crocosphaera watsonii WH8501. Several studies have demonstrated how these sequences could be used very successfully to infer important ecological and physiological characteristics of marine cyanobacteria. However, there are many more genome projects currently in progress, amongst those there are further Prochlorococcus and marine Synechococcus isolates, Acaryochloris and Prochloron, the N2-fixing filamentous cyanobacteria Nodularia spumigena, Lyngbya aestuarii and Lyngbya majuscula, as well as bacteriophages infecting marine cyanobaceria. Thus, the growing body of genome information can also be tapped in a more general way to address global problems by applying a comparative approach. Some new and exciting examples of progress in this field are the identification of genes for regulatory RNAs, insights into the evolutionary origin of photosynthesis, or estimation of the contribution of horizontal gene transfer to the genomes that have been analyzed.[78]

Genomics has provided applications in many fields, including medicine, biotechnology, anthropology and other social sciences.[40]

Next-generation genomic technologies allow clinicians and biomedical researchers to drastically increase the amount of genomic data collected on large study populations.[79] When combined with new informatics approaches that integrate many kinds of data with genomic data in disease research, this allows researchers to better understand the genetic bases of drug response and disease.[80][81]

The growth of genomic knowledge has enabled increasingly sophisticated applications of synthetic biology.[82] In 2010 researchers at the J. Craig Venter Institute announced the creation of a partially synthetic species of bacterium, Mycoplasma laboratorium, derived from the genome of Mycoplasma genitalium.[83]

Conservationists can use the information gathered by genomic sequencing in order to better evaluate genetic factors key to species conservation, such as the genetic diversity of a population or whether an individual is heterozygous for a recessive inherited genetic disorder.[84] By using genomic data to evaluate the effects of evolutionary processes and to detect patterns in variation throughout a given population, conservationists can formulate plans to aid a given species without as many variables left unknown as those unaddressed by standard genetic approaches.[85]

See the rest here:
Genomics - Wikipedia, the free encyclopedia

Posted in Genome | Comments Off on Genomics – Wikipedia, the free encyclopedia

The Cost of Sequencing a Human Genome

Posted: July 3, 2016 at 12:07 pm

The Cost of Sequencing a Human Genome

Advances in the field of genomics over the past quarter-century have led to substantial reductions in the cost of genome sequencing. The underlying costs associated with different methods and strategies for sequencing genomes are of great interest because they influence the scope and scale of almost all genomics research projects. As a result, significant scrutiny and attention have been given to genome-sequencing costs and how they are calculated since the beginning of the field of genomics in the late 1980s. For example, NHGRI has carefully tracked costs at its funded 'genome sequencing centers' for many years (see Figure 1). With the growing scale of human genetics studies and the increasing number of clinical applications for genome sequencing, even greater attention is being paid to understanding the underlying costs of generating a human genome sequence.

Accurately determining the cost for sequencing a given genome (e.g., a human genome) is not simple. There are many parameters to define and nuances to consider. In fact, it is difficult to cite precise genome-sequencing cost figures that mean the same thing to all people because, in reality, different researchers, research institutions, and companies typically track and account for such costs in different fashions.

A genome consists of all of the DNA contained in a cell's nucleus. DNA is composed of four chemical building blocks or "bases" (for simplicity, abbreviated G, A, T, and C), with the biological information encoded within DNA determined by the order of those bases. Diploid organisms, like humans and all other mammals, contain duplicate copies of almost all of their DNA (i.e., pairs of chromosomes; with one chromosome of each pair inherited from each parent). The size of an organism's genome is generally considered to be the total number of bases in one representative copy of its nuclear DNA. In the case of diploid organisms (like humans), that corresponds to the sum of the sizes of one copy of each chromosome pair.

Organisms generally differ in their genome sizes. For example, the genome of E. coli (a bacterium that lives in your gut) is ~5 million bases (also called megabases), that of a fruit fly is ~123 million bases, and that of a human is ~3,000 million bases (or ~3 billion bases). There are also some surprising extremes, such as with the loblolly pine tree - its genome is ~23 million bases in size, over seven times larger than ours. Obviously, the cost to sequence a genome depends on its size. The discussion below is focused on the human genome; keep in mind that a single 'representative' copy of the human genome is ~3 billion bases in size, whereas a given person's actual (diploid) genome is ~6 billion bases in size.

Genomes are large and, at least with today's methods, their bases cannot be 'read out' in order (i.e., sequenced) end-to-end in a single step. Rather, to sequence a genome, its DNA must first be broken down into smaller pieces, with each resulting piece then subjected to chemical reactions that allow the identify and order of its bases to be deduced. The established base order derived from each piece of DNA is often called a 'sequence read,' and the collection of the resulting set of sequence reads (often numbering in the billions) is then computationally assembled back together to deduce the sequence of the starting genome. Sequencing human genomes are nowadays aided by the availability of available 'reference' sequences of the human genome, which play an important role in the computational assembly process. Historically, the process of breaking down genomes, sequencing the individual pieces of DNA, and then reassembling the individual sequence reads to generate a sequence of the starting genome was called 'shotgun sequencing' (although this terminology is used less frequently today). When an entire genome is being sequenced, the process is called 'whole-genome sequencing.' See Figure 2 for a comparison of human genome sequencing methods during the time of the Human Genome Project and circa ~ 2016.

An alternative to whole-genome sequencing is the targeted sequencing of part of a genome. Most often, this involves just sequencing the protein-coding regions of a genome, which reside within DNA segments called 'exons' and reflect the currently 'best understood' part of most genomes. For example, all of the exons in the human genome (the human 'exome') correspond to ~1.5% of the total human genome. Methods are now readily available to experimentally 'capture' (or isolate) just the exons, which can then be sequenced to generate a 'whole-exome sequence' of a genome. Whole-exome sequencing does require extra laboratory manipulations, so a whole-exome sequence does not cost ~1.5% of a whole-genome sequence. But since much less DNA is sequenced, whole-exome sequencing is (at least currently) cheaper than whole-genome sequencing.

Another important driver of the costs associated with generating genome sequences relates to data quality. That quality is heavily dependent upon the average number of times each base in the genome is actually 'read' during the sequencing process. During the Human Genome Project (HGP), the typical levels of quality considered were: (1) 'draft sequence' (covering ~90% of the genome at ~99.9% accuracy); and (2) 'finished sequence' (covering >95% of the genome at ~99.99% accuracy). Producing truly high-quality 'finished' sequence by this definition is very expensive; of note, the process of 'sequence finishing' is very labor-intensive and is thus associated with high costs. In fact, most human genome sequences produced today are 'draft sequences' (sometimes above and sometimes below the accuracy defined above).

There are thus a number of factors to consider when calculating the costs associated with genome sequencing. There are multiple different types and quality levels of genome sequences, and there can be many steps and activities involved in the process itself. Understanding the true cost of a genome sequence therefore requires knowledge about what was and was not included in calculating that cost (e.g., sequence data generation, sequence finishing, upfront activities such as mapping, equipment amortization, overhead, utilities, salaries, data analyses, etc.). In reality, there are often differences in what gets included when estimating genome-sequencing costs in different situations.

Below is summary information about: (1) the estimated cost of sequencing the first human genome as part of the HGP; (2) the estimated cost of sequencing a human genome in 2006 (i.e., roughly a decade ago); and (3) the estimated cost of sequencing a human genome in 2016 (i.e., the present time).

The HGP generated a 'reference' sequence of the human genome - specifically, it sequenced one representative version of all parts of each human chromosome (totaling ~3 billion bases). In the end, the quality of the 'finished' sequence was very high, with an estimated error rate of <1 in 100,000 bases; note this is much higher than a typical human genome sequence produced today. The generated sequence did not come from one person's genome, and, being a 'reference' sequence of ~3 billion bases, really reflects half of what is generated when an individual person's ~6-billion-base genome is sequenced (see below).

The HGP involved first mapping and then sequencing the human genome. The former was required at the time because there was otherwise no 'framework' for organizing the actual sequencing or the resulting sequence data. The maps of the human genome served as 'scaffolds' on which to connect individual segments of assembled DNA sequence. These genome-mapping efforts were quite expensive, but were essential at the time for generating an accurate genome sequence. It is difficult to estimate the costs associated with the 'human genome mapping phase' of the HGP, but it was certainly in the many tens of millions of dollars (and probably hundreds of millions of dollars).

Once significant human genome sequencing began for the HGP, a 'draft' human genome sequence (as described above) was produced over a 15-month period (from April 1999 to June 2000). The estimated cost for generating that initial 'draft' human genome sequence is ~$300 million worldwide, of which NIH provided roughly 50-60%.

The HGP then proceeded to refine the 'draft' and produce a 'finished' human genome sequence (as described above), which was achieved by 2003. The estimated cost for advancing the 'draft' human genome sequence to the 'finished' sequence is ~$150 million worldwide. Of note, generating the final human genome sequence by the HGP also relied on the sequences of small targeted regions of the human genome that were generated before the HGP's main production-sequencing phase; it is impossible to estimate the costs associated with these various other genome-sequencing efforts, but they likely total in the tens of millions of dollars.

The above explanation illustrates the difficulty in coming up with a single, accurate number for the cost of generating that first human genome sequence as part of the HGP. Such a calculation requires a clear delineation about what does and does not get 'counted' in the estimate; further, most of the cost estimates for individual components can only be given as ranges. At the lower bound, it would seem that this cost figure is at least $500 million; at the upper bound, this cost figure could be as high as $1 billion. The truth is likely somewhere in between.

The above estimated cost for generating the first human genome sequence by the HGP should not be confused with the total cost of the HGP. The originally projected cost for the U.S.'s contribution to the HGP was $3 billion; in actuality, the Project ended up taking less time (~13 years rather than ~15 years) and requiring less funding - ~$2.7 billion. But the latter number represents the total U.S. funding for a wide range of scientific activities under the HGP's umbrella beyond human genome sequencing, including technology development, physical and genetic mapping, model organism genome mapping and sequencing, bioethics research, and program management. Further, this amount does not reflect the additional funds for an overlapping set of activities pursued by other countries that participated in the HGP.

As the HGP was nearing completion, genome-sequencing pipelines had stabilized to the point that NHGRI was able to collect fairly reliable cost information from the major sequencing centers funded by the Institute. Based on these data, NHGRI estimated that the hypothetical 2003 cost to generate a 'second' reference human genome sequence using the then-available approaches and technologies was in the neighborhood of $50 million.

Since the completion of the HGP and the generation of the first 'reference' human genome sequence, efforts have increasingly shifted to the generation of human genome sequences from individual people. Sequencing an individual's 'personal' genome actually involves establishing the identity and order of ~6 billion bases of DNA (rather than a ~3-billion-base 'reference' sequence; see above). Thus, the generation of a person's genome sequence is a notably different endeavor than what the HGP did.

Within a few years following the end of the HGP (e.g., in 2006), the landscape of genome sequencing was beginning to change. While revolutionary new DNA sequencing technologies, such as those in use today, were not quite implemented at that time, genomics groups continued to refine the basic methodologies used during the HGP and continued lowering the costs for genome sequencing. Considerable efforts were being made to the sequencing of nonhuman genomes (much more so than human genomes), but the cost-accounting data collected at that time can be used to estimate the approximate cost that would have been associated with human genome sequencing at that time.

Based on data collected by NHGRI from the Institute's funded genome-sequencing groups, the cost to generate a high-quality 'draft' human genome sequence had dropped to ~$14 million by 2006. Hypothetically, it would have likely cost upwards of $20-25 million to generate a 'finished' human genome sequence - expensive, but still considerably less so than for generating the first reference human genome sequence.

The decade following the HGP brought revolutionary advances in DNA sequencing technologies that are fundamentally changing the nature of genomics. So-called 'next-generation' DNA sequencing methods arrived on the scene, and their effects quickly became evident in terms of lowering genome-sequencing costs; note that these NHGRI-collected data are 'retroactive' in nature, and do not always accurately reflect the 'projected' costs for genome sequencing going forward).

In 2015, the most common routine for sequencing an individual's human genome involves generating a 'draft' sequence and comparing it to a reference human genome sequence, so as to catalog all sequence variants in that genome; such a routine does not involve any sequence finishing. In short, nearly all human genome sequencing in 2015 yields high-quality 'draft' (but unfinished) sequence. That sequencing is typically targeted to all exons (whole-exome sequencing) or aimed at the entire ~6-billion-base genome (whole-genome sequencing), as discussed above. The quality of the resulting 'draft' sequences is heavily dependent on the amount of average base redundancy provided by the generated data (with higher redundancy costing more).

Adding to the complex landscape of genome sequencing in 2015 has been the emergence of commercial enterprises offering genome-sequencing services at competitive pricing. Direct comparisons between commercial versus academic genome-sequencing operations can be particularly challenging because of the many nuances about what each includes in any cost estimates (with such details often not revealed by private companies). The cost data that NHGRI collects from its funded genome-sequencing groups includes information about a wide range of activities and components, such as: reagents, consumables, DNA-sequencing instruments, certain computer equipment, other equipment, laboratory pipeline development, laboratory information management systems, initial data processing, submission of data to public databases, project management, utilities, other indirect costs, labor, and administration. Note that such cost-accounting does not typically include activities such as quality assurance/quality control (QA/QC), alignment of generated sequence to a reference human genome, sequence assembly, genomic variant calling, or annotation. Almost certainly, companies vary in terms of which of the items in the above lists get included in any cost estimates, making direct cost comparisons with academic genome-sequencing groups difficult. It is thus important to consider these variables - along with the distinction between retrospective versus projected costs - when comparing genome-sequencing costs claimed by different groups. Anyone comparing costs for genome sequencing should also be aware of the distinction between 'price' and 'cost' - a given price may be either higher or lower than the actual cost.

Based on the data collected from NHGRI-funded genome-sequencing groups, the cost to generate a high-quality 'draft' whole human genome sequence in mid-2015 was just above $4,000; by late in 2015, that figure had fallen below $1,500. The cost to generate a whole-exome sequence was generally below $1,000. Commercial prices for whole-genome and whole-exome sequences have often (but not always) been slightly below these numbers.

Innovation in genome-sequencing technologies and strategies does not appear to be slowing. As a result, one can readily expect continued reductions in the cost for human genome sequencing. The key factors to consider when assessing the 'value' associated with an estimated cost for generating a human genome sequence - in particular, the amount of the genome (whole versus exome), quality, and associated data analysis (if any) - will likely remain largely the same. With new DNA-sequencing platforms anticipated in the coming years, the nature of the generated sequence data and the associated costs will likely continue to be dynamic. As such, continued attention will need to be paid to the way in which the costs associated with genome sequencing are calculated.

Top of page

Last Updated: June 6, 2016

Read more from the original source:
The Cost of Sequencing a Human Genome

Posted in Genome | Comments Off on The Cost of Sequencing a Human Genome

Page 136«..1020..135136137138..150160..»