Skip Navigation

This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Dunham, I.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Dunham, I.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

Mutagenesis, Vol. 17, No. 6, 457-461, November 2002
© 2002 UK Environmental Mutagen Society/Oxford University Press

Human genome sequences: enigmatic variations

Ian Dunham1

The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK


    Abstract
 Top
 Abstract
 Introduction
 The human genome sequence
 Genes in the human...
 Variations on the sequence...
 Conclusions
 References
 
The sequence of the human genome should be completed in 2003. The next steps are to obtain accurate annotation of the genes within the sequence and to begin to define the sequences of multiple human genomes.


    Introduction
 Top
 Abstract
 Introduction
 The human genome sequence
 Genes in the human...
 Variations on the sequence...
 Conclusions
 References
 
Early in 2001, two publications reported sequences of the human genome (IHGSC, 2001Go; Venter et al., 2001Go). The iconic nature of this event required that the publications were accompanied by a(nother) media frenzy. However, this should not obscure two features of the sequences which I believe are key. First, both sequences were incomplete. At best they were working drafts which required editing and refinement. Second, only one of the sequences, that from the public sequencing laboratories (IHGSC, 2001Go), was available to all with unrestricted access. As time has passed, polishing of the public version has continued and as it approaches completion the value of the restricted access sequence has inevitably diminished. Very soon now we will be able to say that we have all the human genome sequence we can get with current methods.

But is one sequence enough? Early discussion of the human genome project (HGP) naturally concentrated on the need to determine a single sequence of the genome. The purpose of this sequence would be to act as a reference or ‘gold standard’ on which to base future studies. As the project moved forwards, there was much public and media interest in the possibility that the sequence might be from a single person, perhaps even the genome sequence of a celebrated individual. Indeed, over the past two and a half years of speaking about the HGP to schoolchildren, university students, the general public, the media and even scientists, the question I have been asked most often is ‘Whose DNA was used?’ The need to attach a single face directly to the sequence has even affected some of those involved in the sequencing, if we are to believe recent reports (Wade, 2002Go). The practical answer to the ‘whose DNA?’ question is that the public reference sequence of the human genome is a composite and is both representative of all our genomes and not representative of any one individual. In all probability the same is largely true of the human genome sequence that was generated in the private domain. However, the desire to individualize the genome sequence reflects our innate understanding of a simple truth that is self-evident to anyone who has ever lived in a family. That is that there are both similarities and differences between us and that many of these may have a significant genetic explanation. With the reference sequence in hand, the task now for biomedical science is to exploit the information in our genetic material to study the basis of human disease. To do this we will need both to use the existing sequence and sample many more human genome sequences.

The next steps in this task can split into two logically separate areas. In the first we wish to understand the complete ‘parts list’ as detailed in the reference sequence and to describe how the parts behave in normal and diseased states. Despite recent descriptions of an increased role for non-protein coding genes, the parts that are of interest in this area are predominantly genes and their proteins. There is now a wide battery of techniques which can be used, including gene knockouts in model organisms, cDNA, protein and tissue microarrays, 2-dimensional gel electrophoresis, etc., and new techniques continue to develop. However, I would argue that despite much activity in this area, the data generated by these techniques can only be as complete as the baseline set of genes extracted from the genome sequence and that we are still some way short of having the definitive human gene annotation. In the second area we wish to understand how genetic differences in genes between individuals correlate with occurrence of disease and other medically relevant phenotypes. Here there is also a wide range of technologies, but each technology is trying to address the same question. Namely, how to sample many more human genome sequences from appropriately designed studies at practical cost. I believe we are some way from having the answer to this question, but recent developments in our understanding of the diversity of human chromosomes within populations are indicating the potential directions we should take. In this review I will summarize the state of the human genome sequence, the gene annotation and new developments in understanding human genome sequence variation. I will also try to indicate the inadequacies in our current knowledge of the human genome and assess the likely ways ahead.


    The human genome sequence
 Top
 Abstract
 Introduction
 The human genome sequence
 Genes in the human...
 Variations on the sequence...
 Conclusions
 References
 
In February 2001, two versions of the human genome sequence were reported (IHGSC, 2001Go; Venter et al., 2001Go). This was the culmination of a long period of competition by rival genome sequencing groups, one from what could be broadly characterized as the public domain and the other a private company, Celera. I will not dwell on the history of this competition as it has been covered at length by others (Davies, 2001Go; Sulston and Ferry, 2002Go), but I will recommend a particularly enlightening review by Olson (2002)Go. The two sequences were intended to be obtained by independent strategies. The public domain sequencing groups adopted a hierarchical strategy in which a map of overlapping bacterial artificial chromosome (BAC) clones was constructed based on restriction enzyme digestion patterns and then approximately 30 000 representative BACs were shotgun sequenced (IHGSC, 2001Go) After assembling the BAC sequences and eliminating redundancy, the sequence obtained covered ~88% of the genome. Celera initially adopted a whole genome shotgun (WGS) approach whereby they sequenced sufficient randomly chosen clones from libraries of 2, 10 and 50 kb insert size to give a 5.1 times depth of coverage of the genome. Although they report an attempted whole genome assembly of the WGS data combined with data from the public domain project, their biological analysis was instead based on a superior assembly (called the compartmentalized shotgun assembly) which superimposed the Celera data on the BAC sequencing data from the public domain laboratories and was estimated to include >90% of the genome. So, in practice the public domain and Celera assemblies should contain the same sequences from the public domain, with the addition of the Celera whole genome shotgun available to Celera only. This cross-fertilization of the Celera assembly, the unavailability of a pure WGS assembly and their reliance on the combined data for their downstream analysis meant that there was really no assessment of how well a WGS assembly would work for a vertebrate genome. Therefore, the success or otherwise of the WGS assembly has been hotly disputed, and a detailed critique and rebuttal are available elsewhere (Myers et al., 2002Go; Waterston et al., 2002Go). However, from a practical point of view the criteria that matter are the relative merits of the assemblies, accessibility and the future commitment to complete the sequence. It is on these criteria that the final quality of the sequences will be judged.

At the time of the publications, a companion analysis compared the broad characteristics of the two assemblies and found that at a superficial level at least they were broadly similar in content, although different in the precise distribution of contig sizes and gap distributions (Aach et al., 2001Go). A comparison of the orders of STS markers in the two assemblies as compared with that determined by high resolution radiation hybrid mapping found that 36% of STS pairs were present in different orders between the two assemblies (Olivier et al., 2001Go). A later analysis in a well-studied region of chromosome 4 again showed considerable variability between various versions of the sequence (Semple et al., 2002Go). However, extensive comparisons have been hampered by restricted access to the Celera data.

The public domain sequencing groups adopted a policy of rapid and unrestricted release of genome sequence data from very early on in the project (Guyer, 1998Go). All assemblies of genome sequence from clones were made publicly available within 24 h of the assembly and were deposited in the community DNA sequence databases, GenBank, EMBL and DDBJ. Furthermore, the assembled genome sequence and subsequent updates have been made available through a number of public servers, including NCBI (http://www.ncbi.nlm.nih.gov/), Ensembl (http://www.ensembl.org) and the UCSC genome browser (http://genome.ucsc.edu). Access to the Celera sequence is through their webserver and allows academic scientists to download up to 1 Mb per week, free of reach-through provisions or publication restrictions, but redistribution is prohibited. Access to larger stretches of sequence requires an institutional agreement. Commercial users must either sign a material transfer agreement or subscribe or seek a licence. The restriction on redistribution prevents provision of the data through the usual public repositories of DNA sequence.

To the naive user of genome sequence what might be surprising in the few comparative analyses of the two assemblies is the relatively high level of discrepancies between them and the occurrence of misassemblies when compared with other data. However, this reflects the incomplete and transitory nature of the sequences. In fact, both sequences contain tens of thousands of gaps where the sequence has not yet been determined and a similar order of magnitude of stretches of sequence whose orientation is unknown. Neither sequence could be accurately called the sequence of the human genome, but instead are initial views of what the finished sequence will contain. For this reason the public domain sequence was referred to as a ‘draft’. As such they are suitable for a first pass analysis, but require considerable work in order to declare them complete. It is imperative that this work proceeds to ensure a high quality reference sequence on which to base all future human genome research. For the public domain project this work involves further mapping work to close holes in the minimum set of BACs to be sequenced, adding more shotgun sequence data to each BAC and resolving sequence ambiguities and gaps within BACs by directed sequencing. This work has been ongoing from before the initial publication of the draft sequence and has resulted in completed sequences of chromosomes 22, 21 and 20 (Dunham et al., 1999Go; Hattori et al., 2000Go; Deloukas et al., 2001Go). At the time of writing 71% of the public domain sequence has now reached this completed stage and it is anticipated that the sequence will be completed in 2003. With recent changes in Celera’s business strategy it is likely that the public domain human genome sequence will not have immediate competition. Given the unrestricted access and commitment to completion, the sequence that will stand the test of utility is the one provided by the public domain.


    Genes in the human genome sequence
 Top
 Abstract
 Introduction
 The human genome sequence
 Genes in the human...
 Variations on the sequence...
 Conclusions
 References
 
The initial analyses of the human genome sequence tended to confirm observations that had been made on smaller sections of the genome, including the finished sequences of chromosomes 21 and 22 (Dunham et al., 1999Go; Hattori et al., 2000Go), but with the additional authority of more data. The human genome shows substantial variation in the distributions of the obvious genomic features, including genes, GC content, repetitive elements and recombination rates. The genome is packed with repetitive sequences, many derived from transposable elements, accounting for about half the total sequence. Furthermore, there appears to have been numerous segmental duplications of regions of the genome, particularly around the centromeres and telomeres, which seem to be ongoing during primate evolution (Samonte and Eichler, 2002Go). In fact, the genome sequence is full of information which we have yet merely glimpsed and will keep the research community occupied for decades to come. However, I want to focus on the two areas that I identified in the Introduction as being immediately relevant to biomedical research, the protein coding genes and human sequence variation, and to assess the status of our knowledge in these areas.

Despite the availability of the draft genome sequence, there is still considerable uncertainty about the number of protein coding genes that are contained within it. Earlier analyses based on sampling transcript collections, comparative sequencing, CpG island analysis or sequencing of substantial genomic regions had indicated that there might be anywhere between 28 000 and 120 000 genes (Antequera and Bird, 1993Go; Fields et al., 1994Go; Dunham et al., 1999Go; Ewing and Green, 2000Go; Liang et al., 2000Go; Roest Crollius et al., 2000Go). Both the public and private domain groups predicted between 30 000 and 40 000 protein coding genes from their different sequences, although their analysis methods were slightly different. Surprisingly, an assessment of the gene sets predicted by both groups suggested that while genes based on curated transcripts from the Refseq database were largely in common between the gene sets, those based on prediction were largely non-overlapping (Hogenesch et al., 2001Go). To further confuse matters an independent group used largely similar methods to estimate that the genome sequence contained 65 000–70 000 transcription units (Wright et al., 2001Go).

Where does this leave us? It is clear that there is some way to go before we have a well-defined gene annotation of the human genome. The reasons for this are several. First, the presence of gaps and misoriented contigs within the genome sequence tends to lead to gene structures being fragmented. As the sequence is completed this problem will disappear. Second, computer programs that predict human genes from genomic sequence alone are prone to both inaccuracy and over-prediction, so that they cannot be used without support from other data (see Guigo et al., 2000Go, for one assessment of these softwares). It remains to be seen whether these problems can be resolved, but my intuition is that development of highly reliable prediction methods may depend on having the complete gene index available for human. However, the prediction methods may well prove valuable as more genomes are sequenced. Third, much of the experimental support for gene annotations in genomic sequence comes from sequencing from cDNA libraries either as ‘full-length’ cDNAs or expressed sequence tags (ESTs). These libraries are a complex mixture of alternative splices, alternative 3'-ends, partially processed and incomplete transcripts and, hence, the annotation based on these sources tends to fragment gene structures. Data from comparison of the completed chromosome 20 sequence and its gene annotation with genomic sequence from mouse and the pufferfish, Tetraodon nigoviridis, suggests that the publicly available cDNA sequences combined with comparisons at the protein level are sufficient to identify the vast majority of human genes (Deloukas et al., 2001Go). Unpublished data from an extensive annotation of human chromosome 22 in my group confirms this (Collins et al., unpublished data). It is also likely that additional sequences from other vertebrate genomes, including mouse, rat and zebrafish, will assist the annotation process, although the chromosome 20 and 22 experiences do not suggest that many additional genes will be uncovered (Deloukas et al., 2001Go). However, there is a difference between identifying genes and elucidating their full structures, and our studies on chromosome 22 suggest that there is still a requirement for experimental validation of genes and a continuous process of updating annotation if we are to have a complete gene annotation of the genome. For future studies, whether using microarrays, proteomics or genetics, I believe it is essential that this requirement is met and it is now the time to develop large-scale annotation programmes.

It is also worth considering the accessibility of the gene annotations. There are several databases and browsers providing access to human genome gene annotations, including the Ensembl project, NCBI and the UCSC genome browser. Exposure to each of these systems unearths their strengths and weaknesses and at the current time it is necessary to experiment to see which suits your needs best. However, as the genome sequence and the annotation become more stable, I anticipate that there will be considerable improvements in the utility of these browsers.


    Variations on the sequence theme
 Top
 Abstract
 Introduction
 The human genome sequence
 Genes in the human...
 Variations on the sequence...
 Conclusions
 References
 
Application of the genomic sequence to genetic disease requires the availability of variants within the sequence that are polymorphic within human population samples. Restriction enzyme length polymorphisms and mini- and microsatellite sequences have all been used to understand genetic contributions to disease, particularly by genetic linkage analysis in the relatively rare diseases which display Mendelian segregation in families (Collins, 1995Go). However, for the more common diseases with a genetic component, such as diabetes or cardiovascular disease, it is widely accepted that approaches based on association are required (Risch and Merikangas, 1996Go) and for genome-wide association studies high densities of polymorphisms will be necessary (Kruglyak, 1999Go; Cardon and Bell, 2001Go). Fortunately, single nucleotide polymorphisms (SNPs) are ubiquitous within the genome, occurring on average every 1–2 kb when any two human chromosomes are compared (Li and Sadler, 1991Go; Taillon-Miller et al., 1998Go; Cargill et al., 1999Go; Halushka et al., 1999Go; Marth et al., 1999Go).

Two major approaches have been followed to generate large collections of SNPs across the public domain human genome sequence. The first takes advantage of the origin of the clones used to determine the genome sequence. Most of the clones used were derived from a single BAC library which was prepared from DNA extracted from blood of a single anonymous male. Even within this library there are, of course, two copies of the genome which are as different from each other as any other two genomes. Therefore, whenever two BAC sequences overlap with each other but are derived from different chromosome homologues there will be sequence differences, and this can be exploited to extract SNPs (Horton et al., 1998Go; Taillon-Miller et al., 1998Go; Marth et al., 1999Go; Dawson et al., 2001Go). In fact, the range of libraries used, particularly for some parts of the sequence which were completed early, included DNA from other individuals as well (see for instance Dawson et al., 2001Go). The second approach was pioneered by the groups involved in the SNP consortium (TSC) and essentially involved resequencing specific DNA restriction fragments from a mix of individuals selected for polymorphism discovery (Altshuler et al., 2000Go; Mullikin et al., 2000Go). As more genome sequence was accumulated these SNPs could then be placed back onto the genome. Concurrent with the genome sequence publication a set of 1 400 000 non-redundant SNPs from these two sources plus additional SNPs in public databases was mapped onto the genome (Sachidanandam et al., 2001Go). On average this gives ~1 SNP every 1.9 kb and it was estimated that 85% of exons will be within 5 kb of a SNP. The spectrum of nucleotide changes observed in all the SNP studies showed that transitions account for 70% of all substitutions and mutation of methyl-CpG to TpG accounts for 25% of all SNPs, as has been seen in surveys of disease mutations (Krawczak et al., 1998Go). Analysis of the levels of nucleotide diversity across the genome showed that there is great variation, but that this is consistent with the standard coalescent population genetic model of human history. Other recent analyses have indicated that nucleotide diversity is also correlated with recombination rates and natural selection (Nachman, 2001Go).

With so many sequence variants, the number of possible human genome sequences could be astronomically large. Surprisingly, recent results have shown that, at least in localized regions, there may only be a limited repertoire of common sequences. This may also enhance the prospects of mapping common disease genes by genome-wide association mapping. The non-random association of alleles at different loci is termed linkage disequilibrium (LD), which can be quantified in a particular data set by a variety of measures (Devlin and Risch, 1995Go). At the molecular level, the fact that specific sets of sequence variants occur together preferentially underlies this association. These combinations of variants are called haplotypes. Knowledge of the occurrence of one variant in a population provides information about the occurrence of another variant if they are in LD. To effectively use genome-wide association strategies to map disease genes requires some knowledge of how LD and haplotypes occur in the genome. There has been considerable theroretical and experimental work devoted to this subject (reviewed in Ardlie et al., 2002Go), but until recently only small regions of the genome could be sampled experimentally. For instance, Reich et al. (2001)Go sampled 19 different 160 kb regions of the genome and found highly variable patterns of LD. However, detailed examination of a single 500 kb region on chromosome 5q31 indicated that in fact the patterns of LD are highly structured, with stretches of high LD interrupted by short segments without LD (Daly et al., 2001Go). A much larger study of 20 human chromosomes 21 from ethnically diverse individuals using high density oligonucleotide arrays again indicated limited haplotype diversity and suggested that >80% of a global human sample can typically be characterized by only three common haplotypes (Patil et al., 2001Go). Our own analysis of LD on chromosome 22 found regions of high LD that could extend over many hundreds of kilobases, including contiguous regions of low haplotype diversity extending up to 800 kb (Dawson et al., 2002Go). These observations of highly organized haplotype structures could be explained by limited diversity haplotype blocks separated by irregularly spaced recombination hotspots. This model is given support by high resolution analysis of recombination hotspots and LD within the human MHC class II region, where the boundaries of LD correspond precisely with 1–2 kb hotspots (Jeffreys et al., 2001Go). Furthermore, patterns of LD on chromosome 22 correlate strongly with genetic map distance (Dawson et al., 2002Go) and haplotype blocks boundaries appear to be highly correlated across human populations (Gabriel et al., 2002Go).

These data leave us with a picture of the human genome sequence in different individuals as consisting of blocks of limited haplotype diversity (and therefore limited sequence possibilities within the blocks) separated by short sequences of preferential recombination. Once the organization of these blocks is established, a relatively small number of SNPs could be used to uniquely tag each possible haplotype within a block; called haplotype tagging SNPs (Johnson et al., 2001Go). Typing a set of such SNPs across the genome effectively serves as a surrogate for determining the genome sequence, until such time as the technology and cost of resequencing become a practical proposition, and should allow genome-wide association studies using high throughput genotyping.


    Conclusions
 Top
 Abstract
 Introduction
 The human genome sequence
 Genes in the human...
 Variations on the sequence...
 Conclusions
 References
 
Determination of the human genome sequence was a massive technological and organizational achievement and the availability of the finished reference sequence will serve as the basis for the next century of human genetics. However, major challenges still lie ahead in order to maximize the value of the sequence. High quality annotation of the sequence is a prerequisite for use in most functional genomics experiments and it is imperative to push ahead to define a definitive gene index for man. Other functional annotation of the sequence, including promoters, control elements, chromatin structures and epigenetic modification sites, as well as annotation of protein function, will be layered on top as they are elucidated. On the genetics side we are on the verge of another major programme to describe haplotype structures within human populations and, hence, to describe the rich variety of human genome sequences. The challenge then is to apply this knowledge to the study of human disease.


    Acknowledgments
 
Thanks are due to Drs Charlotte Cole and Don Powell for comments on the manuscript. The author is supported by the Wellcome Trust.


    Notes
 
1 Email: id1{at}sanger.ac.uk Back


    References
 Top
 Abstract
 Introduction
 The human genome sequence
 Genes in the human...
 Variations on the sequence...
 Conclusions
 References
 

    Aach,J., Bulyk,M.L., Church,G.M., Comander,J., Derti,A. and Shendure,J. (2001) Computational comparison of two draft sequences of the human genome. Nature, 409, 856–859.[Medline]

    Altshuler,D., Pollara,V.J., Cowles,C.R., Van Etten,W.J., Baldwin,J., Linton,L. and Lander,E.S. (2000) An SNP map of the human genome generated by reduced representation shotgun sequencing. Nature, 407, 513–516.[Medline]

    Antequera,F. and Bird,A. (1993) Number of CpG islands and genes in human and mouse. Proc. Natl Acad. Sci. USA, 90, 11995–11999.[Abstract/Free Full Text]

    Ardlie,K.G., Kruglyak,L. and Seielstad,M. (2002) Patterns of linkage disequilibrium in the human genome. Nature Rev. Genet., 3, 299–309.[Web of Science][Medline]

    Cardon,L.R. and Bell,J.I. (2001) Association study designs for complex diseases. Nature Rev. Genet., 2, 91–99.[Web of Science][Medline]

    Cargill,M., Altshuler,D., Ireland,J. et al. (1999) Characterization of single-nucleotide polymorphisms in coding regions of human genes. Nature Genet., 22, 231–238.[Web of Science][Medline]

    Collins,F.S. (1995) Positional cloning moves from perditional to traditional. Nature Genet., 9, 347–350.[Web of Science][Medline]

    Daly,M.J., Rioux,J.D., Schaffner,S.F., Hudson,T.J. and Lander,E.S. (2001) High-resolution haplotype structure in the human genome. Nature Genet., 29, 229–232.[Web of Science][Medline]

    Davies,K. (2001) Cracking the Genome: Inside the Race to Unlock Human DNA. Simon & Schuster, New York, NY.

    Dawson,E., Chen,Y., Hunt,S. et al. (2001) A SNP resource for human chromosome 22: extracting dense clusters of SNPs from the genomic sequence. Genome Res., 11, 170–178.[Abstract/Free Full Text]

    Dawson,E., Abecasis,G.R., Bumpstead,S. et al. (2002) A first-generation linkage disequilibrium map of human chromosome 22. Nature, 418, 554–548.

    Deloukas,P., Matthews,L.H., Ashurst,J. et al. (2001) The DNA sequence and comparative analysis of human chromosome 20. Nature, 414, 865–871.[Medline]

    Devlin,B. and Risch,N. (1995) A comparison of linkage disequilibrium measures for fine-scale mapping. Genomics, 29, 311–322.[Web of Science][Medline]

    Dunham,I., Hunt,A.R., Collins,J.E. et al. (1999) The DNA sequence of human chromosome 22 [see comments]. Nature, 402, 489–495. [Erratum (2000) Nature, 404 (6780), 904.]

    Ewing,B. and Green,P. (2000) Analysis of expressed sequence tags indicates 35,000 human genes [see comments]. Nature Genet., 25, 232–234.[Web of Science][Medline]

    Fields,C., Adams,M.D., White,O. and Venter,J.C. (1994) How many genes in the human genome? [news] [see comments]. Nature Genet., 7, 345–346.[Web of Science][Medline]

    Gabriel,S.B., Schaffner,S.F., Nguyen,H. et al. (2002) The structure of haplotype blocks in the human genome. Science, 296, 2225–2229.[Abstract/Free Full Text]

    Guigo,R., Agarwal,P., Abril,J.F., Burset,M. and Fickett,J.W. (2000) An assessment of gene prediction accuracy in large DNA sequences. Genome Res., 10, 1631–1642.[Abstract/Free Full Text]

    Guyer,M. (1998) Statement on the rapid release of genomic DNA sequence. Genome Res., 8, 413.[Free Full Text]

    Halushka,M.K., Fan,J.B., Bentley,K., Hsie,L., Shen,N., Weder,A., Cooper,R., Lipshutz,R. and Chakravarti,A. (1999) Patterns of single-nucleotide polymorphisms in candidate genes for blood-pressure homeostasis. Nature Genet., 22, 239–247.[Web of Science][Medline]

    Hattori,M., Fujiyama,A., Taylor,T.D. et al. (2000) The DNA sequence of human chromosome 21. The chromosome 21 mapping and sequencing consortium [see comments]. Nature, 405, 311–319.[Medline]

    Hogenesch,J.B., Ching,K.A., Batalov,S., Su,A.I., Walker,J.R., Zhou,Y., Kay,S.A., Schultz,P.G. and Cooke,M.P. (2001) A comparison of the Celera and Ensembl predicted gene sets reveals little overlap in novel genes. Cell, 106, 413–415.[Web of Science][Medline]

    Horton,R., Niblett,D., Milne,S., Palmer,S., Tubby,B., Trowsdale,J. and Beck,S. (1998) Large-scale sequence comparisons reveal unusually high levels of variation in the HLA-DQB1 locus in the class II region of the human MHC. J. Mol. Biol., 282, 71–97.[Web of Science][Medline]

    IHGSC. (2001) Initial sequencing and analysis of the human genome. Nature, 409, 860–921.[Medline]

    Jeffreys,A.J., Kauppi,L. and Neumann,R. (2001) Intensely punctate meiotic recombination in the class II region of the major histocompatibility complex. Nature Genet., 29, 217–222.[Web of Science][Medline]

    Johnson,G.C., Esposito,L., Barratt,B.J. et al. (2001) Haplotype tagging for the identification of common disease genes. Nature Genet., 29, 233–237.[Web of Science][Medline]

    Krawczak,M., Ball,E.V. and Cooper,D.N. (1998) Neighboring-nucleotide effects on the rates of germ-line single-base-pair substitution in human genes. Am. J. Hum. Genet., 63, 474–488.[Web of Science][Medline]

    Kruglyak,L. (1999) Prospects for whole-genome linkage disequilibrium mapping of common disease genes. Nature Genet., 22, 139–144.[Web of Science][Medline]

    Li,W.H. and Sadler,L.A. (1991) Low nucleotide diversity in man. Genetics, 129, 513–523.[Abstract]

    Liang,F., Holt,I., Pertea,G., Karamycheva,S., Salzberg,S.L. and Quackenbush,J. (2000) Gene index analysis of the human genome estimates approximately 120,000 genes [see comments]. Nature Genet., 25, 239–240.[Web of Science][Medline]

    Marth,G.T., Korf,I., Yandell,M.D., Yeh,R.T., Gu,Z., Zakeri,H., Stitziel,N.O., Hillier,L., Kwok,P.Y. and Gish,W.R. (1999) A general approach to single-nucleotide polymorphism discovery. Nature Genet., 23, 452–456.[Web of Science][Medline]

    Mullikin,J.C., Hunt,S.E., Cole,C.G. et al. (2000) An SNP map of human chromosome 22. Nature, 407, 516–520.[Medline]

    Myers,E.W., Sutton,G.G., Smith,H.O., Adams,M.D. and Venter,J.C. (2002) On the sequencing and assembly of the human genome. Proc. Natl Acad. Sci. USA, 99, 4145–4146.[Free Full Text]

    Nachman,M.W. (2001) Single nucleotide polymorphisms and recombination rate in humans. Trends Genet., 17, 481–485.[Web of Science][Medline]

    Olivier,M., Aggarwal,A., Allen,J. et al. (2001) A high-resolution radiation hybrid map of the human genome draft sequence. Science, 291, 1298–1302.[Abstract/Free Full Text]

    Olson,M.V. (2002) The Human Genome Project: a player’s perspective. J. Mol. Biol., 319, 931–942.[Web of Science][Medline]

    Patil,N., Berno,A.J., Hinds,D.A. et al (2001) Blocks of limited haplotype diversity revealed by high-resolution scanning of human chromosome 21. Science, 294, 1719–1723.[Abstract/Free Full Text]

    Reich,D.E., Cargill,M., Bolk,S. et al. (2001) Linkage disequilibrium in the human genome. Nature, 411, 199–204.[Medline]

    Risch,N. and Merikangas,K. (1996) The future of genetic studies of complex human diseases. Science, 273, 1516–1517.[Abstract/Free Full Text]

    Roest Crollius,H., Jaillon,O., Bernot,A. et al. (2000) Estimate of human gene number provided by genome-wide analysis using Tetraodon nigroviridis DNA sequence [see comments]. Nature Genet., 25, 235–238.[Web of Science][Medline]

    Sachidanandam,R., Weissman,D., Schmidt,S.C. et al. (2001) A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms. Nature, 409, 928–933.[Medline]

    Samonte,R.V. and Eichler,E.E. (2002) Segmental duplications and the evolution of the primate genome. Nature Rev. Genet., 3, 65–72.[Web of Science][Medline]

    Semple,C.A., Morris,S.W., Porteous,D.J. and Evans,K.L. (2002) Computational comparison of human genomic sequence assemblies for a region of chromosome 4. Genome Res., 12, 424–429.[Abstract/Free Full Text]

    Sulston,J. and Ferry.G. (2002) The Common Thread—A Story of Science, Politics, Ethics and the Human Genome. Bantam Press, London, UK.

    Taillon-Miller,P., Gu,Z., Li,Q., Hillier,L. and Kwok,P.Y. (1998) Overlapping genomic sequences: a treasure trove of single-nucleotide polymorphisms. Genome Res., 8, 748–754.[Abstract/Free Full Text]

    Venter,J.C., Adams,M.D., Myers,E.W. et al. (2001) The sequence of the human genome. Science, 291, 1304–1351.[Abstract/Free Full Text]

    Wade,N. (2002) Scientist reveals genome secret: it’s his. New York Times, April 27, 2002, Section A, p. 1.

    Waterston,R.H., Lander,E.S. and Sulston,J.E. (2002) On the sequencing of the human genome. Proc. Natl Acad. Sci. USA, 99, 3712–3716.[Abstract/Free Full Text]

    Wright,F.A., Lemon,W.J., Zhao,W.D. et al. (2001) A draft annotation and overview of the human genome. Genome Biol., 2, WEB PAGES–RESEARCH0025.

Received on June 28, 2002; accepted on July 12, 2002.


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Dunham, I.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Dunham, I.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?