CpG islands (CGIs) are often considered as gene markers, but the number of CGIs varies among mammalian genomes that have similar numbers of genes. In this study, we investigated the distribution of CGIs in the promoter regions of 3,197 human-mouse orthologous gene pairs and found that the mouse genome has notably fewer CGIs in the promoter regions and less pronounced CGI characteristics than does the human genome. We further inferred CGI's ancestral state using the dog genome as a reference and examined the nucleotide substitution pattern and the mutational direction in the conserved regions of human and mouse CGIs. The results reveal many losses of CGIs in both genomes but the loss rate in the mouse lineage is two to four times the rate in the human lineage. We found an intriguing feature of CGI loss, namely that the loss of a CGI usually starts from erosion at the both edges and gradually moves towards the center. We found functional bias in the genes that have lost promoter-associated CGIs in the human or mouse lineage. Finally, our analysis indicates that the association of CGIs with housekeeping genes is not as strong as previously estimated. Our study provides a detailed view of the evolution of promoter-associated CGIs in the human and mouse genomes and our findings are helpful for understanding the evolution of mammalian genomes and the role of CGIs in gene function.
Recently, Fryxell and Moon (2005) examined methylation-dependent transition rates (5(m)C deamination rates), which were calculated by the difference between the CpG transition and GpC transition rates, using 4,437 transition mutations in CpG or GpC dinucleotides. They concluded that 5(m)C deamination rates were highly dependent on local GC content but not on local sequence lengths over which GC content was calculated or the genomic regions where the mutations occurred. Here, we reexamined these statements by using 292,216 CpG - TpG/CpA and GpC - GpT/ApC mutations, an increase of 66 times as much data. Contrary to Fryxell and Moon's conclusions, our analysis indicated that 5(m)C deamination rates in the human genome were dependent on both the local sequence length and the genomic region. Some explanations for their conclusions were provided.
Background: The local environment of single nucleotide polymorphisms ( SNPs) contains abundant genetic information for the study of mechanisms of mutation, genome evolution, and causes of diseases. Recent studies revealed that neighboring-nucleotide biases on SNPs were strong and the genome-wide bias patterns could be represented by a small subset of the total SNPs. It remains unsolved for the estimation of the effective SNP size, the number of SNPs that are sufficient to represent the bias patterns observed from the whole SNP data.Results: To estimate the effective SNP size, we developed a novel statistical method, SNPKS, which considers both the statistical and biological significances. SNPKS consists of two major steps: to obtain an initial effective size by the Kolmogorov-Smirnov test (KS test) and to find an intermediate effective size by interval evaluation. The SNPKS algorithm was implemented in computer programs and applied to the real SNP data. The effective SNP size was estimated to be 38,200, 39,300, 38,000, and 38,700 in the human, chimpanzee, dog, and mouse genomes, respectively, and 39,100, 39,600, 39,200, and 42,200 in human intergenic, genic, intronic, and CpG island regions, respectively.Conclusion: SNPKS is the first statistical method to estimate the effective SNP size. It runs efficiently and greatly outperforms the algorithm implemented in SNPNB. The application of SNPKS to the real SNP data revealed the similar small effective SNP size (38,000-42,200) in the human, chimpanzee, dog, and mouse genomes as well as in human genomic regions. The findings suggest strong influence of genetic factors across vertebrate genomes.
Sequence characterization of the genomic region of sorghum yellow seed1 shows the presence of two genes that are arranged in a head to tail orientation. The two duplicated gene copies, y1 and y2 are separated by a 9.084 kbp intergenic region, which is largely composed of highly repetitive sequences. The y1 is the functional copy, while the y2 may represent a pseudogene; there are several sequence indels and rearrangements within the putative coding region of y2. The y1 gene encodes a R2R3 type of Myb domain protein that regulates the expression of chalcone synthase, chalcone isomerase and dihydroflavonol reductase genes required for the biosynthesis of 3-deoxyflavonoids. Expression of y1 can be observed throughout the plant and it represents a combination of expression patterns produced by different alleles of the maize p1. Comparative sequence analysis within the coding regions and flanking sequences of y1, y2 and their maize and teosinte orthologs show local rearrangements and insertions that may have created modified regulatory regions. These micro-colinearity modifications possibly are responsible for differential patterns of expression in maize and sorghum floral and vegetative tissues. Phylogenetic analysis indicates that sorghum y1 and y2 sequences may have arisen by gene duplication mechanisms and represent an evolutionarily parallel event to the duplication of maize p2 and p1 genes.
Background: The pattern of point mutation is important for studying mutational mechanisms, genome evolution, and diseases. Previous studies of mutation direction were largely based on substitution data from a limited number of loci. To date, there is no genome-wide analysis of mutation direction or methylation-dependent transition rates in the chimpanzee or its categorized genomic regions.Results: In this study, we performed a detailed examination of mutation direction in the chimpanzee genome and its categorized genomic regions using 588,918 SNPs whose ancestral alleles could be inferred by mapping them to human genome sequences. The C. T ( G. A) changes occurred most frequently in the chimpanzee genome. Each type of transition occurred approximately four times more frequently than each type of transversion. Notably, the frequency of C - T (G - A) was the highest in exons among the genomic categories regardless of whether we calculated directly, normalized with the nucleotide content, or removed the SNPs involved in the CpG effect. Moreover, the directionality of the point mutation in exons and CpG islands were opposite relative to their corresponding intergenic regions, indicating that different forces govern the nucleotide changes. Our analysis suggests that the GC content is not in equilibrium in the chimpanzee genome. Further quantitative analysis revealed that the 5-methylcytosine deamination rates at CpG sites were highly dependent on the local GC content and the lengths of SNP flanking sequences and varied among categorized genomic regions.Conclusion: We present the first mutational spectrum, estimated by three different approaches, in the chimpanzee genome. Our results provide detailed information on recent nucleotide changes and methylation-dependent transition rates in the chimpanzee genome after its split from the human. These results have important implications for understanding genome composition evolution, mechanisms of point mutation, and other genetic factors such as selection, biased codon usage, biased gene conversion, and recombination.
So far, there is no genome-wide estimation of the mutational spectrum in humans. In this study, we systematically examined the directionality of the point mutations and maintenance of GC content in the human genome using similar to 1.8 million high-quality human single nucleotide polymorphisms and their ancestral sequences in chimpanzees. The frequency of C - T (G - A) changes was the highest among all mutation types and the frequency of each type of transition was approximately fourfold that of each type of transversion. In intergenic regions, when the GC content increased, the frequency of changes from G or C increased. In exons, the frequency of G:C - A:T was the highest among the genomic categories and contributed mainly by the frequent mutations at the CpG sites. In contrast, mutations at the CpG sites, or CpG - TpG/CpA mutations, occurred less frequently in the CpG islands relative to intergenic regions with similar GC content. Our results suggest that the GC content is overall not in equilibrium in the human genome, with a trend toward shifting the human genome to be AT rich and shifting the GC content of a region to approach the genome average. Our results, which differ from previous estimates based on limited loci or on the rodent lineage, provide the first representative and reliable mutational spectrum in the recent human genome and categorized genomic regions. (c) 2006 Elsevier Inc. All rights reserved.
Background: Myb proteins contain a conserved DNA-binding domain composed of one to four repeat motifs ( referred to as R0R1R2R3); each repeat is approximately 50 amino acids in length, with regularly spaced tryptophan residues. Although the Myb proteins comprise one of the largest families of transcription factors in plants, little is known about the functions of most Myb genes. Here we use computational techniques to classify Myb genes on the basis of sequence similarity and gene structure, and to identify possible functional relationships among subgroups of Myb genes from Arabidopsis and rice (Oryza sativa L. ssp. indica). Results: This study analyzed 130 Myb genes from Arabidopsis and 85 from rice. The collected Myb proteins were clustered into subgroups based on sequence similarity and phylogeny. Interestingly, the exon-intron structure differed between subgroups, but was conserved in the same subgroup. Moreover, the Myb domains contained a significant excess of phase 1 and 2 introns, as well as an excess of nonsymmetric exons. Conserved motifs were detected in carboxy-terminal coding regions of Myb genes within subgroups. In contrast, no common regulatory motifs were identified in the noncoding regions. Additionally, some Myb genes with similar functions were clustered in the same subgroups. Conclusions: The distribution of introns in the phylogenetic tree suggests that Myb domains originally were compact in size; introns were inserted and the splicing sites conserved during evolution. Conserved motifs identified in the carboxy-terminal regions are specific for Myb genes, and the identified Myb gene subgroups may reflect functional conservation.
Myb domain proteins contain a conserved DNA-binding domain composed of one to four conserved repeat motifs. In animals, Myb proteins are encoded by a small gene family and commonly contain three repeat motifs (R1R2R3); whereas, plant Myb proteins are encoded by a very large and diverse gene family in which a motif containing two repeats (R2R3) is the most common. In contrast to the conservation in the Myb domain, other regions of Myb proteins are highly variable. To explore the evolutionary origin of Myb genes, we cloned and sequenced Myb domains from maize and sorghum, and conducted a comprehensive phylogenetic analysis of Myb genes. The results indicate that the origins of individual Myb repeats are strikingly distinct, and that the R2 repeat has evolved more slowly than the R1 and R3 repeats. However, it is not clear which repeat is the most ancient one. The evidence also suggests that R2R3 and R1R2R3 Myb genes co-existed in eukaryotes before the divergence of plants and animals. Based on our results, we propose that R1R2R3 Myb genes were derived from R2R3 Myb genes by gain of the R1 repeat through an ancient intragenic duplication; this gain model is more parsimonious than the previous proposal that R2R3 Myb genes were derived from R1R2R3 Mybs by loss of the R1 repeat. A separate group of diverse non-typical Myb proteins exhibits a polyphyletic origin and a complex evolutionary pattern. Finally, a small group of ancient Myb paralogs prior to the amplification of current Myb genes is identified. Together, these results support a new model for the ordered evolution of Myb gene family. (C) 2003 Elsevier B.V. All rights reserved.
Two 1R chromosomes of Secale cereale L. were isolated from one metaphase cell by means of chromosome micro-isolation, and the chromosomal DNA was amplified adopting the cohesive adapters single primer polymerase chain reaction (CASP-PCR) technique. The CASP-PCR products were labeled as probes. The results of Southern blot hybridization confirmed that the CBSP-PCR products derived from the chromosome 1R were homologous with the genomic DNA of S. cereale. The clones of PCR products were obtained with high efficiency. Over 10 000 recombinant clones were obtained from one-tenth of the ligation mixture which was transferred into the competent E. coli DH5 alpha. The size of the inserted fragments of clones ranged from 250 bp to 500 bp. This research has established the foundation for further selection of chromosome IR markers.