Codon Usage Bias Tutorial

Analysis of sequence data can provide valuable insight into the molecular makeup and evolutionary history of a species. Genomic databases have proven a fertile research ground for finding new genes (Zhang, 2002), identifying the ancestral origins of genes and other sequences, and sorting out the phylogenetic relationships among species (Swofford, 2000; Sokal and Michener, 1958). Recent research has suggested that one potentially valuable source for information regarding coding regions lies in the inherent degeneracy of the genetic code. Since most amino acids are represented by more than one codon (triplet of nucleotides, see Appendix A for a cross reference chart of codons to amino acids), preferential usage of specific codons within genes can provide insight into the origins of genes and even their role within the cell. It is well established that organisms preferentially utilize one or more of the available synonymous codons for each amino acid (Grantham et al., 1980, 1981; Ikemura, 1981a; Sharp et al., 1988). This codon usage bias has been used to help identify horizontally transferred and other recently acquired genes within species (Garcia- Vallv'e et al., 2000; Chou and Zhang, 1995, 1994), and has been shown to correlate with the degree to which the gene is expressed for many prokaryotic species (Gouy and Gautier, 1982; Ikemura, 1981b; Sharp and Li, 1986) and some low-order eukaryotes (Sharp et al., 1988; Carbone et al., 2003). Expressivity in turn has been shown to be related to a variety of characteristics including specific amino acid selection (Akashi and Gojobori, 2002; Heizer Jr. et al., 2006).

Selective pressure to enhance translational efficiency is thought to be one of the underlying causes of this bias, and it is this translational efficiency bias that is related to gene expressivity (Ikemura, 1981b; Sharp and Li, 1986). Since the transfer RNAs associated with each codon vary in their relative abundance, efficiencies in translation can be realized when the tRNA of highest abundance is employed. This is known as the tRNA adaptation theory (Garel et al., 1970; Chavancy and Garel, 1981). An additional benefit realized when preferred codons are used is an increase in the accuracy (as much as ten fold) of translation (Precup and Parker, 1987).

There are several methods for isolating and measuring codon usage bias. Those that specifically measure the bias associated with translational efficiency generally tend to rely upon knowledge of a set of highly expressed genes. These genes will adhere more strongly to the translational efficiency bias and can be used to identify preferred codons (those codons associated with more abundant tRNA species). Following are descriptions of some methods for finding translational efficiency bias.

Methods for Isolating Translational Efficiency Bias

Frequency of usage of optimal codons (FOP)

The relationship between codon usage bias and expressivity was first documented in 1981 (Ikemura, 1981a). At that time there were only a few dozen genes sequenced for E. coli. Now there are over 4,000 (NC 000913.2). While research had already noted that a bias existed (Fiers et al., 1975; Air et al., 1976; Efstratiadis et al., 1976) it was Ikemura (1981a) that identified the underlying association with expressivity. His research began by studying the correlation of codon usage bias and tRNA abundance because it was becoming clear that the bias was "mostly attributable to the availability of transfer RNA within a cell" (Post et al., 1979; Post and Nomura, 1980; Nichols and Yanofsky, 1979; Nakamura et al., 1980; Yokota et al., 1980; Ikemura, 1981a; Ikemura et al., 1980). Ikemura (1981a) found that codon usage bias did, indeed, adhere to the tRNA adaptation theories but he identified another interesting trend. He wrote a second paper that same year proposing that synonymous codon usage could be used as a predictor of expression rates (Ikemura, 1981b). The synonymous codons that are revealed most often in highly expressed genes were termed optimal or preferred codons. Ikemura (1981a) found that there was a "tendency that the genes encoding abundant protein species selectively use the codons of major isoacceptors..." Further, he found that this choice is strictly constrained by tRNA availability.

This precipitated the identification of four rules that predict the choice of major codons. His earlier research (Ikemura, 1981a) yielded the first: thiolation of uridine in the wobble position (the third and most highly variable nucleotide of a codon) of an anticodon produces a preference for using an A-terminated codon over a G-terminated codon. Other research (Grosjean et al., 1978) provided a second rule: codons of type (A or U)-(A or U)-(pyrimidine) would support an optimal interaction strength between a codon and an anticodon when the third nucleotide is C. To these Ikemura added two new constraints. The first was: "the introduction of inosine (a nucleoside formed by the deamination of adenosine. Important because the nucleotide fails to form specific pair bonds with the other bases) at the wobble position may produce a possible preference for Uand C- terminated codons over the A-terminated codon, which must lead to purine-purine wobble pairing." The second was: "synonymous codon usage is governed by the most highly available tRNA." These rules and subsequent trends led to the concept of frequency of use of optimal codons.

This frequency was found to be highly correlated with protein abundance. All the rules except tRNA availability were the same from species to species and the translational efficiency attained through tRNA abundance was presumed to be the driving force behind the correlation.

Codon Adaptation Index (CAI)

In 1986 Sharp was again involved in the development of a measure of synonymous codon usage bias. It was called the Codon Adaptation Index (CAI) (Sharp and Li, 1987). The measure was created to address several perceived weaknesses in the existing measures. Prior to CAI the more popular measurements were essentially binary--either the codon in question was optimal or it was not. There was no gradation. Also, it was not possible to determine whether a codon was optimal in every case. Sometimes codons had to be excluded because their status was unclear. Finally, Sharp and Li observe that no between-species comparisons could be performed because the "proportional division of the codon table into the two categories [differed from species to species]."

An already existing measure known as a codon preference statistic addressed the first two issues (Gribskov et al., 1984). This statistic is calculated as the probability of finding a particular codon in a highly expressed gene compared to the probability of finding it in a random sequence made up of the same nucleotides. Unfortunately the codon preference statistic can produce two very different results for genes with different amino acid compositions even if both used only optimal codons. The codon adaptation index corrected this deficiency by including normalization. This makes interspecies comparisons possible and convenient. The process of calculating CAI requires a priori knowledge of expression rates for an organism's genes. The gene set that is most highly expressed is known as the reference set. From the sequences of these genes a table of codon usage values is built. Once again Relative Synonymous Codon Usage (RSCU) values are used (count of a codon divided by the average for codon family).

A weight, or relative adaptiveness, for an associated codon is calculated by dividing its RSCU value by the RSCU of its maximal sibling. The maximal sibling will have a weight of one. The codon adaptation index of a gene is calculated by taking the geometric mean of its codon's weights.

Methods for Simply Measuring the Degree of Bias

Scaled Chi Squared

In 1988 a Scaled Chi Squared measure was introduced as a measure of codon bias (Shields et al., 1988; Shields and Sharp, 1987). Sharp was involved [from the clustering and CAI methods (Sharp and Li, 1987; Sharp et al., 1986)] so there were similarities in the methods employed (e.g. RSCU was used along with the Chi Squared metric, though this time it was scaled). Drosophila melanogaster (fruit flies) was the target genome. D. melanogaster is eukaryotic (vs. the single-celled prokaryotic organisms commonly studied) and is, therefore, much more complex. In this case, clustering was inappropriate since the within-species variation was continuous rather than discrete.

A silent site was defined as a synonymously variable position within a codon and their research uncovered evidence of bias in selecting nucleotides at these positions. A Chi Squared calculation was performed that examined deviation of codon usage from expected values. "Since these values are generally highly correlated with gene length, they were then scaled with division by the number of codons in the gene (excluding Trp and Met codons, which do not contribute to chi)."

Effective Number of Codons (Nc)

The effective number of codons for a gene is a measure of how biased a gene is in favor of a subset of codons (Wright, 1990). It was developed in 1990 as a means of determining codon usage bias with sequence information only. No a priori knowledge of tRNA concentrations or expressivity was required. There are 61 codons that can code for the 20 amino acids. The index is designed such that uniform usage of codons yields an effective number of codons of 61. If some codons are used more than others the number of effective codons begins to decline. If, for each amino acid, a single codon is used to the exclusion of its synonymous codons, an effective codon number of 20 can be attained.

Automated Methods

Several researchers have attempted to automate the isolation of the dominant bias within an organism's genome. In these methods, the bias discovered may or may not be associated with translational efficiency. It might also be biases such as GC-content or strand bias.

Major Codon Usage (MCU)

In 1996 Kanaya et al. used principal components analysis to identify major codons (Kanaya et al., 1996, 1999). A codon frequency, or RSCU (Equation 2.3), matrix is generated whose rows represent genes and whose columns represent the codon frequencies in each gene (Figure 2.1). Principal components analysis (Hotelling, 1933; Jolliffe, 1986) (Appendix B) is performed on this matrix to derive the axis of greatest variance within this data (first principle component). The RSCU matrix will be represented as X.

The intuition behind this approach is as follows: assume first that the primary force explaining the variance in a genome's codon usage is translational efficiency bias (i.e. highly expressed genes show a high preference for major codons while weakly expressed genes do not). It follows that highly expressed genes should be found at one end of the axis of greatest variance, and weakly expressed genes should fall at the other. To determine where each gene is to be projected upon the first principal component, a dot product is performed (principal components are Eigenvectors generated from X's covariance matrix making them unit vectors). The resultant projection is Z01. Projections are often normalized. The prime (0) in this case indicates that this is the non-normalized version.

In order to determine which codons are "major," or preferred, it is a simple matter of determining whether or not the codon contributes to the overall ordering of genes on Z01. This is established by computing the correlation between the usage of each codon in each gene and each gene's location on Z01. A positive correlation indicates that the codon contributes positively to the overall ordering of Z01 and can be considered preferred or major. More formally, this is accomplished by measuring the correlation between each column of the frequency (RSCU) matrix and Z01 (Figure 2.2). The resultant correlations are known as factor loadings. Kanaya et al. (1996) compared the factor loadings to the preferred codons derived using the Ikemura 4-rule method (Ikemura, 1981a) in order to validate their findings. No a priori knowledge of tRNA level is required to determine which are major. Note that Ikemura is a coauthor on this work, so it can be seen as removing the a priori requirement from the determination of FOP (Ikemura, 1981a).

CAI Revisited

In 2003 Carbone et al. removed the need for a priori knowledge of gene expressivity from the CAI calculation process (Carbone et al., 2003). Theirs is a two step approach that works first to identify the proper reference set of genes (using a greedy, hill-climbing algorithm) and then calculates the CAI score for each gene based upon this reference set. Identification of the reference set is performed by assigning a precise mathematical definition to reference set membership and then searching for the genes that match this definition. Carbone et al. define a reference set as a small set of genes (1% of genome) characterizing a bias to which its (the reference set's) adherence is stronger than all other genes in the genome. The search algorithm (for the reference set) is iterative in nature. Identification of the reference set begins by considering the entire genome as a reference set. It assigns a weight to each codon based upon the codon usage in this all-inclusive reference set. The weight for a given codon is calculated as before (Equation 2.5) based upon the identified reference set. CAI scores are calculated for all genes (Equation 2.9), the list of genes is sorted by this CAI score, and then the genes in the top half of the list are kept as the new reference set. New w values are calculated, followed by new CAI values for the genes. This half-splitting technique is repeated until the reference set represents one percent of the original number of genes.

To prevent confusion, when the term CAI is used it will be in reference to values derived using the traditional Sharp and Li approach. When the Carbone et al. (2003) method is utilized this text will employ the terminology "self-consistent codon index (SCCI)," as described in Carbone (2006). Self-consistency refers to the definitional condition that the identified reference set adhere more strongly to the bias (which the reference set itself defines, hence self-consistent) than all other genes in the genome.

The Need for an Improved Method for Identifying Translational Efficiency Bias

Biases associated with translational efficiency are not the only biases found in prokaryotic and small eukaryotic genomes. Codon usage can also be affected by tendencies toward high or low GC-content as well as strand bias (Carbone et al., 2003, 2005). Strand bias is brought on during the processes of transcription and replication. The transcription-induced bias is thought to be caused by increased deamination of cytosine leading to C!T transitions on the non-transcribed strand (Francino et al., 1996; Francino and Ochman, 1997; Beletskii and Bhagwat, 1998). The replicationinduced bias is thought to be the result of the discontinuous process for replicating the lagging strand where Okazaki fragments are synthesized and then joined. This process results in the leading strand being richer in G+T than the lagging strand (Lafay et al., 2000, 1999; Rocha et al., 1999). In some cases these biases (content and strand) can coexist with translational efficiency bias (Grocock and Sharp, 2002; Carbone et al., 2005). When this occurs the translation-driven codon usage bias can be obscured, making gene expression levels difficult to predict.

The two algorithms written by Raiford et al. (that's me) attempt to address this by stearing the search away from unbalanced GC-content regions in the search space (the first algorithm) or by searching for a set of SCCI weights that better explains the high placement of known highly expressed genes in a sorted list of genes by their adherence scores (the second algorithm). These methods are known as modified self-consistent codon index (mSCCI) and the "search-based method," respectively.

The content and opinions expressed on this Web page do not necessarily reflect the views of nor are they endorsed by the University of Montana.