Université de Bordeaux, Centre de Bioinformatique et Génomique Fonctionnelle Bordeaux, F-33000 Bordeaux, France

Equipe SYMBIOSE, INRIA Rennes Bretagne Atlantique, Campus de Beaulieu, F-35042 Rennes, France

Université de Toulouse, ENVT, UMR 1225, F-31076 Toulouse, France

INRA, UMR 1225, F-31076 Toulouse, France

Anses, Lyon Laboratory, UMR Mycoplasmoses of Ruminants, 31 Avenue Tony Garnier F-69364 Lyon cedex 07, France

CIRAD, UMR CMAEE, Campus de Baillarguet, F-34398 Montpellier, France

Université de Bordeaux, UMR 1332, 71, avenue Edouard Bourlaux, F-33140 Villenave d'Ornon, France

INRA, UMR 1332, 71, avenue Edouard Bourlaux, F-33140 Villenave d'Ornon, France

Université de Bordeaux, Laboratoire Bordelais de Recherche en Informatique, UMR 5800, F-33405 Talence, France

Abstract

Background

Substitution matrices are key parameters for the alignment of two protein sequences, and consequently for most comparative genomics studies. The composition of biological sequences can vary importantly between species and groups of species, and classical matrices such as those in the BLOSUM series fail to accurately estimate alignment scores and statistical significance with sequences sharing marked compositional biases.

Results

We present a general and simple methodology to build matrices that are especially fitted to the compositional bias of proteins. Our approach is inspired from the one used to build the BLOSUM matrices and is based on learning substitution and amino acid frequencies on real sequences with the corresponding compositional bias. We applied it to the large scale comparison of Mollicute AT-rich genomes. The new matrix, MOLLI60, was used to predict pairwise orthology relationships, as well as homolog families among 24 Mollicute genomes. We show that this new matrix enables to better discriminate between true and false orthologs and improves the clustering of homologous proteins, with respect to the use of the classical matrix BLOSUM62.

Conclusions

We show in this paper that well-fitted matrices can improve the predictions of orthologous and homologous relationships among proteins with a similar compositional bias. With the ever-increasing number of sequenced genomes, our approach could prove valuable in numerous comparative studies focusing on atypical genomes.

Background

A fundamental task in evolutionary biology and comparative genomics is to quantify the similarity between biological sequences and then assess their evolutionary relationships. The challenge is in particular to distinguish between homologous (and even orthologous and paralogous) and unrelated sequences. DNA and protein sequences similarities are usually assessed through an alignment of sequences. The scoring function is crucial in the alignment process: not only it enables to choose the best pairing between the sequences among all possible, but also it is often at the heart of the evaluation of the alignment significance. Indeed, a p-value or e-value is usually derived from the score of the alignment to assess if it is statistically different from one that can arise by chance (in a random alignment). Substitution matrices are therefore key parameters in the alignment process since they assign an elementary score to each match and mismatch between any two letters of the alphabet. This is particularly relevant in the case of protein alignment. As the protein alphabet contains 20 amino acids with different frequencies in biological sequences and different biochemical properties that may impact the protein structure, matches and mismatches between amino acids are not equipropable. It is thus crucial to take into account these differences in order to estimate accurately the similarity between sequences.

The most widely used matrices are the BLOSUM

These matrices are typically expressed in log-odds scores: the score of the pairing of two amino acids is the logarithm of the ratio of the likelihoods of this pairing under two hypotheses: homology versus chance. It reflects thus how likely two aligned residues are descendants of a common ancestral residue with respect to random pairing: if the pairing of two residues is more likely to arise by chance than by homology the score will be negative, in the opposite situation it will be positive. Consequently, two probabilities need to be computed: the one of observing the two residues aligned in an alignment of homologous sequences and the probability of observing the two residues aligned in unrelated sequences. The latter can be easily computed when assuming independence between the sequences, that is the probability of drawing at random these two residues among all residues present in the sequences, it is thus the product of their frequencies in the sequences, also called the background frequencies. The first probability, also called the target frequency, is inferred by counting the amount of each kind of substitutions in homologous sequence alignments.

For both approaches, the reference set of homologous sequences used to estimate the substitution frequencies is therefore essential and the resulting matrices will reflect both the composition and the evolutionary characteristics of these sequences. For both matrix series, this reference set was composed of "classical" sequences and together with the large size of the datasets, this gave matrices with an "average" amino acid composition. However, sequence compositions can vary drastically, and one can find groups of sequences with atypical compositions: certain types of proteins have a composition directly correlated to their function, such as transmembrane proteins which include tracts of hydrophobic residues. Additionally, at the scale of the whole genome, some species show extreme nucleotide compositions that affect their amino acid composition, one of the most extreme example being the malaria agent,

Keeping the rationale of log-odds scores in mind, we can easily figure out that the classical matrices will not perfectly fit sequences with biased amino acid composition. Basically, the likelihood of each match and mismatch under the null hypothesis of unrelated sequences will vary according to the differences in the background frequencies. For instance, for amino acids which are rare in classical sequences but much more abundant in the sequences of interest, the probability of observing their pairing just by chance would be greatly underestimated in the classical matrices and therefore their log-odds score will not reflect a "true" likelyhood ratio. As a consequence of these discrepancies the resulting alignment can be wrong (not pairing together homologous residues) and also mis-evaluated, both leading possibly to mis-interpretation of the similarity. This issue has already been pointed out from a theoretical point of view by Yu

In this paper, we address the problem of designing new substitution matrices fitted to the compositional bias of Mollicute proteins. Mollicutes are a particular class of bacteria derived from Gram-positive bacteria, including for instance the well-known

Results

Compositional bias of Mollicute proteins

The nucleotide composition of Mollicute genomes is strongly biased towards A and T, with a G+C content that varies from 21.4 to 40.0% with a median value of 27.1%.

Amino acid frequencies

**Amino acid frequencies**. Amino acid frequencies averaged over all cds of 24 whole genomes of Mollicutes versus BLOSUM62's. A complete list of genomes taken into account is provided in Additional file

MOLLI60: the new substitution matrix

The main result of this paper is the generation of a new substitution matrix, called MOLLI60, which is better fitted to the compositional bias of Mollicute protein sequences. We generated this matrix with an approach similar to the one used to generate the well-known BLOSUM matrices

**The MOLLI60 matrix**.

Click here for file

**perl scripts for building new matrices**.

Click here for file

Global features of the matrix were then investigated, such as the entropy and expected values. These are commonly computed for substitution matrices and reflect the level of stringency of the matrix. Typically, a matrix with a lower entropy and a greater expected value will reflect more diverged sequences. We reported also, in table

Comparison of entropy and expected values for several matrices

**matrix**

**entropy**

**expected**

**mean mismatch**

**mean match**

MOLLI60

0.7126

-0.5820

-1.56

6.10

BLOSUM62

0.6979

-0.5209

-1.42

5.80

BLOSUM45

0.3795

-0.2789

-1.34

7.05

However, when looking more closely at each individual element, we can see in Figure

matrix substraction: MOLLI60 - BLOSUM62

**matrix substraction: MOLLI60 - BLOSUM62**. Term by term substraction of the two matrices MOLLI60 minus BLOSUM62

The robustness of MOLLI60 with respect to the learning set composition was evaluated, using a re-sampling approach (see Methods). If we removed randomly 10% of the initial mutliple alignments, the resulting matrix was slightly modified with 3 to 8% of the positions with different values and when removing half of the data only 15% of the matrix positions were changed. Noteworthy, these changes are minor compared to the differences with BLOSUM62.

Since the initial alignments used to infer the substitutions were obtained using BLOSUM62 (not fitted to the bias), the impact of this first step was evaluated by iterating the matrix construction with MOLLI60. Actually, this recursive step appeared not necessary since the resulting matrix was very similar to the first one, showing only 10 elements over 210 with a -1 or +1 difference (less than 5% of the elements). The amount of differences is thus similar to the one obtained when removing 10% of the data.

Application of MOLLI60 in homology predictions

We used the matrix MOLLI60 to align Mollicute protein sequences and to predict homology and orthology relationships based on these alignments.

Orthology predictions

The standard method of bi-directional best hit (BDBH) was used to predict orthologous genes between two genomes and proteic sequence alignments were obtained using the program SSEARCH, using either MOLLI60 or the standard BLOSUM62 as substitution matrix (see Methods).

We focused on 3 mycoplasma genomes for which we dispose of a reference set of orthologous relationships:

When comparing orthologous predictions obtained using the standard BLOSUM62 matrix versus our matrix MOLLI60, we first noticed that alignments obtained with MOLLI60 have overall smaller e-values than the ones obtained with BLOSUM62: the median e-value of BDBH proteins obtained with BLOSUM62 is 5.9e-22 against 1.1e-31 for MOLLI60. If we consider only common pairwise relationships between the two sets, the median pairwise ratio of e-values is of 10^{11 }in favor of the alignments obtained with MOLLI60: half of the common alignments have e-values obtained with BLOSUM62 more than 10^{11 }larger than the ones obtained with MOLLI60. However, having better e-values does not necessarily mean that the alignments are better and enable to infer better biological information. What is important is to get scores or e-values that improve the discrimination between biologically meaningfull similarity and random similarity. If we consider now all the alignments obtained between all proteins (with still an e-value cut-off of 10), meaning that this set contains surely a large portion of non-homologous relationships (or poor similarity), we still get better e-values with the MOLLI60 alignments than the ones with BLOSUM62. However, the difference is significantly smaller than previously with a median ratio of only 5 (instead of 10^{11}). Thus this suggests that the observed differences in e-values between the two matrices are not systematic and depends on the significance of the alignments.

To evaluate the discriminative power of the e-values, we plotted in Figure ^{-38 }against 10^{-25}) and it is the contrary for false positive alignments (0.026 against 0.00125). Overall the mean ratio of e-values between false and true positives is of 10^{45 }for MOLLI60 against 10^{25 }for BLOSUM62, thus the discrimination between true and false positive e-values is significantly improved when using MOLLI60 (p-value of 4.96

Evalue discrimination

**Evalue discrimination**. Comparison of the distribitution of BDBH e-values between True Positive (TP in red) and False Positive (FP in green). On the left alignments were obtained with BLAST and the BLOSUM62 matrix, on the middle with SSEARCH and the BLOSUM62 matrix, and on the right with SSEARCH and the MOLLI60 matrix.

For both matrices, we computed also their sensitivity and specificity (see Methods). As the two matrices give different e-values even for the same alignment, it is not pertinent to compare their sensitivity and specificity when using a fixed e-value cut-off to predict orthologs. Therefore we represented them using the Receiver Operating Characteristic (ROC) curves

ROC curves

**ROC curves**. ROC curves of one-to-one orthologous relationship predictions using the Bi-directional Best Hit method with several best-hit criteria and substitution matrices. Either the score (bdb-score) or the e-value (bdb-evalue) was used to determine the best hit, and the alignments were obtained with SSEARCH using either the BLOSUM62 matrix or the MOLLI60 matrix. As inset, is a zoom-in on the most interesting part of the curve: where the trade-off between sensitivity and specifity is usually determined for orthology predictions. Note that sensitivity (or true positive rate) is plotted against the absolute number of false positives instead of the false positive rate as in a classical ROC curve, for lisibility reasons. FP rate can be obtained simply by dividing the amount of FP by the amount of non-orthologous relationships which is constant (884833).

**Supplementary figures**.

Click here for file

Importantly, we verified that these results were not due to the fact that the training dataset used to build the matrix had common sequences with the test dataset. To do so, we built a second matrix without using any sequence of the genomes of the test dataset (

Homologous family predictions

We applied our matrix also to predict homologous protein families among 24 Mollicute genomes. The prediction of homologous relationships is more challenging that predicting orthologous ones (and especially one-to-one relationships), since we can not rely anymore on the best hit criterion. Instead, we have to use a cut-off to decide if two proteins are homologous and a clustering approach to build groups where each member is assumed to be homologous to any other member of the group. The construction of homologous clusters was performed using a naive approach based on an e-value cut-off: any two proteins with their alignment e-value lower than the cut-off are grouped in the same cluster (this corresponds to single-linkage clustering, see the Methods section). This implies that two proteins can belong to the same cluster even if they do not share enough similarity, for instance they only need to be both related to a third protein. This transitivity property is justified by the definition of homology of a common ancestor, which implies that if a and b are homologous and so are a and c, then b and c are also homologous.

Considering such process, and as the best hit criterion is no more usefull, we guess that the impact of the substitution matrix on these predictions may be more significant than for BDBH predictions. However in the absence of a reference dataset, the estimation of sensitivity and specificity scores is not applicable. Instead, we can compare the number of clusters, their size and their phylogenetic profile (presence/absence of species). We call the pan genome the number of distinct clusters (including also the single proteins) among the set of genomes, and the core genome the number of distinct clusters containing at least one member of each species. Assuming that each cluster represents a distinct funtion, the pan genome may be seen as the set of functions one can find in the community of genomes and the core genome the set of functions shared by all genomes. The number and size of clusters depend naturally on the e-value threshold: the more stringent is the threshold the more numerous and the smaller the clusters are, in other words the larger both the pan and the core genomes are. We can expect that a good clustering would give the fewest singletons (that is single proteins without any homolog), and consequently the smallest pan genome. On the other hand, a clustering with too many large clusters may group together several distinct families and thus reduce the core genome. Therefore when comparing several clustering we should prefer the one with the largest core genome and the smallest pan genome.

Since the core and pan genomes depend on the e-value threshold, and we have seen previously that e-values are not directly comparable between the two matrices, we built several clustering for each matrix by varying the e-value cut-off and computed for each their core and pan genomes. The results are presented in Figure

Core and pan genomes

**Core and pan genomes**. Comparison of core (top) and pan (bottom) genome sizes for clustering obtained with the two matrices, as a function of the e-value threshold (left) or the number of retained alignments (right). The dash lines represent the number of unclustered proteins, ie. single proteins.

As a complementary view, we evaluated the homogeneity of the families in terms of domain compositions of their protein members. Assuming domain architecture is conserved inside families, we expect that a good clustering will have a common domain architecture inside each family and different ones between families. For 14% of the proteins we could predict a domain architecture based on the Pfam-A database (see the Methods section). For a given clustering, we computed two metrics of domain homogeneity:

Based on these results, the matrix MOLLI60 was applied for the generation of comparative data in the reference database Molligen (predictions of orthologous and homologous relationships) and it is proposed as an optional parameter for the alignment tools provided to query the database.

Discussion

We want to emphasize that, although our results are focused on a specific group of species, the Mollicutes, the method for building the matrix and the analyses of its impacts can be easily generalized to any group of species with an atypical composition.

Methodology to build the matrix

To build this new matrix, we followed the approach initially proposed by Henikoff and Henikoff to build the BLOSUM series

As for the global approach, we chose to follow the BLOSUM strategy rather than the PAM one mainly because it has been shown to perform better in homology search context

Actually, the major hurdle of our strategy is its reliance on a reference set of orthologous proteins, as opposed to methods that only adjust the standard matrices to the bias with theoritical formulas

As concerns this learning dataset, we also verified that the final matrix was not too dependant on some individual parts of this set, by a re-sampling approach. This suggested that our learning set was homogeneous enough to estimate correct substitution rates. We also noticed that a recursive step where multiple alignments would be obtained with the new matrix instead of the standard BLOSUM62 had few impacts on the resulting matrix. This suggests that substitution matrices may be less important in multiple alignment than pairwise alignment and especially for well conserved proteins. Indeed, the signal of conservation, if it exists, is easier to detect when we deal not only with two sequences but with a large set of sequences. Moreover, we used a multiple aligner, T-Coffee, which is based on consistency scores rather than substitution scores and may thus be less sensitive to differences in substitution matrices. Additionally, in multiple alignments, we align sequences already known as homologous and we do not need nor use the scoring of the alignment. In the context of pairwise alignment for homology search, substitution matrices may have more impact on the evaluation of the alignment rather than on the alignment itself (ie. which position is aligned against which one).

Finally, the last but non negligible step of the method is the optimisation of the gap parameters to the new matrix. In gapped alignments, gap penalties have a strong impact on the alignments and especially their length. Indeed, the score of a gap has to be fixed relatively to the scores of matches and mismatches. The task of adjusting the gap penalties to the matrix is not trivial and there exists no theoretical formula to relate them to some features of the matrix

Evaluation of the matrix

Pairwise alignments and similarity searches in databases are the basis of numerous comparative genomics analyses. What is important in this step is to be able to discriminate between biologically meaningfull similarities and random ones. We showed in this paper that this can be improved by using well fitted substitution matrices. We demonstrated that MOLLI60 enabled to better evaluate pairwise alignments with raw-score, as well as e-values. The difference with BLOSUM62 was particularly significant for the raw score (Figure

In this paper, we also compared two alignment programs: SSEARCH ^{2})), its implementation in SSEARCH as multi-threaded and vectorized greatly improves the runtime, which is no longer a limiting factor. Secondly, they differ in the way they evaluate alignments. BLAST computes e-values using a formula with pre-computed statistical parameters

The evaluation of the performances and impacts of the substitution matrices in homology search is a difficult task especially due to the lack of benchmarks or reference datasets, that is the knowledge of the true homologous relationships. Whereas such datasets may exist for model organisms, they lack crucially for many specific groups of species, such as Mollicutes. We first used a personal manually curated dataset of one-to-one orthologs. As the human expertise can not infer with certainty all evolutionary relationships between proteins of distant species, this set may contain errors and missing orthologs. These errors could bias our estimators of performance such as sensitivities and specificities. However, we can assume that these errors are independent of the compositional bias and will impact the results of both matrices the same way. Moreover, the initial set of orthologs which was further curated was obtained with classical predictions using BLOSUM62. Therefore, potential errors in the reference dataset will tend to favor the BLOSUM62 matrix rather than MOLLI60, and we may only underestimate the benefits of our new matrix. Indeed, if MOLLI60 did improve the ortholog predictions, the benefits were slight. This is due in part to the small size of the benchmark and also to the nature of the predictions: one-to-one orthologs are the easiest homology relationships to predict and the criterion of best hit reciprocity is very strong. The additionable value of MOLLI60 can only be shown for a small subset of the results and its impact is consequently minimized with respect to the large number of obvious one-to-one orthologs.

Concerning the homologous families analysis, to cope with the lack of "true" relationships, we used the phylogenetic profiles of the clusters as an evaluation criterion. We thus took full advantage of the large number of genomes at our disposal and their wide distribution in the evolutionary tree of Mollicutes. The computation of the core and pan genomes enables to have a first glance at the clustering properties in terms of biological considerations. The evaluation of the clustering is however based on two assumptions: that the orphan genes (genes without homolog) are rare and must be minimized and that the number of distinct clusters shared by all genomes must be maximized. These assumptions are guided by the parsimony principle. Indeed, in the last common ancestor of the considered species the core and pan genomes were equal and they diverged by gene gains and losses: the pan genome increased by gene gain and the core genome decreased by gene loss. Therefore a clustering resulting in a smaller pan genome and a larger core genome may invoke less gain and loss and be more parsimonious. It is however important to note that this hypothesis is strong and may not reflect the reality of Mollicute gene set evolution, that is the real evolutionary scenario may not be the most parsimonious. This is the reason why we performed a second evaluation based on the domain composition in and between families. This relies also on an hypothesis, that domain architectures are conserved inside the families and are specific to families.

Moreover, it concerns only a small portion of the genes, those for which the function is the best known and represented in studied organisms and this may bias the results towards more conserved genes. As the two evaluation methods relied on different assumptions and may have different biases, we argue that they complement each other and the fact that they gave the same trends strengthen our conclusion.

Conclusions

In conclusion, we showed in this paper that we can improve orthology and homology predictions of Mollicute proteins by using a well-fitted substitution matrix, rather than the standard BLOSUM62. We argue that this result can be generalized to other groups of species with a marked compositional bias. The methodology we propose to construct such well-fitted matrices is very simple to carry out and could be useful in an increasing number of studies especially in the current context of massive growth of genomic data: the new sequencing technologies enable from now on to study specific groups of genomes with their specific and atypical nucleotide or amino acid compositions, notably for sequences coming from metagenomic studies.

Methods

Building of the matrix

To generate new matrices, we followed an approach similar to the one used for the well-known BLOSUM matrices

**List of genomes used in these analyses**.

Click here for file

These families were obtained by combining pairwise orthologous predictions (by BDBH) with high significance alignments, then selecting protein sets having at most one member per species. These families were further manually curated according to the protein annotations

For each family, a multiple alignment was obtained using the program T-Coffee _{
i
}with _{
j
}is obtained with the following formula:

where _{
i
}, _{
j
}) is the observed frequency of pairwise alignment of the two amino acids _{
i
}and _{
j
}, and _{
i
}) (resp. _{
j
})) is the observed frequency of the amino acid _{
i
}. The level of clustering is determined with the parameter

For each matrix, relative entropy and expected values were obtained from the following formula:

The whole pipeline to build new matrices was implemented in perl scripts, which are provided in Additional file

Application of the matrix

Sequence data

All sequences and annotations of Mollicute proteins were retrieved from the Molligen database

Alignments and BDBH

Protein alignments were obtained with the program SSEARCH of the Fasta program package

BDBH between pairs of genomes were obtained by aligning each protein sequence of one genome against all the protein sequences of the other genome and vice versa. A pair of proteins, _{1 }and _{2}, belonging respectively to genomes _{1 }and _{2}, was called a BDBH if _{1 }is the best hit of _{2 }among all proteins of _{1 }and reciprocally _{2 }is the best hit of _{1 }among all proteins of _{2}. The best hit was determined by the e-value (if not mentionned in the text) or the raw score of the pairwise alignment. BDBH relationships can be further filtered according to their alignment features (e.g. e-value) to discard less significant pairs.

Gap penalties

The substitution matrix is not the only parameter used to score gapped alignments. For each new matrix, it is thus important to optimise gap penalties. Indeed, the score of a gap has to be fixed relatively to the scores of matches and mismatches. Our approach to determine well-adapted gap penalties was empirical: we tried several combinations of gap open and gap extend penalties, and we evaluated them with respect to the performances in predicting orthologous relationships with the BDBH approach. We chose the gap penalties that gave the best compromise between sensitivity and specificity of ortholog predictions, using the ROC curves. We performed the same analysis for the BLOSUM62 matrix. For both matrices, we chose the following penalties: 10 for the opening of a gap and 1 for each extension of a gap. We verified that both matrices along with their gap penalties gave alignments of similar length.

Robustness evaluation

To evaluate if the obtained matrix is robust with respect to the set of protein families given as input, we used a Jacknife sampling approach. We generated 10 other matrices with the same procedure but with a smaller set of protein families sampled randomly from the original one. We generated two sets of samples: keeping either 90% or 50% of the initial protein families.

Reference set of orthologous relationships

The reference set of one-to-one orthologous relationships was obtained with standard BDBH predictions between 3 well annotated

This set enables to evaluate several methods of orthology predictions, by classifying the predictions as true positive, false positive (predictions absent from the reference set) and false negative (relationships of the reference set not predicted as orthologous); we denote by TP, FP and FN the amount of predictions in each category respectively. Then we can compute sensitivity and specificity values for each method, such as

The variability of alignment evalues was analyzed with respect to the matrix and the status of the prediction (either TP or FP), using an Analysis of Variance. To test the effect of each factor and in particular if the discrimmination between TP and FP evalues is different according to the matrix used (interaction factor), we used the following model:

with

Clustering of protein families

We built homologous protein families between 24 Mollicute genomes with a simple clustering method: first we runned all-vs-all protein alignments with SSEARCH using one of the two matrices. Then, we grouped together in a same cluster any two proteins with a significant alignment, that is with an e-value less than a given threshold and covering at least 50 percent of the longest protein. In other words, we built a similarity graph with each node being a protein and each edge connecting two proteins with a significant alignment. The clusters were then the connected components of this graph, that is any two custers are completely disjoint in this graph (also known as single-linkage clustering).

Domain architectures were attributed to clusters containing at least one sequence with at least one predicted domain. We considered the domains predicted by running InterProScan on each proteic sequence against the Pfam-A database of domains

Authors' contributions

CL implemented the method and performed the analyses. CL, AB, PT and PSP designed and conceived the method and experiments. AB and PSP built and manually curated the reference datasets. CL and PT wrote the manuscript. PSP, CT, FTa and FTh wrote the grant application upon which this project was built. All authors read and approved the final manuscript.

Acknowledgements

This work was funded by the French project ANR EVOLMYCO (ANR-07-GMGE-001). We thank two anonymous reviewers for helpful remarks on the manuscript.