Efficacy of the Fuzzy Polynucleotide Space in Phylogenetic Tree Construction

. The study of evolutionary relationships is an important endeavor in the field of Bioinformatics. The fuzzification of genomes led to the introduction of a “ fuzzy polynucleotide space ”, which has been successfully used in classification and clustering of amino acids, thereby suggesting a possible application in phylogeny. As phylogenetic trees illustrate similarities and evolutionary relationships among different taxa, through this study we attempt to determine the efficacy of the fuzzy polynucleotide space in phylogenetic tree reconstruction, and discuss its implications in evolutionary biology.


INTRODUCTION
Sequence analysis and comparative genomics play a central role in Bioinformatics. Phylogenetic relationships among organisms are established on the basis of molecular sequences, in order to understand their course of evolution and ancestry. Molecular phylogeny involves building of a relationship tree that shows the probable evolution of various organisms. The conventional tree building approaches are broadly divided into: a) Distance based approaches-which take into account the evolutionary distances between all taxa, where the distance represents the number of nucleotide or amino acid changes between sequences. These include methods such as Neighbour Joining, Unweighted Pair Group Mean Average (UPGMA), Minimum Evolution, and alike. b) Character based approaches-these include methods such as Maximum Parsimony, Maximum likelihood and Bayesian sequence analysis.
Statistical techniques have played, and will continue to play a pivotal role in sequence analysis. The past decade has witnessed several applications of fuzzy sets and fuzzy logic in bioinformatics, with its successful use in sequence alignment, DNA sequencing, clustering and classification [1,2,3,4]. Fuzzy set theory was first rendered directly accessible to sequence comparisons in the works of Sadegh-Zadeh. He introduced the concept of Fuzzy Polynucleotides [5], by transforming nucleic acid sequences into ordered fuzzy sets. The author showed that the genetic code can be considered as a 12 -dimensional code, with each triplet codon XYZ having a 3 × 4 =12 dimensional fuzzy code, and thus falling as a point in what the author termed as the 12-dimensional fuzzy polynucleotide space I=[0,1] 12 , where I ∈ R.
Torres and Nieto [6] redefined the Fuzzy Polynucleotide Space, based on the fuzzy hypercube concept proposed by Bart Kosko [7]. Taking into account the frequencies of the nucleotides at the three base sites of a codon in the coding sequence, the authors mapped a given polynucleotide on an I 12 space which they termed as Fuzzy Polynucleotide Space (FPNS). A sequence of any length could thus be mapped on a 12-dimensional vector, facilitating comparison between sequences of varying lengths. A distance metric d that determined distances between the fuzzy vectors of any two polynucleotides, was proposed.
Given the fuzzy polynucleotide space for two sequences p and q, where p = (p1, p2, . . pn), q = (q1, q2, . . . , qn) ∈ In, n=12, the difference between p and q was calculated as: (1) The distance metric as defined in Equation (1) is termed as the NTV metric. The authors computed the fuzzy polynucleotide space for two genomes of E. coli and M. tuberculosis, considering only the coding regions of these genomes, and the distance between them was calculated. The approach was further extended and distances between other genomes were computed [8].
The NTV metric has also been used for the classification of amino acids via fuzzy equivalence relation [9]. In their research study, the authors used two different distance functions viz. the Minkowski distance function and the NTV metric. The clusters obtained using the NTV metric were the same as that obtained using the Minkowski distance metric for high values of the similarity degree. Nieto and Torres [10] have suggested the possible use of NTV metric in phylogenetic analysis. With this backdrop, we have, in this sequel, made an attempt to study the efficacy of the NTV metric in phylogenetic reconstruction.

METHODS
The structured approach is divided into three parts. Section 2.1 deals with data collection, while section 2.2 describes the salient features of sequence analysis, and the results and discussion on phylogenetic tree analysis are presented in section 2.3:

Data collection
A total of nine datasets were considered for the detailed study. However, the discussion on the results of three major datasets was considered sufficient to test the hypothesis and draw meaningful conclusions. The other datasets are available on request.
Dataset 1 comprises of polyprotein-coding regions of Dengue type 3 viruses. The viral isolates were chosen from different regions of the world. Dataset 2 represents gyrase B gene sequences from members of the genus Microbacterium. Dataset 3 includes vertebrate mitochondrial cytochrome b sequences. The cyt b genes were taken from representative members of the six classes viz. Mammalia, Reptilia, Amphibia, Aves, Chondrichthyes and Osteichthyes of Sub-Phylum Vertebrata. The other datasets, considered in the detailed analysis include gyrase B gene sequences from Burkholderia, ompA gene sequences from the genus Rickettsia, chloroplast matK gene sequences from the family Tillandsioideae, VLTF-1 genes from Penguin-pox virus, low-molecular weight glutenin subunit genes from tall wheatgrass, and mitochondrial genes from the hawkmoth genus Hyles.
Only protein-coding genes were considered, and the coding sequences were extracted from National Centre for Biotechnology information (NCBI). All the datasets comprised of experimentally validated, non-redundant sequences. For majority of the datasets, the phylogenetic relationships have been well established. Generic and species information were obtained from taxonomy database of NCBI.

Sequence Analysis
Multiple sequence alignment was performed using ClustalW [11] . The sequence data was used to determine distances using DNADIST program of the Phylogeny Inference Package (PHYLIP) [12]. The Jukes-Cantor distance parameter was selected for determining evolutionary distances. Each sequence for all the datasets was mapped onto a 12-dimensional fuzzy vector i.e., each sequence was represented in terms of its fuzzy polynucleotide space. Distance matrices were computed using the NTV distance metric for the same sequences.
Neighbour Joining (NJ) method, one of the most effective distance based methods, was used for phylogenetic tree construction. The distance matrices generated through DNADIST and NTV metric served as input for the NEIGHBOR program of PHYLIP. Bootstrap values were set to 1000 for all trees.

Class Amphibia
Xenopus laevis X.levis

Class Chondrichthyes
Chimaera monstrosa Rabbit fish

RESULTS
For all the datasets, there was a marked difference in the tree topologies for the trees constructed using the Jukes-Cantor distance and the NTV metric. The trees generated employing the Jukes-Cantor distance conformed to the observed phylogenetic relationships for all the datasets, while the NTV metric based trees showed varying results. Figure 1(a) and 1(b) represent the trees generated for dataset 1, employing the Jukes Cantor distance and the NTV metric respectively. As can be observed, the NTV metric fails to show distinct clusters for all the viral isolates from different countries. Fig. 1 (a). Tree constructed employing the Jukes-Cantor distance model in DNADIST using the NEIGHBOUR JOINING method for Dataset 1. Fig. 1(b). Tree constructed employing the NTV distance metric and the NEIGHBOUR JOINING method for Dataset 1 Figures 2(a) and 2(b) reflect the difference in tree topologies for the trees generated using the Jukes Cantor distance and the NTV metric for Dataset 2. 2(a) conforms to established phylogeny of Microbacterium [13] , however the NTV based phylogenetic tree shows starkly contrasting results, and does not agree with the known phylogenetic relationship of the family. For example, the NTV metric classifies M.arborescens with M.aerolatum, while it is known to be evolutionarily closer to M. imperiale instead, as reflected by the Jukes-Cantor distance in 2(a). The established phylogeny of the Microbacterium genus follows distinct clusters, while the NTV-based tree shows incorrect and fewer clades of taxa. Fig. 2 (b). Tree constructed employing the Jukes-Cantor distance model in DNADIST using the NEIGHBOUR JOINING method for Dataset 2, using Agromyces as outgroup. Fig. 2 (b). Tree constructed employing the NTV distance metric and NEIGHBOUR JOINING method for Dataset 2, using Agromyces as outgroup Figures 3(a) and (b) similarly reflect the differences observed in the two methods of phylogenetic reconstruction for dataset 3. As can be observed, the NTV metric does not give distinct clusters for different members of the vertebrate classes. Also, it misclassifies African Savanna elephant with guinea pig, which otherwise belong to different classes. While the Gibbons are seen to cluster together in a single clade in both trees, the similarity between the generated trees is otherwise limited. These results are contrary to those seen in 3(a), and differ from the already proven taxonomic relationships in vertebrates. Further, cytochrome B sequences are highly conserved among eukaryotes, and are known to conform to different relationships among the data representatives. The NTV-metric could not correctly capture distances even among such welldocumented similarities.
A. Sambarey and A. Deshpande   Fig. 3 (a). Tree constructed employing the Jukes-Cantor distance model in DNADIST using the NEIGHBOUR JOINING method for Dataset 3 Fig. 3 (b). Tree constructed employing the NTV distance metric and the NEIGHBOUR JOINING method for Dataset 3.
The same variation in tree topologies was seen for the phylogenetic trees constructed using the other six datasets for the two methods. NTV-based tree showed diametrically opposite results to the expected tree, based on known relationships. Further, in all cases, the number of distinct clades are significantly lower in the NTV-based trees as opposed to clear clusters observed in the Jukes-Cantor method based trees. This illuminates the limitation of the metric in capturing evolutionary relationships among various taxa. Thus, for all the datasets, the NTV metric failed to correctly represent the phylogenetic relationships among organisms.

DISCUSSION
Some of the possible reasons for the failure of the fuzzy polynucleotide space in determining biological distances are as under: Failure could be due to the observation that Fuzzy Polynucleotide Sequence is same for two different sequences, where one sequence is just a permutation of triplets of the other sequence, as suggested by K. Sadegh Zadeh [14]. The distance between these two sequences would be zero according to the NTV metric, whereas quite the opposite is true.
Another explanation for the limitation of the NTV metric in phylogeny is that phylogeny is a depiction of evolutionary distances between sequences, and takes into account per-site substitutions in a sequence alignment. The conventional distance based approaches used for phylogenetic construction employ distance parameters such as Jukes-Cantor distance, Kimura 2point correction parameter etc. The Jukes-Cantor substitution model reflects the number of synonymous and nonsynonymous substitutions per site of the alignment, and hence is a reflection of the number of changes occurred in DNA over the course of evolution. Since NTV is independent of sequence alignment, but rather depends on the relative base frequencies at each site of the codon, it does not account for evolutionary changes and hence is not an appropriate indicator of distances between biological sequences.

CONCLUDING REMARKS
The limited study infers that the fuzzy polynucleotide formalism may not be suitable in the construction of phylogenetic trees, as it is not a true indicator of distances among biological sequences.