In addition to our own dataset of Oxytropis ITS clones comprising the ITS1, 5.8S rDNA, and ITS2 regions, and flanking 18S and 25S rDNA (also including some European Astragalus; EMBL accession numbers AM401376 to AM401574; AM943374 to AM943384; FM205750 to FM205773), we added all ITS sequences stored in DDBJ, EMBL, and NCBI gene banks (sequences usually obtained by direct sequencing, not by cloning; downloaded on January 2008) for the genera Astragalus and Oxytropis. Solitary ITS1 and ITS2 sequences were combined if they belonged to the same source according to the original literature/description. The sequences were aligned using the POA program (Lee et al. 2002). The 5.8S rDNA was excluded from the analyses. Phylogenetic inference was performed under the maximum likelihood (ML) optimality criterion using RA x ML 7.0 (Stamatakis 2006; Stamatakis et al. 2008). The program implements a new fast ML bootstrapping and subsequent search for the best topology. Duplicate sequences were eliminated prior to ML analyses. Tree inference and 100 bootstrap replicates were conducted under the CAT approximation (Stamatakis 2006), but final parameter optimization was done under a GTR+r model. A plain text (NEXUS) file containing the complete alignment, GenBank accession numbers, and information on the sets of identical sequences reduced by RAxML to a single one, respectively, is available at http://www.goeker.org/mg/clustering/.
To obtain a non-nested sequence classification based on the given distance (or similarity) threshold, we use NHSL clustering, as in the case of the popular tool blastclust (ftp://ftp.ncbi.nih.gov/blast/documents/blastclust.html; regarding hierarchical single-linkage clustering, see, e.g., Legendre and Legendre 1998, pp. 308-312). NHSL is based on the notion of a "link". A link is defined as any distance between two objects (here, sequences) that is smaller than or equal to the predefined threshold. NHSL starts by assigning the first object to the first cluster. For each of following objects in turn, it assigns that object to the same cluster than a previously clustered object if the distance between them is a link. If no such previously classified object is found, the current object is assigned to a new cluster. If several such previously classified objects are found that belong to distinct clusters, these clusters are joined.
Obviously, larger threshold values will result in larger but less numerous clusters, whereas small thresholds will lead to numerous small clusters. A biologically sensible threshold is usually not known a priori. However, a particularly well studied (and monophyletic) taxon (here, O. pilosa), can be used as a standard by determining the lowest possible threshold that results in all sequences obtained from that taxon being assigned to a single cluster. It is easy to implement an algorithm that calculates this value for a predefined group. Wirth et al. (1966, p. 61) who used similarities instead of distances define "...a similarity value, c, which is the largest fixed linking similarity value for which the cluster is still an interlinked aggregate of specimens". For each of the n objects within the group, the distance to the least distant object that belongs to the same group is determined; the largest of these n values represents the result, analogous to c. This is due to the fact that in NHSL, a single link is sufficient for an object to be assigned to a cluster. This algorithm and NHSL have been implemented in the program optsil (Goker et al. 2009) which is available upon request. The determination of the standard threshold and NHSL relied on uncorrected distances (also called "Hamming" or "p" distances; e.g., Swofford et al. 1996, p. 455), which represent the relative number of deviations between two sequences. For downloaded sequences, we relied on the NCBI taxonomy provided in the same files to assign them to genera and species. The optimal threshold calculated for O. pilosa was then applied in NHSL.
A disadvantage of the NHSL approach is that clustering methods such as single-linkage clustering cannot be considered as valid methods of phylogenetic inference, mainly because lower pair wise distances (or higher similarities) do not necessarily indicate a closer phylogenetic relationship (e.g., Felsenstein 2004, pp. 165-167). This issue has led to the widespread avoidance of UPGMA clustering (Sokal and Michener 1958) in phylogenetic studies. However, the difficulties may be tempered or even disappear if non-nested clustering is applied and if species are to be distinguished, because within-species sequence dissimilarities are expected to be rather low. The same rationale applies to the usual arguments against uncorrected distances because unobserved, superimposed nucleotide substitutions only play a role if distances are large (e.g., Felsenstein 2004, p. 158).
Goker and Grimm (2008) used the well-known Shannon entropy formula (Shannon 1948) to calculate the character data of hosts (plant individuals) from the character data of their associates (cloned sequences obtained from the respective plant individuals). For all sequences belonging to the same individual, the entropy of each alignment column was calculated to represent the amount of genetic divergence within each individual using the program g2cef designed and implemented by M. Goker (downloadable from http://www.goeker.org/mg/distance/). An alignment of length n will thus result in n corrected entropy values per group. Because the variance in nucleotide characters may depend on the number of sequences, each entropy value was corrected by division through the maximum possible entropy for the given number of associates, which is 0 in the case of a single associate. Accordingly, the corrected entropy values for individuals present with a single sequence only are undefined.
For each pair of groups, n differences between the n corresponding corrected entropy values can be determined. Subsequently, a non-parametric Wilcoxon signed-rank test or a parametric t-test (or any other appropriate statistical test) can be applied to assess whether the distribution of these differences significantly deviates from 0 and, thus, the two original entropy distributions are significantly different from each other. This procedure is similar to paired-site tests used to assess whether the scores of two phylogenetic trees are significantly different, given a sequence alignment of the same taxa (Felsenstein 2004, p. 364 ff.). Pair wise tests were conducted with R (R Development Core Team 2005) and restricted to those clusters obtained by NHSL that comprised more than five sequences.
Was this article helpful?