[Jalview-discuss] pca

Jim Procter jprocter at compbio.dundee.ac.uk
Wed Feb 17 15:20:06 GMT 2010


Dear Hadas, thanks for your email.


On 15/02/2010 14:04, Hadas Ner Gaon wrote:
> Dear members
>
> Can you please help me understand why the blusom score in the PCA 
> analysis is not reciprocal.
>
> Those are the results of the output values of 4 proteins sequences PCA.
>
>            Seq1      seq2    seq3    seq4
>
> seq1 1292.00 930.00 643.00 631.00
>
> seq2 931.00 1289.00 589.00 633.00
>
> seq3 622.00 567.00 1338.00 768.00
>
> seq4 629.00 630.00 785.00 1303.00
>
> Why the score of seq1-seq3 is 643 while the seq3-seq1 score is 622?
>

A good question!  In the matrix used for the PCA calculation, each 
element e(i,j) represents the sum of substitution scores for mutating 
the symbols in the i'th sequence into the corresponding symbol in the 
j'th sequence. For proteins, the substitution matrix used is the 
blosum62 matrix - and because this is not symmetric (ie the score for 
mutating an R to a G is different to the score for mutating a G to an 
R), there are often differences between the upper and lower triangles of 
the similarity matrix. Its simplest to consider each triangle as 
representing the 'forward' or 'backwards' mutation cost for each pair of 
sequences in the alignment.

As you may be aware, the matrix that I just described differs slightly 
from the one given in the 'SeqSpace' paper cited in the jalview PCA 
documentation (http://www.jalview.org/help/html/calculations/pca.html). 
In the original paper (Casari, Sander and Valencia 1995 : 
http://novacripta.cbm.uam.es/bioweb/courses/MasterBiofis0708/tema03/Casari_NatStructBiol_95.pdf 
), the matrix used for PCA analysis is called the comparison matrix, and 
is defined as the product of a matrix representation of the alignment 
with its transpose:

  C = F x T(F)

Here, C is a symmetric n by n matrix, because each element of the matrix 
is the sum of identical pairs of symbols for the corresponding pair of 
sequences in the alignment. Jalview's slightly different comparison 
matrix calculation should, in theory, reflect favourable mutations 
between sequences in addition to conservation. However, in my limited 
tests, the resulting PCA plot often resembles that produced by the 
original algorithm's projection, so this refinement probably doesn't 
improve greatly on the seqspace approach.

thanks for the question, and happy Jalviewing!
Jim.

ps. this difference between seqspace and Jalview is not made clear in 
the documentation. This will be rectified in a future release.

-- 
-------------------------------------------------------------------
J. B. Procter  (JALVIEW/ENFIN)  Barton Bioinformatics Research Group
Phone/Fax:+44(0)1382 388734/345764http://www.compbio.dundee.ac.uk
The University of Dundee is a Scottish Registered Charity, No. SC015096.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.compbio.dundee.ac.uk/pipermail/jalview-discuss/attachments/20100217/019d0eca/attachment.html 


More information about the Jalview-discuss mailing list