We build a novel method to calculate and analyze the correlations in mutational behavior between different positions in a multiple sequence alignment. The inter-dependence between the residues for a protein family is represented as a matrix of correlation values obeying the invariance with respect to specific amino acids, the number of sequences representing a family, the length of sequences, residue variability and the uniformity of data set representation. Common and distinguishing properties of the few protein families, including immunoglobulins, are revealed, based on the geometry of correlation matrices. We analyze the specific texture of these matrices, inherent to the specific families, and suggest a way to distinguish proteins from non-protein set of sequences.
The role of correlation matrix technique in classification is
discussed. We suggest that the classification criteria should be based on
the residues at the positions with the highest overall correlation with
the other positions. Revealing the positions with various correlation
strength helps to reconstruct the phylogeny of protein families.
Paper Available at:
ftp://dimacs.rutgers.edu/pub/dimacs/TechnicalReports/TechReports/2001/2001-08.ps.gz