The overall goal of this procedure is to use standard spreadsheet software to develop a reference for divergent proteins in a group that lacks coherent criteria for nomenclature and classification. This method can help clarify relationships between related proteins that have confusing or inconsistent nomenclature, such as the cysteine-stabilized alpha-beta superfamily of defensive peptides that includes invertebrate defenses. The main advantage of this technique is that it provides a simple visual representation of the proteins of interest.
To begin, identify the defining characteristics of the protein group of interest. For example, the CS alpha-beta fold in the solution structure of insect defense and aid from phormia terraenovae defines the CS alpha-beta superfamily. This fold also includes a smaller motif called the cysteine-stabilized helix, which is identified by CXXXC upstream of a CXC.
The four cysteines form two disulfide bonds. To complete the CS alpha-beta motif, a third disulfide bond is formed by an additional pair of cysteines. Ultimately it is critical to know at least some important features related to the structure or function of the protein.
Without this, there is no basis for generating a reference. Now, enter these defining features into a spreadsheet. Use columns for the conserved features and to represent the spaces between these features.
Keep the columns wide enough to fit numbers, and give them a consistent width. In the rows, describe the sequences. To indicate a sequence as a feature, fill in the feature box with color, using the fill function.
Then, to indicate the spacing between the features, enter the number of amino acids in the box between. Now, add representative sequences that have been previously established as members of the group based on structural databases and published results. As needed, add features that are likely to define a subgroup of sequences, such as an extra cysteine.
If features are missing from a given sequence, leave the box unfilled and combine it with boxes representing intervening amino acids. To do this, us the Merge and Center function. Once representative sequences have been entered, identify groups of clearly related sequences.
Then, summarize the characteristics of these groups. When the number of amino acids between features varies, use a hyphen to indicate a range or slashes to indicate a few specific numbers. As needed, creatively annotate features that may be relevant or aren't common enough to include in the reference.
For example, since cysteines are important in the superfamily, the presence of additional cysteines can be labeled. Sometimes, when adding newly identified sequences to the spreadsheet, the sequences of one species fall into several different groups of the superfamily, such as for tardigrades. Once completed, the rows of the spreadsheet can then be sorted to highlight variation within a species, or variation in between taxonomic groups.
This demonstration uses the freely available MEGA 6 software. However, other software can be used similarly. To start a new alignment, select Edit/Build Alignment under the Align tab.
Then select Create a new alignment, in the box that appears and click OK.Then select Protein. Now, select Insert Sequence from File, in the Edit menu, to import the sequences. The sequences must be in FASTA format.
Background colors reflecting different amino acid types are shown by default and can be turned off with a toggle, under the Display menu. Once all the sequences are entered, click the flexing arm icon and then Align Protein to align the sequences using the muscle algorithm. If the message, Nothing selected for alignment.
Select all? pops up, choose OK.Some parameters can be changed in the pop-up window, but for this demonstration, the defaults will suffice. Now, check the alignment based on the important features of the protein superfamily.
The top bar shows an asterisk in any position in which the amino acid is completely conserved. The initial alignment identifies three of the four conserved cysteines. One sequence is clearly misaligned.
To fix a misaligned sequence, highlight the dashes and press the Delete key without accidentally deleting amino acids. Then, move the amino acids into their proper alignment by adding spaces. After aligning the sequence manually, note that the last cysteine of the CXXXC motif is conserved throughout the alignment.
Manual adjustment is often necessary to prioritize the most important features of the sequences. The spreadsheet alignment of the previously established CS alpha-beta superfamily revealed five basic patterns of bond formation. The newly identified tardigrade sequences fell into the complete spectrum of these patterns.
Next, phylogenetic analyses were used to examine how this group of proteins may have evolved. However, the sequences are generally short and highly divergent. Thus, the resulting trees were poorly resolved and offered little insight.
The multiple sequence alignment was optimized using the reference, but there was still poor resolution in the maximum likelihood analysis and in the Bayesian phylogenetic analysis. Most clades had only low levels of support. However, five small groups were supported in at least one of the two trees.
Sequences with different cysteine numbers within a taxonomical group may be more closely related than sequences with the same pattern from different groups. After watching this video, you should have a good understanding of how to use spreadsheet software to generate a simple visualization of important protein characteristics. While attempting this procedure, it's important to remember that determining the most relevant characteristics for the group of proteins is often an iterative process, and revision of the reference will likely be necessary.
New sequences can always be added to a spreadsheet, and it is easy to have multiple versions allowing for new features to be added that may be useful for classification and analysis. Although the spreadsheet alignment allows easy visualization of structural and functional features, other methods like phylogenetic analysis can be performed to provide additional insight into evolutionary relationships.