The overall goal of this procedure is to attain an exhaustive characterization of a gene family in plants to a global approach including the definition of gene structure, concerned protein motifs, chromosomal location, gene duplications, and expression patterns. This method can help answer key questions in the classification of plants in families such as world genomic identification of family members and their transcriptomy provides this through gene expression analysis. The main advantage of the whole procedure is to provide an exhaustive list of experimental steps to achieve a complete gene family characterization.
This procedure can be applied for any gene family of any plant species for which genetic data are available. Moreover, this approach provides valuable information to identify interesting candidates for functional studies. Demonstrating the whole procedure will be Pietro Ariani from my laboratory.
Open the Blast web page and click on the protein blast section. In the enter query sequence field, enter the amino acid sequence of the protein that will be used as the probe to identify the other family members. Use a protein that has all the important features that characterize the family.
Next, in the field choose search set, select the reference protein data base and the organism of interest. Now, in the field program selection, select the PSI-BLAST algorithm. If desired, click on algorithm parameters to adjust advanced parameters such as max target sequence and scoring matrix.
The E value is the threshold, which for this search is decreased from 005 to 001. Click the BLAST button to run the analysis. From all the sequences displaying relative matches to the query, un-select the entries which clearly do not belong to the family.
Do so by clicking on the tick in the select for PSI-BLAST column. Then, run a second PSI-BLAST iteration by clicking the go button. All of the newly identified sequences will be highlighted in yellow.
Again, un-select the hits that do not belong to the family. Continue this process until the algorithm does not find any relevant entry or it reaches convergence. Once all the false-positives are weeded out, collect the amino acidic sequences in a FASTA formatted file and upload the file into the bio-informatics software suite.
Next, select all the imported sequences in the list and click the align/assemble button in the toolbar. Then click pair wise multiple alignment and select MUSCLE alignment and OK.The alignment will now be made using the defaults. Now visualize the sequence logo of the alignment and inspect the genes visually to search for false-positives.
Further analysis of protein physical parameters and domains, chromosomal distribution, duplications, and exon intron organization are described in the text protocol. Next analyze the relationships among the ATL family members through the construction of a high quality phylogenetic tree and the definition of a family nomenclature. First, retrieve the arab adopt systeliana ATL sequences, required as references for a grape vine gene nomenclature, from the UmoProt database for reference.
Then make a FASTA file including all the nucleotide sequences of the grape vine and the Arabidopsis gene family members for the phylogenetic analysis. Now browse the phylogeny France home page and select the phylogeny analysis pipeline. One click is suitable in most of the cases, but if needed, it is possible to select specific advanced settings using the advanced options or even a fully customized analysis using the a la carte options.
Next, input a name for the analysis and upload the FASTA file. Then click submit to run the analysis. If this procedure results in an error message, complete each step of the phylogeny suite pipeline individually.
First, go to the MUSCLE software homepage and upload the FASTA file. Select Pearson FASTA as output format and submit. Next, download the file in FASTA format and eliminate any poorly aligned positions.
Open the G-block server tool, upload the alignment FASTA file, select DNA as type of sequence, and choose the stringency that best fits with the analysis. For a grape vine ATL gene family alignment, select all the three options proposed for less stringent selection because of high sequence divergence. Then click get blocks to run the analysis.
To save the results, click on resulting alignment at the bottom of the output page and make a new FASTA file. G-blocks eliminates poorly aligned position in divergent regions of a DNA or prodding alignment so that it becomes more suitable for a phylogenetic analysis. For this step, impericle parameter selection for the columns stringency is crucial to ensure they're a liability of the three.
To complete the tree, we turn to the Phylogeny France home page. There, select the a la carte pipeline and deselect the following options:multiple alignment and alignment curation. Next, click create workflow, upload the G-blocks curated FASTA file, select bootstrapping procedure, and use default parameters and settings.
Finally, click submit to run the analysis. This completes the step by step process. Now that the phylogenic tree is made, collapse the poorly supported branches with bootstrap values of less than 70%Then download the final results in the Newick format to further analysis.
The text protocol provides instructions on assigning a gene name based on the phylogeny. For this protocol, make a tapped, lineated, text format file of arma-normalized expression values taken from a gene expression library. Organize the values for each family gene under the same pattern rows.
Next, perform the hierarchical bi-clustered analysis using multi-experiment viewer software. First, upload the working data matrix and select the text file. Select single-color array and remove the tick from load annotation when an automatic annotation is not provided.
Select the upper-left most expression value of the expression table preview and click the load button. Next, adjust the data, applying Log2 transformation and gene/row normalization. Then set the proper scale limit.
Now calculate the hierarchical clustering. In the ordering optimization field, select optimize gene leaf order and optimize sample leaf order. Next, in the linkage method selection field, select average linkage clustering and in the distance matrix selection field choose Pearson correlation.
Then click OK to run the analysis. View the results in the left panel of the window and export the heat map by clicking save image in the file menu. The most similar gene to arabidopsis thaliana ATL2, according to a BLAST-P search was used to survey the ATL family members in the grape vine genome.
The PSI BLAST analysis converged after a few cycles providing a list of putative genes in the gene family. The presence of the canonical ring H2 domain for each candidate was evaluated by the visual inspection of the MUSCLE alignment. This narrowed the search to 96 grape vine genes.
A phylogenetic analysis of the identified genes comparing them with the arabidopsis ATL gene family was used for a nomenclature. 13 of 96 grape vine ATLs received a specific identifier as an arabidopsis ATL gene ortholog. Mapping the identified genes to the grape vine global gene expression atlas using a hierarchical bi-clustered analysis showed that all 96 are expressed in at least one tissue and five expression clusters can be identified.
A similar approach was applied to study the expression of these genes in response to biotic stresses. There was direct involvement of the ATL gene family in response to pathogen infection with 62 genes showing a significant modulation in at least two conditions. After watching this video, you should have a good understanding of how to deal with the softwares used to characterize a gene family.
While attempting this procedure, it's important to carefully define the characteristics of a given gene family and the corresponding probe used for the whole genome family member identification. Following this procedure, it is possible to investigate gene family evolution and P30 function by making additional phylogenetic trees based on some species other than the species required for the attribution of gene families nomenclature.