Facilitating analysis of immunological data with visual analytic techniques. While the capacity to collect and store data has advanced rapidly, the ability to process and analyze it and in comparison has made little progress. As a result, large data sets often exist in biomedical laboratories, which are not analyzed effectively or efficiently.
With that potentially rich and powerful information is lost in the abyss of storage systems. Visual analytics or VA has emerged as a new way to analyze large complex data sets. VA techniques are based on visualizations which allow analysts to use their visual intelligence to spot patterns in data such as general trends or outliers.
These quick visualizations allow for rapid formation of hypotheses while exploring data. The flexibility of VA tools allows the analyst to both zoom in, drill down, and build connections across multiple data sets while exploring their relationships. Through the application of VA to integrated data sources, the user can reveal new and important findings.
Parent analysis is one VA approach where a VA tool expert and a technical also known as domain expert work together that the domain expert asks biologically relevant questions about the data. The VA tool expert then creates visualizations that may help reveal patterns that help answer this question or lead to further exploration. This process can be iterated to build different visualizations that provide insight.
We set out to test the suitability of a paired analysis VA approach to a large complex biomedical data set. In preliminary pilot experiments, we evaluated several of the existing VA tools for the current problem. We chose Tableau by Tableau software as the tool most suitable for the task at hand.
The selection criteria in these pilot experiments were based on subjective parameters such as user friendliness, overall usability, as well as objective technical features such as a range of interaction techniques and visualization features. We have here a data set in a Microsoft Excel spreadsheet typical of a laboratory working in the field of infectious diseases. This set contains a subject identifier data on variation in genetic DNA sequences.
In this case NF kappa BIA single nucleotide polymorphisms or SNS for the subject, as well as the observed concentration of several biological molecules in this case, cytokines produced by immune cells of the subject after stimulation of the immune cells with specific stimuli. We'll now scroll down to the spreadsheet. To give you a sense of the volume of this dataset, we are interested in finding out if there's a general relationship between the genotype that is the different snips of, in this case, the NF Kappa BIA gene, and the cytokine response observed.
After stimulation, we will now connect the dataset with Tableau, making sure that we import the NF kappa BIA table. You can see on the left side that tableau is connected to the correct table and automatically separated the column variables into what Tableau calls, dimensions and measures. Dimensions simply are the columns that categorize the data and measures the quantitative values in that column.
For this visualization, we'll now plot stimulus concentration levels against observed cytokine response concentration. We now average the values of cytokine concentration levels. Order of the concentration levels is wrong, but it's quite easily to quickly resort this.
Then we can switch the view to fit the screen and allow easier visualization of the data. Since we want to investigate how to differentiate between the different genotypes, all we have to do is drop the genotype dimension into this color section. The visualization automatically and immediately separates based on genotype.
Now, we can try a different display formats. For example, a line graph might better reveal a pattern we want to capture. There are obviously many other options.
The biologists in this paired analysis suggests that we start off by exploring the relationships of the production of one of the cytokine markers called QNF alpha after stimulation with a reagent called 3M oh oh two. To do so, we need to filter the marker dimension, TNF alpha and the stimulus dimension 3M oh oh two. To make the filtering process more flexible, we can choose the show quick filter option for both marker and stimulus dimensions, making sure that it is a single values list.
This visualization clearly shows a difference with the TNF alpha production after different levels of three MO oh two stimulation separated by genotype in different colors, we can choose any other combination of marker and stimulus filter values, and the visualization would change accordingly. Similar to Excel, we could build different visualizations in separate tabs. For presentation purposes, we can also generate a summary view of multiple analyses.
In this case, we've investigated a production of TNF Alpha across several subjects with a different NF Kappa BIAS snip genotype. In this demonstration, we successfully produced a series of powerful visualizations in about a minute and 30 seconds using a paired analysis VA approach. A similar set of visualizations typically requires a biomedical researcher 30 minutes to generate in Excel.
A previous example was a simple two dimensional analysis. The true power of VA is the ability to visualize multiple dimensions at the same time. For example, Tableau supports analysis between data sets through logical joins of key values.
Here are two spreadsheets placed in the same workbook. The first dataset is one from the previous demonstration example, and the other is a data set of cells analyzed by technique called flow cytometry for the production of multiple cytokines in the same cell. At the same time, a measure called poly functionality degree or PFD, you can name the sheet so it's easier to identify them during the import stage.
This allows Tableau to connect the two spreadsheets. After choosing the multiple table option, you can use the add new table feature to join the two tables. This feature adds the second spreadsheet to the first and uses the join statements to combine the data sets using identical keys such as cell type, concentration level stage, and group stimulus and subject identifier.
Notice that the dimensions are separated by spreadsheet name. This allows us to use the dimensions that were not part of the the logical joint statement. The definition for poly functionality, for example, is the percentage of cells that produce more than one cytokine.
For example, a cell that makes two cytokines as a PFD of two and a cell making three cytokines as a PFD of three. Here we create one calculated field to combine these values into one measure that we can use in a visual display. Now we can start building the visualization.
First, we plot concentration of cytokine levels against PFDs over two, and like the last demo take the average value of PFDs greater than two. We also arrange the concentration labels from low to high by setting it manually. Since genotype information is only available for some in this group, we need to filter out the rows of data that don't contain genotype information.
Just as before, we can quickly drop the genotype into the color label, allowing us a differentiate each different genotype as well. Then we can switch the view to fit the screen and allow easier visualization of the data. We can also change the bar graph two.
For example, a line graph who tested this provides a good sense of how the CYT response and PFP response varies according to the patterns specific for each genotype. You immediately notice that the NF kappa b SNP with the GG genotype has a different response pattern compared to the other genotypes. We can explore this further by investigating the impact of different stimuli on this pattern.
Note that after adding LPS in the stimulus dimension, the three main genotypes display a similar PFD level at all concentrations, but with the 3M MO oh two stimuli only, the GG genotype shows a sharp in PFD from low to high concentration of stimulus. This finding allows us to generate a hypothesis to test for in future experiments, namely that the type of stimulus impacts PFD. In the last two demonstrations, we saw the rapid generation of visualization to detect potentially meaningful patterns both within and between data sets.
The power of visual analytics can be rapidly extended to large data sets, scaling up the dimensions of analysis depending on the application, integrating information across vast data sets. For example, with the many data silos generated in cohort studies, VA is a highly transferable approach that can potentially be applied to any domain with large amount of many different types of data, including categorical and numerical based data sets. The VA approach offers two main advantages.
One, flexible hypothesis generation. The user can generate hypotheses about the data on the spot derived from current findings, and rapidly create new visualizations that explore the hypothesis for two time saving. The usability and efficiency of UVA tools are their main advantage over traditional information visualization tools.
The effort typically involved in graphing using traditional methods may take several working days to complete what is readily accomplished with two to three hours on a VA platform such as Tableau. Clearly, they are and are likely will be other application platforms each with specific advantages and disadvantages. The additional benefit approaching this task with para analysis clearly adds to the overall benefit of a VA based approach to the analysis of complex multidimensional data.