Scientific data has grown increasingly complex and rich over the last couple of decades, yet scientists continue to use methods of organization that no longer meet their expanding data needs. The main advantage of a technique described in this video is that it allows for a database that gives a rigorous data pipeline and storage while maintaining flexibility for data analysis. To begin evaluation of the data set of interest, download the example codes and databases shown in this table.
Next, use this graphical representation of a multidimensional database to evaluate if the dataset of interest is indeed multidimensional. The data needs to meet two conditions to benefit from the database organization. First, the data needs to be able to be visualized in a multidimensional form.
And second, it must gain greater scientific insight by being able to relate a specific experimental outcome to any of the dimensions. Relational databases store information in the form of tables which are organized in rows and columns and can be used to link identifying information within the database. Multidimensionality is handled by relating different fields, such as the table's columns and individual tables, to each other.
First, organize the data files so they have well thought out, unique names. Good practice with file-naming conventions and folder/subfolder structures allow for broad database scalability without compromising the readability of accessing files manually. Add dated files in a consistent format and name subfolders according to metadata.
As the database structure is designed, draw relationships between the fields in different tables. Create README documentation that describes the database and relationships that were created. It can be both graphical like this figure or text-based.
Once an entry between different tables is linked, all associated information is related to that entry and can be used to call complex queries to filter down to the desired information. Make the end result similar to this example where the differing characteristics of individuals are related to associated experimental data of those individuals. The same was done through relating columns of pattern types and data types to matching entries in the main DataValues table to explain various shorthand notations.
Identify all the various experiments and data analysis methods that might lead to data collection, along with the normal data storage practices for each data type. Work with open source version control software such as GitHub to ensure necessary consistency and version control while minimizing user burden. Make sure to create a procedure for consistent naming and storing of data to allow for an automated pipeline.
Use any convenient programming language to generate new data entries for the database. Create small helper tables in separate files that can guide automated selection of data. These files serve as a template of possibilities for the pipeline to operate under and are easy to edit.
To generate new data entries for the data pipeline, program the code in a similar way to the example shown here which is provided in the supplemental files with this article. This will allow one to use the helper tables as inputs to be selected by the user. From here, assemble a new spreadsheet of file locations by combining the new entries with the previous entries.
The code shown here and provided in the supplemental files can be used to automate this process. Afterwards, check the merged spreadsheet for duplicates using the code shown here to automate this step. Additionally, check the spreadsheet for errors using an automated method and notify the user of their reason and location.
Furthermore, you can write a code that will check the compiled database and identify any missing bad data points. Manually remove bad points without losing the integrity of the database using code similar to what is shown here. Repeat these steps in order to add more data points.
Then use the file locations to generate a data value spreadsheet. Also, create an updated list of entries that can be accessed to identify file locations or merged with future entries. To begin database creation, first create a blank database document to load the helper table for the cell lines, data types, and pattern types.
Go to the External Data menu, select Text File import, click on Browse, and then select the desired file. In the Import Wizard, select Delimited and hit Next. Select First Row Contains Field Names and Comma for the delimiter type.
After clicking on Next, select the default field options and then select No primary key. Click on Next and then Finish. Next, load the data and pattern types by repeating these same steps.
Next, load the data value table. Go to the External Data menu, select Text File import, click on Browse, and then select the desired file. In the Import Wizard, select Delimited and hit Next.
Select First Row Contains Field Names and Comma for the delimiter type. After clicking on Next, select the default field options and then select Let Access add primary key. Click on Next and then Finish.
Now create the relationships by selecting the database tools, going to Relationships, and dragging all of the tables to the board. Then go to Edit Relationships and select Create New. Select the table and column names and then click on the Join Type that will point to the helper tables.
After each desired relationship is set up, go to Create and select Query Design and select or drag all relevant tables into the top window. In this example, cell lines, data values, data types, and pattern type are shown. The relationships should automatically set up based on the previous relationship design.
Now, fill out the query columns for desired results. For this data set, go to show and select Totals. Fill out the first column, the second column, and the third column as shown here.
Fill out the fourth column, the fifth column, and the sixth column as well. When finished filling out the columns, save and run the query. For this sample experimental data, use the one-way analysis of variance using Tukey's test for mean comparisons between various conditions.
When given a multitude of possible confirmations, it can be difficult to identify where novel relationships exist using manual data aggregation methods. Here, the organization of subcellular actin filaments across multiple conditions were measured using the degree of orientational order by querying the database in different confirmations. The anisotropic and isotropic datasets show vastly different OOPs, which was expected since fibronectin micropatterning heavily influences tissue organization.
However, there were no significant differences between mutation status conditions when comparing isotropic tissues. Conversely, the pattern tissues were statistically less organized in the positive control cell line. This relationship held even when the data was aggregated by different families versus positive and negative control.
If needed, the data can be parsed further. As an example, here actin OOP was plotted against individual's age at time of biopsy, separated by mutation status and family to illustrate aggregation against a clinical variable. With this dataset, there is no correlation between actin organization and an individual's age.
This shows how the same data can be analyzed in different combinations and how easily the normally difficult task of aggregating data that falls under multiple classes can be accomplished using databases. This protocol to create a data organizational pipeline and generate a database provides scientific rigor that is absolutely essential in this age of large volume data collection.