A subscription to JoVE is required to view this content. Sign in or start your free trial.
Method Article
Many researchers generate "medium-sized", low-velocity, and multi-dimensional data, which can be managed more efficiently with databases rather than spreadsheets. Here we provide a conceptual overview of databases including visualizing multi-dimensional data, linking tables in relational database structures, mapping semi-automated data pipelines, and using the database to elucidate data meaning.
Science relies on increasingly complex data sets for progress, but common data management methods such as spreadsheet programs are inadequate for the growing scale and complexity of this information. While database management systems have the potential to rectify these issues, they are not commonly utilized outside of business and informatics fields. Yet, many research labs already generate "medium sized", low velocity, multi-dimensional data that could greatly benefit from implementing similar systems. In this article, we provide a conceptual overview explaining how databases function and the advantages they provide in tissue engineering applications. Structural fibroblast data from individuals with a lamin A/C mutation was used to illustrate examples within a specific experimental context. Examples include visualizing multidimensional data, linking tables in a relational database structure, mapping a semi-automated data pipeline to convert raw data into structured formats, and explaining the underlying syntax of a query. Outcomes from analyzing the data were used to create plots of various arrangements and significance was demonstrated in cell organization in aligned environments between the positive control of Hutchinson-Gilford progeria, a well-known laminopathy, and all other experimental groups. In comparison to spreadsheets, database methods were enormously time efficient, simple to use once set up, allowed for immediate access of original file locations, and increased data rigor. In response to the National Institutes of Health (NIH) emphasis on experimental rigor, it is likely that many scientific fields will eventually adopt databases as common practice due to their strong capability to effectively organize complex data.
In an era where scientific progress is heavily driven by technology, handling large amounts of data has become an integral facet of research across all disciplines. The emergence of new fields such as computational biology and genomics underscores how critical the proactive utilization of technology has become. These trends are certain to continue due to Moore's law and steady progress gained from technological advances1,2. One consequence, however, is the rising quantities of generated data that exceed the capabilities of previously viable organization methods. Although most academic laboratories have sufficient computational resources for handling complex data sets, many groups lack the technical expertise necessary to construct custom systems suited for developing needs3. Having the skills to manage and update such data sets remains critical for efficient workflow and output. Bridging the gap between data and expertise is important for efficiently handling, re-updating, and analyzing a broad spectrum of multifaceted data.
Scalability is an essential consideration when handling large data sets. Big data, for instance, is a flourishing area of research that involves revealing new insights from processing data characterized by huge volumes, large heterogeneity, and high rates of generation, such as audio and video4,5. Using automated methods of organization and analysis is mandatory for this field to appropriately handle torrents of data. Many technical terms used in big data are not clearly defined, however, and can be confusing; for instance, "high velocity" data is often associated with millions of new entries per day whereas "low velocity" data might only be hundreds of entries per day, such as in an academic lab setting. Although there are many exciting findings yet to be discovered using big data, most academic labs do not require the scope, power, and complexity of such methods for addressing their own scientific questions5. While it is undoubtable that scientific data grows increasingly complex with time6, many scientists continue to use methods of organization that no longer meet their expanding data needs. For example, convenient spreadsheet programs are frequently used to organize scientific data, but at the cost of being unscalable, error prone, and time inefficient in the long run7,8. Conversely, databases are an effective solution to the problem as they are scalable, relatively cheap, and easy to use in handling varied data sets of ongoing projects.
Immediate concerns that arise when considering schemas of data organization are cost, accessibility, and time investment for training and usage. Frequently used in business settings, database programs are more economical, being either relatively inexpensive or free, than the funding required to support use of big data systems. In fact, a variety of both commercially available and open source software exists for creating and maintaining databases, such as Oracle Database, MySQL, and Microsoft (MS) Access9. Many researchers would also be encouraged to learn that several MS Office academic packages come with MS Access included, further minimizing cost considerations. Furthermore, nearly all developers provide extensive documentation online and there is a plethora of free online resources such as Codecademy, W3Schools, and SQLBolt to help researchers understand and utilize structured query language (SQL)10,11,12. Like any programming language, learning how to use databases and code using SQL takes time to master, but with the ample resources available the process is straightforward and well worth the effort invested.
Databases can be powerful tools for increasing data accessibility and ease of aggregation, but it is important to discern which data would most benefit from a greater control of organization. Multi-dimensionality refers to the number of conditions that a measurement can be grouped against, and databases are most powerful when managing many different conditions13. Conversely, information with low dimensionality is simplest to handle using a spreadsheet program; for example, a data set containing years and a value for each year has only one possible grouping (measurements against years). High dimensional data such as from clinical settings would require a large degree of manual organization in order to effectively maintain, a tedious and error-prone process beyond the scope of spreadsheet programs13. Non-relational (NoSQL) databases also fulfill a variety of roles, primarily in applications where data does not organize well into rows and columns14. In addition to being frequently open source, these organizational schemas include graphical associations, time series data, or document-based data. NoSQL excels at scalability better than SQL, but cannot create complex queries, so relational databases are better in situations that require consistency, standardization, and infrequent large-scale data changes15. Databases are best at effectively grouping and re-updating data into the large array of conformations often needed in scientific settings13,16.
The main intent of this work, therefore, is to inform the scientific community about the potential of databases as scalable data management systems for "medium sized", low velocity data as well as to provide a general template using specific examples of patient sourced cell-line experiments. Other similar applications include geospatial data of river beds, questionnaires from longitudinal clinical studies, and microbial growth conditions in growth media17,18,19. This work highlights common considerations for and utility of constructing a database coupled with a data-pipeline necessary to convert raw data into structured formats. The basics of database interfaces and coding for databases in SQL are provided and illustrated with examples to allow others to gain the knowledge applicable to building basic frameworks. Finally, a sample experimental data set demonstrates how easily and effectively databases can be designed to aggregate multifaceted data in a variety of ways. This information provides context, commentary, and templates for assisting fellow scientists on the path towards implementing databases for their own experimental needs.
For the purposes of creating a scalable database in a research laboratory setting, data from experiments using human fibroblast cells was collected over the past three years. The primary focus of this protocol is to report on the organization of computer software to enable the user to aggregate, update, and manage data in the most cost- and time-efficient manner possible, but the relevant experimental methods are provided as well for context.
Experimental setup
The experimental protocol for preparing samples has been described previously20,21, and is presented briefly here. Constructs were prepared by spin-coating rectangular glass coverslips with a 10:1 mixture of polydimethylsiloxane (PDMS) and curing agent, then applying 0.05 mg/mL fibronectin, in either unorganized (isotropic) or 20 µm lines with 5 µm gap micropatterned arrangements (lines). Fibroblast cells were seeded at passage 7 (or passage 16 for positive controls) onto the coverslips at optimal densities and left to grow for 48 h with media being changed after 24 h. The cells were then fixed using 4% paraformaldehyde (PFA) solution and 0.0005% nonionic surfactant, followed by the coverslips being immunostained for cell nuclei (4',6'-diaminodino-2-phenylinodole [DAPI]), actin (Alexa Fluor 488 phalloidin), and fibronectin (polycloncal rabbit anti-human fibronectin). A secondary stain for fibronectin using goat anti-rabbit IgG antibodies (Alexa Fluor 750 goat anti-rabbit) was applied and preservation agent was mounted onto all coverslips to prevent fluorescent fading. Nail polish was used to seal coverslips onto microscope slides then left to dry for 24 h.
Fluorescence images were obtained as described previously20 using a 40x oil immersion objective coupled with a digital charge coupled device (CCD) camera mounted on an inverted motorized microscope. Ten randomly selected fields of view were imaged for each coverslip at 40x magnification, corresponding to a 6.22 pixels/µm resolution. Custom-written codes were used to quantify different variables from the images describing the nuclei, actin filaments, and fibronectin; corresponding values, as well as organization and geometry parameters, were automatically saved in data files.
Cell lines
More extensive documentation on all sample data cell lines can be found in prior publications20. To describe briefly, the data collection was approved and informed consent was performed in accordance with UC Irvine Institutional Review Board (IRB # 2014-1253). Human fibroblast cells were collected from three families of different variations of the lamin A/C (LMNA) gene mutation: heterozygous LMNA splice-site mutation (c.357-2A>G)22 (family A); LMNA nonsense mutation (c.736 C>T, pQ246X) in exon 423 (family B); and LMNA missense mutation (c.1003C>T, pR335W) in exon 624 (family C). Fibroblast cells were also collected from other individuals in each family as related mutation-negative controls, referred to as "Controls", and others were purchased as unrelated mutation-negative controls, referred to as "Donors". As a positive control, fibroblast cells from an individual with Hutchinson-Gliford progeria (HGPS) were purchased and grown from a skin biopsy taken from an 8-year-old female patient with HGPS possessing a LMNA G608G point mutation25. In total, fibroblasts from 22 individuals were tested and used as data in this work.
Data types
Fibroblast data fell into one of two categories: cellular nuclei variables (i.e., percentage of dysmorphic nuclei, area of nuclei, nuclei eccentricity)20 or structural variables stemming from the orientational order parameter (OOP)21,26,27 (i.e., actin OOP, fibronectin OOP, nuclei OOP). This parameter is equal to the maximum eigenvalue of the mean order tensor of all the orientation vectors, and it is defined in detail in previous publications26,28. These values are aggregated into a variety of possible conformations, such as values against age, gender, disease status, presence of certain symptoms, etc. Examples of how these variables are used can be found in the results section.
Example codes and files
The example codes and other files based on the data above can be downloaded with this paper, and their names and types are summarized in Table 1.
NOTE: See Table of Materials for the software versions used in this protocol.
1. Evaluate if the data would benefit from a database organization scheme
2. Organize the database structure
NOTE: Relational databases store information in the form of tables. Tables are organized in schema of rows and columns, similar to spreadsheets, and can be used to link identifying information within the database.
3. Set up and organize the pipeline
4. Create the database and queries
NOTE: If tables store information in databases, then queries are requests to the database for information given specific criteria. There are two methods to create the database: starting from a blank document or starting from the existing files. Figure 4 shows a sample query using SQL syntax that is designed to run using the database relationships shown in Figure 2.
5. Move the output tables to a statistical software for significance analysis
Multi-dimensionality of the data
In the context of the example data-set presented here, the subjects, described in the Methods section, were divided into groups of individuals from the three families with the heart disease-causing LMNA mutation ("Patients"), related non-mutation negative controls ("Controls"), unrelated non-mutation negative controls ("Donors"), and an individual with Hutchinson-Gilford progeria syndrome (HGPS) as a positive control
Technical discussion of the protocol
The first step when considering the use of databases is to evaluate if the data would benefit from such an organization.
The next essential step is to create an automated code that will ask the minimum input from the user and generate the table data structure. In the example, the user entered the category of data type (cell nuclei or structural measurements), cell lines' subject designator, and number of files being selected. The rele...
The authors have nothing to disclose.
This work is supported by the National Heart, Lung, and Blood Institute at the National Institutes of Health, grant number R01 HL129008. The authors especially thank the LMNA gene mutation family members for their participation in the study. We also would like to thank Linda McCarthy for her assistance with cell culture and maintaining the lab spaces, Nasam Chokr for her participation in cell imaging and the nuclei data analysis, and Michael A. Grosberg for his pertinent advice with setting up our initial Microsoft Access database as well as answering other technical questions.
Name | Company | Catalog Number | Comments |
4',6'-diaminodino-2-phenylinodole (DAPI) | Life Technologies, Carlsbad, CA | ||
Alexa Fluor 488 Phalloidin | Life Technologies, Carlsbad, CA | ||
Alexa Fluor 750 goat anti-rabbit | Life Technologies, Carlsbad, CA | ||
digital CCD camera ORCAR2 C10600-10B | Hamamatsu Photonics, Shizuoka Prefecture, Japan | ||
fibronectin | Corning, Corning, NY | ||
IX-83 inverted motorized microscope | Olympus America, Center Valley, PA | ||
Matlab R2018b | Mathworks, Natick, MA | ||
MS Access | Microsoft, Redmond, WA | ||
paraformaldehyde (PFA) | Fisher Scientific Company, Hanover Park, IL | ||
polycloncal rabbit anti-human fibronectin | Sigma Aldrich Inc., Saint Louis, MO | ||
polydimethylsiloxane (PDMS) | Ellsworth Adhesives, Germantown, WI | ||
Prolong Gold Antifade | Life Technologies, Carlsbad, CA | ||
rectangular glass coverslips | Fisher Scientific Company, Hanover Park, IL | ||
Triton-X | Sigma Aldrich Inc., Saint Louis, MO |
Request permission to reuse the text or figures of this JoVE article
Request PermissionThis article has been published
Video Coming Soon
Copyright © 2025 MyJoVE Corporation. All rights reserved