MVApp

Glittery multivariate analysis platform for all kinds of data. Follow us on twitter @MVApp007.

The app is available here or you can run it locally from your device by typing the following command in your R window:

install.packages("shiny") library("shiny") shiny::runGitHub("mmjulkowska/MVApp", "mmjulkowska")

(….it will take some time for the first time to upload all the libraries)

Purpose statement - What is MVApp for?

MVApp was created to streamline data analysis for all kinds of biological queries - from investigating mutant phenotypes, examing the effects of an experimental treatment, to studying natural variation using any biological system.

We believe that MVApp will enhance data transparency and standardize data curation and analysis in the scientific community by empowering researchers to perform complex analyses without extensive knowledge of R or statistics, as well as improve the data analysis literacy in wider scientific community.

Although the MVApp development team is buried armpit-deep in Plant Science, we are trying to make the App as applicable as possible for all biological disciplines and beyond. If you have any suggestions on other types of analyses we can include, please check out our guidelines on how to contribute.

Currently MVApp has following features:

  1. Identification of outliers using different methods based on one or multiple phenotypes
  2. Summary of the data dynamics by fitting simple functions or polynomial curves to data points
  3. Hypothesis testing using parametric and non-parametric tests, including testing the assumptions of normality and equal variance
  4. Correlation analysis of all measured traits in the experiment or within a specific subset of data
  5. Reduction of data dimensionality and identifying the traits that explain the most data variance using principal component analysis and multidimensional scaling
  6. Clustering individual samples using hierarchical or k-means clustering
  7. Estimation of broad-sense heritability of measured traits
  8. Quantile regression analysis that allows the identification of traits with significant contribution to traits of major interest

You can read the instructions below, or watch one of our video-tutorials on youtube.

How to cite the MVApp:

The app is not published yet, but you can find the pre-print version of the MVApp manuscript on figshare: Julkowska, Magdalena; Saade, Stephanie; Agarwal, Gaurav; Gao, Ge; Pailles, Yveline; Morton, Mitchell; Awlia, Mariam; Tester, Mark (2018): MVAPP – Multivariate analysis application for streamlined data analysis and curation. figshare. Paper.

If you wish to cite the app itself, please use the following: Julkowska, M.M., Saade, S., Gao, G., Morton, M.J.L., Awlia, M., Tester, M.A., “MVApp.pre-release_v2.0 mmjulkowska/MVApp: MVApp.pre-release_v2.0”, DOI: 10.5281/zenodo.1067974

Table of contents:

1. DATA UPLOAD

2. SPATIAL VARIATION

3. CURVE FITTING

4. OUTLIER SELECTION

5. DATA EXPLORATION

6. CORRELATIONS

7. PRINCIPAL COMPONENT ANALYSIS

8. MULTIDIMENSIONAL SCALING

9. HIERARCHICAL CLUSTER ANALYSIS

10. K-MEANS CLUSTER ANALYSIS

11. HERITABILITY

12. QUANTILE REGRESSION

1. DATA UPLOAD

Data format:

MVApp can handle .csv files containing at least the following columns:

If you have a timeseries experiment, or any other gradient, and you want to fit curves to your data, the input data should include columns containing:

Your data should look similar to the Example dataset, with ID and TIME column being optional:

mvapp_data

GO BACK TO TABLE OF CONTENTS

Upload and annotate your data

To upload your data, navigate to the “Upload your data” tab in the uppermost panel. Click on the “Browse” button and locate your .csv data file:

01_data_upload_01

Select the columns pertaining to Genotype, Independent Variables, Dependent Variables (phenotypes):

01_data_upload_02

01_data_upload_03

Optionally, continuous Independent Variables and Sample IDs:

01_data_upload_04

Finally, click on the “Click to set the data” button to finalise data upload with selected columns and annotations (unselected columns from the original dataset will be dropped at this point).

View the newly uploaded dataset in “New Data” sub-tab:

01_data_upload_05

GO BACK TO TABLE OF CONTENTS

2. SPATIAL VARIATION

Why test spatial variation?

Any environment where plants are grown is susceptible to spatial variation resulting in e.g. light, temperature and humidity gradients. These can affect the growth of the plants, and thus have an effect on the obtained results. While we encourage the experimental designs which will minimize the effect of spatial variation, we also allow to test the spatial variation effects on the individual data.

At the moment MVApp does not include specific models to correct for spatial variation, as these require advanced statistical insight and commercial R-packages which we cannot provide in a reliable way. Nevertheless, we are happy to recieve contributions from the community.

Upload the spatial information into the MVApp

In order to test the spatial effects, you need to include the collumns containing spatial information in the uploaded dataset. You can do it by selecting the box “Data contains information on spatial distribution of collected data?” and add more than one collumn containing spatial information:

spatial_dataupload

Examine the effect of spatial variation on individual phenotypes

By navigating into the Spatial Variation tab, you can select the phenotype you want to examine and the spatial components of which the effect will be tested. Once you have selected everything, click on the button “Unleash spatial viz” to run the analysis. In the main panel, you will find the graph indicating the trait value across the spatial gradient for visual examination:

spatial_dataselect

When you scroll down, you will find the ANOVA results of the effect of the selected spatial components:

spatial_datatest

You can also perform the spatial variation analysis on multiple spatial components, by selecting more than one component into the “Spatial Variable” widnow. Please press “Unleash spatial viz” button to update the analysis:

spatial_multi_iv

If you collected data across multiple time points / treatments, and would like to perform spatial analysis separately for each time point, you can select “Subset the data?” checkbox and select the specific subset you would like to explore. Please press “Unleash spatial viz” button to update the analysis:

spatial_subset

GO BACK TO TABLE OF CONTENTS

3. CURVE FITTING

Why model your data?

If you have a continuous Independent Variable in your experiment, you might want to estimate how your Dependent Variables change across it. For example, you could investigate the dynamics of plant/bacterial growth over time, or the dose dependency of a phenotypic response to a chemical treatment. Fitting curves will allow you to observe and model these response dynamics.

Fit simple functions

At the moment, MVApp helps you to fit simple functions: linear, quadratic, exponential and square root functions. For these functions, we fit a linear model (using lm() function) between the continuous Independent Variable indicated in the “Time” column and the Dependent Variable (phenotype).

Modelling of non-linear functionsis also relies on fitting a linear function, by transforming the Dependent Variable, so the linear model can be fitted:

MVApp extracts the model parameters: y-intercept (“INTERCEPT”) and the first regression coefficient (“DELTA”), as well as the r2 values to determine model performance.

GO BACK TO TABLE OF CONTENTS

Fit curves with MVApp

First, in the side panel, select which the Independent Variable(s) you wish to group your samples by, and which Dependent Variable you would like to model.

02_curve_fit_01

02_curve_fit_02

If you do not know which function will fit best, you can click on “Unleash model estimation” button. The best model will be indicated based on the r2 values presented in the table:

02_curve_fit_03

Decide on which model you would like to apply for entire dataset and click “Unleash the model”:

02_curve_fit_04

Whenever you fit the curves to the chosen Dependent Variable (phenotype), MVApp will automatically calculate the coefficient of correlation (r2) that indicates how well the fitted function models the observed data. The number of samples that have a poor fit (r2 < 0.7) will be indicated in the message box above the table:

02_curve_fit_05

You can change the threshold of the r2 by changing the r2 cut-off value:

02_curve_fit_06

You can view the lowest r2 values by sorting the samples based on r2 by clicking on the r2 column sorting arrow:

02_curve_fit_07

You can download your data without the samples showing poor r2 fit by scrolling down and clicking on “Download curated data with r2 > cut-off” button:

02_curve_fit_08

GO BACK TO TABLE OF CONTENTS

Visualise goodness of fit of the dynamic curves with fit-plots

You can examine how good your data fits to the selected model by viewing fit-plots - the names of the samples are merged by “Genotype_IndependentVariable_SampleID”. You can either scroll through the sample list or type in the sample name:

02_curve_fit_09

If you are having trouble interpreting the graphs, or need a quick figure legend describing the graph, you can select “show figure legend” for the default version of figure legend to pop-up:

02_curve_fit_10

You can view multiple fit-plots simultaneously. The plots can be sorted by either increasing or decreasing r2 values:

02_curve_fit_11

You can scroll through the individual graphs with the slider on the right side of the main window:

02_curve_fit_12

GO BACK TO TABLE OF CONTENTS

Assess and compare the dynamics between Genotypes and / or Independent Variables

Finally, you can compare how the calculated DELTAs or Coefficients extracted from the models differ between your genotypes and other Independent Variables, such as “treatment”.

You can examine the differences by clicking on sub-tab “Examine differences”. The message box at the top provides ANOVA results, with the p-value threshold indicated below the graph:

02_curve_fit_13

You can use raw data, or data with r2 above the threshold value:

02_curve_fit_14

By scrolling further down, you will find a panel to control the design of the graph, as well as threshold p-value for the ANOVA:

02_curve_fit_15

You can change the graph to, for example, bar graph, remove background or determine what is represented by the error bars. The default figure legend will update automatically too:

02_curve_fit_18

By scrolling further down, you can find the significant groups, as calculated per Tukey’s pairwise test, with the same p-value threshold as ANOVA:

02_curve_fit_20

By scrolling down even further, you will find a table containing the summary statistics for your data for all fitted values:

02_curve_fit_21

GO BACK TO TABLE OF CONTENTS

Fit polynomial curves with MVApp

If your data shows signs of complex dynamics across your continuous Independent Variable (often particularly applicable for long time-series), you might consider fitting a polynomial curve. The splines usually have very high r2 values, and are therefore not included in the “model estimation” for the best fitting curves.

For the cubic splines, we use lm(phenotype ~ bs(time, knots=X)) function in R, where you can indicate the position of a knot in the “timepoint to split the cubic spline” box.

002_curves_cubicspline_01

The fit-plots for the cubic splines carry a dashed diagonal line at the knot position.

002_curves_cubicspline_02

For the smoothed splines, we use smooth.spline() function in R, and you can select between between automatic or user-defined degrees of freedom. The user-defined degrees of freedom can be selected with the “Number of degrees of freedom” slider.

002_curves_smoothedspline_01

The fitplots for the smoothed splines are represented with the purple lines.

002_curves_smoothedspline_02

In case you choose to fit smoothed splines with automatically determined degrees of freedom, they will be displayed in the last column of the table in the sub-tab “Modelled data”. Please be aware that the degree of freedom might differ between individual samples.

002_curves_smoothedspline_03

Although the polynomial functions commonly have better fit than the simple functions, like linear, quadratic or exponential ones, they often result in more coefficients describing the dynamics. So if you would like to use curve-fitting for simplifying your data, using polynomial functions might not be the best choice.

002_curves_cubicspline_03

GO BACK TO TABLE OF CONTENTS

4. OUTLIER SELECTION

Why identify potential outliers?

For those familiar with large(ish) scale experiments, you probably had to curate your data, removing outlier samples that stem from experimental errors or even human mistakes while recording data to avoid making spurious conclusions based on unrepresentative data.

You likely identified these “weird” samples by simple graphical means, or based on their distance from the median in terms of the Standard Deviation or the Interquartile Range.

MVApp helps to automatically highlight potential outliers based on a single or multiple Dependent Variables, using various approaches. However, be careful, outliers should not be automatically removed. It is good practice to justify outlier samples, perhaps refering to notes or images taken during the experiment that might explain the unusual result. It is possible that a “potential outlier” is in fact a valuable, if extreme, result.

GO BACK TO TABLE OF CONTENTS

Highlight potential outliers

Begin by choosing which dataset you would like to use. Whether or not to remove samples with missing values or (in case you have performed curve fitting) dataset with curated r2 values:

03_outliers_01

03_outliers_02

Select the Independent Variable(s) by which to group the samples, and whether you would like to select outliers based on one, some or all phenotypes / measured traits:

03_outliers_03

Next, select which method you would like to use to highlight potential outliers. MVApp provides the following methods:

03_outliers_04

You can now click on the “unleash the outlier highlighter” to view the curated data. In the main tab, the outlier message will appear indicating the number of potential outliers highlighted, as well as a table of your data. If you scroll to the right, you will see the columns marked “outl_Dependent Variable” (for example “outl_AREA”), where “true” will indicate this sample as being an outlier per genotype / day / independent variables selected. If considering all Dependent Variables, the final column will indicate whether a given sample is a potential outlier in the number of Dependent Variables that meets or exceeds the user-defined threshold (annotated as “true”). The number of identified outliers will be shown in the text box above the table:

003_outliers_spare_02

If you decided to select outliers based on all Dependent Variables (remove the entire column), and not replace them by NA (empty cells), use the slider input in the side panel to select the number of Dependent Variables a given sample must be an outlier in order to be considered an outlier across the whole experiment, i.e. the samples that are extreme across many phenotypes and thus should be removed from the data analysis.

003_outliers_spare_03

As you slide the slider, the number of the outliers in the message box will change.

003_outliers_spare_04

GO BACK TO TABLE OF CONTENTS

Examine the data with and without potential outliers

After highlighting potential outliers in your data, you can look at how the data looks with and without them.

Go to sidepanel sub-tab “Tweak the graphs” and select the Dependent Variable you wish to examine. You can also select the type of plot you would like: box plot, scatter plot or bar plot (we find box plots to be the most informative).

Click on the main panel sub-tab “Graph containing outliers” to see your plots prior to removing potential outliers.

03_outliers_08

You can reduce the number of genotypes you are examining simultaneously by changing the slider “Show … number of samples”:

03_outliers_09

You can change the portion of the data that you examine by changing the slider “Plot portion of the data starting from the element number …”:

03_outliers_10

If you wish to color code your samples or split the plots based on your Independent Variables, you can do this in the side panel:

03_outliers_11

If you want to alter the order of your samples, you can swap the order of the Independent Variables in the sidebar sub-panel “Outlier selection”:

mvapp_outlier_graph_even_nicer

You can also change the order of the samples, by adjusting the order of the Independent Variables in the side-panel

003_outliers_spare_08

GO BACK TO TABLE OF CONTENTS

Compare the data with outliers removed

If you want to look at the graphs with potential outliers removed (as highlighted in main panel “The outliers test”), click on the main panel “Graphs with outliers removed”. You can click between “Graph containing outliers” and “Graph with outliers removed” to compare both datasets.

003_outliers_spare_05

003_outliers_spare_06

IMPORTANT NOTE: This outlier test was developed to facilitate data curation. Please do NOT remove any data before making absolutely sure that there is a very good reason that the sample is not representative. We recommend downloading the dataset with outliers highlighted, manually removing samples that you can reasonably explain, and reuploading the curated dataset before continuing with your analysis.

GO BACK TO TABLE OF CONTENTS

Calculate summary statistics

MVApp can calculate summary statistics functions (e.g. mean, median, standard deviation) in the main panel sub-tab “Summary data”.

Select the dataset that you would like to use in the upper left corner of the main tab, and select the functions that you want to be calculated in the upper right corner.Then click “Unleash summary statistics”:

03_outliers_17

Table containing all the calculations will appear in the main panel. You can download the data into your computer by clicking “Download summary statistics data” button:

03_outliers_19

GO BACK TO TABLE OF CONTENTS

5. DATA EXPLORATION

Once your data is nice and clean and ready to go, it is time to start having a proper look at it. A good place to start is to check out how your data is distributed using histograms and boxplots, grouping samples according to your various Independent Variables. From these you can get an idea of how your different genotypes are behaving, how your treatments are affecting your phenotypes, and how variable your data is.

Beyond eyeballing, you can apply statistical tests such as ANOVA to test whether there are significant differences between groups. These are all easy things to do in MVApp, which also helps you check the assumptions of these statistical tests, such as normal distribution and homoscedasticity (i.e. equal variance).

In the side panel, you can choose:

04_explore_01

Once the choices are made, the user can proceed to the different tests available in the DATA EXPLORATION tab.

GO BACK TO TABLE OF CONTENTS

Examine distribution

Your histograms will appear in the “Testing normal distribution” sub-tab, where you can choose between “Histograms with counts on y-axis” and “Histograms with density on y-axis”.

04_explore_03

04_explore_04

You can split the graphs by Independent Variable, by ticking the box “Split the graph?” and chosing the Independent Variable that you would like to use for splitting:

04_explore_05

You can subset your data even further, by ticking the box “Subset the data?” in the main window, and selecting yet another one of the Independent Variables, and selecting specific value for this variable that will be used for the displayed graphs:

04_explore_06

From these plots, you can look at the spread of your data across the Independent Variable groupings selected in the side-panel. Below the histograms, you will find a message that summarizes the groups/subgroups that seem not to have a normal distribution, where the p-value of the Shapiro-Wilk test is larger than the p-value threshold selected in the side-panel. Normal distribution is a requirement for performing an ANOVA test (less so for large sample sizes).

04_explore_08

If you want to see the detailed results of the Shapiro-Wilk test for all groups/subgroups along with their QQ-plots, tick the checkbox “See detailed Shapiro-Wilk test and QQ-plots”. The table shows p-value of the Shapiro-Wilk tests performed for each groups/subgroups. If the p-value of the Shapiro-Wilk test for a group is larger than the selected p-value threshold, in the final column the group will be noted with “Data has NORMAL distribution”.

04_explore_09

Sample size affects the Shapiro-Wilk test and hence (the more the merrier), the user is strongly encouraged to check the QQ-plots. If you ticked the “See detailed Shapiro-Wilk test and QQ-plots?” checkbox, sliders for QQ-plots also appear. These sliders help choose the optimum number of columns and plots for display. The first slider “ Display QQ plots in … columns:” allows the user to choose the number of columns for the display of the QQ-plots. If the number of plots is too large to be displayed all at once in the window, a second slider “Plot portion of the data starting from element number…” appears and the user can choose the portion of plots to be displayed.

Based on the results obtained in this sub-tab, you can have a better judgement in the following sub-tab whether to check Bartlett or Levene test for equal variances.

GO BACK TO TABLE OF CONTENTS

Examine variance

In the “Testing equal variance” sub-tab, you can have a look at the results of the Bartlett test and Levene test of equal variances between the different groups and for each sub-groups. Equal variances, or homoscedasticity, is also a requirement for performing an ANOVA test.

In the main window, you see the boxplots for each group: left - the observed data (y), middle - the data with the subtracted median (y-med(y)), right - the absolute deviations from the median (abs(y-med(y))).

04_explore_11

If you scroll lower, you will see the results of both Bartlett and Levene tests. The null hypothesis of the Bartlett and Levene tests assumes that variances across the groups are the same. The Bartlett test is more robust when the data comes from a normal distribution, while Levene test is more robust in case of departures from normality.

The first table displays the results of the Bartlett test and the second table displays those of the Levene test. The tables show the p-value of tests performed for each groups/subgroups. If the p-value of the test for a group is larger than the selected p-value threshold, groups are noted as “Equal”. In this case there is not enough evidence to reject the null hypothesis, where the variances are considered equal. If the p-value of the test for a group is smaller than the selected p-value threshold, groups are noted as “Not equal”. In this case, the null hypothesis is rejected and the variances are considered not equal.

04_explore_12

As indicated previously, the results of this sub-tab and the previous sub-tab are needed to for the ANOVA test performed in the following sub-tab. ANOVA assumes the data comes from a normal distribution and the variances are equal.

GO BACK TO TABLE OF CONTENTS

One / two sample test

In this subtab you can explore the differences between a certain value and your sample, or between two selected samples, with one/two sample t-test or Kolmogorov-Smirnov test (for non-parametric samples).

First, select which test you would like to perform and how you wish to group your samples in the left panel of the main window:

04_explore_14

In the case of one-sample t-test you should enter “mu value” - to test for significant difference between the value and the mean value of your selected sample group. The results of the one t-test will be displayed above the boxplot:

04_explore_15

For the two-sample t-test or Kolmogorov-Smirnov test, you should select two specific samples. The results of the test will be shown above the graph:

04_explore_16

GO BACK TO TABLE OF CONTENTS

Test significant differences between groups

In this sub-tab, you can check for signifcant differences in the means between different groups using analysis of variance (ANOVA) or a non-parametric test (Kruskal-Wallis). A text box displays the p-value of the ANOVA test between different groups of Independent Variables:

04_explore_17

If the p-value of the ANOVA test for a group is larger than the selected p-value threshold, groups are noted with “NO significant difference in means”. In this case there is not enough evidence to reject the null hypothesis and the means of the groups are assumed equal. If the p-value of the ANOVA test for a group is smaller than the selected p-value threshold, groups are noted with “SIGNIFICANT difference in means”. In this case the null hypothesis, where the means of the group are considered equal, is rejected and the means of the groups can be considered significantly different.

A second text box displays the significant groups based on Tukey’s pairwise comparison. Groups that share a common letter do not have significantly different means for the selected Dependent Variable.

04_explore_18

Boxplots display the distribution of the data for a specific trait (dependent variable) for the levels of the independent variable. The boxplots can be split by the second Independent Variable, which can be selected once you tick the checkbox for “Split graph?”. You can also change the main Independent Variable to compare differences between Genotypes across the individual subsets:

04_explore_19

You can also run a non-parametric test, by selecting it from the drop-down menu:

04_explore_20

In case of non-parametric test, Wilcoxon / Mann-Whitney test will be used to make the pairwise comparison between individual groups. The results of the test are displayed in the lower text box:

04_explore_22

GO BACK TO TABLE OF CONTENTS

Two-way ANOVA

In this sub-tab you can explore the effect of two Independent Variables and the interaction between them. Select Independent Variable 1 and 2 from the drop-down menu in the main window:

04_explore_23

You can additionally subset your data for yet another Independent Variable, by selecting the “Subset the data?” box, and choosing the Independent Variable and a specific subset to be displayed / analysed:

04_explore_24

The results of two-way ANOVA are displayed in the text box below the interaction plot. The Independent Variable 1 (IV1) is the variable selected in the most left drop-down menu, while Independent Variable 2 (IV2) is the variable selected in the center drop-down menu:

04_explore_25

If you scroll down, you can see the residual plot of the two-way ANOVA shown above. The residuals should not show any pattern, indicating the linear relationship between independent variables. If this is not the case, the results of two-way ANOVA should not be trusted:

04_explore_27

GO BACK TO TABLE OF CONTENTS

6. CORRELATIONS

This tab is used to check whether and how strongly your selected dependent variables (phenotypes) are related by creating a correlation matrix of the selected variable pairs. Correlation coefficients and p-values are provided for each variable pair.

Select the dataset

First of all, select the dataset you would like to use to perform the correlation analysis. If you did not perform outlier removal or curve fitting, the “outliers removed” and “r2 fitted curves curated data” will not work properly, so please do not select them.

Note: be aware of the outliers in your selected data. While statistically there’s no harm if the data contains outliers, they can significantly skew the correlation coefficient and make it inaccurate. You can spot them visually from the scatterplot or use the outlier removal feature from Data Curation tab and select the outlier removed data for your correlation analysis.

If you want to include / exclude some of the Dependent Variables from your data, you can do so by selecting or deselecting them from the “Choose from Dependent Variables to be plotted” window.

05_correlate_01

GO BACK TO TABLE OF CONTENTS

Select the correlation method

There are two methods you can choose from to calculate the correlation coefficients. The default Pearson r correlation is the most widely used correlation statistic to measure the degree of the relationship between linearly related variables. It assumes that your dataset to be correlated approximate the normal distribution and follow a linear relationship.(https://en.wikipedia.org/wiki/Pearson_correlation_coefficient).Alternatively, you can use Spearman correlation which assess statistical associations based on the ranks of the variables instead of the variables themselves, and it does not hold any assumptions about the distributions of your data. Spearman correlation.

05_correlate_01a

If you scroll down, you will find a table containing the coefficient of determination (R2) and the p-value for the goodness of fit:

picture1b

GO BACK TO TABLE OF CONTENTS

Correlation for subsetted data

The default is to perform correlation analysis across all dependent variables (phenotypes) across all independent variables. You can also choose to use a subset of your data (for examples, phenotypes under a certain treatment, or from a certain day) to examine the correlation.

Tick the checkbox “Subset your data for correlation analysis” and choose the specific subset to display the correlation for from the dropdown menu:

05_correlate_02a

Once you select to subset your correlation, you will see a message displaying the top 5 most variable correlation pairs. Those pairs of Dependent Variables are determined by examining the variance in R2 of the correlation between the individual subsets of the Independent Variable. The pairs are NOT selected based on the p-values of the correlation, so the variance in correlation should be examined in more details, before making any conclusions:

05_correlate_02b

GO BACK TO TABLE OF CONTENTS

Customize the correlation plot

You can choose the plotting method. The default method is “circle” where the correlation strength between individual Dependent Variables is represented by the size and color of the circle:

05_correlate_02

Some of the correlation plot method only represent the correlation strength with the color - such as “number” method, where the correlation coefficient values are refelcted in different levels in the color scale:

05_correlate_03

You can change the plot type, and plot the correlations between individual traits with full square matrix, or using lower or upper portion of the correlation matrix:

05_correlate_04

You can also indicate the non-significant correlation with a cross, by ticking the box “indicate non-significant correlation”, located lower in the sidebar panel and set the p-value threshold:

05_correlate_05

GO BACK TO TABLE OF CONTENTS

Scatterplots

To examine the correlation between selected Dependent Variables in more details, you can use scatterplot. The data used for this graph is exactly the same data you chose in the “correlation plot” tab. From the sidebar panel, choose two Dependent Variable that you wish to plot on x- and y-axis respectively, and the Independent Variable that you would like to use to color-code the graph:

05_correlate_06

You can choose to further subset your data by ticking the checkbox “Subset the data?” and selecting an Independent Variable for which you wish to subset:

05_correlate_07

By scrolling with your pointer through the graph, you will get a specific information of the samples represented by individual data points. The sample identifier is representing GENOTYPE, Independent Variable, Timepoint and Sample ID (selected in “Data upload” tab). The R2 and p-value for individual correlations are reported in the figure legend, that you can view by selecting “Show the figure legend” checkbox:

05_correlate_08

The scatterplot is interactive, so you can select the specific subsets indicated by individual colors to be hidden from the graph. Please NOTE that this will not affect the R2 and p-value presented in the figure legend since all the values used for the original graph will still be considered for those calculations :

05_correlate_09

GO BACK TO TABLE OF CONTENTS

7. PRINCIPAL COMPONENT ANALYSIS

Principal component analysis (PCA) is often used to simplify data into fewer dimensions and to check which traits explain the majority of the variation found in the population studied. However, PCA is often not explored to its full potential. You can, for example, run PCA on data subsetted by an Independent Variable (e.g.: treatment or a specific timepoint) or run PCA separately on those subsets to see how much each of your Dependent Variables (traits) contribute to explaining the observed variation. MVApp allows you to do all this!

Select data, subsets, and Dependent Variables

Select the dataset you would like to analyse from the dropdown menu at the top of the side panel. If you have not performed outlier removal or curve fitting, the “outliers removed” and “r2 fitted curves curated data” will not work properly, so please do not select them.

06_pca_01

Subsequently, select which Dependent Variables you would like to use for PCA:

06_pca_02

You can additionally select if you would like to scale the data (recommended if the values of individual Dependent Variables are differing in their scale), and run PCA on a specific subset of your data. After selecting all of the above, click “Unleash the PCA monster”.

06_pca_03

You can view the selected dataset in the first tab called “Selected dataset”:

06_pca_04

The specific subset (scaled or non-scaled) can be viewed in the tab “Final data for PCA”:

06_pca_05

GO BACK TO TABLE OF CONTENTS

Visualize the principal components

In the sub-tab called “Eigenvalues”, you will find the scree plot showing the main principal components generated from the PCA. The principal components are shown in descending order based on the percentage variance that is explained by each component. You can download the plot as “.pdf” by clicking on “Download plot” button above the graph:

06_pca_06

Below the graph, you can view a default figure legend:

06_pca_07

If you scroll down in the main window, you will find a table summarizing the eigenvalues of each principal component (comp), their percentage of variance explained and the cumulative percentages of all the components that add up to 100%. The table can be downloaded as a “.csv” file, by clicking on the “Download table” button:

06_pca_08

GO BACK TO TABLE OF CONTENTS

Visualize the contribution of each Dependent Variable to the principal components

In the sub-tab ‘Contribution per variable’, you can visualize the Dependent Variable contribution of each selected principal component. Select the principal components to be plotted on x- and y-axis from the drop-down menu below the graph. The values between brackets on the x- and y-axis indicate the percentage of the variance explained:

06_pca_09

GO BACK TO TABLE OF CONTENTS

What are the principal component coordinates for individual samples?

By scrolling down, you can find the PC coordinates of each sample represented as a scatter plot. The x- and y-axis are controlled by the same dropdown menu as the contribution plots. You can color the plot by any of the Independent Variable that you select from the dropdown menu.

06_pca_10

You can check if there is a separation in the PC coordinates among different genotypes / treatments / timepoints in your samples by changing the color-coding of the graph on the left panel:

06_pca_11

GO BACK TO TABLE OF CONTENTS

Explain principal components by examining the contribution of Dependent Variables

In the sub-tab ‘Contribution per PC’, the contribution of the Dependent Variables to each PC is displayed. You can download the associated graphs by clicking the “Download plot” button:

06_pca_12

If you scroll down, you will find a table summarizing the contribution of the Dependent Variables to each PC/dimension (Dim). You can download the table containing the percentange contribution data per PC as a “.csv” file by clicking the “Download the data” button:

06_pca_14

GO BACK TO TABLE OF CONTENTS

8. MULTIDIMENSIONAL SCALING

Multidimensional scaling (MDS) is a multivariate data analysis approach that is used to visualize the similarity/dissimilarity between samples by plotting points in two dimensions. The input data for MDS is a dissimilarity matrix representing the distances among pairs of objects. MDS is mathematically and conceptually similar to PCA and factor analysis, but PCA is more focused on the dimensions themselves and seeks to identify the traits that explain the most variance, whereas MDS is more focused on the relationships found between the scaled objects.

Select data, subsets, and dependent variables

Select the dataset you would like to analyze from the dropdown menu at the top of the side panel. If you have not performed outlier removal or curve fitting, the “outliers removed” and “r2 fitted curves curated data” will not work properly, so please do not select them.

07_mds_01

Subsequently, select which Dependent Variables you would like to use in MDS:

07_mds_02

You can additionally select whether you would like to scale the data (recommended if the values of individual Dependent Variables are differing in their scale), and run MDS on a specific subset of your data. If you would like to segregate your scaled samples into a number of clusters, you can select “Cluster samples using k-means” checkbox and choose the number of clusters you would like to use.

After selecting all the above options, you can click on “Unleash the power of MDS”. You will then view the selected dataset in the first tab called “Selected dataset”:

07_mds_03

The specific subset (scaled or non-scaled) can be found in the tab “Final data for MDS”:

07_mds_04

GO BACK TO TABLE OF CONTENTS

Multidimensional scaling of individual samples

In the sub-tab “MDS of the samples”, you can view a scatter plot showing the two dimensions resulting from the MDS. If the k-means clustering option was selected, the individual samples would be displayed in colors corresponding to different color-coded clusters. This plot can be downloaded as a “.pdf” file, by clicking the “Download plot” button”

07_mds_05

By scrolling with your pointer through the graph, you will find specific information regarding your samples. The sample identifier is representing Genotype, Independent Variable, Time/Gradient and Sample ID (selected in “Data upload” tab).

07_mds_06

If you scroll down, you will see the table summarizing the coordinates of individual samples as calculated with MDS, including the K-means clusters if the option for “k-mean clustering” was chosen. The table can be downloaded as a “.csv” file, by clicking on the “Download table” button.

07_mds_07

GO BACK TO TABLE OF CONTENTS

Multidimensional scaling of the selected Dependent Variables

In the sub-tab “Scaling of traits”, you can find an MDS performed on the selected Dependent Variables. The plot showing the coordinates of each Dependent Variable is displayed and color-coded by cluster number if that option was included. This kind of plot can provide you with an insight into the relationships between the measured traits:

07_mds_08

GO BACK TO TABLE OF CONTENTS

9. HIERARCHICAL CLUSTER ANALYSIS

Hierarchical cluster analysis is an algorithmic approach to find discrete groups with varying degrees of (dis)similarity. The samples are hierarchically organised depending on the selected method and may be presented as a dendrogram. Hierarchical clustering is commonly used in discretising largely continuous ecological phenomena to aid structure detection and hypothesis generation. For example, if data were collected along a gradient, cluster analysis may help to identify distinct regions therein which may correspond to an ecologically meaningful grouping. Similarly, the hierarchical cluster analysis can be used in phenotype analysis from the experiment performed with and without stress, and group different genotypes into groups that show similar responses to stress conditions.

The hierarchical clustering approach was previously used for clustering the Arabidopsis accessions based on their root and shoot responses to salt stress. Now you can perform this analysis within minutes on your own data using MVApp.

Selecting the data

Select the dataset to analyse from the dropdown menu at the top of the side panel. If you did not perform outlier removal or curve fitting, the “outliers removed” and “r2 fitted curves curated data” will not work properly, so please do not select them.

Subsequently, select which Dependent Variables you want to use in the Hierarchical clustering. Please be aware that unlike PCA, the greater number of traits will result in larger number of groups identified. Therefore - we advise to limit the number of dependent variables used as an input for Hierarchical Clustering Analysis:

08_hclust_02

You can additionally select whether you would like to scale the data (recommended if the values of individual dependent variables are differing in their scale), perform the Hierarchical Clustering on the mean data (means are calculated per Genotype, Independent Variable and Time points selected in “Data upload” tab), or run Hierarchical Clustering on a specific subset of your data. At this point you should also select method for clustering the samples.

You can chose between the following:

08_hclust_03

After selecting all of the above you can click “Unleash cluster analysis”. You can view the selected dataset in the first tab called “Selected dataset”, while the specific subset (scaled or non-scaled) used for the Hierarchical Clustering is displayed in the tab “Final data used for HClust”:

08_hclust_04

GO BACK TO TABLE OF CONTENTS

View the clusters and select the similarity distance for cluster separation

The relationship between the accessions is established depending on the selected dependent variables and the method. In the sub-tab “Clustering your HOT HOT data” you can view a heatmap of the selected dependent variables displayed as separate rows, while the individual (or mean) values corresponding to individual samples will be displayed in separate columns.

08_hclust_05

If you scroll down, you will find a dendrogram representing individual samples that are clustered as in the heatmap above, but now you will be able to see the (dis)similarity distance between the samples. Enter the distance at which you wish to separate the data into the clusters in the window “Separate clusters at:”.

08_hclust_06

As soon as you enter a value, the message above the dendrogram will change, displaying the number of clusters. Please be aware that having too many clusters might not be informative and will significantly slow, or even crash, the cluster validation step.

08_hclust_07

If you scroll even further down, you will find a table containing the cluster ID for your specific samples.

08_hclust_08

Cluster Validation

In the sub-tab “Cluster validation” you will find a message box displaying all the dependent variables for which ANOVA found significant effect of the clusters. In the graph below the message box you can find a box-plot representing individual clusters and the letters above the graph display significant groups calculated using Tukey.HSD test for pairwise comparison.

08_hclust_10

You can view individual dependent variables by selecting them from a drop-down menu “View” above the box-plot.

08_hclust_11

GO BACK TO TABLE OF CONTENTS

10. K-MEANS CLUSTER ANALYSIS

K-means clustering is often used to find groups a data set, when categories or groups in the data are unknown. The K-means algorithm assigns the individuals to a number of centroids, defined by the user. The Euclidean distance between the individual and the cluster mean is computed and the individual is assigned to the closest centroid, so that the samples within the same cluster (K) are as similar as possible. This analysis is useful to confirm user’s hypotheses about the existance of possible groups or to detect unidentified groups in complex datasets.

Selecting the data

Select the dataset to analyse from the dropdown menu at the top of the side panel. If you did not perform outlier removal or curve fitting, the “outliers removed” and “r2 fitted curves curated data” will not work properly, so please do not select them.

09_kmclust_01

Subsequently, select which Dependent Variables you want to use in the K-means clustering. We recommend that you only include non-redundant, informative traits, which can be selected by visualizing the correlations between all traits. The use of highly correlated traits could increase the importance of these specific traits and skew the clustering. Thus, excluding redundant traits is advised.

You can additionally select whether you would like to scale the data (recommended if the values of individual Dependent Variables are in different units), perform the K-means Clustering on the mean data (means are calculated per Genotype, Independent Variable and Time points selected in “Data upload” tab), or run K-means Clustering on a specific subset of your data.

09_kmclust_02

You can view the selected dataset in the first tab called “Selected dataset”, while the specific subset (scaled or non-scaled) used for the K-means clustering is displayed in the tab “Final data used for K-means”:

09_kmclust_03

Optimal cluster number estimation

The number of clusters or centroids (K) must be defined by the user. If you do not know what is the best number of clusters for your data, you can run a preliminary analysis in the sub-tab “Optimal number of clusters” by clicking “Unleash optimal cluster number estimation” button in the sidepanel.

It will take some time for this analysis to finish - so please be patient…

09_kmclust_04

After the estimation step finishes running, the graphs will appear in the main windown in the sub-tab “Optimal number of clusters”. The graphs represent a graphical methods for cluster estimation.

The first graph visualize the elbow method. You can identify the optimal number of clusters by identifying the point at which the line is making the sharpest turn, so called “elbow”.

09_kmclust_05

If you scroll down, you will see the graph representing the “silhouette method”, where the optimal number of clusters is indicated by dashed line.

09_kmclust_06

Scrolling even lower, you will find a message box displaying the results of the cummulative tests, using 30 different indeces, indicating the best cluster number according to the majority rule.

09_kmclust_07

Below the message box, you will find the graphical representation of the cummulative cluster number indicated by different methods.

09_kmclust_08

Performing k-means clustering

Once you decided on the number of clusters to be used for k-means clustering, you should enter the cluster number in the “Cluster number” box and click “Unleash cluster analysis” on the side-panel.

09_kmclust_09

The results of the k-means clustering will be displayed in sub-tab “K-means clustering plots”, where you can view the individual samples plotted in the order of the selected Dependent Variable from the drop-down menu “Variable to plot” in the main window. The colors represent the individual clusters.

09_kmclust_10

You can split the graph by selecting “Split the graph” checkbox and the Independent Variable to split by, as well as modify the appearance of the graph modifying the boxes on the right-hand side above the graph.

09_kmclust_12

In the sub-tab “K-means clustering scatter plots”, you can plot the corelation between two selected Independent Variables, selected from the drop-down menus in the upper left corner above the plot.

09_kmclust_13

You can split the graph by selecting “Split the graph” checkbox and the Independent Variable to split by, as well as modify the appearance of the graph modifying the boxes on the righ-hand side above the graph.

09_kmclust_14

Finally, in the sub-tab “K-means clustering data table” you can view the table and which of your samples belong to which clusters. The columns with cluster identity is all the way at the right side end of the table. You can download the table as a “.csv” file by clicking on the button “Download data”.

09_kmclust_15

11. HERITABILITY

Heritability is the proportion of the phenotypic variance that can be attributed to genetic variance. This statistic is important in the fields of genetics in order to assess if a trait is heritable (genetically controlled). MVapp allows you to calculate the broad-sense heritability, which is the ratio of total genetic variance to total phenotypic variance.

NOTE! Please, be aware that in order to estimate heritability, you should have at least 5 different genotypes.

Selecting the data

Select the dataset to analyse from the dropdown menu at the top of the side panel. If you did not perform outlier removal or curve fitting, the “outliers removed” and “r2 fitted curves curated data” will not work properly, so please do not select them.

10_heritability_01

If you performed your experiment across different years or experimental batches, please select the column indicating the year / experimental batch from the drop-down menu “Select column containing experimental batch / year”. If you do not have this information, or your data was collected from one experiment, select “none”. The model will still be able to run.

The same applies for the drop-down menu “Select column containing location”.

10_heritability_02

Subsequently, enter the number of the replications per location and per year.

10_heritability_03

If your data contains different treatments, you can split your data by selecting “Split the data?” checkbox and selecting an Independent Variable from the drop-down menu for which you wish to split.

10_heritability_04

GO BACK TO TABLE OF CONTENTS

Estimated broad-sense heritability

In the main window, the message box will give a summary of information entered (number of replications, number of years/locations and unique values per year/location) and the summary of the model used to calculate heritability.

10_heritability_05

In case you wish to subset your data even further, you can do it by selecting “Subset the data?” checkbox and selecting an Independent Variable from the drop-down menu for which you wish to subset, as well as specific subset to be displayed. As soon as you do that, the estimated broad-sense heritability values will adjust.

10_heritability_06

You can compare the heritability values between individual subset by changing them in the drop-down menu “Use subset”

10_heritability_07

GO BACK TO TABLE OF CONTENTS

12. QUANTILE REGRESSION

Quantile regression is a way to estimate the conditional quantiles of a response variable distribution in the linear model that provides a complete view of possible causal relationships between variables. Quantile regression minimizes absolute errors and can provide a more comprehensive analysis of the effect of the predictors on the response variable than mean regression. Linear quantile regression is related to linear least-squares regression as both try to study the relationship between the predictor variables and the response variable, the only difference being that least-squares involves modeling the conditional mean of the response variable, whereas, quantile regression models the conditional quantile of response. It is especially useful in applications where extremes are important, such as environmental studies where upper quantiles of yield are critical.

When should you use it?

Quantile regression estimates are more robust against outliers in the response, so if your response variable has potential outliers or extreme data, then ordinary least squares (OLS) regression is more effected as mean is more effected by outliers, you can use median regression as a substitute. If your errors are non-normal then OLS is inefficient, but quantile regression is robust. If your data fails to satisfy the assumption of homoscedatcity of the error terms, then you can use this technique, as there is no such assumption required here. Beyond that, quantile regression also provides a richer characterization of the data, allowing us to consider the impact of an explantory variable on the entire distribution of response, not merely its conditional mean.

Select the dataset

Select which dataset you would like to use to perform quantile regression from the drop-down menu at the top of the side panel. If you did not perform outlier removal or curve fitting, the “outliers removed” and “r2 fitted curves curated data” will not work properly, so please do not select them.

11_qr_01

GO BACK TO TABLE OF CONTENTS

Select reponse, explanatory variable, subsets

Select the phenotype you want as response of your quantile regression, you can only choose one variable.

11_qr_02

Select the independent varaibles to subset the data, you can choose a maximum of two variables.

11_qr_03

Then choose the explanatory variables of your quantile regression model, you can choose any number of explanatory variables.

11_qr_04

You can also choose a p-value threshold to test the significance of the explantory variables. You have the option to scale the data which might be useful if your variables are in different units. After you select all of the neccessary parameters you can click on “Unleash the power of Quantile Regression” button. The selected dataset will be displayed in “Selected Dataset” sub-tab in the main window.

11_qr_04a

The specific subset, selected for the quantile regression with / without scaling is displayed in “Final data for analysis” sub-tab.

11_qr_05

GO BACK TO TABLE OF CONTENTS

Results of quantile regression

The result of the quantile regression model can be seen in the sub-tab ‘Modelled data’. You can chose a specific subset to view from the drop-down menu “Use subset” above the message box.

11_qr_06

The message box displays the significant phenotypes for lower, median and upper quantiles of the response for the particular subset chosen from the drop down list. You can choose the subset whose result you want to see in the message box.

The results from all the quantile regression models for different subsets are tabulated. The table can be downloaded as a “.csv” file by clicking the button “Download modelled data” containing all the results.

11_qr_07

GO BACK TO TABLE OF CONTENTS

Visualize the quantile regression results

The plots of the regression models are displayed in the sub-tab ‘Quantile plots’. You can choose the independent variable by which you want to group your plot. If you have chosen two independent variables to subset your data, then you can also choose the value of your another subset variable whose result you want to see.

If you view a single plot, then choose the particular phenotype you want to view.

The coefficients of the phenotype are plotted against the quantile level. The colored dots represent that the variable is significant for the particular quantile level and the cross sign represent that it is not significant. The different colors represent the different unique realizations of the grouping variable. The different lines can be used to compare the behavior of phenotypes in different conditions or different days, depending on the grouping variable. The plot can be downloade by using the “Download plot” button above the plots.

11_qr_08

If you want to view the results of all your phenotypes the you can choose ‘multiple plots’ from “View plots as:”. The panel displaying the plots will update automatically.

11_qr_09

You can view the contribution of different Dependent Variables by using a scroll bar “Show plot of variables starting from…”

11_qr_10

GO BACK TO TABLE OF CONTENTS

Quantile plots

If you choose to view your plot as single plot, the quantile plot of the phenotype chosen will be displayed. The coefficients of the phenotype are plotted against the quantile level. The colored dots represent that the variable is significant for the particular quantile level and the cross sign represent that it is not significant. The different colors represent the different unique realizations of the grouping variable. The different lines can be used to compare the behavior of phenotypes in different conditions or different days, depending on the grouping variable. The plot can be downloade by using the “Download plot” button at the bottom.

single plot

If you choose to view your plots as multiple plots, the quantile plots of all the phenotypes will be displayed. You can download these plots using the “Download plot” button at the bottom.

multiple plots

If you have more than four explanatory variables, then you can use the slider to view more plots.

multiple plots slider

GO BACK TO TABLE OF CONTENTS