MorphoJ uses three types of collections of data: projects, datasets and data matrices.
A project is a collection of data and analyses. It is self-contained in that it contains all the data and other information needed for the analyses. Whatever data are required needs to be imported into the project in order to be available for analysis or visual display in MorphoJ. There are no limits to the items that can be included in a single project, although it normally makes sense to include only the information that is used together in the same study. For instance, it might make sense to include all the data to be used for one paper or for a MSc thesis in the same project, but it is doubtful whether it is useful to combine studies on human skull variation and the evolution of Drosophila wings in the same project.
Only one project can be open in MorphoJ at any time. Each project can be saved to a disk as a single file. The resulting project files can be exchanged among users and provide the complete details of the analyses of the project.
A dataset is a single collection of data items that are much more homogeneous. A dataset with landmark data will have the same landmarks and will only contain information about one set of individuals. A dataset is not necessarily self-contained, because information from outside it may be needed for the analysis or for displaying the results. Typically, a dataset includes all the data that needs to be considered togeter in a single analysis.
Another use of datasets is to store results of analyses. When running analyses such as principal component analysis, canonical variate analysis or regression, MorphoJ makes new datasets with transformed versions of the data such as PC scores, CV scores or the residual and predicted values from regression. These values can be exported from the MorphoJ datasets for use in other programs (spreadsheets, statistics programs, etc.).
The distinction between projects and datasets may be easiest to understand in some examples.
A study of the evolution of skeletal traits in a group of rodents may be a good example for the kinds of data that can be included in a single project. The study may include three-dimensional data on landmarks of the skull, two-dimesional data on the mandible and scapula, and additional three-dimensional data on the pelvis. In this case, the data on the skull, mandible, scapula and pelvis would each be one dataset. It doesn't matter that for some specimens the skull may be present but no data may be available because of damage to the material. This arrangement allows the user to do both separate analyses of single datasets and combined analyses of multiple datasets (if the specimens can be matched appropriately).
Additional information that can be included in the project might be data on the location and habitat where the specimens were collected, one or more phylogenies of the group, and resources for visualization of the results, such as wireframe diagrams or an outline drawing for the views of structures in the two-dimensional datasets. This information can be made available for use throughout the project.
Other kinds of data that might reasonably be included in single projects are separate datasets concerning landmarks of the flowers and leaves of plants, or datasets of landmarks of the facial soft tissue, skull structures and maybe brain scans of human subjects.
MorphoJ can easily split datasets into multiple datasets according to criteria of group membership of the obseravations. It is much harder to combine multiple datasets into more extensive ones. Therefore it is often advantageous to use more inclusive definitions of datasets at the beginning when importing data into MorphoJ.
A dataset can itself consist of several parts. Most datasets contain several data matrices, for instance, a matrix of raw coordinates, a matrix of shape coordinates after Procrustes fit, a matrix of centroid sizes, or a matrix of covariates. Each data matrix contains a certain type of data. Usually, MorphoJ keeps track of these matrices automatically, but quite often the user is asked to select a certain type of data from a data set. In that case, MorphoJ will use the corresponding data matrix for the analysis.
Each data matrix contains one or more variables. Some variables can be used and interpreted individually, such as centroid size or log-transformed centroid size (both in the data matrix of type Centroid Size). Other data types, such as the landmark coordinates after a Procrustes fit, are not interpretable individually, and MorphoJ therefore uses them together for all subsequent analyses.
Each project has a tree structure that reflects the dependencies among the datasets and analyses contained in the project. Analyses are attached to the datasets on which they are performed, and the derived datasets with results are in turn attached to the analyses.
The tree structure cannot reflect all the possible dependencies in a project, because some analyses such as regressions or partial least squares analyses can link datasets across the branches of the project tree.Those analyses are attached to one of the respective branches of the tree.
Because multiple datasets and analyses may be available in the project side by side and because the data can be changed (by including or excluding observations, or by editing classifiers and covariates), there is a danger that different analyses reflect different states of the data. MorphoJ addresses this problem by updating automatically the datasets and analyses that are affected by a change. Depending on the type of analyses, this can take some time (e.g. permutation tests). Therefore, it is preferable to make changes to datasets (searching for outliers etc.) before most of the analyses.