The DataSHaPER (Data Schema and Harmonization Platform for Epidemiological Research) is both a scientific approach and a suite of practical tools. Its primary aims to facilitate the prospective harmonization of emerging biobanks, provides a template for retrospective pooling and it supports the development of questionnaires and information-collection devices even when pooling data with other biobanks is not foreseen.
Its basic structure reflects a four step approach to harmonization that is for each new harmonization action as it arises:
In the context of the DataSHaPER, the term "variables" refers to the primary units of interest in a statistical analysis (e.g. current smoker [yes/no], or body mass index as a quantitative trait). An important distinction is drawn between such variables and the specific "assessment items" that are collected by a particular study (e.g. questions in a questionnaire or individual physical measures). Crucially, it is variables that are harmonized between studies and it is this that provides for flexible yet robust harmonization, because a given variable may potentially be built using different assessment items in different studies.
Structurally, the DataSHaPER is a dynamically evolving entity with two primary components: the DataSchema Platform and the Harmonization Platform.
A DataSchema identifies and describes a thematic set of core variables that are of particular value in a specified scientific setting. In other words, it encapsulates step 1* of the harmonization process. Each DataSchema has a hierarchical structure: variables are nested in domains that are in turn grouped within themes.
The DataSchema Platform contains a growing number of such DataSchemas, each with its own scientific purpose. The platform also contains associated support material including variable definitions, links to relevant ontologies, and access to reference questionnaires and operating procedures that have been selected or developed to reliably generate the variables in each DataSchema.
Each DataSchema in the DataSchema platform is partnered by a corresponding Harmonization Unit that provides a foundation for harmonizing studies relative to that particular schema (harmonization step 2*).
At present, the Harmonization Platform is under construction. Ultimately, it will contain a growing number of harmonization units, each associated with its corresponding DataSchema. In the longer term, harmonization steps 3* and 4* will also be addressed by the Harmonization Platform by including direct links to algorithmic pairing rules and to IT platforms that enable data to be pooled.