It is the DataSchema Platform that lies at the heart of the DataSHaPER. Each DataSchema in the platform addresses step 1* of the four step approach to harmonization. A DataSchema is primarily aimed at providing a template for prospective harmonization between biobanks - i.e. as a tool to help the design of biobanks and their information acquisition tools in order to optimize the potential for future pooling. But, by its very nature, when combined with its complementary Harmonization Unit a DataSchema also provides a powerful template for retrospective harmonization and pooling of data that have already been collected by studies that have not been prospectively harmonized. Finally, because it is based on variables that have been selected on the basis of carefully thought through scientific criteria (see text box) a DataSchema also provides a guide to help emerging biobanks and cohort studies select suitable information items and collection tools, even when data pooling with other studies is not foreseen.
A DataSchema contains a thematic set of core variables that may be constructed from the assessment items that can be derived from questionnaires, physical measures, biochemical measures, or electronic registries associated with individual studies. In other words, each DataSchema represents a library of variables that are of significant relevance to the specific scientific area addressed by that DataSchema and meet the criteria outlined in the table. Each DataSchema also contains supplementary descriptive information that complements the list of variables itself. This includes variable definitions, links to relevant ontologies and access to reference questionnaires and operating procedures that have been selected or developed to reliably generate the variables in the DataSchema.
The core variables in each DataSchema form one component of a three level nested hierarchy:
Criteria for selecting the individual variables in a DataSchema
At all times the aim is to restrict a DataSchema to an appropriately limited number of the most important variables. The number required must be tailored to the specific scientific purposes.
At present the DataSchema Platform contains two completed DataSchemas, the Generic baseline DataSchema and the CPT DataSchema that has been developed for the Canadian Project for Tomorrow. Several other disease-specific DataSchemas are currently under development.
|
|
Generic baseline DataSchema
The "Generic baseline DataSchema" aims to support the construction of cross-sectional baseline questionnaires in general purpose biobanks enrolling middle-aged participants. |
|
|
CPT DataSchema
The "CPT DataSchema" is aimed at supporting the harmonization of the five population-based cohorts that together compose the Canadian Partnership for Tomorrow (CPT) project. |