Home > DataSHaPER > DataSchema Platform

DataSchemas and the DataSchema Platform

It is the DataSchema Platform that lies at the heart of the DataSHaPER. Each DataSchema in the platform addresses step 1* of the four step approach to harmonization. A DataSchema is primarily aimed at providing a template for prospective harmonization between biobanks - i.e. as a tool to help the design of biobanks and their information acquisition tools in order to optimize the potential for future pooling. But, by its very nature, when combined with its complementary Harmonization Unit a DataSchema also provides a powerful template for retrospective harmonization and pooling of data that have already been collected by studies that have not been prospectively harmonized. Finally, because it is based on variables that have been selected on the basis of carefully thought through scientific criteria (see text box) a DataSchema also provides a guide to help emerging biobanks and cohort studies select suitable information items and collection tools, even when data pooling with other studies is not foreseen.

A DataSchema contains a thematic set of core variables that may be constructed from the assessment items that can be derived from questionnaires, physical measures, biochemical measures, or electronic registries associated with individual studies. In other words, each DataSchema represents a library of variables that are of significant relevance to the specific scientific area addressed by that DataSchema and meet the criteria outlined in the table. Each DataSchema also contains supplementary descriptive information that complements the list of variables itself. This includes variable definitions, links to relevant ontologies and access to reference questionnaires and operating procedures that have been selected or developed to reliably generate the variables in the DataSchema.

The core variables in each DataSchema form one component of a three level nested hierarchy:

1. Theme:
General area of interest. Each theme subsumes one or more domains.
2. Domain:
Risk factor or outcome of interest. Each domain subsumes a number of variables.
3. Variable:
Primary unit of interest for a statistical analysis.

Criteria for selecting the individual variables in a DataSchema

  • The variable is of substantial relevance to the scientific area addressed by the DataSchema: a variable may be selected because it is of direct relevance in its own right, is a qualifying variable that influences interpretation of other variables of direct interest, is seen as being a potentially important confounder or is a source of bias
  • Where relevant, each identified level of response in a categorical variable is of high enough prevalence to ensure that sufficient power can realistically be obtained for meaningful analysis
  • The assessment items required to generate the variable can be obtained in a valid way and can reliably be assessed
  • The assessment items required to generate the variable can be collected in a manner that entails an acceptable impact on participants
  • The assessment items required to generate the variable can be collected at acceptable cost

At all times the aim is to restrict a DataSchema to an appropriately limited number of the most important variables. The number required must be tailored to the specific scientific purposes.

Available DataSchemas

At present the DataSchema Platform contains two completed DataSchemas, the Generic baseline DataSchema and the CPT DataSchema that has been developed for the Canadian Project for Tomorrow. Several other disease-specific DataSchemas are currently under development.

Generic DataSchemas

Generic DataSchema Generic baseline DataSchema
The "Generic baseline DataSchema" aims to support the construction of cross-sectional baseline questionnaires in general purpose biobanks enrolling middle-aged participants.

Partner DataSchemas

CPT DataSchema CPT DataSchema
The "CPT DataSchema" is aimed at supporting the harmonization of the five population-based cohorts that together compose the Canadian Partnership for Tomorrow (CPT) project.
© 2005 Public Population Project in Genomics.
All rights reserved.
Information Usage