Digital Curation Processes at DataFirst
DataFirst is involved in the entire Data Curation Lifecycle to support the research process. See how we curate data for reuse and the other services we offer as shown in our Digital Curation Reference Model. This model also shows how DataFirst supports the virtuous cycle of reuse: We work with data depositors to improve the quality of their data products, based on feedback from researchers.
More detailed information on Digital Curation policies and procedures at DataFirst can be found in our Digital Curation Document.
Stage 1 in our Reference Model (Data Collection) is outside our workflows, as we do not collect data. We source data and documents from ongoing institutional partnerships and via the academic and policy research community. View the information for Depositors.
Stage 2 Managing Dataset Deposits
We refer to the dataset at this stage as the Deposit Dataset, depicted as A in our model. View the Depositors Data Description Form.
Collections Policy - DataFirst accepts deposits of unit record data from census or survey research, or administrative records.
Formats – DataFirst accepts data files in any formats, but our preferred format is Stata.
Documentation – Background documentation helps support data re-use. Any documents pertaining to the data collection process should be deposited with the data files, e.g., questionnaires, codebooks, and reports.
Data Ownership – Depositors must ensure they have the rights to deposit data to be shared by DataFirst.
Stage 3. Assuring Data
Data Checks and Cleaning
All datasets deposited with us undergo quality checks to confirm the accuracy and usability of the data. Anomalies in data files and documents are corrected in consultation with Depositors. Errors and corrections are recorded as Data Quality Notes in the metadata for with each dataset.
Disclosure Control
Assuring data involves undertaking disclosure control (anonymisation) to ensure the final shared data files do not contain personal information that could be used to identify individuals. View our DataFirst Disclosure Control Flowchart.
Versioning
Data files with data quality changes will receive new version numbers. File naming and versioning is according to the Data Documentation Initiative (DDI) standard. DataFirst versions at file level as well as at dataset level and therefore individual data files within a dataset may not have the same version numbers. The version number of the dataset will be that of the latest data file. Notes on this are included in the metadata for new versions. The advantage of this is that researchers will not need to download/recheck data files that have not been changed, just the files that have been changed. View our data preparation flow diagram.
Stages 4 and 5. Describing the Data and Documentation (Metadata Creation)
Extensive provenance and usage information is created for each dataset in our collection. This metadata is created according to the DDI data description standard using free Nesstar markup software available from the World Bank’s International Household Survey Network.
Stage 6. Preserving (Archiving) Datasets
An archival version of all iterations of each dataset is retained by DataFirst. Archival copies are securely preserved and migrated as technology changes, to ensure they are always accessible. The archived copy is called the Preservation Dataset shown as B in our model.
Stage 7. Disseminating Datasets
Stage 7 is where we make datasets available for reuse. The dataset at this stage is referred to as the Dissemination Dataset depicted as C in our reference model. DataFirst disseminates datasets under Creative Commons open copyright access and use Licenses, namely:
Creative Commons CC-BY Attribution-Only License which allows for any use as long as the user cites the data producer and DataFirst as the data distributor, and sends us a link to any publications based on the data. The data should be cited according to our recommended citation.
Creative Commons (CC BY-SA) Attribution-ShareAlike License as above but this license requires the user to distribute any work based on the data under the same license as the original.
Creative Commons (CC-BY-NC) Attribution Non-Commercial License as above but data access and use must be for non-commercial purposes only.
Stage 8. Support for Research Data Management
We support researchers, research projects and government agencies to manage and share their data or deposit data with us for sharing. Read about our Data Curation Training Workshops or contact us on our This email address is being protected from spambots. You need JavaScript enabled to view it. for further information.
Stage 9. Support Data Analysis
DataFirst gives training in Data Analysis to build quantitative skills in South Africa and other African countries. Read about our training courses and workshops.
Stage 10. Support Data Citation
We help researchers to cite our datasets in their research publications that use the data. The metadata for each dataset includes a Citation field which shows how to cite the dataset in research publications. Read our advice on how to cite data according to international citation standards.
Stage 11. Track Data Usage
We keep track of research publications based on our data and link these to the relevant datasets. You can see works citing our datasets from the “Related Publications” tab on the landing-page for each of our datasets. We also report data usage statistics to Depositors.
Stages 12 and 13 Data Quality Feedback
These stages depict how we utilise the “virtuous cycle of open data” to advance the quality of the data and metadata. In these stages, we draw on user feedback and consult with depositors to fix data errors or document data issues. Researchers may produce secondary datasets based on their analysis of our datasets and this type of dataset is depicted as D in our reference model.