Postgraduate Data Management Plan

This template will help you to plan how you will manage your data, code and documentation throughout your research project. You might not be able to fill it out comprehensively at first; if so, you can treat the last question as your ‘to do’ list, reminding you of gaps you need to fill in as things become clearer.

Each question is accompanied by guidance and examples to help you answer it. Further guidance is available from the Research Data Service web pages: http://www.bath.ac.uk/research/data/. Write as much or as little as you feel you need to answer the question. Once you are happy with your answer, delete the accompanying guidance and examples. You should also delete this introductory text before submitting your plan for evaluation.

1  Overview

1.1  Project name

1.2  Plan author

1.3  Project description

Provide two or three sentences summarising your project's research questions and data needs.

2  Compliance

2.1  With what legislative, contractual and policy requirements must the project comply?

Provide a high-level summary of what you will need to achieve while managing your data, and a list of the documents from which these requirements come. You do not need to provide the practical details of how you will do these things until later in the plan.

If you will be working with personal data, you will need to take measures to comply with the Data Protection Act. Consult the University' guidance to find out what this will mean in practice for your research: http://www.bath.ac.uk/data-protection/guidance/

University policies that might be relevant to your project are listed below.

•  University of Bath Research Data Policy: http://www.bath.ac.uk/research/data/policy/research-data-policy.html

•  University of Bath IT Security Policy: http://www.bath.ac.uk/bucs/aboutbucs/policies-guidelines/policies-it-security.html

•  University of Bath Information Classification Framework: http://www.bath.ac.uk/university-secretary/guidance-policies/information_security.html

The policies of major funders may be found here: http://www.bath.ac.uk/research/data/compliance/funders/

Examples:

•  Informed consent must be obtained from participants for data to be retained, shared, and used for new purposes.

•  Access to the data must be restricted to myself and my supervisor.

•  The data underlying published results must be kept for at least ten years.

•  All published papers must include a data access statement.

•  The project is sponsored by an industrial partner and is covered by a collaboration agreement and my studentship agreement. Under the terms of these agreements, any data relating to the industrial partner must be checked and approved by them prior to being shared.

•  Sources: University of Bath Research Data Policy (link); EPSRC Policy Framework on Research Data (link).

3  Gathering data

3.1  What data will the project require?

Give a brief description of the data you will need to gather in order to answer your research questions. You do not have to be especially granular or specific, especially if you will be dealing with a wide variety of data, but it should be clear to the assessor the sorts of data you will be dealing with (e.g. qualitative or quantitative, output by a device or manually compiled).

In what formats will your data files be? If you will be using proprietary software to work with them, would you still be able to access the data if you no longer had access to that software? If not, consider some alternative formats you might use instead or as well. The UK Data Service has a list of recommended file formats for commonly used file types: http://ukdataservice.ac.uk/manage-data/format/recommended-formats.aspx.

How much data will you need to gather? For digital data, try to give this in terms of bytes (MB, GB, TB). You may find it easiest to calculate this in terms of the number of files of each type you expect to gather, and how big they are on average; or in terms of the number of experiments/observations you intend to perform and their average data output. For non-digital data, consider how many notebooks or filing cabinet drawers you might need.

Examples:

•  I expect to record 30 1-hour interviews; these will be stored as MP3s of about 60 MB each. I will transcribe these into Microsoft Word (.docx) documents of about 100 KB each.

•  I will require SEM images correlated with environmental sensor readings. The images will be saved in uncompressed TIFF 6.0 format. The sensor data will be saved in the proprietary format of the instrument then converted to CSV. I expect to run about 100 experiments, generating about 1 GB each.

•  I expect my consent forms to fill one ring binder and completed questionnaires to fill two filing cabinet drawers.

•  I will start with an initial conditions dataset of about 10 GB in Microsoft Excel (.xlsx) format. I will then run 3-5 simulations, each of which will generate 4 TB of temporary data, of which I will only retain summary data amounting to a 100 GB CSV file.

3.2  How will these data be gathered?

Are the data you need already available from elsewhere: from literature, from online data repositories, or from other researchers’ websites? It is good practice to show that you have checked in your plan.

Will you gather your data from experiments, observations or simulations? What equipment or instruments will you use? How will you capture observations or conversations?

Examples:

•  I will record interviews with my participants using a digital audio recorder, then transcribe them into text.

•  I will test my catalyst under a number of conditions, then submit samples of the products to analysis facilities.

•  I will generate data using model code that I’ve written, then process the data in various ways to produce visualisations.

•  I will take high-resolution digital photographs of artefacts recovered in the field, and send some samples off for analysis.

•  I will combine existing data from sources such as ... and re-analyse them to derive new conclusions.

3.3  What original software, if any, will the project create?

Give a brief description of any scripts, libraries, plug-ins, software tools or applications you plan to develop as part of your research. What programming language do you intend to use? How will you handle dependencies of your code? To what level of quality will you develop your code:

•  Sufficient to allow your own workflow to be re-run?

•  Sufficient to support variant workflows in other projects?

•  Formal software product with successive version releases?

How will you ensure that level of quality is achieved?

Examples:

•  I will not develop any original software.

•  I will automate my analysis workflow with a script written in Python v3. I will install any third-party dependencies from PyPI and note the version numbers I used in inline code comments.

•  To facilitate translation and analysis of the instrument data, I will write a reusable C++ library developed and documented using CWEB.

4  Working with data

4.1  Where and how will the data be stored?

Choose a storage solution that will keep your data safe (from theft, accidental loss, and corruption) and secure (from unauthorised access), and that can accommodate all the data you outlined above. For digital data, you should use the X Drive as your primary data storage area unless you have a good reason not to (e.g. insufficient capacity).

How will your data be backed up? With the University’s managed data storage, this is handled by Computing Services. If you are reliant on standalone computers and external drives, you will need to put your own backup strategy in place. Follow the 3-2-1 rule: at least 3 copies, on at least 2 different types or makes of medium, with at least 1 kept physically distant from the others. Perform a test restore on a regular basis to ensure your strategy is working.

With non-digital data, consider taking copies as backups: either physical copies to be kept in a separate building (but equally securely), or digital copies.

Examples:

•  My primary copy is on the University’s managed data storage (the X Drive), to which both my supervisor and I have access. The X Drive is backed up daily by Computing Services. When working away from a secure and reliable network connection, I will synchronise the files I need between the X Drive and my local hard drive beforehand and immediately afterwards.

•  My participants’ responses will be kept in a locked drawer within my supervisor’s locked office. For the purposes of backup, I will scan the responses and keep the digital copies in an encrypted folder on the X Drive.

•  A raw copy of my data will be stored temporarily at the facility where I will run my experiments. Once I have finished processing them, my summary data will be transferred securely to the X Drive, and the data at the facility deleted.

•  Most of my data are stored in my supervisor’s area of the X drive, but data from my statistical modelling will be stored by my CDT at the University of Bristol.

4.2  How will access be controlled?

During your project, it is usually a good idea to keep your data private so you can publish your results before others have a chance to do so. You need to be especially cautious if you are working with personal, confidential, or environmentally sensitive data, as the consequences of unauthorised access could be severely damaging.

Who needs access to the data? Normally this will be you and your supervisor, but may include members of your research group or collaborators outside it. Can they all access your secure storage? If not, describe how you will transfer data securely to and from them.

Examples:

•  Only myself and my supervisor will have access to my data during the project. We will have the only copies of the key to the locked filing cabinet, and the decryption password to the encrypted folder.

•  Others in my research group and my supervisor’s industrial partners will need to see some of my data. I will arrange for the industrial partners to have temporary Computing Services accounts for accessing the appropriate area of the X Drive.

•  Data will be exchanged as needed with the project team at Exeter. Data will be transferred to them via files.bath, and received back through their institutional OneDrive for Business. In both cases the transfers will be password protected (using previously agreed passwords) and use encrypted connections.

4.3  How will the data be organised?

What folder structure will you use within your storage area? If different files have different access permissions, your first set of folders should reflect the different access groups. Beyond that, your folders should group together files by the task they support (e.g. work package/task numbers), the subject of the data (e.g. sample number, run/survey number, company name), or the type of data (e.g. raw, derived).

How will you name your folders and files? Pick some elements to include (e.g. date, subject code, type) that will help you to navigate through lists and recognise the file you want. You do not have to use exactly the same pattern for all your files, but you will find it easier if you always write elements the same way and in the same order. Note that dates in YYYYMMDD or YYYY-MM-DD format will sort chronologically.

If you are likely to have several versions of some of your files, how will you keep track of which version is which? An effective solution on a small scale is to append a numeric version number to the end of the filename. At larger scales, it is better to use a version control system such as Git or Mercurial.

Examples:

•  I use the structure <experiment<date>/<reagent>-<replicate-number>.

•  A folder for each project phase, and within those a folder for each interview.

•  I use folder names to organise the data, and then the equipment/model automatically numbers all files created within that folder.

•  Each filename starts with the date on which the data were collected in YYYYMMDD format.

•  As I survey new cohorts, data are appended to the dataset and saved as a new file. The version number is appended to the filename in the form ‘_v00’.

•  I will use a Git repository hosted on the University's GitHub to manage the code that I write.

4.4  What documentation will accompany the data?

What would you need to know to reproduce the data or to write up your methods and results at the end of your study? If someone else in your lab or a reader of your papers wanted to replicate your analyses, what would they need to know? If you have used abbreviations or codes in your data, how will others know what they mean?

This type of detail is particularly important to record because it is often glossed over in published outputs, where the general method and conclusions are more important than the fine detail.

Once you’ve decided what information should be recorded, you should think about how best to associate it with the data. You may be able to embed this information in the data or files, but it is generally easiest to record information in a ‘readme’ file that you store with your data. You could think about setting up a template to make this quicker for new data.