10
Research data management hands on activities
Research data management from conception to publication 1
Background 1
Procedures 2
Case study 2
Store and backup data 3
Activity 1 -- Ensure that I always have local copies 3
Activity 2 -- Give myself a 99.999999% assured backup 3
Describe, document and organize data 9
Activity 3 -- Automatically share project documents, code and data across devices and collaborators. 9
Activity 4-- Track versions of code and data 12
Share data with colleagues 18
Activity 5 -- Make my code and selected data available for others to use 18
Activity 6 -- Create a data citation 19
Activity 7 -- Get a PURL to your data 20
Plan for data management 22
Activity 9 -- Data Management Planning: Use the DMPTool to write a data management plan 22
Research data management from conception to publication
Background
Ensuring that researchers do not lose data means employing data storage methods that 1) use multiple storage media, 2) employ bit-checking software to ensure data is not corrupted and 3) adheres to desired security and privacy objectives. This exercise takes us through the configuration and use of multiple examples of storage environments so you can select the best way to approach your own data storage needs.
1. Be able to identify types of storage environments (e.g. local disk, cloud-sync, disk snapshots/backups)
2. Understand the technical and policy issues surrounding the choice of storage environments
3. Be prepared to create your own data storage environment.
Procedures
For the purpose of this exercise we will work with three types of storage environments, local disk, cloud-synchronization services and disk/file versioning and backup systems. The following table outlines the features and limitations of each type of service.
Type of service / Example / Benefits / LimitationsLocal disk / Hard Drive, flash drive, external USB disk / High speed I/O, local availability, easy to secure physically / Most susceptible to "bit-rot," risk of data loss if physical device is stolen, few services to manage versions
Cloud-sync service and network drives / Google Drive, Box, Dropbox, Amazon S3, Network shares / High replication, built-in bit-checking tools (sometimes), cross-device availability, build-in sharing and security features / Security and IRB issues are higher risk with network availability, Not a true backup unless implemented in a certain way, may require network connectivity for access
Disk or file version/backup systems / Time Machine (OSX), File History (Win 8), GitHub, Exversion / Provides true snapshot that can be recovered/restored, can be automated, may include descriptive metadata, can preserve files as a group enabling batched recovery or access / Some solutions rely on local disk while others use cloud service, most technically complex to setup and administer, considerable overhead associated with proper administration
Case study
I have a set of data derived from a database that is not considered to be "protected" in the IRB sense of the word but came with licensing restrictions from the database vendor who runs the resource. The agreement I signed allows me to publish derivative analyses of the data but not the core dataset itself. In order to do the analysis I am designing my own algorithm using a set of programming scripts. My data management approach for this project is to 1) ensure I always have a copy of my source data, 2) track versions of my python script code alongside versions of the code output and 3) be able to publish the code and excerpts of the dataset that the code can process so that others can reproduce and build on my work. In addition, I want to be able to easily access these files from multiple machines and share the code and source files with my collaborators.
I happen to know a lot about my storage options and want to experiment with a number of solutions, using different options for each step in the process. After reading an article on hard drive failure rates (http://bit.ly/hddfailure) for example that the Annualized Failure Rate (AFR) for hard drives is around 1.4% on average and that as drives age - the chance that they fail climbs considerably with AFRs growing to 8.6% in year 3 (http://bit.ly/hddfailure_afr). Considering my compliance requirements, my cost sensitivity and my interest in being 'in the cloud' as much as possible I decide to setup the following data management infrastructure:
Step / Service selected / Activity / Why?Store and back up data
-- Ensure that I always have local copies / Time machine on my Mac / Activity 1 / Easy to setup, fast, and cheap storage for the 92% (see articles cited above) reliable storage.
-- Give myself a 99.999999% assured backup / Amazon S3 service / Activity 2 / S3 is not as intuitive as Google Drive or Dropbox but it does automatic version control and bit-fixity checking
Describe document and organize data
-- Automatically replicate the current version of project documents, code and data across devices and collaborators. / Google Drive / Activity 3 / Google Drive is easy to use and cross-device compatible. It includes version control and allows me to set detailed permissions.
-- Track versions of code and data / GitHub / Activity 4 / GitHub performs true version control over code and forces me to create good metadata. I can even leverage the GitHub wiki and other collaboration features
Share data with colleagues
-- Make my code and selected data available for others to use / GitHub / Activity 5 / All I have to do is make the right version of the data public!
-- Create a data citation / Dataverse / Activity 6 / Keeps my data someplace even after I delete or abandon my GitHub account
-- Get a PURL to your data / EZID / Activity 7 / I have my data in GitHub or some other repository but I need a permanent link to it
Plan for data management
-- Create a data management plan / DMP Tool / Activity 8 / I need a tool to help me create a data management plan for a grant submission
Store and backup data
Activity 1 -- Ensure that I always have local copies
While setting up time machine or Windows File History is not too difficult it is hard to do in this workshop given our lack of external drives! Try out the OSX tutorial (http://support.apple.com/kb/ht1427), the Lifehacker article on Windows backup options (http://bit.ly/windowsfilehistory) or this Ubuntu guide for backups in Linux (https://help.ubuntu.com/community/BackupYourSystem).
Activity 2 -- Give myself a 99.999999% assured backup
While I am confident in my everyday time machine backup I also want to make sure that I have certain files stored in a secure and positively backed up space. In order to do this I am going to manually upload my dataset into Amazon S3. Amazon S3 (Simple Storage Service) stores files in a system with some important features including 1) automatic file replication across storage media and geographic regions, 2) bit fixity checking to make sure the file you upload is the file you download and 3) file version history (called logging). We will set each of these services up.
1. Login to Amazon using the credentials for this class (note all files and credentials will be deleted at the end of the workshop)
- Login URL: https://researchbench.signin.aws.amazon.com/console
- User: ask instructor
- Password: ask instructor
2. Once you login you should see the main console:
3. Click on "S3"The S3 console allows you to create file folders, called Buckets in S3 and to upload files one by one. There are also programmatic tools in the Amazon AWS API toolkit to automate this as well as S3 integrated secure file transfer programs (SFTP) like Cyberduck (see below).
4. Click on "Create bucket" and name the bucket with the convention yyyy_dlab _yourinitials (e.g. 2013_dlab_workshop_etm).
5. Click on "Set Up Logging" and fill out the logging setup using the bucket you created as your target bucket.
6. Click "Create"
7. Look through the properties screen once the bucket has been created and see if you can answer the following questions.
8. Questions:
- Can I turn my bucket into a website?
- What is Amazon Glacier and how does it relate to S3?
- What permission options do I have?
- What is logging?
9. Let's turn on versioning so that we can track changes to our files over time. If you are not already looking at the properties for your bucket click on the "Properties" button and click on "Versioning." Click on "Enable Versioning" and click ok.
10. Before we upload a file lets create one. Open your favorite text editor and create a text file with the content "This is my data file. It has version 1.0." Save the file with the file name data_file.txt
11. Let's upload a file to our bucket. Click on your bucket name on the left hand side of the screen. Once in your bucket click on "Actions > Upload."
12. Click on Add Files and add the data file you just created13. Click on Start Upload
14. Once your file is uploaded, take a look at it. Right click on the file name and choose "Download" or "Open."
15. Let's look at the file properties as well. Right click on your file again and choose properties. This opens the properties window up on the right hand side of the screen.
16. Look through the properties and note the details in the table.
AWS Feature / Function
Storage Class / Storage class switches between standard and reduced redundancy storage. Reduced redundancy storage is only 99.99% durable (e.g. 1 object in 10,000 may become corrupted) while standard redundancy is 99.999999999%.
Server Side Encryption / Encrypts the file before it stores it on disk. This ensures that your files are encrypted on Amazon servers - not as secure as client side encryption
Permissions / Allows you to share or make public specific files in a bucket
Metadata / Shows the metadata tags assigned to your object. Many of these tags are considered "actionable" by S3 meaning that Amazon takes action based on their presence and value
17. You can add metadata tags to your buckets and file that make them easier to work with. For example you may want to tag the files with your research project name.
18. To add a research project name tag to your file click on the metadata stanza in properties and type in your tag name (research_project) and tag value (D-Lab workshop). It is worth noting that this was broken in Safari when I did it!
19. You can also add multiple versions of a file. Edit your file to change the version number in the text to "2.0."
20. Upload the file again using the same process as above. Amazon S3 should detect that you are adding the same file by matching file name.
21. With your new file uploaded you can click on the "Transfers" button and see a history of file actions (the screenshot below contains more file actions than we have covered in this worksheet)
22. You can also see all of the file versions by clicking on the "Show" button next to versions
23. Using the "Download" function in the right click menu for the file make sure you can retrieve both version 1.0 and version 2.0 of your file.
24. Questions:
- Amazon S3 supports server side encryption where can you set this property?
- How could you compare the persistence of a file on Amazon S3 versus one stored on your local hard drive?
- Is storing data on Amazon S3 a breach of IRB protocol or University policy?
25. While the Amazon S3 web service is OK, you can also manage the service using any number of tools. We will familiarize ourselves with "Cyberduck."
26. Download cyberduck from http://cyberduck.ch/
27. Launch Cyberduck and click on "Open Connection"
28. Select Amazon S3 from the option list and use the following credentials (Note - these credentials will only work on 11/19/2013)- Access Key ID: Ask instructor or create your own
- Secret Access Key: Ask instructor or create your own
29. Browse the file structure in CyberDuck and take notice of available features. In general you should find that CyberDuck allows you to upload files pretty easily but also makes it a bit more difficult to look at file properties.
30. Check out the bucket properties by right clicking on the bucket and choosing "Info"
31. Let's update our data file to version 3.0 and upload it. Edit your text file and click on "Action > Upload." Did it work? In my experience Cyberduck throws some errors but does upload the file32. Ready for some advanced work? Try the "Action > Synchronize" activity. What happens?
33. As you can see by turning on logging, it is both good and bad. Logs are great but the AWS S3 logs are so detailed it can be overwhelming.
Describe, document and organize data
Activity 3 -- Automatically share project documents, code and data across devices and collaborators.
While Amazon S3 is great for archival file storage, we have seen that it can be somewhat cumbersome for collaboration and sharing. AWS requires each person to have an AWS account and to manually sync files (*Note -there are programs that use S3 as a back-end and provide the sort of continuous backup services we are describing in this section). Ease of use and transparency are two key issues for anyone implementing a backup and for that reason it is important to identify products that allow us to track versions and sync data every time a file is saved. This process, also known as continuous data protection (CDP) or real-time backup is an excellent intermediate step that recognizes that most changes that are made throughout the day do not need persistent archival versions in systems like S3.For this exercise we will use Google Drive (or Google Drive) but equally good services include Box.net, Dropbox, Microsoft Sky Drive, and Amazon Cloud Drive (uses S3!). More options can be found at http://creativeoverflow.net/the-10-best-alternatives-to-dropbox/.
Note: You may already have Google Drive or another service like it setup on your machine. If you do not - set it up today! If you do, read through the exercise and look for opportunities to explore. Re-installing Google Drive on your machine will just make problems so don't do it!
- Lets start by installing Google Drive on your computer if you have not already. You can get Google Drive at https://tools.google.com/dlpage/drive
- Install the software and login. Allow the installer to place the folder in your home directory
- If you have time/interest - you can also install the Google Drive app on your smartphone or table (iOS and Android). Login and browse around
Google Drive and other services automatically synchronize and track versions of your data by attaching operational hooks to your operating system that copies files to the cloud service every time you save a document. There is nothing magic here and in fact the tools to do this have been around for a very long time (rsync). These file syncing tools are useful however because they include data sharing and collaborative tools, because they are multi-platform and because they work continuously.