D4.1: C4DM Datamanagement Policy (Proposal)

D4.1: C4DM Datamanagement Policy (Proposal)

D4.1: C4DM DataManagement Policy (proposal)

Introduction...... 3

Why manage research data?...... 3

Benefits...... 3

Barriers to data publication...... 4

Costs...... 4

Data...... 6

Data repository...... 6

Depositor...... 6

Metadata...... 6

Data management...... 7

Data curation...... 7

Data preservation...... 7

Data confidentiality...... 7

Sensitive data...... 7

Intellectual Property Rights...... 7

C4DM Data Management System...... 8

General Policies...... 10

Roles and responsibilities...... 10

Repository policies for the data management system at C4DM...... 12

Content coverage...... 12

Scope ...... 12

Kinds of research data...... 12

Status of research data...... 12

Data file formats...... 13

Volume and size limitations...... 13

Dataset versions -> marcof: This is not confirmed, just speculation...... 13

Metadata ...... 15

Metadata types and sources...... 15

Metadata collection...... 16

Reuse of metadata...... 17

Submission of data...... 18

Eligible depositors...... 18

Ingestion process...... 18

Public data workflow...... 18

Private data workflow...... 19

Access and reuse of data...... 20

Private Repository Data...... 20

Access to data objects...... 20

Use and reuse of data objects...... 20

Public Repository Data...... 20

Access to data objects...... 20

Use and reuse of data objects...... 20

Sharing data with other repositories...... 20

Cover sheet...... 20

Data citation...... 20

Tracking users and use statistics...... 21

Preservation of data...... 22

Retention period...... 22

File preservation and sustainability...... 22

Fixity and authenticity...... 22

Withdrawal of data from the public repository...... 22

Succession plans...... 23

Appendix A: Use cases...... 24

Case 1: Data set connected to a publication (anyone)...... 24

Case 2: Data set obtained from another university under a non-disclosure agreement (Millionsongdatabase) 24

Case 3: MIR ground truth data (BEATLES)...... 25

Case 4: Collaborative project among different institutions (JOSH)...... 26

Introduction

Whymanageresearchdata?

OpenAccess[1],[2]andReproducibleResearch[3]aretwoconceptsthataregainingimportanceintheresearchcommunity.Increasingly,publicresearchfundersrequirethatdataproducedinthecourseoftheprojectstheyfundbemadepubliclyavailable.Thesechangesaimmainlyatincreasingtransparency,andatfosteringreuseandrepurposingofresearchdata.Furthermore,manyscientificjournalsnowrequirethatdatasupportinganarticlebepubliclyavailable,eitherthroughthejournal’sownrepositoryorelsewhere.

Benefits

Therearealsomorepersonalreasonswhyresearchersshouldmakedatamanagementanintegralpartoftheirresearchpractice,apartfromthepreviouslymentionedtop-downrequirements.Infact,itcouldbearguedthatgooddatamanagementpracticesareessentialforbeingagoodresearcher.AsstatedinQueenMary’s “GuidelinesonGoodPracticeinResearch”[4]:

Researchersshouldkeepclearandaccuraterecordsoftheproceduresfollowedandtheapprovalsgrantedduringtheresearchprocess,includingrecordskeptoftheinterimresultsobtainedaswellastheresultsofthefinaloutcomes.Thisisnecessarynotonlyasameansofdemonstratingproperpractice,butalsoincasequestionsaresubsequentlyaskedabouteithertheconductoftheresearchortheresultsobtained.ResearchersshouldbeawarethatthedatageneratedisthepropertyoftheCollege.Datageneratedinthecourseofresearchshouldbekeptsecurelyinpaperorelectronicformat,asappropriate(aminimumof10yearsisrecommended).

Severalpersonalbenefitscanderivefromgoodresearchdatamanagementpractice.Forexample:

-Moreeffectiveresearch,meaningreducedriskofdataduplicationandeffort

-Reducedriskoflossofimportantdata

-Potentiallymorecitationsofpapers(andcitationofdataitself)fromdatareuse

Regardingcitations,researchassessmentiscurrentlybasedonthenumber, quality and impactofarticlesinjournalsandconferences,andnotpublisheddatasets.However,accordingtoarecentpublication[5]bytheDigitalCurationCentre(DCC),“Thisislikelytochangewith evidencethatmakingdatarelatedtoanarticle publiclyavailablecorrelateswithhighercitation rates,atleastinfieldsthathavebuiltthenecessary repositories,standardsandcollaborativeculture.”

Furthermore,gooddatamanagementisalsobeneficialforC4DMasagroup. Inparticular,possibleshortandlongtermbenefitsinclude:

-Morecontinuityinresearchthroughinternaldatareuse

-Greater visibility byincreasednumberofcitations,contributingtomoreopportunities for collaboration and better recruitment

-Increasedtransparencyofresearchresults,leadingtobetterreputation

-Fasterandeasiercreationofteachingmaterial

-Engagementwithindustryandtheresearchcommunitythroughdatareuse

The extent of these benefits has been assessed using metrics that emerged from the JISC Managing Research Data Programme[6].

Barrierstodatapublication

Thereareseveralbarriersthatpreventresearchersfrompublishingtheirdata.To giveafewexamples:

-Thefeelingofownership(“Icollectedit;whyshouldIshareit?”)

-Theneedtospendtimeorganizingandpreparingthedataforpublication

-Thefeelingthatdatamightbemisinterpreted;thefeelingthat the “dataisnotgoodenough”

-Thelackoftechnical(e.g.repository)andlegal(confidentiality,copyrights)supportfromtheinstitution

Thesebarrierscanbeovercomebycollaboration between allthepartiesinvolved(researchers,projectmanagers,supervisors,repositoryadministrators,ITservices,C4DM,QMUL).Rolesandresponsibilitiesmustbeexplicitlydefinedforthiscollaborationtotake place efficiently.

Costs

Apart from the benefits, there are of course costs involved in implementing a data management policy, in terms of both human and financial resources. These costs mainly include the initial setup of a data management system, and its sustainability, including maintenance, extensions, and user training and support.

Glossary

Aglossaryofimportanttermsrelatedtodatamanagementisnecessarytohelpthereadertobetterunderstandtheremainderofthisdocument.

Data

Areinterpretablerepresentationofinformationinaformalizedmannersuitableforcommunication,interpretation,orprocessing.Examplesofdataincludeasequenceofbits,atableofnumbers,thecharactersonapage,therecordingofsoundsmadebyapersonspeaking,oramoonrockspecimen.[7]

Datarepository

Adigitalrepositoryisamechanismformanagingandstoringdigitalcontent.Repositoriescanbesubjectorinstitutionalintheirfocus.Puttingcontentintoaninstitutionalrepositoryenablesstaffandinstitutionstomanageandpreserveit,andthereforederivemaximumvaluefromit.Arepositorycansupportresearch,learning,andadministrativeprocesses.Repositoriesuseopenstandardstoensurethatthecontenttheycontainisaccessibleinthatitcanbesearchedandretrievedforlateruse.Theuseoftheseagreedinternationalstandardsallowsmechanismstobesetupwhichimport,export,identify,storeandretrievethedigitalcontentwithintherepository.[8]

Depositor

Aperson who depositsdataintoarepositoryforlongtermpreservation.

Metadata

Dataaboutotherdata.[9]

Inadatarepository,metadataisnormallydividedintothreecategories:

-Descriptive:informationdescribingthecontentofthedata

-Administrative:informationusedbytherepositorytomanagethedata(e.g.depositor’sname,dateofdeposit,format,version,licenses)

-Structural:informationthatdescribestherelationsbetweendifferentdataobjectsinarepository.

Datamanagement

Researchdatamanagementconcernstheorganizationofdata,fromitsentrytotheresearchcyclethroughtothedisseminationandarchivingofvaluableresults.Itaimstoensurereliableverificationofresults,andpermitsnewandinnovativeresearchbuiltonexistinginformation.[10]

Datacuration

Datacurationisaboutensuringthatprojectresultsarefittoarchive,andthatvaluedresearchassetsremainfitforreuse.[11]

Datapreservation

Datapreservationisaboutensuringthatwhatishandedovertoarepositoryorpublisherremainsfitforsecondaryuseinthelongerterm(e.g.10yearspost-project).[12]

Dataconfidentiality

Dataconfidentialityisapropertyofdata,usuallyresultingfromlegislativemeasures,whichpreventsitfromunauthorizeddisclosure.[13]

Sensitivedata

Sensitivedatareferstoinformationcovering:

-TheracialorethnicoriginoftheDataSubject

-Politicalopinions

-Religiousorotherbeliefsofasimilarnature

-Membershipoftradeunions

-Physicalormentalhealthorcondition

-Sexuallife

-Thecommissionofanyoffenseorcriminalrecords

Sensitivedatamustbecollectedusinganopt-inandshouldbecarefullyhandled.Otherclassesofdatathatmightberegardedassensitivearedatarelatingtochildrenandfinancialinformation.[14]

IntellectualPropertyRights

Theterm “Intellectualpropertyrights” referstotheassignmentofpropertyrightsthroughpatents,copyrightsandtrademarks.Thesepropertyrightsallowtheholdertoexerciseamonopolyontheuseoftheitemforaspecifiedperiod.

Byrestrictingimitationandduplication,monopolypowerisconferred,butthesocialcostsofmonopolypowermaybeoffsetbythesocialbenefitsofhigherlevelsofcreativeactivityencouragedbythemonopolyearnings.[15]

C4DMDataManagementSystem

The “SustainableManagementofDigitalMusicResearchData” projecthasdeveloped,alongwiththisdocument,aprototypeDataManagementSystem(C4DM-DMS)thataims to supportandencourageresearchersatC4DMto preserve theirresearchdata and share it withtheircolleaguesandthewiderresearchcommunity.

Currently,thesystemisbasedonthedigitalrepositorysoftwareDSpace[16].Thepurposeoftherepositoryistostoreresearchdatasetsthathavebeencuratedinordertobeusablebypeopleotherthantheauthor/depositor.Note, however, that since the repository supports versioning, datasets do not have to be in a final, stable version in order to be accepted for publication.

Thesystemsupportstwolevelsofdatasharing:

-Privatedatasharing:includesalldatasetsofinterestforresearchersatC4DMthattheauthor/depositordoesnotwanttomakepublic,orthatfordifferentreasonscannotbepubliclyshared.Thelattermightincludelegallyobtainedcopyrightedmaterial(e.g.rippedaudioCDs),ordatasetsacquiredfromotherresearchgroupsunderanon-disclosurelicense.

-Publicdatasharing:datasetsmeanttobeopenlysharedwiththeresearchcommunity.Normally,thesewouldbedatasetsaccompanyinganexperimentdescribedinapublication,and mightbeusedtoindependentlyreproduce,verify,andcomparetheresults,ortorunadifferentexperiment.

Access to the protected parts of the repository is regulated through authentication based on EECS – QMUL credentials, or a combination of username and password, after the new user has been accepted by the system administrators.

TheuseoftheC4DM-DMSisregulatedbythepoliciesdescribedintheremainderofthisdocument.Differentpoliciesapplytothetwosharinglevels(usuallymorerestrictiveforthepublicdata).Acleardistinctionismadewheneverthepoliciesdiffer.

GeneralPolicies

(To be superceded by QMUL policies when these are finalised.)

  1. Inordertocomplywithfunders’ recent policies(e.g.EPSRC[17]),allnewresearchproposalsmustincluderesearchdatamanagementplansorprotocolsthatexplicitlyaddressdatacapture,management,integrity,confidentiality,retention,sharingandpublication.
  2. TheUniversity/School/C4DMwillprovidetraining,support,adviceandwhereappropriateguidelinesandtemplatesforresearchdatamanagement execution and planning.
  3. TheUniversity/School/C4DMwillprovidemechanismsandservicesforstorage,backup,registration,depositandretentionofresearchdataassetsinsupportofcurrentandfutureaccess,duringandaftercompletionofresearchprojects.
  4. Researchdataoffuturehistoricalinterest,andallresearchdatathatrepresentrecords oftheUniversity/School/C4DM,includingdatathatsubstantiateresearchfindings,willbeofferedandassessedfordepositandretentionintheUniversity/School/C4DMrepository.
  5. Anydatathathasbeenalreadypublishedelsewhere,forexampleinaninternationaldataservice,domainrepositoryorwebsite (e.g. Internet Archive[18], Isophonics[19], MIREX website[20]), should also be registered with theUniversity/School/C4DM’sdatarepository.
  6. Researchdatamanagementplansmustensurethatresearchdataareavailableforaccessandre-usewhereappropriateandunderappropriatesafeguards.
  7. Thelegitimateinterestsofthesubjectsofresearchdatamustbeprotected.
  8. Therightstoreuseorpublishresearchdatashouldberegulatedusingappropriatelicenses.Researchdatashould normally be openlyavailableforre-use,unless restrictions (e.g. due to conditionsoffunding)apply.

Rolesandresponsibilities

C4DMisresponsibleforproviding:

-Mechanismsandservicesforstorage,backup,registration,depositandretentionofresearchdataassetsproducedby C4DM researchers,insupportofcurrentandfutureaccess,duringandaftercompletionofresearchprojects. (Under current plans, this responsibility is likely to pass to the College in the future.)

-Training,support,adviceandwhereappropriateguidelinesandtemplatesforresearchdatamanagement,researchdatamanagementplans,anduseoftheresearchdatamanagementsystem.

QMUL isresponsibleforproviding:

-Legaladviceandsupportonmattersrelatingtocopyrightedandconfidentialdata.

Projectmanagers (PI's, PhD supervisors)areresponsiblefor:

-Havingresearchdatamanagementplansinplaceforprojects,inordertoensurethatresearchdataareavailableforaccessandre-usewhereappropriateandunderappropriatesafeguards.Thelegitimateinterestsofthesubjectsofresearchdatamustbeprotected.

Researchers(includingresearchstudents)areresponsiblefor:

-ComplyingwithQMUL,C4DM,andspecificfunders’ datamanagementpolicies.

-Participating in trainingandmakinguseofalltheavailableresourcesprovidedbyC4DMandQMUL.

-Selectingthedataappropriateforlong-termpreservation.

-Curatingthedatatoensurethatitisfitforreuseinthelongterm.

-Depositingthedatainthe appropriate digitalrepository,includingallthenecessarydescriptivemetadata.

-Ensuringthatanydatathatisretainedelsewhere,forexampleinaninternationaldataserviceordomainrepository,beeithertransferred,oratleastregistered,withtheinstitutional/grouprepository.

-Makingsurethanallcopyrightsarerespected,andconfidentialdataisnotpublished.Incaseofdoubt,theresearchershouldseeklegaladviceonthematter.

RepositorypoliciesforthedatamanagementsystematC4DM

This section covers specific policies referring to the C4DM data management system and guidance for compiling Data Management Plans.

Contentcoverage

Scope

The C4DM-DMS can only be used to store and publish data related to research conducted at the Centre for Digital Music, Queen Mary, University of London. Any data published on the repository (both for private and public sharing) that is not considered appropriate by the administrators of the repository will be removed.

Kindsofresearchdata

AnykindofresearchdatathatisincludedinthescopeoftheC4DM-DMS(seepreviouspoint)canbepublishedandsharedonthesystem,aslongasitisadequatelydocumentedintheaccompanyingmetadata(seemetadatasectionformoreinformation).Acceptableresearchdataincludes,butisnotlimitedto:

-Datausedinscientificexperimentsforwhichtheresultshaveorhavenotbeenpublished(e.g.audiorecordings,MIDIfiles,musicalscores)

-Modelsandsimulations,includingthemodel, its parameters,andthedatageneratedasoutput(e.g.auditorymodelsandtheresultingauditorystreams;outputfrommusicanalysisalgorithms)

-Dataderivedfromprocessingofotherdata(e.g.automaticallygeneratedmusictranscriptions)

-Referencedata(e.g.groundtruthdatamanuallyannotated,originalmusicalscores)

-Observations(e.g.recordedandtranscribedinterviews,fieldnotes)

-Accompanyingmaterialtoascientificpublication(e.g.figures,dataresultingfromstatisticalanalysis)

-Othersupplementaryobjects

Publicationsarenotincludedinthescopeoftherepository, as it is expected that these will be placed in a dedicated publications repository (Publists),althoughreferencestopublicationsrelatedtodatapublishedontherepositoryareencouraged. Likewise software is expected to be published at SoundSoftware.ac.uk.

Statusofresearchdata

Datasetscanbepublishedatdifferentstagesofdevelopment:

-Raw/preliminarydata(e.g.unprocessedaudiorecordingsandvideos)

-Usedbutnotreadyforpublication(e.g.unsorteddatausedinapreliminaryexperiment)

-Readyforrelease(e.g.datathathasbeencuratedinordertomakeitreusablebyotherpeople)

Normally,onlydatathatisreadyforpublicationshouldbemadeavailableonthepublicpartoftheC4DM-DMS,andwithallthenecessarymetadata(seethealsotheMetadatasection).Dataatotherstagesofdevelopmentshouldbekeptforprivatesharingonly.Depositorsoftheseearly-stagedataareneverthelessencouragedtoattachasmuchmetadataaspossible.

Datafileformats

AnydatafileformatcanbeuploadedtotheC4DM-DMS,aslongasthecontentisincludedinthescopeoftherepository.IftheformathasaMIMEtype,thiswillbestoredinthedescriptivemetadata.

DatafileformatsareeitherSupportedorUnsupported. A supportedformatmeansforexample that the descriptivemetadatafoundintheheaderofthefilecanbeautomaticallyextractedandindexedbytherepository.Forcertainsupportedfileformats(e.g.audio,images,PDF),previewfunctionalitiesarealsooffered.FilesinanUnsupportedformatwillsimplybestoredintherepository:thedepositorwillhavetoeditallthemetadatafieldsmanuallyorusingtheappropriatebatchprocesses.

Table1:Listofsupportedformats

Format / MIMEtype / Metadataextraction / Preview
wav / audio/x-wav / Yes / Yes

Volumeandsizelimitations

Thereareinprinciplenovolumeorsizelimitations.However,in thecaseofparticularlylargefiles(e.g.videorecordings),alternativesolutionstothenormaluploadproceduresshouldbediscussedwiththerepositoryadministrators.

Dataset versions

The C4DM-DMS supports versioning of datasets. (Note: at time of writing, this functionality is incomplete.) This means that datasets can be updated at any time with new material. Previous versions of datasets will be kept for the record, and be available for download. The most recent versions of the dataset will always be presented as the default version of the dataset, with the possibility to browse older versions. The system will automatically take care of naming conventions and generate a cover sheet with all the necessary information to keep track of which version of the dataset has been retrieved, including instructions on how to correctly cite the specific version of the dataset.

Metadata

Metadataisarguablythemostimportantpartofthesubmissionintoarepository,becauseitenablesotherpeopletofind,understand,andreusethedataset.Itisimportantthatallthebasicdescriptivemetadata(Dublin Core simple metadata set)isattachedtotherelevantdatafilesas early aspossible,evenbeforethedatasetis to be submittedtotherepository.Thisshouldensurethatthemetadataisaccurate,andwillsavetimeduringthesubmissionprocess.

C4DM’s metadata policies also comply with the EPSRCPolicyFrameworkonResearchData[21]:

Research organizations will ensure that appropriately structured metadata describing the research data they hold is published (normally within 12 months of the data being generated) and made freely accessible on the internet; in each case the metadata must be sufficient to allow others to understand what research data exists, why, when and how it was generated, and how to access it. Where the research data referred to in the metadata is a digital object it is expected that the metadata will include use of a robust digital object. […]

Where access to the data is restricted the published metadata should also give the reason and summarize the conditions that must be satisfied for access to be granted. For example ‘commercially confidential’ data, in which a business organization has a legitimate interest, might be made available to others subject to a suitable legally enforceable non-disclosure agreement.

Metadatatypesandsources

Descriptivemetadata

  • DublinCoresimplemetadataset(author,title,dateofpublication, etc.),whichwillbesharedwithotherrepositories.Thesemetadataarecompulsory:thedatasetwillnotbeacceptediftheyaremissing.
  • Methodsusedtocreatethedata,referencestopublicationsdescribingthesemethods,and/orthesoftwareusedtoproducethedata,possiblywithlinkstotheexactversion,ifavailable,onSoundSoftware.ac.uk.
  • Domain-specificmetadata,encodedinad-hocschemas,whichmightbeimplementedonacase-by-casebasis,inordertoaccommodatethewiderangeofdatatypesproducedatC4DM.Newmetadataschemasmightcoverspecificexperiments/collections,orbesuitableforwiderpurposes.Somedomainspecificmetadataschemas,derivedforexamplefromthe Music Ontology,arealreadyavailable,andcanbeusedtodescribeSupportedfiletypes(seeTable1).Onlytherepositoryadministratorsareallowedtocreatenewmetadataschemas.Ifyouthinkthatanewschemashouldbeimplementedandincludedinthesystem,pleasediscusswiththeadministrators. For useful advise, see the DDI (Data Documentation Initiative).
  • Anyotherimportantdescriptiveinformationthatdoesnotfitintothebuilt-inmetadataschemascanbeattachedtothedataasaserializedmetadatafile,savedinastandardformat(e.g.XML)

Itis the responsibilityofthedepositortoprovidetherepositorywithdescriptivemetadata.Descriptivemetadatawillbeusedforsearchingtherepository,andit is thusintheinterestofthedepositortoprovideasaccuratemetadataaspossible.Asalreadymentioned,itmightbepossiblethatforSupportedformatssomeofthedescriptivemetadatacanbeautomaticallyimportedduringthesubmissionprocess,butthedepositorwillhavetomakesurethattheimportedmetadataiscorrectandaccurate.

Administrativemetadata:technicalinformation(e.g.depositdate,accessrights,version number)thatisautomaticallyincludedbytherepository.

Structuralmetadata:thismetadataiscreatedbytherepositorysoftwareandusedtomanagelinksbetweenobjects.

Metadatacollection

BasicDescriptivemetadata:

-Manualingestion:descriptivemetadatacanbetypedinforeachdepositeditemthroughtherepository’swebinterface

-AutomaticingestionforSupportedfiletypes:thebasicdescriptivemetadatacanbeattachedtotheheaderofthefileandautomaticallyimportedbytherepository.

-Automaticingestionfromserializedmetadatafiles:descriptivemetadatacanbeimportedfromserializedmetadatafiles(e.g.XML,CSV).

Otherdescriptivemetadata:

-Manualingestion:atextfieldforanyadditionaldescriptivemetadataisavailableformanual entry.

-Attachedserializedmetadatafiles:ifavailable,thesefilescanbesavedalongwiththedatafilesandindexedforfreetextsearch.

Accesstometadata

Anyone may have access to the metadata free of charge. This includes metadata for both the private (to be confirmed) and public sections of the repository.

In accordance with the EPSRC Policy Framework, since the data on the private part of the repository will not be available for download, the reasons for not publishing them, and the conditions for granting access to them must be clearly displayed. The repository will automatically include this information based on the information provided by the depositor. This also applies to data in the public part of the repository that have a particular license agreement.

Reuseofmetadata

Descriptive metadata might be subject to a specific license, which must be clearly stated by the depositor during the deposit process.

Users are not allowed to use any metadata for commercial purposes, unless they have the written consent of the owner of the material.

Metadata will be made open to harvesting by repository search engines and other open repositories using the OAI-PMH protocol.

Submissionofdata

Eligibledepositors

Onlyacademics,researchers,researchstudents,andstaff at theCentreforDigitalMusic(C4DM),QueenMary,UniversityofLondon,areallowedtodepositmaterialintotherepository.Tobeabletodepositdata,users should signinusingtheirEECScredentials.

Onacase-by-case basis,permissiontosubmitdatatotherepositorymightbegrantedtoexternaldepositors,forexampletopeople fromotherinstitutionsworkingonaprojectincollaborationwith C4DM.Externaluserswillbeaskedtoregisterwiththesystem,andtheadministratorwillhavetoapprovetheregistration.

Ingestionprocess

ThedataingestionprocessunderDSpacefollowsaspecifiedschema,definedinthesystem.Ingeneral,adatasetcanbeingestedinseveraldifferentways,forexamplethroughthewebinterface,thebatchitemimporter,orusingtheSWORDprotocol.Thesewillcreatea “submissioninprogress” entryintotherepository.Atthispoint,asubmissionworkflowisstarted,inwhichthemetadatacontentischecked,andeventuallylicensesareadded.Therearetwostandardbasicworkflows,oneforpublicdataandoneforprivatedata(seefollowingsectionfordetails).Depending on the type of data to be deposited, variations to the two workflows can also be created, see Appendix A or discuss this with the system administrator.

Oncethedatasetgoesthroughtheentireworkflow(forexample,itisapprovedbythedesignatedperson),thedatasetisinstalledandarchived,andofficiallyappearsontherepository.

Publicdataworkflow
  • Dataqualityrequirements:thevalidityandauthenticityofthecontentofsubmissionistheresponsibilityofthedepositor,andisnotcheckedbytherepository.Therepositorywillnotperformanykindofqualityreview(peer-revieworother)priortothepublicationofthedata.
  • Metadataqualityrequirements:therepositorywillautomaticallycheck,priortopublication,thatallthemandatorymetadatahasbeenfilledinbythedepositor.Therepositoryreservestherighttoautomaticallyaddadditionalinformationtothedata,suchasadministrativemetadata,structuralmetadata,persistentdigitalobjectidentifiers(handle.net),andapplypredefinednamingconventionstofiles.
  • Rightsandownership:attheendoftheingestionworkflow,thedepositorwillbeaskedtoselectalicenseunderwhichtopublishthedataandmetadata(e.g.CreativeCommons,DataCommons,BSD).Observethatdifferentlicensesmayapplytodifferentobjectsinadatasetorcollection.Basedontheselectedlicense,thedatamightbegivenrestrictedaccess.Forexample,copyrightedaudiofilesmightonlybeavailabletoregistereduserswithinC4DM,whilethemetadatawillbeavailableforpublicconsultation.Alinktothelocationoftheoriginalrecordingmightbeaddedtothemetadata.See IPR in Databases within the UK (McGeever 2007, DCC) for further advice.
  • Repositorytermsandconditions:Attheendoftheingestionprocess,thedepositorwillalsobeaskedtoaccepttheRepositoryTermsandConditions,inwhichtheresponsibilitiesofboththerepositoryandthedepositorwillbeclearlystated.
  • Embargostatus:dependingonthetypeoflicensechosenduringthesubmissionprocess,anautomaticembargoperiodmightbesetupbytherepository.Thedepositorwillalsobeabletosetupaspecificembargoperiod.Attheendoftheembargoperiodthedatawillautomaticallybepublishedforpublicaccess.Duringtheembargoperiod,thebasicDublinCoredescriptivemetadatawillbeavailableforconsultation.Thedateonwhichtheembargoendswillalsobeclearlydisplayed.
  • Confidentialityanddisclosure:itistheresponsibilityofthedepositortoensurethatconfidentialdataisappropriatelyanonymised,andthatalltherequiredpermissionshavebeengrantedpriortosubmission.See thePreservationofdatasectionformoredetailsonwithdrawalofdatathatmightinfringeconfidentialitylaws. Questionsregardingconfidentiality should bereferredto the Director of C4DM in the first instance.
Privatedataworkflow

Norestrictionsareimposedondatadeposited intheprivaterepository,althoughcommonpracticesuggeststhatmetadatashouldbeascompleteanddescriptiveaspossibletofacilitatethereuseofdata.TheDublinCoresimplemetadatasetremainscompulsory.Copyrightsshouldbeclearlyindicated,ifknown,toallowpublication onthepublicrepositoryatalaterstage.

Accessandreuseofdata

PrivateRepositoryData

Accesstodataobjects

Datapublishedontheprivaterepositorywillonlybeaccessibletoacademics,researchers,researchstudents,andstaff at theCentreforDigitalMusic(C4DM),QueenMary,UniversityofLondon.

Useandreuseofdataobjects

Unless this contradicts conditions of funding or supply of data, the right to reusedatapublishedontheprivaterepository will be granted,forresearchpurposes,toregisteredusersworkingatC4DM.

PublicRepositoryData

Accesstodataobjects

OpenAccessisgrantedinprincipletoalldatapublishedonthepublicrepository.Inspecificcases,whenonlymetadataisfreelyavailable,datamightbemade available onrequest.

Useandreuseofdataobjects

Thetermsandconditionsforreusingdataaredefinedbythelicenseattachedtoeachdatasetbythedepositoratthe time of publication. Normally this would involve the citation of the data source and/or associated journal or conference publications in any published work using the dataset.

Sharingdatawithotherrepositories

Inadditiontotheautomaticharvestingofmetadata,therepositorywillsupportexportingdatasetsthroughstandardprotocolssuchasSWORD[22].Thisisnotanautomaticprocess.

Coversheet

Anautomaticallygeneratedcoversheetwillbeattachedtoanymaterialretrieved,sharedorexportedfromtheC4DMrepository.Thecoversheetwillcontainareferencetothesourcerepository(incasethematerialwasfoundthrougharepositorysearchengineorothersource),andasummaryofthedescriptivemetadata(i.e.thecompulsoryDublinCoresimplemetadataset)andoftheadministrativemetadata(i.e.dateofdeposit,version,license, and dateofretrieval).

Datacitation

Eachentryintherepositoryisassignedaunique,persistentdigitalobjectidentifierusingtheHandleSystem[23].InstructionsonhowtocorrectlycitethedataareprovidedintheCoversheet.

Trackingusersandusestatistics

Anonymoususagedatamightbecollectedtoproduceusagestatistics.

Preservationofdata

Retentionperiod

According to the EPSRC Policy Framework:

Research organisations will ensure that EPSRC-funded research data is securely preserved for a minimum of 10-years from the date that any researcher ‘privileged access’ period expires or, if others have accessed the data, from last date on which access to the data was requested by a third party; all reasonable steps will be take to ensure that publicly-funded data is not held in any jurisdiction where the available legal safeguards provide lower levels of protection than are available in the UK.

Oncepublished,datawillberetainedindefinitely,unlessotherwisespecifiedbythedepositor,butalwaysincompliancewithotherpoliciesinplace.

Filepreservationandsustainability

Alldatafileswillbepreservedintheiroriginalformat.Thefileswillnotbeautomaticallyconvertedtootherformatsforsustainabilitypurposes.Whereappropriate(e.g.audioandvideofiles,textdocuments),therepositorywillprovidedirectsupportforvisualization,butthereisnoguaranteethatthisservicewillworkinthefuture.Itisthesoleresponsibilityofthedepositortoguaranteethecompatibilityandlong-termusability of data.Forthisreason,itisadvisable,whenpossible,touseOpenstandardformats,asopposedtoproprietaryformats.

Fixityandauthenticity

Therepositorywillensurethatthedatafilesarepreservedwithoutmodification.For this purpose, hashing functions such as CRC, MD5, and SHA1 will be offered for every downloaded file.

Withdrawalofdatafromthepublicrepository

TheC4DMpublicrepositoryfollowsa“Publisheverything,withdrawifnecessary”policywhenthecopyrightsofcertaindataareunclearorundefined.Requestsforwithdrawalofpublishedmaterialwillbeacceptedandevaluatedonacase-by-case basis.Ifdataiswithdrawnfromthepublicrepository,thedatawillremainavailableontheprivaterepository,unlessotherwisespecified.Themetadatawillremainavailableonthepublicrepository,togetherwitha “tombstone” linkdescribingthereasonsforwithdrawingit.

Successionplans

Althoughtherepositoryismeanttoremainactiveindefinitely,reasonsmayariseforitsclosure.Theadministratorswillmakesurethatallthedatastoredintherepositorycanbeeasilytransferredtoadifferentrepository,forexamplebysupportingstandardrepositorydataexchangeprotocolssuchasSWORD.Also,DSpaceallowstoexportandcloneentirerepositories.

AppendixA:Usecases

Case1:Datasetconnectedtoapublication(anyone)

Peter,aresearchstudentatC4DM,hassubmittedapapertoajournal.Thepaperdescribesanexperimentalsetupinwhichheranalisteningtestwithalargenumberofsubjectsinordertoverifyamodelheimplemented.Theresultsofthestatisticalanalysisofthedatacollectedfromthelisteningtestarealsoincludedinthepaper.Thereviewersindicatedthat,accordingtothejournal’spolicies,Peterhastomakethedatafromthelisteningtestandthemodel’sdataopenlyaccessible.

UsingC4DM’sdatamanagementsystem,Petercreatesadataset(Collection)namedafterthepaper’stitle,addingallthebasicmetadata(author,dateofcreation,briefdescription),aswellasareferencetothepaper.ThenheuploadsthroughthewebuserinterfacetheproperlyanonymisedlisteningtestresultsasCommaSeparatedValues(.csv),andincludesaREADME.txtfiledescribinghowthedatawascollected,andwhatvaluesareincluded.Peterchoosesanappropriatelicenseforthistypeofpotentiallysensitivedatatoautomaticallyattachwhenthedatasetisretrieved.

Healsouploadsthedatafileswiththemodel’ssettings.Inthemetadataforthesefiles,heincludesalinktotheversionofthesoftwareimplementationofthemodelheusedintheexperiment,whichisstoredoncode.soundsoftware.ac.uk.Thesystemautomaticallycreatesapermanentlink for the dataset usingtheHandlesystem,andautomaticallygeneratesacitationtobeusedtorefertothedatasetinpublications.

Petercanembargothedatasetuntilthepaperisacceptedandgoestopress.Untilthatdate,onlythemetadatawillbeavailableonthepublicrepository.Aftertheembargoexpires,the complete datasetwillbeavailablefordownload.

Whensomeoneretrievesthedataset,allthedatafilesarepackagedinazipfile,andanautomaticallygeneratedcoverpageisaddedwiththebasicmetadata,referencestotheC4DMdatarepository,andthelicense.

Case2:Datasetobtainedfromanotheruniversityunderanon-disclosureagreement(Millionsongdatabase)

Mary,anRAatC4DM,isusingaverylargedatasetofaudiofilessheobtainedfromanotherinstitutiontorunsomeMIRtasks.Sincetheaudiofilesarecopyrighted,shehadtoacceptanon-disclosureagreement.ThisagreementcoversallresearchersatC4DM,andsheknowshercolleaguesmightalsobeinterestedinusingthedataset.Shehascopiedthedataontothec4dm-datasetsvirtualdiskonlandin,butsheisnotsureeveryoneknowsaboutit.Besides,shewantstomakesurethatthenon-disclosureagreementiscleartoeveryone.Shehasalsocollectedsomeextrametadataaboutthedatasetthatshewantstopublish.

MarycreatesadatasetonC4DM’sdatamanagementsystem,wheresheaddsthebasicmetadataaboutthedataset,includingareferencetotheoriginalsource,andthenon-disclosureagreement.Shethenlinkstheaudiofilesonc4dm-datasetstotherepository:inthisway,theaudiofilesremainonthefileserver,buttheycanallbeaccesseddirectlyfromtherepository.Marysetsuptheaudiofilestobeonlyaccessiblebyauthorisedusers(i.e.C4DMmembers).ShealsouploadstheadditionalmetadatainRDF/XML.Theseadditionalfilesaresettobeopenlyaccessible,asthebasicmetadata.

Someonebrowsingtherepositorywill be able tofindoutaboutthedatasetandfollowthereferencetotheoriginalauthor.Atthesametime,hewillbeabletodownloadtheadditionalmetadatauploadedbyMary.

Case3:MIRgroundtruthdata(BEATLES)

JohnisalectureratC4DM.HesupervisesseveralPhDstudentsworkingondifferentaspectsofMIR.Totesttheiralgorithms,thesestudentsusealargesetofaudiorecordingsmadeavailablebyJohn.Tocreatethisdataset,JohnpurchasedacollectionofCDsandhadaninternripthetracks.Intheprocess,ID3tagswereaddedcontaininginformationsuchasartist,album,tracknumber andgenre.Allthefilesareavailablefromthec4dm-datasetsfolderonlandin,buttheyareuncategorizedanddifficulttoexplore.Along with theaudiofiles,anumberofmanualannotationshavebeencreatedtobeusedasground-truthtotestthealgorithms.

Inordertomakethedatasetmoreeasilyaccessible,JohncreatesadatasetontheC4DMdatamanagementsystemtostorethedataset.Theaudiofilesremainonthefileserver,butthesystemautomaticallyextractsalltheavailablemetadatainformationfromthefileheadersandfromtheID3tags,andmakesthemsearchable.Furthermore,Johnaddsreferencestotheoriginalsourceofthetrack,andwheretheycanbeobtained(e.g.theCD’scataloguenumber,orthelinkontheiTunesstore).Whennewfilesareaddedtothefolderonlandin,thesystemautomaticallyaddsthemtothedataset.

Thebasicmetadatafortheaudiodatasetareavailableforsearchanduseonthepublicwebsiteaswell:subsetsofthedatasetcanbecreatedbysearchingforcertainterms.Theaudiofilescanonlyberetrievedfromtherepositorybyauthorisedusers,butanyonecanfollowthelinks to obtain the metadata describing the CDfromwhichthetrackswereripped.

Aseparatedatasetisalsocreatedfortheannotations,andlinksaremadetotheaudiofilesdataset.TheannotationsdatasetiscompletelyOpenAccess.

Informationisprovidedonhowtocitethedatasetinapublication,incaseitisreusedbyotherresearchers.

Case4:Collaborativeprojectamongdifferentinstitutions(JOSH)

Tim, a lectureratC4DM,isthecoordinatorofaprojectinvolvingseveralresearchcentresintheUKinterestedindigitalaudio.Theaimoftheprojectistocreatealargedatabaseofmulti-trackaudiorecordingsonwhichtorunvariousexperiments.Thesetracks might becommercial,freefornon-commercialuse, orhomerecorded.

TimcreatesadatasetentryintheC4DMdatamanagementsystemforthedatabase.Healsocreatesuseraccountsforallthepartnersinvolvedintheproject,sothattheycandepositfilesthroughthewebinterface.

Apartfrombasicdescriptivemetadata,whichcanbeautomaticallyimportedifavailable,ortyped in via theuserinterface,thedepositorsareencouragedtoattachasmuchadditionalmetadataastheycaninordertomakethedataseteasilysearchable.Thiscanbedonebyattachingserialisedmetadatafiles(XML)orbytypingkeywords.Thedatabaseiseasilysearchable via tagsorfreetext,andsubsetscaneasily bedownloadedasa.ziparchive.

Adifferentlicensecanalsobechosenforeachrecording,sothatthecopyrightedmaterialisavailableonlytoauthorisedusers,whiletherestcanberetrievedbyanyone.

[1]