Secretariat:ANSIANSIANSIANSI

ISOTC46/SC4N595

Date:2006-02-62006-02-62006-02-62006-02-6

ISO/WDXXXXX

ISOISOISOISOTC46464646/SC4444/WG

Secretariat:ANSIANSIANSIANSI

Information and documentation— The WARC File FormatInformation and documentation— The WARC File FormatInformation and documentation— The WARC File FormatInformation and documentation— The WARC File Format

Élément introductif— Élément central— Élément complémentaireÉlément introductif— Élément central— Élément complémentaireÉlément introductif— Élément central— Élément complémentaireÉlément introductif— Élément central— Élément complémentaire

Warning

This document is not an ISO International Standard. It is distributed for review and comment. It is subject to change without notice and may not be referred to as an International Standard.

Recipients of this draft are invited to submit, with their comments, notification of any relevant patent rights of which they are aware and to provide supporting documentation.

ISO/WDXXXXXISO/WDXXXXXISO/WDXXXXXISO/WDXXXXX

Copyright notice

This ISO document is a working draft or committee draft and is copyright-protected by ISO. While the reproduction of working drafts or committee drafts in any form for use by participants in the ISO standards development process is permitted without prior permission from ISO, neither this document nor any extract from it may be reproduced, stored or transmitted in any form for any other purpose without prior written permission from ISO.

Requests for permission to reproduce this document for the purpose of selling it should be addressed as shown below or to ISO's member body in the country of the requester:

[Indicate the full address, telephone number, fax number, telex number, and electronic mail address, as appropriate, of the Copyright Manger of the ISO member body responsible for the secretariat of the TC or SC within the framework of which the working document has been prepared.]

Reproduction for sales purposes may be subject to royalty payments or a licensing agreement.

Violators may be prosecuted.

ContentsPage

1Scope......

2Normative references......

3Terms, definitions and acronyms......

3.1Terms and definitions......

3.1.1WARC record......

3.1.2WARC record content block......

3.1.3WARC record payload......

3.1.4WARC record header......

3.1.5WARC named fields......

3.1.6WARC logical record......

3.2Acronyms......

4File and record model......

5Named fields......

5.1General......

5.2WARC-Record-ID (mandatory)......

5.3Content-Length (mandatory)......

5.4WARC-Date (mandatory)......

5.5WARC-Type (mandatory)......

5.6Content-Type......

5.7WARC-Concurrent-To......

5.8WARC-Block-Digest......

5.9WARC-Payload-Digest......

5.10WARC-IP-Address......

5.11WARC-Refers-To......

5.12WARC-Target-URI......

5.13WARC-Truncated......

5.14WARC-Warcinfo-ID......

5.15WARC-Filename......

5.16WARC-Profile......

5.17WARC-Identified-Payload-Type......

5.18WARC-Segment-Number......

5.19WARC-Segment-Origin-ID......

5.20WARC-Segment-Total-Length......

6WARC Record Types......

6.1General......

6.2'warcinfo'......

6.3'response'......

6.3.1General......

6.3.2for 'http' and 'https' schemes......

6.3.3for other URI schemes......

6.4'resource'......

6.4.1General......

6.4.2for 'http' and 'https' schemes......

6.4.3for 'ftp' scheme......

6.4.4for 'dns' scheme......

6.4.5for other URI schemes......

6.5'request'......

6.5.1General......

6.5.2for 'http' and 'https' schemes......

6.5.3for other URI schemes......

6.6'metadata'......

6.7'revisit'......

6.7.1General......

6.7.2Profile: Identical Payload Digest......

6.7.3Profile: Server Not Modified......

6.7.4Other profiles......

6.8'conversion'......

6.9'continuation'......

7Record segmentation......

8Registration of MIME media types application/warc and application/warc-fields

8.1General......

8.2application/warc

8.3application/warc-fields......

9IANA considerations......

AnnexA (informative) Compression recommendations......

A.1General......

A.2Record-at-time compression......

A.3GZIP WARC file name suffix......

AnnexB (informative) WARC file size and name recommendations

AnnexC (informative) Examples of WARC records

C.1Example of 'warcinfo' record......

C.2Example of 'request' record......

C.3Example of 'response' record......

C.4Example of 'resource' record......

C.5Example of 'metadata' record......

C.6Example of 'revisit' record......

C.7Example of 'conversion' record......

C.8Example of segmentation ('continuation' record)......

AnnexD (informative) Use cases for writing WARC records

Foreword......

Introduction......

1Scope......

2Normative references......

3Terms, definitions and acronyms......

3.1Terms and definitions......

3.1.1WARC record......

3.1.2WARC record content block......

3.1.3WARC record header......

3.1.4WARC named fields...... 2

3.1.5WARC logical record......

3.2Acronyms......

4File and record model......

5Named fields...... 5

5.1General...... 5

5.2WARC-Record-ID (mandatory)...... 5

5.3Content-Length (mandatory)......

5.4WARC-Date (mandatory)......

5.5WARC-Type (mandatory)......

5.6Content-Type...... 6

5.7WARC-Concurrent-To......

5.8WARC-Block-Digest...... 7

5.9WARC-Payload-Digest...... 7

5.10WARC-IP-Address......

5.11WARC-Refers-To......

5.12WARC-Target-URI...... 8

5.13WARC-Truncated...... 98

5.14WARC-Warcinfo-ID......

5.15WARC-Filename......

5.16WARC-Profile...... 9

5.17WARC-Identified-Payload-Type...... 109

5.18WARC-Segment-Number......

5.19WARC-Segment-Origin-ID......

5.20WARC-Segment-Total-Length...... 10

6WARC Record Types...... 1110

6.1General...... 1110

6.2'warcinfo'......

6.3'response'...... 1211

6.3.1General...... 1211

6.3.2for 'http' and 'https' schemes......

6.3.3for other URI schemes......

6.4'resource'...... 12

6.4.1General...... 12

6.4.2for 'http' and 'https' schemes...... 1312

6.4.3for 'ftp' scheme...... 1312

6.4.4for 'dns' scheme......

6.4.5for other URI schemes......

6.5'request'......

6.5.1General......

6.5.2for 'http' and 'https' schemes......

6.5.3for other URI schemes...... 1413

6.6'metadata'...... 1413

6.7'revisit'......

6.7.1General......

6.7.2Profile: Identical Payload Digest...... 1514

6.7.3Profile: Server Not Modified......

6.7.4Other profiles...... 15

6.8'conversion'...... 15

6.9'continuation'...... 1615

7Record segmentation......

8Registration of MIME media types application/warc and application/warc-fields...... 1716

8.1General...... 1716

8.2application/warc...... 1716

8.3application/warc-fields...... 17

9IANA considerations......

10Acknowledgments...... Erreur! Signet non défini.18

AnnexA (informative) Compression recommendations......

A.1General......

A.2Record-at-time compression......

A.3GZIP WARC file name suffix......

AnnexB (informative) WARC file size and name recommendations......

AnnexC (informative) Examples of WARC records......

C.1Example of 'warcinfo' record......

C.2Example of 'request' record......

C.3Example of 'response' record......

C.4Example of 'resource' record......

C.5Example of 'metadata' record......

C.6Example of 'revisit' record......

C.7Example of 'conversion' record......

C.8Example of segmentation ('continuation' record)......

1Scope...... 1

2Normative references...... 1

3Terms, definitions and acronyms...... 2

3.1Terms and definitions...... 2

3.1.1WARC record...... 2

3.1.2WARC record content block...... 2

3.1.3WARC record header...... 2

3.1.4WARC named fields...... 2

3.1.5WARC logical record...... 3

3.2Acronyms...... 3

4File and record model...... 3

5Named fields...... 5

5.1General...... 5

5.2WARC-Record-ID (mandatory)...... 5

5.3Content-Length (mandatory)...... 5

5.4WARC-Date (mandatory)...... 6

5.5WARC-Type (mandatory)...... 6

5.6Content-Type...... 6

5.7WARC-Concurrent-To...... 7

5.8WARC-Block-Digest...... 7

5.9WARC-Payload-Digest...... 7

5.10WARC-IP-Address...... 8

5.11WARC-Refers-To...... 8

5.12WARC-Target-URI...... 8

5.13WARC-Truncated...... 8

5.14WARC-Warcinfo-ID...... 9

5.15WARC-Filename...... 9

5.16WARC-Profile...... 9

5.17WARC-Identified-Payload-Type...... 9

5.18WARC-Segment-Number...... 10

5.19WARC-Segment-Origin-ID...... 10

5.20WARC-Segment-Total-Length...... 10

6WARC Record Types...... 10

6.1General...... 10

6.2'warcinfo'...... 10

6.3'response'...... 11

6.3.1General...... 11

6.3.2for 'http' and 'https' schemes...... 11

6.3.3for other URI schemes...... 12

6.4'resource'...... 12

6.4.1General...... 12

6.4.2for 'http' and 'https' schemes...... 12

6.4.3for 'ftp' scheme...... 12

6.4.4for 'dns' scheme...... 12

6.4.5for other URI schemes...... 12

6.5'request'...... 13

6.5.1General...... 13

6.5.2for 'http' and 'https' schemes...... 13

6.5.3for other URI schemes...... 13

6.6'metadata'...... 13

6.7'revisit'...... 14

6.7.1General...... 14

6.7.2Profile: Identical Payload Digest...... 14

6.7.3Profile: Server Not Modified...... 14

6.7.4Other profiles...... 15

6.8'conversion'...... 15

6.9'continuation'...... 15

7Record segmentation...... 15

8Registration of MIME media types application/warc and application/warc-fields...... 16

8.1General...... 16

8.2application/warc...... 16

8.3application/warc-fields...... 17

9IANA considerations...... 17

10Acknowledgments...... 18

AnnexA (informative) Compression recommendations...... 19

A.1General...... 19

A.2Record-at-time compression...... 19

A.3GZIP WARC file name suffix...... 19

AnnexB (informative) WARC File Size and Name Recommendations...... 20

AnnexC (informative) Examples of WARC records...... 21

C.1Example of 'warcinfo' record...... 21

C.2Example of 'request' record...... 21

C.3Example of 'response' record...... 22

C.4Example of 'resource' record...... 22

C.5Example of 'metadata' record...... 22

C.6Example of 'revisit' record...... 23

C.7Example of 'conversion' record...... 23

C.8Example of segmentation ('continuation' record)...... 23

1Scope [Goals]...... 1

2Normative references...... 1

3Terms, definitions and acronyms...... 2

3.1Terms and definitions...... 2

3.2Acronyms...... 3

4File and record Model...... 3

5Named Fields...... 5

5.1General...... 5

5.2WARC-Record-ID (mandatory)...... 5

5.3Content-Length (mandatory)...... 5

5.4WARC-Date (mandatory)...... 6

5.5WARC-Type (mandatory)...... 6

5.6Content-Type...... 6

5.7WARC-Concurrent-To...... 7

5.8WARC-Block-Digest...... 7

5.9WARC-Payload-Digest...... 7

5.10WARC-IP-Address...... 8

5.11WARC-Refers-To...... 8

5.12WARC-Target-URI...... 8

5.13WARC-Truncated...... 8

5.14WARC-Warcinfo-ID...... 9

5.15WARC-Filename...... 9

5.16WARC-Profile...... 9

5.17WARC-Identified-Payload-Type...... 9

5.18WARC-Segment-Number...... 10

5.19WARC-Segment-Origin-ID...... 10

5.20WARC-Segment-Total-Length...... 10

6WARC Record Types...... 10

6.1'warcinfo'...... 10

6.2'response'...... 11

6.2.1for 'http' and 'https' schemes...... 11

6.2.2for other URI schemes...... 12

6.3'resource'...... 12

6.3.1for 'http' and 'https' schemes...... 12

6.3.2for 'ftp' scheme...... 12

6.3.3for 'dns' scheme...... 12

6.3.4for other URI schemes...... 12

6.4'request'...... 12

6.4.1for 'http' and 'https' schemes...... 13

6.4.2for other URI schemes...... 13

6.5'metadata'...... 13

6.6'revisit'...... 13

6.6.1Profile: Identical Payload Digest...... 14

6.6.2Profile: Server Not Modified...... 14

6.6.3Other profiles...... 15

6.7'conversion'...... 15

6.8'continuation'...... 15

7Record segmentation...... 15

8Registration of MIME Media Types application/warc and application/warc-fields...... 16

8.1application/warc...... 16

8.2application/warc-fields...... 16

9IANA Considerations...... 17

10Acknowledgments...... 17

AnnexA (informative) Compression Recommandations...... 18

A.1Record-at-time Compression...... 18

A.2GZIP WARC File Name Suffix...... 18

AnnexB (informative) WARC File Size and Name Recommendations...... 19

AnnexC (informative) Examples of WARC Records...... 20

C.1Example of 'warcinfo' Record...... 20

C.2Example of 'request' Record...... 20

C.3Example of 'response' Record...... 21

C.4Example of 'resource' Record...... 21

C.5Example of 'metadata' Record...... 21

C.6Example of 'revisit' Record...... 22

C.7Example of 'conversion' Record...... 22

C.8Example of Segmentation ('continuation' record)...... 22

AnnexD (informative) Author’s Adresses...... 24

Foreword

ISO (the International Organization for Standardization) is a worldwide federation of national standards bodies (ISO member bodies). The work of preparing International Standards is normally carried out through ISO technical committees. Each member body interested in a subject for which a technical committee has been established has the right to be represented on that committee. International organizations, governmental and non-governmental, in liaison with ISO, also take part in the work. ISO collaborates closely with the International Electrotechnical Commission (IEC) on all matters of electrotechnical standardization.

International Standards are drafted in accordance with the rules given in the ISO/IECDirectives, Part2.

The main task of technical committees is to prepare International Standards. Draft International Standards adopted by the technical committees are circulated to the member bodies for voting. Publication as an International Standard requires approval by at least 75% of the member bodies casting a vote.

Attention is drawn to the possibility that some of the elements of this document may be the subject of patent rights. ISO shall not be held responsible for identifying any or all such patent rights.

ISO/WDXXXXX was prepared by Technical Committee ISO/TC46, Information and documentation, Subcommittee SC4, Technical interoperability. It is derived from a working specification created in the context of an open-source software project and previously published in a series of drafts to prepare for publication as an Internet RFC.[JAK1]

Introduction

Web sites and web pages emerge and disappear from the world wide web every day. For the past ten years, memory organizations have tried to find the most appropriate ways to collect and keep track of this vast quantity of important material using web-scale tools such as web crawlers. A web crawler is a program that browses the web in an automated manner according to a set of policies; starting with a list of URLs, it saves each page identified by a URL, finds all the hyperlinks in the page (e. g. links to other pages, images, videos, scripting or style instructions, etc.), and adds them to the list of URLs to visit recursively. Storing and managing the billions of saved web page objects itself presents a challenge.

At the same time, those same organizations have a rising need to archive large numbers of digital files not necessarily captured from the web (e.g., entire series of electronic journals, or data generated by environmental sensing equipment). A general requirement that appears to be emerging is for a container format that permits one file simply and safely to carry a very large number of constituent data objects for the purpose of storage, management, and exchange. Those data objects (or resources) must be of unrestricted type (including many binary types for audio, CAD, compressed files, etc.), but fortunately the container needs only minimal knowledge of the nature of the objects.

The WARC (Web ARChive) file format offers a convention for concatenating multiple resource records (data objects), each consisting of a set of simple text headers and an arbitrary data block into one long file. The WARC format is an extension of the ARC File Format [ARC] that has traditionally been used to store "web crawls" as sequences of content blocks harvested from the World Wide Web. Each capture in an ARC file is preceded by a one-line header that very briefly describes the harvested content and its length. This is directly followed by the retrieval protocol response messages and content. The original ARC format file is used by the Internet Archive (IA) since 1996 for managing billions of objects, and by several national libraries.

The motivation to extend the ARC format arose from the discussion and experiences of the International Internet Preservation Consortium (IIPC) [IIPC], whose members include the national libraries of Australia, Canada, Denmark, Finland, France, Iceland, Italy, Norway, Sweden, The British Library (UK), The Library of Congress (USA), and the Internet Archive (IA). The California Digital Library and the Los Alamos National Laboratory also provided input on extending and generalizing the format.

The WARC format is expected to be a standard way to structure, manage and store billions of resources collected from the web and elsewhere. It will be used to build applications for harvesting (such as the opensource Heritrix [HERITRIX] web crawler), managing, accessing, and exchanging content. The way WARC files will be created and resources will be stored and rendered will depend on software and applications implementations.

The files constituting websites, harvested on the Internet, are contained as payload of WARC records in the WARC files. However, the different pieces of a same Website may not be contained in the same WARC file or WARC files.

To render the archive of a Website for future users, an access software should request files from different WARC files. It is recommended to use external indexes for a quicker access to the archives.

Besides the primary content recorded in ARCs, the extended WARC format accommodates related secondary content, such as assigned metadata, abbreviated duplicate detection events, later-date transformations, and segmentation of large resources. The extension may also be useful for more general applications than web archiving. To aid the development of tools that are backwards compatible, WARC content is clearly distinguishable from pre-revision ARC content.

The WARC file format is made sufficiently different from the legacy ARC format files so that software tools can unambiguously detect and correctly process both WARC and ARC records; given the large amount of existing archival data in the previous ARC format, it is important that access and use of this legacy not be interrupted when transitioning to the WARC format.

BACKGROUND INFORMATION ON WEB ARCHIVING (PROPOSAL) Web sites and web pages emerge and disappear from the world wide web every day. For the past ten years, memory institutions organizations have tried to find the most appropriate ways to collect and keep track of this vast quantity of important material using web-scale tools such as web crawlers. A web crawler are is a software program which that browses the web in an automated manner according to a set of policies; sIt starts with a list of URI to visit. Atarting with a list of URLs, it visitssaves these each page identified by a URI, it makes copies of the elements identified by these URL, finds all the hyperlinks in the page (e. g. links to other pages, images, videos, scripting or style instructions, etc.), and adds them to the list of URLs to visit recursively. Storing and managing the billions of saved web page objects itself presents a challenge.

EXPLAIN IN MORE DETAILS Needs for a format to physically store, manage and preserve billions of objects harvested.

The Web ARChive (WARC) file format offers a convention for concatenating multiple resource records (data objects), each consisting of a set of simple text headers and an arbitrary data block into one long file. The WARC format is an extension of the ARC File Format [ARC] format that has traditionally been used to store "web crawls" as sequences of content blocks harvested from the World Wide Web. Each capture in an ARC file is preceded by a one-line header that very briefly describes the harvested content and its length. This is directly followed by the retrieval protocol response messages and content. The original ARC format file is used by the Internet Archive (IA) since 1996 for managing a 600To and 50 billions objects archive billions of objects, and by several national libraries.

The motivation to extend the format arose from the discussion and experiences of the International Internet Preservation Consortium (IIPC) [IIPC], whose members included the national libraries of Australia, Canada, Denmark, Finland, France, Iceland, Italy, Norway, Sweden, The British Library (UK), The Library of Congress (USA), and the Internet Archive IA, and tThe California Digital Library and the Los Alamos National Laboratory, which have set up large repositories also provided input on extending and generalizing the format.

The WARC format is expected to be a standard way to structure, manage and store billions of collected web resources collected from the web and elsewhere. It will be used to build applications for harvesting, (such as the open-source Heritrix [HERITRIX] web crawler),[JAK2] DO WE MENTION IT HAS BEEN TESTED ? managing, accessing, and or exchanging purposescontent.

Besides the primary content currently recorded, the extension of the WARC format accommodates related secondary content, such as assigned metadata, abbreviated duplicate detection events, later-date transformations and segmentation of large resources. The extension may also be useful for more general applications than web archiving. To aid the development of tools that are backwards compatible, WARC content is clearly distinguishable from pre-revision ARC content.

ISO/WDXXXXXISO/WDXXXXXISO/WDXXXXXISO/WDXXXXX

Information and documentation— The WARC File FormatInformation and documentation— The WARC File FormatInformation and documentation— The WARC File FormatInformation and documentation— The WARC File Format

1Scope[Goals]

This international standard specifies the Goals of the WARC file format include the following.:

Ability to store both the payload content and control information from mainstream Internet application layer protocols, such as HTTP, DNS, and FTP;.

Ability to store arbitrary metadata linked to other stored data (e.g., subject classifier, discovered language, encoding);

to support data compression and maintain data record integrityto s;.

Ability to store all control information from the harvesting protocol (e.g., request headers), not just response information;.

Ability to store the results of data transformations linked to other stored data.;

Ability to store a duplicate detection event linked to other stored data (to reduce storage in the presence of identical or substantially similar resources);.

Ability to be extended without disruption to existing functionality;

Sto support handling of overly long records by truncation or segmentation where desired.