DCPD: Common Formats

For additional DCPD information, see:

  1. Introduction
  2. Framework
  3. Envelope
  4. Extensions
  5. Common Formats

Common Formats

Some subformats seem likely to be used in multiple contexts. They will be described here to avoid having multiple definitions or descriptions all over the place.

METADATA: File metadata
RELATIVE-SOURCE: Relative bibliographic reference
SOURCE: Bibliographic reference
TEXT: Textual information
URI and URL: Data references
VERSION: Version or release information

METADATA

Identifies the current data file, and provides information relevant to its publication.

{
  id : string;                -- req
  
  date-and-time : string[19]; -- req
  release : VERSION;          -- req
  reason : TEXT;              -- opt
  publisher : string;         -- opt;
  
  history : [                 -- opt;
    {  id, date, release, reason, publisher }, --  0 or more
  ]
}
id : string
unique ID for this particular file in its current release. (Probably an UUID (see RFC4122 for details), but should in general allow for any ID data).
date-and-time : string;
date and time of publication; ISO 8601 extended format (YYYY-MM-DDThh:mm:ss±hh:mm)
release : VERSION;
See Common Formats.
reason : TEXT;
reason for release. See Common Formats.
publisher : string;
who is responsible for this release? who should be asked about possible errors?
history : [ { id, date, release, reason, publisher : …as above… }, ];
array of id/date/release/reason/publisher for earlier releases.

It would be possible to have only history, but include current release as ‘last’ entry, and drop the individual id/date-and-time/… fields. It might even be better to do so …

An additional id, identifying the entire set of releases, and to which the history data refers is probably required.

SOURCE

(Bibliographic reference. For future specification.)

It should allow:

  • at least an informal text-only description of the source used,

  • preferrably a full bibliographical reference of file(s) used, as well as of the source from which they were scanned, i.e. book, periodical or other published or unpublished source, as well as URI/URL to any digital copy. The latter probably has standard formats that can reused (see Wikipedia:Citing sources for an example).

  • May also include accepted references, such as L/N or Betts ids

RELATIVE-SOURCE

(Partial bibliographic reference. For future specification.)

Intended to be used as a complement to a SOURCE reference, and to provide information about problem number, page, column, chapter, section and perhaps volume) of the work described by SOURCE. That is SOURCE + RELATIVE-SOURCE (informally) should produce a full bib. ref.

Basically, RELATIVE-SOURCE is only expected to say ‘problem 328’, ‘page 5’ or ‘issue 12, page 15’ or ‘volume 3, issue 2, front cover’ or similar partial information. Fields that are left unspecified beacuse SOURCE should contain that information, and fields that are deliberately set to ‘empty’ must be possible to distinguish.)

TEXT

Text is free text intended for a human reader.

The exact form is probably something determined by the relevant specification, but three formats seem desirable to allow:

Plain text.
No additional mark-up. None.
Unicode as base character code with UTF-8 encoding.
Markdown.
Lightly marked-up text. The CommonMark specification may be a reasonable level to start with, but it may need to be restricted as well extended to fit DCPD.
Extended Syntax support might be desirable, but CommonMark does allow for HTML tags to be used, so the need for Extended Syntax is probably limited.
CommonMark presupposes Unicode.
URIs may need to be restricted to in-line data, and HTML blocks probably also need to be restricted.
HTML.
Some reasonable subset of HTML tags. To start with, whatever is needed for Markdown-to-HTML translation. (This seems to be at least these tags: block blockquote code em h1 h2 h3 h4 h5 h6 hr li ol p pre strong table), + standard entities, while tags such as a, iframe and img may need to be restricted to inline or local URI’s (i.e. data:… and file:///… ).

<img> is needed to support inclusion of chess diagram images in TEXT/Markdown or TEXT/HTML as long as standard character code support is lacking.

This type may be expanded into:

{
  loc  : "direct" or "indirect"	   -- req
  type : "text"  or "md" or "html" -- req
  data : string  or URL            -- req
}

A loc = “direct” indicates that the data field contains the relevant information, while loc = “indirect” indicates that data is an URL that refers to a UTF-8 byte-stream of data. (If other character encodings are required, a field to identify them is obviously needed.)

URI and URL

The term URI corresponds to the standard definition of the term. (See https://en.wikipedia.org/wiki/Uniform_Resource_Identifier and the standard documents cited by that web page.) These references are all in a context where a full web browser may be needed to interpret data, and where a human user is the primary recipient of the information.

For computer use, any web-related surroundings (JavaScript-based etc.) are undesirable. This excludes sites such as Google Books, HathiTrust Digital Library and other from being used to provide, say, single pages: these sites (and others) typically produce a web reader environment, with controls for browsing, searching, zooming etc, i.e. they provide a fairly complex environment useful for a general reader using a web browser, but not useful for computer access to requested raw data.

Thus, the current DCPD use of the term URL implies that a successful request will produce a response that contains a single data stream that should be interpreted in the context of the request. That context implies that a request for a page image should return a page image, not an entire PDF file, or a HTML page containing JavaScript that puzzles together separate 256x256 image fragments into such image.

At present the only URL allowed uses the ‘file’ scheme (see RFC 8089) and refers only to local files. The file structure model can be illustrated as follows. A published collection is stored on a local disk in a structure similar to:

.../DIRECTORY
	collection.dcp
	image-1.png, image-2.png, ...

The file that describes a collection (i.e. collection.dcp in this case) refers to the files using URLs on the format

file:/image-1.pgn  (or possibly file:///image-1.pgn)

Also, the only form of image data that is supported (for now) is PNG.

Further developments

The restriction to PGN files seems easy enough to lift to allow other raster-graphic formats to be used. The support of such formats should probably be guided by existing recommendations of file formats suitable for digital preservation. (Examples of main formats often mentioned in such recommendations: PGN, TIFF. Some restriction may be necessary, for example concerning multi-image TIFF.)

The restriction to individual local files is probably also easy enough to lift. Two possibilities are:

scheme zip: local ZIP file archive
zip:/<subfolder>/<archive>.zip?<internal-path>/file.png
(should be similar to jar: see https://docs.oracle.com/javase/6/docs/api/java/net/JarURLConnection.html
? The use of ! as separator in jar: seems odd? Legal?

Again, ZIP and TAR are two formats recommended as archive formats.

scheme pdf: local PDF file
Not an existing scheme
pdf:/<subfolder>/<file>.pdf?page=<nr> or #<nr> or something similar (nr is the PDF page number, not the printed page number)

For document archive formats, PDF/A, EPUB and OpenOffice (.odt, perhaps also .sxw) are often recommended for archive use.

In addition, when Markdown is used for including chess diagrams in TEXT, file:/// will work, but data:/// may be more convenient, probably limited to mediatype ‘image/png’ .

As far as remote access is concerned, allowing URLs to include an authority field (e.g. ‘//host.example.com’) does not seem impossible.

Further developments: Integrity assurance

For some (all?) links to files or file containers that do not provide any kind of time-stamped integrity assurance, it may be desirable to include such information together with the URI/URL.

A simple form of this would be a date and time when the link was established, and a digital hash of the contents retrieved.

VERSION

Specifies version of formats, extensions, etc,

The version format is patterned on one used for software releases: <major>.<minor>.<build>.<state>

major : integer 0–MAX
large, significant or important changes to envelope contents. Change of format or extensions. Not necessarily backwards compatible.
minor : integer 0–MAX
minor changes to envelope contents or changes of optional format / extension format or data. Almost always backwards compatible.
build : integer 0–MAX
individual changes. Intended to be incremented at every change or save or commit, or equivalent. May be present in versions of published collections, but is primarily intended to be used for internal or private releases.
state : string
“PRIVATE” - private publication of work-in-progress
“REVIEW” - internal publication of work-in-progress for review
“PRELIMINARY” - official publication of preliminary release

A fourth state (“PUBLISHED”) is not intended to be used explicitly, but expressed through the absence of any other state label.

This is expressed as a string, in the format “DIGITS.DIGITS[.DIGITS[.STATE]]” where square brackets represent optional contents.

A struct-based representation is also possible.