DCPD: Common Formats

For additional DCPD information, see:

  1. Introduction
  2. Framework
  3. Envelope
  4. Extensions
  5. Common Formats

Common Formats

Some subformats seem likely to be used in multiple contexts. They will be described here to avoid having multiple definitions or descriptions all over the place.

METADATA: File metadata
SOURCE: Bibliographic reference
REFERENCE: Detailed source reference
TEXT: Textual information
URI and FILE-URL: Data references
VERSION: Version or release information

METADATA

Identifies the current data file, and provides information relevant to its publication.

{
  id : string;                -- req
  
  history : [                 -- req;
    {  release-id, date-and-time,
	     version, reason, publisher }, --  1 or more
  ]
}
id : string
unique ID for this document in any of its releases. (Probably an UUID (see RFC4122 for details), but should in general allow for any ID data). Assigned when the documented is created and is not changed later.
history : [ { id, date, release, reason, publisher : …as above… }, ];
array of id/date/release/reason/publisher for first to current release. The order of entries should probably follow publication history, and order should not be changed. (date should also reflect publication history.)

The entries of the records of the history array are:

release-id : string
unique ID for this particular file in its current release. (Probably an UUID, as above.)
date-and-time : string;
date and time of publication; ISO 8601 extended format (YYYY-MM-DDThh:mm:ss±hh:mm)
version : VERSION;
See Common Formats: VERSION.
reason : TEXT;
reason for release. See Common Formats: TEXT.
publisher : string;
who is responsible for this release? who should be asked about possible errors?

SOURCE

Bibliographic information. Not entirely specified yet – ideas follow below.

Simple source

Can be used for anything, no matter what.

This is the absolutely most basic information, and is intended for situations when nothing else fits, or when the transcriber can’t make out more complex details.

{
	author : TEXT;   -- opt
	title : TEXT;    -- opt
	info: TEXT;      -- req
}

authors : any text identifying the person or persons who created or compiled or edited (or even performed?) the source.

title : short description of the source. Its title, if it has one.

info : any additional description.

Basic book

{
	author : TEXT;            -- opt
	title : TEXT;             -- req
	publisherAndPlace : TEXT; -- opt
	publYear : TEXT;          -- opt
	
	info : TEXT;              -- opt

}

The purpose is to identify the source. All important info for that purpose (on the title page and, for books, its reverse) should be transcribed here. While most fields are optional, they are not intended to be left empty. However, a specific volume may lack some information, and in such cases individual field may be NULL. Information that is ‘known’ but not present should be placed in info, or perhaps in brackets in the appropriate field.

More complex structures (e.g. books part of a series of volumes, such as the Olms Tchaturangavidja series of reprints) are documented in the info field. ISBN info (from the cover of the book, not from the title page) also go in info.

Basic serial

… to be done

Electronic source

A book/serial/other available through electronic means. Not entirely figured out, but will probably include fields such as

	file : FILE-URL;  -- file:// 
	uri : URL;        -- other type of URI or URL (such as URN)
	
	signature : ...;  -- digital signature associated
	hash : ...        -- digital hash (if digital signature is not available)
	
	notes : TEXT;

along with a standard SOURCE field.

file identifies a directly accessible source (possibly behind access control). ftp:// and similar should be ok. http:// OK only if the response is a single stream of content.

uri is anything else, including web-readers, from which manual downloading is required.

signature and hash are intended for quality control, and should be those of the file source.

General thoughts.

These are the original notes on SOURCE. They are being refined above, and will probably go away eventually:

(Bibliographic reference. For future specification.)

It should allow:

  • at least an informal text-only description of the source used,

  • preferrably a full bibliographical reference of file(s) used, as well as of the source from which they were scanned, i.e. book, periodical or other published or unpublished source, as well as URI/FILE-URL to any digital copy. The latter probably has standard formats that can reused (see Wikipedia:Citing sources for an example).

  • May also include accepted references, such as L/N or Betts ids

  • May also need a classification to show if the reference is to a main entry (problem, article, obituary), or provides additional information (erratum, comment, …) to the main entry. Or … that kind of information may need a separate format to keep SOURCE ‘pure’.

REFERENCE

Detailed reference to a source. For future specification.

Intended to be used for referring to the ‘exact’ source of information, preferrably to page/problem level. A REFERENCE refers to a SOURCE (implicitly or explicitly)), but it also includes the additional details needed to make it easy to find the actual information

Basically, REFERENCE is expected to add ‘problem 328’, ‘page 5’ or ‘issue 12, page 15’ or ‘volume 3, issue 2, front cover, verso’ or similar detailed information to a SOURCE. As SOURCE always refers to a source from which the transcription is made, there seems to be no need to refer to source of information outside that source in a REFERENCE.

Probable contents:

{
	volume : string;  -- opt;
	issue : string;   -- opt;
	section : string; -- opt;
	chapter : string; -- opt;
	page : string;    -- opt;
	column : string;  -- opt;
  ... more?
}

(Question: what to do about problem number? Is that part of a reference, or is that part of a problem record? It is probably both …, as the source may be misprinted )

{Question: A reference probably needs to contain information about what kind of information it refers to. In a problem record, there’s the main problem, but there may also be corrections of errors, added information, solution, and so on. Some of that should be in the SOURCE, but solution may appear in, say, the next volume of a SOURCE. In such case, the reference may need to say “solution: not present” so that later editors can supply missing info.)

TEXT

Text is free text intended for a human reader.

The exact form is probably something determined by the relevant specification, but three formats seem desirable to allow:

Plain text.
No additional mark-up. None. Non-printable characters need to be specified (white space – U+0020 – is required, horizontal tab may not be; lf vs crlf also needs to be decided).
Unicode (release? does it matter?) as base character code with UTF-8 encoding.
Note: This subformat does not allow images to be used, e.g. for diagrams, which Markdown and HTML sub-formats do.
Markdown.
Lightly marked-up text. The CommonMark specification may be a reasonable level to start with, but it may need to be restricted as well extended to fit DCPD.
Extended Syntax support might be desirable, but CommonMark does allow for HTML tags to be used, so the need for Extended Syntax is probably limited.
CommonMark presupposes Unicode.
URIs may need to be restricted to in-line data, and HTML blocks probably also need to be restricted.
HTML.
Some reasonable subset of HTML tags. To start with, whatever is needed for Markdown-to-HTML translation. (This seems to be at least these tags: block blockquote code em h1 h2 h3 h4 h5 h6 hr li ol p pre strong table), + standard entities, while tags such as a, iframe and img may need to be restricted to inline or local URI’s (i.e. data:… and file:///… ).

<img> is needed to support inclusion of chess diagram images in TEXT/Markdown or TEXT/HTML as long as standard character code support is lacking.

If allowing scanned information (entirely or partially) is important, TEXT may be expanded to include an image file URL. However, aggregations of text and images is probably best done by Markdown or HTML sub-formats.

TEXT may be represented by :

{
  loc  : "direct" or "indirect"	         -- req
  type : "text"  or "markdown" or "html" -- req
  data : string  or FILE-URL                  -- req
}

A loc = “direct” indicates that the data field contains the relevant information, while loc = “indirect” indicates that data is an URL that refers to a UTF-8 byte-stream of data. (If other character encodings are required, a field to identify them is obviously needed.)

URI and FILE-URL

The term URI corresponds to the standard definition of the term. (See https://en.wikipedia.org/wiki/Uniform_Resource_Identifier and the standard documents cited by that web page.) These references are all in a context where a full web browser may be needed to interpret data, and where a human user is the primary recipient of the information.

For computer use, any web-related surroundings (JavaScript-based etc.) are undesirable. This excludes sites such as Google Books, HathiTrust Digital Library and other from being used to provide, say, single pages: these sites (and others) typically produce a web reader environment, with controls for browsing, searching, zooming etc, i.e. they provide a fairly complex environment useful for a general reader using a web browser, but not useful for computer access to requested raw data.

Thus, the current DCPD use of the term FILE-URL implies that a successful request will produce a response that contains a single data stream that should be interpreted in the context of the request. That context implies that a request for a page image should return a page image, not an entire PDF file, or a HTML page containing JavaScript that puzzles together separate 256x256 image fragments into such image.

At present the only FILE-URL allowed uses the ‘file’ scheme (see RFC 8089) and refers only to local files. The file structure model can be illustrated as follows. A published collection is stored on a local disk in a structure similar but not neccessarily identical to:

.../DIRECTORY
	collection.dcp
	image-1.png, image-2.png, ...

The file that describes a collection (i.e. collection.dcp in this case) refers to the files using URLs on the format

file:/image-1.pgn  (or possibly file:///image-1.pgn)

which are then interpret relative to the directory that file is located in.

Also, the only form of image data that is supported (for now) is PNG.

Further developments

The restriction to PNG files seems easy enough to lift to allow other raster-graphic formats to be used. The support of such formats should probably be guided by existing recommendations of file formats suitable for digital preservation. (Examples of main formats often mentioned in such recommendations: PNG, TIFF. Some restriction may be necessary, for example concerning multi-image TIFF.)

The restriction to individual local files is probably also easy enough to lift. Two possibilities are:

scheme zip: local ZIP file archive
zip:/<subfolder>/<archive>.zip?<internal-path>/file.png
(should be similar to jar: see https://docs.oracle.com/javase/6/docs/api/java/net/JarURLConnection.html
? The use of ! as separator in jar: seems odd? Legal?

Again, ZIP and TAR are two formats recommended as archive formats.

scheme pdf: local PDF file
Not an existing scheme
pdf:/<subfolder>/<file>.pdf?page=<nr> or #<nr> or something similar (nr is the PDF page number, not the printed page number)

For document archive formats, PDF/A, EPUB and OpenOffice (.odt, perhaps also .sxw) are often recommended for archive use.

In addition, when Markdown is used for including chess diagrams in TEXT, file:/// will work, but data:/// may be more convenient, probably limited to mediatype ‘image/png’ .

As far as remote access is concerned, allowing FILE-URLs to include an authority field (e.g. ‘//host.example.com’) does not seem impossible.

Further developments: Integrity assurance

For some (all?) links to files or file containers that do not provide any kind of time-stamped integrity assurance, it may be desirable to include such information together with the URI/FILE-URL.

A simple form of this would be a date and time when the link was established, and a digital hash of the contents retrieved.

VERSION

Specifies version of formats, extensions, etc,

The version format is Semantic Versioning. This is commonly used for software, and may be summarized as <major>.<minor>.<build>[<state>]

major : integer 0–MAX
large, significant or important changes to envelope contents. Change of format or extensions. Not necessarily backwards compatible.
minor : integer 0–MAX
minor changes to envelope contents or changes of optional format / extension format or data. Almost always backwards compatible.
patch : integer 0–MAX
Intended to be incremented on at least distributed corrections.
state : string
The following state terms are suggested
“-PRIVATE” - private publication of work-in-progress
“-REVIEW” - internal publication of work-in-progress for review
“-PREL” - official publication of preliminary release

A struct-based representation would also be possible.