HHdbVI: PGN Names 1

Once the underlying PGN file format is reasonably OK (see earlier post), the main problem of extracting relevant information from the HHdbVI database is how to identify and interpret it.

For the purpose of this post, I want to be able to identify names and parts of names to the degree that PGN standard requires.

PGN Rules of Names

In PGN files, the White and Black tag pairs are used to store player names.

Tag pairs are specified in section 8.1, and the string token used to hold the tag value is specified in section 7. Important to note is the requirement:

Currently, a string is limited to a maximum of 255 characters of data. 

PGN rules for name tags are specified in section 8.1.1.5 of the standard:

The names are given as they would appear in a telephone directory. The
family or last name appears first. If a first name or first initial is
available, it is separated from the family name by a comma and a space. 
Finally, one or more middle initials may appear. (Wherever a comma
appears, the very next character should be a space. Wherever an initial
appears, the very next character should be a period.) If the name is
unknown, a single question mark should appear as the tag value.

The reference to some kind of telephone directory format is unclear; it is assumed to be informational only. The separation of first names and middle initials appear to be US-centric.

Examples of correctly formed names:

Surname
Surname, F.
Surname, First
Surname, F. M.
Surname, First M.
Surname, First M. M.

Incorrectly formed names:

Surname, F                  (should be: Surname, F.)
Surname, First Middle       (           Surname, First M.)
Surname, F. Main L.         (           Surname, F. M. L.)

(An aside: It can be probably concluded that the PGN format is not well adapted to transcriptions of languages in which a single initial is transcribed into more than one letter. See https://en.wikipedia.org/wiki/Romanization_of_Russian for several examples, and note that the Cyrillic letters Ю and Я are so transcribed in all standards except one. Also note that Ъ and Ь transcribe into non-alphabetical characters, but I can’t say if this would ever occur in practice.)

PGN also provides a notation for multiple names as one tag value, using ‘:’ to separate individual names (chapter 9).

Surname, A.:Surname, B.

All names are stored in the string token of the White tag pair. As PGN limits string tokens to 255 characters, and as there is no convention for splitting a string over multiple lines, it is difficult to reach any other conclusion that tag pair lines are allowed to be more than 255 characters long.

HHdbVI Use of Names

As HHdbVI reuses the White tag pair for names of composers, the rules that PGN formulates are presumably still relevant. However, HHdbVI does not follow the rules menioned above, but it is not known if it is by choice or by necessity. Some examples:

[White "Amelung=F"]
[White "Kling=J Horwitz=B"]	
[White "Van der Heijden=H"]
[White "unknown"]

The most obvious difference is that HHdbVI uses ‘=’ where PGN stipulates ‘, ' (comma followed by space) to separate the surname from first names or initials.

It also uses ' ' (space) to separate two names, where PGN requires ‘:’. Space is also used to separate components of compound names (Van der Heijden).

‘unknown’ always appears by itself, and not together with any other names.

PGN requires these names to appear as:

[White "Amelung, F."]
[White "Kling, J.:Horwitz, B."]
[White "Van der Heijden, H."]
[White "?"]  

There are also some less common name formats.

Some entries appear to have double initials:

Kondratev=VI
Kondratev=VN
Kuznetsov=AG
Kuznetsov=AP
Tkachenko=SI
Tkachenko=SN

Some entries appear to be abbreviated first names:

Berger=Ja
Lasker=Em
Lasker=Ed
Schmidt=Pa
Schmidt=Pe

(It should be noted that there are also ‘normal’ names of Berger=J, Lasker=E and Schmidt=P present in HHdbVI.)

Some entries contain a ‘NN’ entry, for example:

[White "Kling=J Horwitz=B NN"]

These appear to indicate that there is one or more names missing from the tag. They (always?) appear together with a ‘MC’ entry in the Black tag value, which indicates that the White field could not hold all names, and that the full name list instead appears in a comment placed before the first move in movetext data.

The NN entry is presumably only there for visual purposes: the documentation indicates that the ‘MC’ text in the Black tag value is the actual indication.

Example:

[White "Kling=J Horwitz=B NN"]
[Black "(+0130.11e7g7) MC U1"]
...

{Kling=J Horwitz=B Campbell=J Healey=F Zytogorsky=L. U1: Rusz=A HHdbVI 21-2-2016.

The comment field is used for other purposes (as shown in the example), so extracting the actual names poses an additional problem.

And finally, a number of entries contain ‘&':

Dinis de Sousa=J & P
Grytsynyak=I & S
Melnichenko=I & L
Melnichenko=L & I
Van der Heijden=T & H

It probably does not not need to be mentioned, but several names in HHdbVI consist only of a surname without any initials or first names, such as:

[White "Collijn"]

and there are entries with several such names:

[White "Bernard Carlier Leger Verdoni"]

Names/surnames may also contain one or more apostrophes:

Abu'n Na'm
d'Amelio=C
d'Auriac=A
Dell'Ovio=V
D'Hondt=H
D'Orville=A
L'Hermet=R
O'Brien=P
O'Donovan=J
't Hart=W

Converting HHdbVI names to standard PGN names or to some other format requires identifying the extent of each individual name. Standard PGN makes that easy: if there are more than one name present, there should be a ‘:’ to show where one name stops and the next one begins. With HHdbVI this job is done by a space character, but as surnames may contain spaces (several examples are shown above) it will not be entirely trivial. For that reason, approaches to conversion will be discussed in a future post.