Previous topic

Digitization adjustment

Next topic

Xhw0.txt : crude headwords

This Page

Headword identificationΒΆ

X.txt Xhw0.txt Xhw1.txt Xhw2.txt A B C

The diagram indicates that the (final) headword file Xhw2.txt for a particular dictionary, X, is constructed by a sequential application of three programs, the first of which uses the adjusted digitization X.txt.

The Python programs for a given dictionary are in one of the downloads for that dictionary; specifically, they are in the Xxml.zip download.

The beginning data file, X.txt, for a given dictionary is in one of the downloads for that dictionary; specifically, it is in the Xtxt.zip download.

Each digitization (X.txt) represents a dictionary. It is helpful for the developer to have in mind a simple model of a dictionary. In our model, a digitized dictionary consists of three strands: * the lines of the digitization (lines 1 to n of the text file X.txt) * the entries of the dictionary (identified by headword) * the pages of the dictionary (corresponding to the scanned images from which the digitization is typed)

When this model is applied to X.txt, the result is an ancillary text file of metadata, each line of which identifies a headword with a sequence of lines of the digitization and with a beginning page number.

For example:

1-001:a:2,27

It is convenient to separate this construction of headwords into three parts.