Previous topic

Headword identification

Next topic

Xhw1.txt : normalized headwords

This Page

Xhw0.txt : crude headwordsΒΆ

program invocation:

python26 hw0.py X.txt Xhw0.txt

The hw0 program reads the digitization file X.txt line by line and keeps track of four pieces of information:

  • the current line number of the digitization
  • the current page number
  • the current headword
  • the line number where the current headword was identified

It is programmatically trivial to identify the current line number of the digitization.

To keep track of the current page number, each line is checked, using a regular expression, for a certain pattern, which is present in all the digitizations to represent line breaks. The general form of this pattern is [PageY]. The specific form of Y varies with the digitization, but is generally something like:

  • a digit sequence 1234 , representing the scan page number
  • a page-column form, like 123-a, 123-b
  • a volume-page form, like 1-123, 2-1003

The pattern Y also may contain, especially in the newer digitizations, the number of subsequent digitization lines belonging to the page or column; but the programs described here do not make use of this datum.

To keep track of the current headword, each line is checked, using a regular expression, for a certain pattern. This pattern is usually represented by regular expression with one capture group, whose contents represent the crude headword. In a few cases, the pattern may have a separate capture group for the homonym number of the headword. In a few older digitizations, there may also be a third capture group representing a refined headword. For a few dictionaries, headwords appear in one of several patterns.

The headword pattern is generally kept in a separate module, headword.py, which is used by hw0. This module is also used by the program which constructs an xml version of the digitization.

When a line of the digitization is identified as representing a headword, the hw0 program:

  • writes to Xhw0.txt the previous current headword, the scan page number at which the corresponding entry begins, and the beginning and ending line numbers of the digitization corresponding to the entry; for example:: 1-001:a:2,27 Of course, this step is not applicable to the first headword.
  • updates the current headword and its line number.

Finally, after the last line of the digitization, the last headword entry is written to Xhw0.txt.

The hw0 program also takes care of such things as ignoring lines of the digitization that properly are part of no headword entry. These ignored lines might include lines digitizing title pages, prefaces, and appendices.

Parts of the digitization just before or just after sections containing words that begin with a given letter are viewed as a minor problem. Ideally, some such lines should be ignored in the construction of Xhw0.txt, since they pertain to no headword. However, the current versions of hw0 generally do not attempt to exclude such lines.

A headword identified by headword.py and written by hw0 is termed crude because it may not be appropriate for serving as a key by which a search would find the entry. The next step in the processing, Xhw1.txt : normalized headwords , changes crude headwords to normalized headwords.