python26 hw0.py X.txt Xhw0.txt
The hw0 program reads the digitization file X.txt line by line and keeps track of four pieces of information:
It is programmatically trivial to identify the current line number of the digitization.
To keep track of the current page number, each line is checked, using a regular expression, for a certain pattern, which is present in all the digitizations to represent line breaks. The general form of this pattern is [PageY]. The specific form of Y varies with the digitization, but is generally something like:
The pattern Y also may contain, especially in the newer digitizations, the number of subsequent digitization lines belonging to the page or column; but the programs described here do not make use of this datum.
To keep track of the current headword, each line is checked, using a regular expression, for a certain pattern. This pattern is usually represented by regular expression with one capture group, whose contents represent the crude headword. In a few cases, the pattern may have a separate capture group for the homonym number of the headword. In a few older digitizations, there may also be a third capture group representing a refined headword. For a few dictionaries, headwords appear in one of several patterns.
The headword pattern is generally kept in a separate module, headword.py, which is used by hw0. This module is also used by the program which constructs an xml version of the digitization.
When a line of the digitization is identified as representing a headword, the hw0 program:
Finally, after the last line of the digitization, the last headword entry is written to Xhw0.txt.
The hw0 program also takes care of such things as ignoring lines of the digitization that properly are part of no headword entry. These ignored lines might include lines digitizing title pages, prefaces, and appendices.
Parts of the digitization just before or just after sections containing words that begin with a given letter are viewed as a minor problem. Ideally, some such lines should be ignored in the construction of Xhw0.txt, since they pertain to no headword. However, the current versions of hw0 generally do not attempt to exclude such lines.
A headword identified by headword.py and written by hw0 is termed crude because it may not be appropriate for serving as a key by which a search would find the entry. The next step in the processing, Xhw1.txt : normalized headwords , changes crude headwords to normalized headwords.