Table Of Contents

Previous topic

Developers: Framework

Next topic

Headword identification

This Page

Digitization adjustment

X_orig.txt X_orig_utf8.txt X_v0.txt X.txt A B C

The digitization of a particular dictionary, X, is actually represented by a sequence of files, as indicated in the diagram. The triangular arrows in the diagram indicate that a computer program processes the file named in the rectangle to the left of the arrow, and the result is the file named in the rectangle to the right of the arrow.

The processing programs are written in version 2.6 of the Python computer programming language. It is likely that all these programs run as expected in the more recent version 2.7 of Python. However, it is certain that they do not run in the version 3 of Python.

All programs mentioned are executed under the RedHat Enterprise Linux 5 operating system. However, due to the portability of Python, they should work fine under other operating systems. Only modules in the Python Standard Library are used.

The Python programs for a given dictionary are in one of the downloads for that dictionary; specifically, they are in the Xxml.zip download.

The data files (X_orig.txt, etc.) for a given dictionary are in one of the downloads for that dictionary; specifically, they are in the Xtxt.zip download.

X_orig.txt

This is the original digitization prepared by Thomas Malten and his group. It is primarily composed of 7-bit ascii characters, but typically uses several extended ascii characters. This file has one characteristic which make it awkward to process by current computer programs and text editors. Representation of the extended ascii characters is via the cp1252 encoding. Currently, the utf-8 encoding is much more commonly supported by software tools. Thus, the next form of this original digitization is constructed.

X_orig_utf8.txt

This form of the digitization is the same as that above, except for the encoding of extended ascii characters. This file is the utf-8 encoding of X_orig.txt. It is constructed by the program:

python utf8_to_cp1252.py X_orig.txt X_orig_utf8.txt

This utf-8 version of the original digitization should be equivalent to the original digitization. As one check of this equivalence, a program reconstructs a cp1252 encoding from the utf-8 encoding:

python cp1252_to_utf8.py X_orig_utf8.txt X_orig_cp1252.txt

Then, the unix ‘diff’ utility compares the reconstructed cp1252 version with the original:

diff -w X_orig_cp1252.txt X_orig.txt

When this program provides no output, the two files are deemed identical. The ‘-w’ option means that differences in ‘white space’ are ignored in the comparison.

No further use is made of the X_orig_cp1252.txt file, and it is deleted.

No further use is made of the X_orig.txt file, but of course it is not deleted.

Updates: X_v0.txt and X.txt

X.txt – and its possible predecessors X_v0.txt, X_v1.txt, etc. – is an updated version of the digitization.

The reasons for updates are quite varied, and these reasons are difficult to summarize. See Discussion of Correction Types.

A particular change to the original digitization may be later judged to be erroneous or undesireable. Also, corrections are likely to be noticed over time, rather than all at once. Thus, the process of corrections needs to be accomplished in a controlled manner that can be revised when required. This process needs to be repeatable and extendable.

In this section, the techniques of applying changes to the digitization are described.

There are essentially two techniques of correction that have been developed. Each of these is conceptually simple: a correction file is applied to the previous form of the digitization, the result being a new form of the digitization.

Also, each of these simple steps can be part of a sequence of simple steps; the first of these simple steps is applied to X_orig_utf8.txt and the last of these simple steps results in X.txt. X.txt is the ‘final’ form of the digitization.

Global update

In the first type of update, the corrections are applied as global updates:

python update.py <input> <change> <output> <log>

python update.py X_orig_utf8.txt change_01.txt X_v0.txt updatelog1.txt

Here is what a hypothetical change_01.txt file might contain:

sep=:
n~:J:str: normalize HK
\{µ:{T:re
 compar$: compar.:re:typo

The first line ‘sep=:’, specifies the separator used for the fields in subsequent lines of the change_01.txt file.

Each subsequent line of the file is interpreted as a sequence of 3 or 4 fields.

The optional fourth field is a comment.

The third field is either ‘str’ or ‘re’, and indicates whether the first two fields represent a simple string substitution or a regular expression substitution.

The program update.py reads each line of the <input> digitization; it applies all of the changes in the <change> file to that line; it then writes the final form of that line to the <output> file. In case the line is changed, it also gathers information which will be written to the <log> file.

In our example,
  • each occurrence of the string ‘n~’ is changed to the string ‘J’; this is done to bring the file into compliance with the standard Harvard-Kyoto transliteration for the palatal nasal.

  • Each occurence of the regular expression ‘{µ’ is changed to the string ‘{T’. The backslash is required since the ‘{‘ character has special meaning in reqular expressions. In this case, the change could be accomplished by a simple substitution: “{µ:{T:str” and no backslash would be needed.

    Note

    This example illustrates that the <change> file should also be in the utf-8 encoding.

  • Each occurrence of the string ” compar” that occurs at the end of a line is to be replaced by the string ” compar.”.

Note

The framework which this documentation describes was developed over a 6-12 month period. In applications of the framework early in this period, the ‘Line by line update’ was not used. Hence, the ‘global update’ technique was applied even to cases where the so-called global change in fact changed only one line.

Line by line update

In the second type of update, the corrections are applied to specific lines of the digitization; a specific line is identified by its line number in the digitization file, with the first line of the digitization file having line number 1, the second line having line number 2, etc.

Here is the program to apply:

python updateByLine.py <input> <change> <output>

python updateByLine.py X_v0.txt changeByLine.txt X.txt

Here is a sample <change> file:

43384 old <P>{%°as2t2a-sa1¤hasra-%} nt.  (2x8) 16.000.
43384 new <P>{%°as2t2a-sa1¤hasra-%} nt.  (2 × 8) 16.000.
110648 old <P>{@SVAD-@} {%-svadate ({%-ti%}) ({%sva1dati%}, mauv. var. pour {%-kha1dati%}) ;
110648 new <P>{@SVAD-@} {%-svadate%} ({%-ti%}) ({%sva1dati%}, mauv. var. pour {%-kha1dati%}) ;

The <change> file for the updateByLine program consists of pairs of lines:

<lnum> old <oldtext>
<lnum> new <newtext>

The logic of the updateByLine program is quite simple:

  • read all the lines of the <input> digitization into an array of input strings.
  • read the lines of the <change> file into an array of change objects: <lnum>, <oldtext>,<newtext>
  • for each change object: * find the corresponding <input> string, and verify that it matches <oldtext> * change the corresponding <input> string to <newtext>
  • write the array of (modified) <input> strings to <output>

Special update

For a small number of dictionaries, the original digitization is split into more than one piece. For example, in the Sabdakalpadruma dictionary, a special purpose program, update1.py, splits skd_orig_utf8.txt into skd_v0.txt (the body of the dictionary) and skd_preface.txt (the dictionary preface):

python26 update1.py skd_v0.txt skd_v1.txt skd-preface.txt

No further work is done with skd_preface.txt. Various global and line-by-line updates are applied to skd_v0.txt and result in skd.txt.

For Vachaspatyam, the original digitization is separated into three parts:

python26 update1.py vcp_orig.txt vcp_orig0.txt vcp_preface.txt vcp_end.txt

TODO: Need to find the other dictionaries with this ‘special’ update.