python26 hw1.py Xhw0.txt Xhw1.txt Xhw1_note.txt
The hw1 program applies certain normalizations to the crude headwords in the Xhw0.txt file, and writes the normalized form to the Xhw1.txt file. Most headwords are not changed by this normalization. For those headwords which are changed, the crude and normalized forms are written to the Xhw1_note.txt file.
No change in the page number and line numbers are made. The Xhw1.txt file and the Xhw0.txt file have the same number of lines.
The general term normalize means ‘to bring (someone or something) back to a usual or expected state or condition. ‘. In our case of normalizing crude headwords in a Sanskrit dictionary, here are some of the kinds of changes that are made:
Remove non-headword data from the crude headword. An example from skd:
1-001: 'aMzumatI strI' => 'aMzumatI' :115,118
Remove alternate spellings. An example from skd:
2-144: 'kube(ve)raH' => 'kuberaH' :71062,71072 Here, the text indicates a variant spelling 'kuveraH'. There is certainly some justification for allowing access to the entry via this alternate spelling. However, in the version of the skd dictionary as of this writing, this alternate spelling is effectively discarded.
Change certain features of the spelling. An example from skd:
2-139: 'kuNDaM' => 'kuNDam' :70382,70383 Here, a final anusvara ('M') is changed to 'm'. It is debatable whether this change should be made in this ``hw1`` step.
Remove Sanskrit accents. Two examples from sch:
053-3: 'apsavyA3' => 'apsavyA' :3891,3891 087-2: 'astaryà' => 'astarya' :6352,6352 For our further work in displays, the headword is a *key* by which a dictionary entry can be found. So, it is a piece of *metadata* to the digitization. This normalization does not mean that we are changing the digitization. Rather, we are changing the key metadata to a more useful form.
Remove avagraha. An example from sch:
1-040: 'adho'zukaM' => 'adhozukam' :6059,6060
Remove miscellaneous unhelpful coding. For example in skd, the nukta is often coded with the retroflex soft consonants as (D. or D2):
2-309: 'garuD2aH' => 'garuDaH' :96107,96189
There are certain kinds of potential normalizations of spelling which are not currently employed.
Changing headword spelling of nouns from the 1st person singular form to a stem form, or vice versa. In other words, there are different conventions for headword spelling of nouns. For instance:
ap90,skd rAmaH <-> sch,pwg,mw rAma ap90,skd vanaM <-> sch,pwg,mw vana
Root entries in dhatupatha form or in normalized form:
skd,wil gama <-> ap90,pwg,mw gam
Use of homorganic nasal or anusvara:
ap90 gaMgA <-> wil,skd,mw gaGgA (Harvard-Kyoto)
Duplication of consonant after semi-vowel ‘r’:
skd kAryyaM <-> wil kAryya <-> ap90 kArya
Clearly, such differences in spelling cause difficulties in comparing entries across different dictionaries.
A potential advantage of having dictionaries in digital form is that it should be possible to create a global headword list for all the dictionaries. And this should be done in such a way that such differences in spelling conventions are taken into account.
However, a fully realized solution to this problem is not currently available; this would provide a good research project.