Previous topic

Xhw0.txt : crude headwords

Next topic

Xhw2.txt : slp1 transliteration

This Page

Xhw1.txt : normalized headwords¶

program invocation:

python26 hw1.py Xhw0.txt Xhw1.txt Xhw1_note.txt

The hw1 program applies certain normalizations to the crude headwords in the Xhw0.txt file, and writes the normalized form to the Xhw1.txt file. Most headwords are not changed by this normalization. For those headwords which are changed, the crude and normalized forms are written to the Xhw1_note.txt file.

No change in the page number and line numbers are made. The Xhw1.txt file and the Xhw0.txt file have the same number of lines.

The general term normalize means ‘to bring (someone or something) back to a usual or expected state or condition. ‘. In our case of normalizing crude headwords in a Sanskrit dictionary, here are some of the kinds of changes that are made:

  • Remove non-headword data from the crude headword. An example from skd:

    1-001:  'aMzumatI strI' => 'aMzumatI'  :115,118
    
  • Remove alternate spellings. An example from skd:

     2-144:  'kube(ve)raH' => 'kuberaH'  :71062,71072
    
    Here, the text indicates a variant spelling 'kuveraH'.
    There is certainly some justification for allowing access to the
    entry via this alternate spelling.
    However, in the version of the skd dictionary as of this writing,
    this alternate spelling is effectively discarded.
    
  • Change certain features of the spelling. An example from skd:

     2-139:  'kuNDaM' => 'kuNDam'  :70382,70383
    
    Here, a final anusvara ('M') is changed to 'm'.
    It is debatable whether this change should be made in this ``hw1`` step.
    
  • Remove Sanskrit accents. Two examples from sch:

     053-3:  'apsavyA3' => 'apsavyA'  :3891,3891
     087-2:  'astaryà' => 'astarya'  :6352,6352
    
    For our further work in displays, the headword is a *key* by which
    a dictionary entry can be found.  So, it is a piece of *metadata* to the
    digitization.  This normalization does not mean that we are changing the
    digitization.  Rather, we are changing the key metadata to
    a more useful form.
    
  • Remove avagraha. An example from sch:

    1-040:  'adho'zukaM' => 'adhozukam'  :6059,6060
    
  • Remove miscellaneous unhelpful coding. For example in skd, the nukta is often coded with the retroflex soft consonants as (D. or D2):

    2-309:  'garuD2aH' => 'garuDaH'  :96107,96189
    

There are certain kinds of potential normalizations of spelling which are not currently employed.

  • Changing headword spelling of nouns from the 1st person singular form to a stem form, or vice versa. In other words, there are different conventions for headword spelling of nouns. For instance:

    ap90,skd  rAmaH  <-> sch,pwg,mw rAma
    ap90,skd  vanaM  <-> sch,pwg,mw vana
    
  • Root entries in dhatupatha form or in normalized form:

    skd,wil   gama  <-> ap90,pwg,mw  gam
    
  • Use of homorganic nasal or anusvara:

    ap90 gaMgA  <-> wil,skd,mw gaGgA  (Harvard-Kyoto)
    
  • Duplication of consonant after semi-vowel ‘r’:

    skd kAryyaM <-> wil kAryya <-> ap90 kArya
    

Clearly, such differences in spelling cause difficulties in comparing entries across different dictionaries.

A potential advantage of having dictionaries in digital form is that it should be possible to create a global headword list for all the dictionaries. And this should be done in such a way that such differences in spelling conventions are taken into account.

However, a fully realized solution to this problem is not currently available; this would provide a good research project.