Previous topic

Xhw2.txt : slp1 transliteration

Next topic

Construction of sqlite

This Page

Construction of xml file

X.txt Xhw2.txt Xhw1.xml

program invocation:

python hw2.py make_xml.py X.txt Xhw2.txt X.xml

Once X.xml has been constructed, we need to construct an associated X.dtd file that defines the xml struction of X.xml. Then, the following command (under linux) verifies the structure:

xmllint --noout --valid X.xml

An xml version of the digitization is easily converted into an SQL database table, which in turn is conveniently accessed randomly by the headword key for web display applications.

Another advantage of an xml version is that it imposes a validatable markup structure to the digitization. For example, many digitizations use informal markup such as {%and%} to represent that ‘and’ is italicized in the text; implicit in such markup is that the opening markup ‘{%’ needs to be accompanied by the closing markup ‘%}’. When this markup of X.txt is converted to markup <i>and</i> in X.xml, then standard software tools for xml can detect as errors such features as missing markup. For instance if X.xml has ‘<i>and’ (with no balancing </i>), this will be caught by checking whether X.xml is well-formed; the error message for the error in X.xml then leads to the error ‘{%and’ (with no closing ‘%}’) in X.txt, which can be corrected.

Four general principles guide the construction of X.xml for the various dictionaries:

  1. Transform the informal markup in the digitization to an xml form
  2. Aggregate the headword entries of the digitization into a single xml record
  3. Add a small amount of additional markup for identifying the page number and headword for an aggregated headword entry
  4. Use a line-break element to preserve the lines of the digitization

The details of the internal markup are summarized in a ‘meta’ file for each dictionary. This meta file is named ‘X-meta.txt’ and is in the Xtxt.zip download for dictionary X (for instance, skd-meta.txt shows the markup conventions used in the Sabdakalpadruma dictionary, and is available in the skdtxt.zip download). This X-meta.txt file is also comprises the main part of this developer documentation for dictionary X; see for instance SKD Sabda-kalpadruma (Developer notes).

The rest of this documentation describes in some detail the make_xml.py program for a particular dictionary, Sabdakalpadruma (skd). The informal markup in the digitization skd.txt is simpler than in many dictionaries.

The construction of skd.xml consists of three parts:
  1. The head part::

    <?xml version=”1.0” encoding=”UTF-8”?> <!DOCTYPE skd SYSTEM “skd.dtd”> <!– Copyright Universitat Koln 2013 –> <skd>

  2. The body part, with one entry for each headword line of skdhw2.txt This part is described below.

  3. The tail part, which is just one line ‘closing’ the root element::

    </skd>

Here is how the third record of the body part is constructed. It is based on the third line of the skdhw2.txt headword list:

1-001:aH:41,47

and the corresponding lines 41-47 of skd.txt:

<HI>aH, puM, (atati sarvvaM vyApnoti iti ataterDaH)
<>viSNuH | iti medinI | [Page1-001-b+ 41]
<>“akAro viSNuruddiSTa ukArastu mahezvaraH |
<>makAra ucyate brahmA praNavena trayo matAH” ||
<>iti durgAdAsadhRtavacanaM | (klI | brahma | yathA, --
<>a i u e o om kalAzca mUlaM brahma iti
<>kIrttitam, iti agnipurANam |)

Here is the corresponding third record of skd.xml, whose construction will be explained (this record comprises just one line of skd.xml, but we have added line breaks for the sake of this explanation):

<H1>
 <h><key1>aH</key1><key2>aH</key2></h>
 <body>
  <HI/><s>aH, puM, (atati sarvvaM vyApnoti iti ataterqaH)</s>
  <lb/><s>vizRuH . iti medinI . </s>[Page1-001-b+ 41]
  <lb/><s>“akAro vizRuruddizwa ukArastu maheSvaraH .</s>
  <lb/><s>makAra ucyate brahmA praRavena trayo matAH” ..</s>
  <lb/><s>iti durgAdAsaDftavacanaM . (klI . brahma . yaTA, --</s>
  <lb/><s>a i u e o om kalASca mUlaM brahma iti</s>
  <lb/><s>kIrttitam, iti agnipurARam .)</s>
 </body>
 <tail><L>3</L><pc>1-001</pc></tail>
</H1>

The whole entry is an ‘<H1>’ element in the xml file.

This <H1> element is composed of a sequence of three child elements:
  • h , the head of the record, containing the headword in two forms, key1 and key2. key1 is derived from the skdhw2 record. key2 is derived from the headword portion of the first line of the digitization, as specified in headword.py; namely the characters between <HI> and the comma. In this case, key2 is the same as key1. key1 will be the ‘primary’ key of the database table for skd (skd.sqlite).

  • body , the body of the text. This corresponds to the lines of the digitization, with these differences:

    1. The pseudo xml elements <HI> and <> of the digitization become empty xml elements <HI/> and <lb/> (‘lb’ for ‘line-break’)
    2. The rest of each line of the digitization represents Sanskrit Devanagari in skd. This is enclosed in the ‘<s>’ element in the xml. However, notice that the metadata [Page1-001-b+ 41] of the second line is not within the <s> element.
    3. The Devanagari Sanskrit, now in the <s> element, is converted from the Harvard-Kyoto transliteration to SLP1 transliteration. The reason for this is primarily because the display software expects data in the <s> element to be in SLP1.
  • tail, the tail element has two pieces of metadata.
    1. <L>3</L> indicates this is the 3rd record
    2. <pc>1-001</pc> indicates this record starts on scan page 1-001 (page 001 of volumne 1). The 1-001 is from the skdhw2 record.

This completes the overview of the make_xml.py program.

The program imports two modules, in addition to a few standard library modules.

The headword module consists of just one line for skd; it is used in determining ‘key2’:

reHeadword = r'^<HI>(.*?)$'

The transcoder module is used to convert Sanskrit text from HK to SLP1 transliteration; it makes use of the hk_slp1.xml specification of the details of this transcoding. For more on the usage of the module, see Transcoding (Developers).