Table Of Contents

Previous topic

Construction of query_dump file

Next topic

Dictionaries (developers)

This Page

Transcoding (Developers)


In the context of Sanskrit dictionaries, the term transcoding originally referred to a method for converting one representation of Sanskrit to another. However, the concept does not apply only to Sanskrit. For instance, a similar technique applies to converting one representation of European language diacritics to another, such Anglicized Sanskrit to Unicode.

About 2006, Malcolm Hyman developed a method of converting the SLP1 transliteration of Sanskrit into Devanagari Unicode. Technically, this was a C program based upon the Unix FLEX program. Variations of this technique were developed for the displays of the Cologne Sanskrit Lexicon web site for the Monier Williams Sanskrit English Dictionary.

This FLEX approach had two drawbacks. First, it required a C-compilation of FLEX for each target computer operating system. Second, the syntax of the defining FLEX file requires some understanding of the C programming language.

In 2009, Ralph Bunker realized that the details of a particular transcoding could be represented by a relatively simple XML document, which could be processed by a general-purpose transcoding program. He originally implemented such a program in Java. Soon after, Jim Funderburk implemented Bunker’s Java program in PHP and then Python, and these implementations are used on this website. Ralph has since extended his Java program and xml file structures beyond those that are used here.

Python usage

Here are the basic steps to use the Python transcoding.
  1. Import the transcoder module:

    import transcoder
  2. Establish the directory that contains the transcoder xml files:

    transcoder.transcoder_set_dir("") # use current directory
  3. Transcode a string:

    y = transcoder.transcoder_processString(x,'<from>','<to>')
    y = transcoder.transcoder_processString(x,'hk','slp1')

The program looks for the file ‘hk_slp1.xml’ in the transcoder directory. From this file, it constructs a finite state machine data structure representing the transcoding. The input string ‘x’ is operated on by the finite state machine to produce the transcoded output ‘y’.

  1. Transcode the parts of a string which are text contents of a particular xml element. For instance:

    x =  "The word <SA>azva</SA> means 'horse'."
    y = transcoder.transcoder_processElements(x,'hk','slp1','SA')
    print y

    results in:

    The word aSva means 'horse'.

    since ‘z’ in HK corresponds to ‘S’ in slp1, as the hk_slp1.xml file specifies.

Sample xml structures

Here are parts of the hk_slp1.xml transcoding file used for ap90:

<fsm start='INIT' inputDecoding='UTF-8' outputEncoding='UTF-8'>
 <!-- Nov 23, 2013
  Used only by
  Dec 18, 2013. Add coding 'au0' for HK to transcode to 'au' for slp1,
    to handle words like 'titau0'
<e> <s>INIT</s> <in>R</in> <out>f</out></e>
<e> <s>INIT</s> <in>RR</in> <out>F</out></e>

<e> <s>INIT</s> <in>ai</in> <out>E</out></e>
<e> <s>INIT</s> <in>au</in> <out>O</out></e>
<e> <s>INIT</s> <in>au0</in> <out>au</out></e>
<e> <s>INIT</s> <in>kh</in> <out>K</out></e>
<e> <s>INIT</s> <in>gh</in> <out>G</out></e>
<e> <s>INIT</s> <in>G</in> <out>N</out></e>
<e> <s>INIT</s> <in>ch</in> <out>C</out></e>
<e> <s>INIT</s> <in>jh</in> <out>J</out></e>
<e> <s>INIT</s> <in>J</in> <out>Y</out></e>
<e> <s>INIT</s> <in>T</in> <out>w</out></e>
<e> <s>INIT</s> <in>Th</in> <out>W</out></e>
<e> <s>INIT</s> <in>D</in> <out>q</out></e>
<e> <s>INIT</s> <in>Dh</in> <out>Q</out></e>
<e> <s>INIT</s> <in>N</in> <out>R</out></e>
<e> <s>INIT</s> <in>th</in> <out>T</out></e>
<e> <s>INIT</s> <in>dh</in> <out>D</out></e>
<e> <s>INIT</s> <in>ph</in> <out>P</out></e>
<e> <s>INIT</s> <in>bh</in> <out>B</out></e>
<e> <s>INIT</s> <in>z</in> <out>S</out></e>
<e> <s>INIT</s> <in>S</in> <out>z</out></e>
<!-- <e> <s>INIT</s> <in>R/</in> <out>f</out></e> 20131115 changed -->
<!-- <e> <s>INIT</s> <in>RR/</in> <out>F</out></e> 20131115 changed -->
<!-- <e> <s>INIT</s> <in>L/</in> <out>x</out></e>  20131115 changed -->

<e> <s>INIT</s> <in>R</in> <out>f</out></e>
<e> <s>INIT</s> <in>RR</in> <out>F</out></e>
<e> <s>INIT</s> <in>L</in> <out>L</out></e> <!-- 20131115 -->
<e> <s>INIT</s> <in>Lh</in> <out>|</out></e> <!-- 20131123 -->
<e> <s>INIT</s> <in>lR</in> <out>x</out></e> <!-- 20131115 -->
<e> <s>INIT</s> <in>lRR</in> <out>X</out></e> <!-- 20131115 -->
<e> <s>INIT</s> <in>|</in> <out>.</out></e> <!-- 20131123 danda-->
<e> <s>INIT</s> <in>MM</in> <out>M~</out></e><!-- 20131124 candrabindu-->


Any characters not specified in the transcoder file are passed through; that is why this file does not have to show that ‘a’ in HK corresponds to the same ‘a’ in slp1, even though it might be ‘cleaner’ to do so.

Unicode characters should be represented in the uxxxx form. For instance, in as_roman.xml (conversion from Anglicized Sanskrit to Roman Unicode), we can have:

<e> <s>INIT</s> <in>a10</in> <out>\u00e2</out> </e>

Here u00e2 is converted by the transcoder program to â (latin small letter ‘a’ with circumflex).

PHP usage

The PHP version is functionally similar to the Python version. Since xml files are parsed with simplexml in PHP, version 5 or greater of PHP is required.

Here’s how the similar tasks might look in PHP:

transcoder_set_dir("");  // sets transcoder directory to current directory
$y = transcoder_processString($x,"as","roman");  // transcode string in $x
$y = transcoder_processElements($x,"slp1",$filter,"SA"); // transcode xml elements