Table Of Contents

Previous topic

WIL Wilson Sanskrit-English Dictionary (Developer notes)

This Page

YAT Yates Sanskrit-English Dictionary (Developer notes)

Date of digitization: 2014


The original digitization is file yat_orig.txt, which is coded in the cp1252 (windows 1252) encoding, and is best viewed in a text editor which supports this encoding. For example, in Emacs, one may use the command revert-buffer-with-coding-system and then select cp1252 as the coding. The internet reference // describes this coding system.

The file yat_orig_utf8.txt is a conversion of yat_orig.txt to the more common utf-8 encoding. The file yat.txt is also in the utf-8 encoding, and incorporates various editing changes, such as corrections of typographical errors.

There are several extended ascii codes occurring in yat.txt:

¦  (\u00a6) 45211 := BROKEN BAR
Æ  (\u00c6)    34 := LATIN CAPITAL LETTER AE
æ  (\u00e6)    69 := LATIN SMALL LETTER AE
‘  (\u2018)     1 := LEFT SINGLE QUOTATION MARK
’  (\u2019)     1 := RIGHT SINGLE QUOTATION MARK

The {X...X} style of coding serves several purposes:

{#X#}   48296 : {#X#} devanagari text, coded with HK
{%X%}   55674 : italic text
{??}        1 : uncodable text

The following <x> type tags are found:

<HI>    45205 : Headword
<>      24834 : Line break
<F>..</F>   2 : Footnote
<g></g>     2 : Greek text (uncoded)
<H>        49 : Headline for letter break
<HS>        1 : In preface
<P>         6 : Paragraph break, in preface
Page breaks are coded as [Page...].
Pages are like [Page-X+nn], where X has one of several forms:
(a) -title-Y (Y = ii, iii, iv)
(b) 908 (an intentially blank page, before addenda)
(c) Y (with Y = 924-927; a page missing from pdf used for digitization)
(d) Y-C where Y = 001 to 923 (except 908) and C = ‘a’ or ‘b’ (column)

The lines of the digitization correspond to actual text lines.

Headword coding is exemplified by: <HI>{#a#}¦
More generally, <HI>{#X#}¦ where X is in HK coding.
More typical is <HI>{#akSa|pATika (kaH)#}¦
Here, the vertical bar ‘|’ is used to represent a ‘period’ in the text, which
is described in Footnote on page 1 as follows:
“The dots beneath or between the letters distinguish or separate
<>the component parts of compound words, and so point out their deriva-
The headwords are ordered according to Sanskrit alphabet ordering.

Sanskrit in the text usually appears in the European Indological form, which is coded in yat.txt with the the AS (Anglicized Sanskrit) coding.

The general AS scheme, as described in CDSL.pdf, uses Latin alphabetical letters ‘x (a-z,A-Z), possibly with suffixed numbers; the letter-number combinations are, in the general scheme:

x1 = macron
x2 = dot below
x3 = dot above
x4 = accent aigu
x5 = tilde
x6 = dash below
x7 = umlaut
x10 = circonflex (hat)
x11 = accent grave

Here are the characters that occur in yat.txt in this coding, with their approximate frequency:

a10     6 := â  (\u00e2)  LATIN SMALL LETTER A WITH CIRCUMFLEX
a4  1943 := á   (\u00e1)  LATIN SMALL LETTER A WITH ACUTE
d2    71 := ḍ  (\u1e0d)  LATIN SMALL LETTER D WITH DOT BELOW
e4   319 := é  (\u00e9)  LATIN SMALL LETTER E WITH ACUTE
i10     9 := î  (\u00ee)  LATIN SMALL LETTER I WITH CIRCUMFLEX
i4   293 := í (\u00ed) LATIN SMALL LETTER I WITH ACUTE
n2     9 := ṇ  (\u1e47)  LATIN SMALL LETTER N WITH DOT BELOW
o4    63 := ó  (\u00f3) LATIN SMALL LETTER O WITH ACUTE
r2     3 := ṛ  (\u1e5b)  LATIN SMALL LETTER R WITH DOT BELOW
s4     1 := ś  (\u015b)  LATIN SMALL LETTER S WITH ACUTE  (text has grave)
                         originally coded as 's12'
s2     6 := ṣ  (\u1e63)  LATIN SMALL LETTER S WITH DOT BELOW
t2    11 := ṭ  (\u1e6d)  LATIN SMALL LETTER T WITH DOT BELOW
u4    25 := ú (\u00fa) LATIN SMALL LETTER U WITH ACUTE

Generally, vowels with an acute accent (a4, i4, etc.) are this dictionary’s way of representing long vowel, which in current usage would be represented by a macron instead of an acute accent.



<?xml version="1.0" encoding="UTF-8"?>
<!-- yat.dtd
 May 28, 2014

<!ELEMENT  yat (H1)*>
<!ELEMENT H1 (h,body,tail) >
<!ENTITY % body_elts "i |s|lb|F|g|H " >
<!-- h element -->
<!ELEMENT h  (key1,key2,hom?)>
<!ELEMENT key1 (#PCDATA) > <!-- in slp1 -->
<!ELEMENT key2 (#PCDATA )><!-- in AS -->
<!ELEMENT hom (#PCDATA)> <!-- homonym -->

<!ELEMENT body (#PCDATA  | %body_elts;)*>
<!ELEMENT i (#PCDATA | lb)*> <!-- italic -->
<!ELEMENT s (#PCDATA |lb)*> <!-- Sanskrit, in HK transliteration  -->
<!ELEMENT lb EMPTY>  <!-- line break -->
<!ELEMENT g  EMPTY>  <!-- Greek text, not coded -->
<!ELEMENT H  EMPTY>  <!-- Headline, usu. letter break -->
<!ELEMENT F (#PCDATA |lb)*> <!-- Footnote -->
<!-- tail -->
<!ELEMENT tail (#PCDATA | L | pc )*>

<!-- attributes  -->