Table Of Contents

Previous topic

BUR Burnouf Dictionnaire Sanscrit-Français (Developer notes)

Next topic

CCS Cappeller Sanskrit Wörterbuch (Developer notes)

This Page

CAE Cappeller Sanskrit-English Dictionary (Developer notes)

Date of digitization: 2008

Metadata

The original digitization is file cae_orig.txt, which is coded in the cp1252 (windows 1252) encoding, and is best viewed in a text editor which supports this encoding. For example, in Emacs, one may use the command revert-buffer-with-coding-system and then select cp1252 as the coding. The internet reference http://www.cp1252.com/ describes this coding system.

The file cae_orig_utf8.txt is a conversion of cae_orig.txt to the more common utf-8 encoding. The file cae.txt is also in the utf-8 encoding, and incorporates various editing changes, such as corrections of typographical errors.

There are several extended ascii codes occurring in cae.txt:

­  (\u00ad)  2681 := SOFT HYPHEN  indicates first of two alternate headwords
                     Both headwords are coded, with same 'definition'
°  (\u00b0)  8393 := DEGREE SIGN
±  (\u00b1)   318 := PLUS-MINUS SIGN  (symbol used in scan)
¶  (\u00b6)     2 := PILCROW SIGN
·  (\u00b7) 40081 := MIDDLE DOT (precedes a gender, etc.)
î  (\u00ee)     1 := LATIN SMALL LETTER I WITH CIRCUMFLEX
‚  (\u201a)     1 := SINGLE LOW-9 QUOTATION MARK
†  (\u2020)   587 := DAGGER  (in text)
€  (\u20ac)     2 := EURO SIGN

The {X...X} style of coding serves several purposes:

{X}    : {X} devanagari text, coded in  HK.

There is no pseudo-xml type of coding.

Page breaks are coded as [PageX],
where X is 1 to 670

The lines of the digitization generally represent ‘sections’ of the text; the actual line-breaks of the text are not coded.

Headword coding is exemplified by:
.{#a#}1{#a°#} or
.{#a#}1{#a°#}1
Here is the regular expression used in python programs to recognize headwords;
it is in headword.py.
r’^[.]{(.*?)}.{(.*?)}([0-9]?)’

The first group is key1, the 2nd is key2. The third may be empty or
a homonym, in format n. Both key1 and key2 are coded in HK.

The headwords are ordered according to Sanskrit alphabet ordering.

Sanskrit in the text appears in the European Indological form, which is coded in cae.txt with the the AS (Anglicized Sanskrit) coding. Some words are coded with The general AS scheme, as described in CDSL.pdf, uses Latin alphabetical letters ‘x (a-z,A-Z), possibly with suffixed numbers; the letter-number combinations are, in the general scheme:

x1 = macron
x2 = dot below
x3 = dot above
x4 = accent aigu
x5 = tilde
x6 = dash below
x7 = umlaut
x10 = circonflex (hat)
x11 = accent grave

Here are the characters that occur in cae.txt in this coding, with their approximate frequency:

A1     9 := Ā  (\u0100)  LATIN CAPITAL LETTER A WITH MACRON
a1   528 := ā  (\u0101)  LATIN SMALL LETTER A WITH MACRON
d2    40 := ḍ  (\u1e0d)  LATIN SMALL LETTER D WITH DOT BELOW
h2     1 := ḥ  (\u1e25)  LATIN SMALL LETTER H WITH DOT BELOW
I1     1 := Ī  (\u012a)  LATIN CAPITAL LETTER I WITH MACRON
i1   138 := ī  (\u012b)  LATIN SMALL LETTER I WITH MACRON
m2     2 := ṃ  (\u1e43)  LATIN SMALL LETTER M WITH DOT BELOW
n1    29 := n* (\u006e\u0304) LATIN SMALL LETTER N, COMBINING MACRON
n2   393 := ṇ  (\u1e47)  LATIN SMALL LETTER N WITH DOT BELOW
n3     3 := ṅ  (\u1e45)  LATIN SMALL LETTER N WITH DOT ABOVE
R2     4 := Ṛ  (\u1e5a)  LATIN CAPITAL LETTER R WITH DOT BELOW
r2   159 := ṛ  (\u1e5b)  LATIN SMALL LETTER R WITH DOT BELOW
s2   434 := ṣ  (\u1e63)  LATIN SMALL LETTER S WITH DOT BELOW
S4   330 := Ś  (\u015a)  LATIN CAPITAL LETTER S WITH ACUTE
s4    85 := ś  (\u015b)  LATIN SMALL LETTER S WITH ACUTE
t2    47 := ṭ  (\u1e6d)  LATIN SMALL LETTER T WITH DOT BELOW
U1     1 := Ū  (\u016a)  LATIN CAPITAL LETTER U WITH MACRON
u1    63 := ū  (\u016b)  LATIN SMALL LETTER U WITH MACRON

Within {}, HK-coded devanagari,
x3 is used in HK-coded devanagari to indicate an udatta accent on 'x':
A3  1915
a3  8567
I3   540
i3  1547
U3   297
u3  1068
R3   496
e3   460
o3   364

Svarita accent does not appear in the digitization of devanagari. It is
likely that svarita accents do occur in the text, but that they have be
coded as 'R'; an example is seen under aBva (slp1).

DTD

cae.dtd:

<?xml version="1.0" encoding="UTF-8"?>
<!-- cae.dtd
 June 10, 2014

-->
<!ELEMENT  cae (H1)*>
<!ELEMENT H1 (h,body,tail) >
<!ENTITY % body_elts "s " >
<!-- h element -->
<!ELEMENT h  (key1,key2,hom?)>
<!ELEMENT key1 (#PCDATA) > <!-- in slp1 -->
<!ELEMENT key2 (#PCDATA )><!-- in AS -->
<!ELEMENT hom (#PCDATA)> <!-- homonym -->

<!ELEMENT body (#PCDATA  | %body_elts;)*>
<!ELEMENT s (#PCDATA)> <!-- Sanskrit, in AS transliteration  -->

<!-- tail -->
<!ELEMENT tail (#PCDATA | L | pc )*>
<!ELEMENT L (#PCDATA) >
<!ELEMENT pc (#PCDATA) >

<!-- attributes  -->