Table Of Contents

Previous topic

CAE Cappeller Sanskrit-English Dictionary (Developer notes)

Next topic

GRA Grassman Wörterbuch zum Rig Veda (Developer notes)

This Page

CCS Cappeller Sanskrit Wörterbuch (Developer notes)

Date of digitization: 2008

Metadata

The original digitization is file ccs_orig.txt, which is coded in the cp1252 (windows 1252) encoding, and is best viewed in a text editor which supports this encoding. For example, in Emacs, one may use the command revert-buffer-with-coding-system and then select cp1252 as the coding. The internet reference http://www.cp1252.com/ describes this coding system.

The file ccs_orig_utf8.txt is a conversion of ccs_orig.txt to the more common utf-8 encoding. The file ccs.txt is also in the utf-8 encoding, and incorporates various editing changes, such as corrections of typographical errors.

There are several extended ascii codes occurring in ccs.txt:

¦  (\u00a6) 30003 := BROKEN BAR
§  (\u00a7)     6 := SECTION SIGN
ª  (\u00aa) 11822 := FEMININE ORDINAL INDICATOR
    ª = udatta,  ªª = svarita
°  (\u00b0)  6673 := DEGREE SIGN
µ  (\u00b5)   225 := MICRO SIGN
·  (\u00b7)     3 := MIDDLE DOT
º  (\u00ba)     1 := MASCULINE ORDINAL INDICATOR
Ä  (\u00c4)    40 := LATIN CAPITAL LETTER A WITH DIAERESIS
Ç  (\u00c7)    90 := LATIN CAPITAL LETTER C WITH CEDILLA
Ö  (\u00d6)    35 := LATIN CAPITAL LETTER O WITH DIAERESIS
Ü  (\u00dc)   195 := LATIN CAPITAL LETTER U WITH DIAERESIS
ß  (\u00df)  2057 := LATIN SMALL LETTER SHARP S
ä  (\u00e4)  5135 := LATIN SMALL LETTER A WITH DIAERESIS
ç  (\u00e7)    22 := LATIN SMALL LETTER C WITH CEDILLA
ö  (\u00f6)  3213 := LATIN SMALL LETTER O WITH DIAERESIS
ü  (\u00fc)  6713 := LATIN SMALL LETTER U WITH DIAERESIS

The {X...X} style of coding serves several purposes:

{#X#}  47585 : {#X#} devanagari text, coded with HK
{%X%}  40433 : italic text

The following <x> type tags are found in ccs.txt:

<H>    43  :=  Headline, at letter breaks
<UL>  567  := The string '|<UL>' is used for column breaks.
           Note: No use is made of these currently. A small improvement
            would convert appropriately to forms like [PageXXX-2].
Page breaks are coded as
[PageX-1], where X = 001 to 541. See note regarding <UL>.

The lines of the digitization generally represent individual words of the text; the line-breaks of the text are often coded with a vertical bar ‘|’ character. The lines of the digitization do not correspond to lines of text.

Headword forms are exemplified by
.{#a#}^1¦ and
.{#aMzakalpanA#}¦
The general form is
.{#X#}¦ or .{#X#}^H¦
where
X is the headword coded with HK transliteration (and
with ª or ªª for accents)
H is the homonym number (1,2,etc)

The headwords are ordered according to Sanskrit alphabet ordering.

Some Sanskrit in the text appears in the European Indological form, which is coded in ccs.txt with the the AS (Anglicized Sanskrit) coding. The general AS scheme, as described in CDSL.pdf, uses Latin alphabetical letters ‘x (a-z,A-Z), possibly with suffixed numbers; the letter-number combinations are, in the general scheme:

x1 = macron
x2 = dot below
x3 = dot above
x4 = accent aigu
x5 = tilde
x6 = dash below
x7 = umlaut
x10 = circonflex (hat)
x11 = accent grave

Here are the characters that occur in ccs.txt in this coding, with their approximate frequency in ccs.txt:

a1     1 := ā  (\u0101)  LATIN SMALL LETTER A WITH MACRON
a10   349 := â  (\u00e2)  LATIN SMALL LETTER A WITH CIRCUMFLEX
A10     8 := Â  (\u00c2)  LATIN CAPITAL LETTER A WITH CIRCUMFLEX
C2    89 := Ç  (\u00c7)  LATIN CAPITAL LETTER C WITH CEDILLA
c2    47 := ç  (\u00e7)  LATIN SMALL LETTER C WITH CEDILLA
d2    26 := ḍ  (\u1e0d)  LATIN SMALL LETTER D WITH DOT BELOW
i10    82 := î  (\u00ee)  LATIN SMALL LETTER I WITH CIRCUMFLEX
m2     8 := ṃ  (\u1e43)  LATIN SMALL LETTER M WITH DOT BELOW
n1    11 := n̄   (\u006e\u0304)  LATIN SMALL LETTER N WITH COMBINING MACRON
n2   231 := ṇ  (\u1e47)  LATIN SMALL LETTER N WITH DOT BELOW
n5    27 := ñ  (\u00f1)  LATIN SMALL LETTER N WITH TILDE
r2    86 := ṛ  (\u1e5b)  LATIN SMALL LETTER R WITH DOT BELOW
R2     5 := Ṛ  (\u1e5a)  LATIN CAPITAL LETTER R WITH DOT BELOW
t2    28 := ṭ  (\u1e6d)  LATIN SMALL LETTER T WITH DOT BELOW
u10    38 := û  (\u00fb)  LATIN SMALL LETTER U WITH CIRCUMFLEX

DTD

ccs.dtd:

<?xml version="1.0" encoding="UTF-8"?>
<!-- ccs.dtd
 June 30, 2014

-->
<!ELEMENT  ccs (H1)*>
<!ELEMENT H1 (h,body,tail) >
<!ENTITY % body_elts "i |s " >
<!-- h element -->
<!ELEMENT h  (key1,key2,hom?)>
<!ELEMENT key1 (#PCDATA) > <!-- in slp1 -->
<!ELEMENT key2 (#PCDATA )><!-- in AS -->
<!ELEMENT hom (#PCDATA)> <!-- homonym -->

<!ELEMENT body (#PCDATA  | %body_elts;)*>
<!ELEMENT i (#PCDATA)*> <!-- italic -->
<!ELEMENT s (#PCDATA)> <!-- Sanskrit, in HK transliteration  -->

<!-- tail -->
<!ELEMENT tail (#PCDATA | L | pc )*>
<!ELEMENT L (#PCDATA) >
<!ELEMENT pc (#PCDATA) >

<!-- attributes  -->