Table Of Contents

Previous topic

SNP Sanskrit Names of Plants (Developer notes)

Next topic

VCP Vacaspatya (Developer notes)

This Page

STC Stchoupak Dictionnaire Sanscrit-Français (Developer notes)

Date of digitization: 2006

Metadata

The original digitization is file stc_orig.txt, which is coded in the cp1252 (windows 1252) encoding, and is best viewed in a text editor which supports this encoding. For example, in Emacs, one may use the command revert-buffer-with-coding-system and then select cp1252 as the coding. The internet reference http://www.cp1252.com/ describes this coding system.

The file stc_orig_utf8.txt is a conversion of stc_orig.txt to the more common utf-8 encoding. The file stc.txt is also in the utf-8 encoding, and incorporates various editing changes, such as corrections of typographical errors.

There are several extended ascii codes occurring in stc.txt:

¤  (\u00a4)    59 := CURRENCY SIGN
«  (\u00ab)    43 := LEFT-POINTING DOUBLE ANGLE QUOTATION MARK
°  (\u00b0) 24624 := DEGREE SIGN
»  (\u00bb)    43 := RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
Ç  (\u00c7)  1260 := LATIN CAPITAL LETTER C WITH CEDILLA
×  (\u00d7)    20 := MULTIPLICATION SIGN
ç  (\u00e7)  7228 := LATIN SMALL LETTER C WITH CEDILLA

The {X...X} style of coding serves several purposes:

{@X@}  25228 : bold text
{%X%}  44619 : italic text
{^X^}        : super-script
{??}       1 : unreadable text
{T..T}     5 : In preface material

The following <x> type tags are found in stc.txt:

<F>...</F>  8 : Footnote
<g></g>     1 : Greek, uncoded
<H>        48 : Headline (letter breaks)
<P>     46807 : Paragraph
<Title>     1 : on line 2 only
Page breaks are coded as [Page...].
Page breaks are more specifically coded as
[PageP-C] where
P is page number (from 1 to 895) and C is column number (‘1’ or ‘2’)
Exceptions:
[PageP] where P = 270,271,272,580
[PageR] where R = -II,-III-1,-III-2,IV (in preface material)
Headwords coding is exemplified by:
<P>{@adhvan-@}
The general form is <P>{@X@} where
where X (key1) is coded in AS transliteration.

The headwords are ordered according to Sanskrit alphabet ordering.

Lines of the digitizaion generally represent sections of the text, rather than individual lines of the text.

Sanskrit in the text appears in the European Indological form, which is coded in stc.txt with the the AS (Anglicized Sanskrit) coding.

The general AS scheme, as described in CDSL.pdf, uses Latin alphabetical letters ‘x (a-z,A-Z), possibly with suffixed numbers; the letter-number combinations are, in the general scheme:

x1 = macron
x2 = dot below
x3 = dot above
x4 = accent aigu
x5 = tilde
x6 = dash below
x7 = umlaut
x10 = circonflex (hat)
x11 = accent grave

Here are the characters that occur in stc.txt in this coding, with their approximate frequency:

Note: Many of these codings represent diacritics used in French
or other European languages.

A1   734 := Ā  (\u0100)  LATIN CAPITAL LETTER A WITH MACRON
a1 29231 := ā  (\u0101)  LATIN SMALL LETTER A WITH MACRON
a10  1375 := â  (\u00e2)  LATIN SMALL LETTER A WITH CIRCUMFLEX
a11  6976 := à  (\u00e0)  LATIN SMALL LETTER A WITH GRAVE
a7     3 := ä  (\u00e4)  LATIN SMALL LETTER A WITH DIAERESIS
c7     2 := ĉ  (\u0109)  LATIN SMALL LETTER C WITH CIRCUMFLEX
d2  1489 := ḍ  (\u1e0d)  LATIN SMALL LETTER D WITH DOT BELOW
D2    65 := Ḍ  (\u1e0c)  LATIN CAPITAL LETTER D WITH DOT BELOW
E10    21 := Ê (\u00ca)  LATIN CAPITAL LETTER E WITH CIRCUMFLEX
e10  4092 := ê  (\u00ea)  LATIN SMALL LETTER E WITH CIRCUMFLEX
e11  5945 := è  (\u00e8)  LATIN SMALL LETTER E WITH GRAVE
E4    15 := É  (\u00c9)  LATIN CAPITAL LETTER E WITH ACUTE
e4 44747 := é  (\u00e9)  LATIN SMALL LETTER E WITH ACUTE
e7     8 := ë  (\u00eb)  LATIN SMALL LETTER E WITH DIAERESIS
h2   478 := ḥ  (\u1e25)  LATIN SMALL LETTER H WITH DOT BELOW
I1   256 := Ī  (\u012a)  LATIN CAPITAL LETTER I WITH MACRON
i1  8003 := ī  (\u012b)  LATIN SMALL LETTER I WITH MACRON
i10  1423 := î  (\u00ee)  LATIN SMALL LETTER I WITH CIRCUMFLEX
i7   172 := ï  (\u00ef)  LATIN SMALL LETTER I WITH DIAERESIS
L2    15 := Ḷ  (\u1e36)  LATIN CAPITAL LETTER L WITH DOT BELOW
l2    18 := ḷ  (\u1e37)  LATIN SMALL LETTER L WITH DOT BELOW
M2    44 := Ṃ  (\u1e42)  LATIN CAPITAL LETTER M WITH DOT BELOW
m2  3621 := ṃ  (\u1e43)  LATIN SMALL LETTER M WITH DOT BELOW
N2    28 := Ṇ  (\u1e46)  LATIN CAPITAL LETTER N WITH DOT BELOW
n2  5538 := ṇ  (\u1e47)  LATIN SMALL LETTER N WITH DOT BELOW
N3    32 := Ṅ  (\u1e44)  LATIN CAPITAL LETTER N WITH DOT ABOVE
n3  1174 := ṅ  (\u1e45)  LATIN SMALL LETTER N WITH DOT ABOVE
N5    89 := Ñ  (\u00d1)  LATIN CAPITAL LETTER N WITH TILDE
n5   989 := ñ  (\u00f1)  LATIN SMALL LETTER N WITH TILDE
o10   532 := ô  (\u00f4)  LATIN SMALL LETTER O WITH CIRCUMFLEX
o7     2 := ö  (\u00f6)  LATIN SMALL LETTER O WITH DIAERESIS
R2   817 := Ṛ  (\u1e5a)  LATIN CAPITAL LETTER R WITH DOT BELOW
r2  4875 := ṛ  (\u1e5b)  LATIN SMALL LETTER R WITH DOT BELOW
R21   107 := Ṝ  (\u1e5c)  LATIN CAPITAL LETTER R WITH DOT BELOW AND MACRON
S2   377 := Ṣ  (\u1e62)  LATIN CAPITAL LETTER S WITH DOT BELOW
s2  8135 := ṣ  (\u1e63)  LATIN SMALL LETTER S WITH DOT BELOW
T2    87 := Ṭ  (\u1e6c)  LATIN CAPITAL LETTER T WITH DOT BELOW
t2  2832 := ṭ  (\u1e6d)  LATIN SMALL LETTER T WITH DOT BELOW
U1   143 := Ū  (\u016a)  LATIN CAPITAL LETTER U WITH MACRON
u1  2945 := ū  (\u016b)  LATIN SMALL LETTER U WITH MACRON
u10   477 := ū  (\u016b)  LATIN SMALL LETTER U WITH MACRON
u11   285 := ù  (\u00f9)  LATIN SMALL LETTER U WITH GRAVE
u7     1 := ü  (\u00fc)  LATIN SMALL LETTER U WITH DIAERESIS

DTD

stc.dtd:

<?xml version="1.0" encoding="UTF-8"?>
<!-- stc.dtd
 May 17, 2013
 June 22, 2013
-->
<!ELEMENT  stc (H1)*>
<!ELEMENT H1 (h,body,tail) >

<!ENTITY % misc_empty "F | g | b | i | sup |H | P |br" >
<!ENTITY % body_elts "  %misc_empty; " >
<!-- h element -->
<!ELEMENT h  (key1,key2)>
<!ELEMENT key1 (#PCDATA)>
<!ELEMENT key2 (#PCDATA)>

<!ELEMENT body (#PCDATA  | %body_elts;)*>
<!ELEMENT br EMPTY > <!-- line breaks in stc.txt ?-->
<!ELEMENT g (#PCDATA)> <!-- Greek (only once) -->
<!ELEMENT F (#PCDATA | b | br)*> <!-- Footnote -->
<!ELEMENT P EMPTY > <!-- Paragraph -->
<!ELEMENT H EMPTY > <!-- head-line -->
<!ELEMENT b (#PCDATA) > <!-- bold -->
<!ELEMENT i (#PCDATA | br)* > <!-- italic -->
<!ELEMENT sup (#PCDATA) > <!-- superscript -->

<!-- tail -->
<!ELEMENT tail (#PCDATA | L | pc )*>
<!ELEMENT L (#PCDATA) >
<!ELEMENT pc (#PCDATA) >
<!-- attributes  -->