Table Of Contents

Previous topic

PWG Böhtlingk and Roth Grosses Petersburger Wörterbuch (Developer notes)

Next topic

SHS Shabda-Sagara Sanskrit-English Dictionary (Developer notes)

This Page

SCH Schmidt Nachträge zum Sanskrit-Wörterbuch (Developer notes)

Date of digitization: 2008

Metadata

The original digitization is file schmidt_orig.txt, which is coded in the cp1252 (windows 1252) encoding, and is best viewed in a text editor which supports this encoding. For example, in Emacs, one may use the command revert-buffer-with-coding-system and then select cp1252 as the coding. The internet reference http://www.cp1252.com/ describes this coding system.

The file schmidt_orig_utf8.txt is a conversion of schmidt_orig.txt to the more common utf-8 encoding. The file sch.txt is also in the utf-8 encoding, and incorporates various editing changes, such as corrections of typographical errors.

There are several extended ascii codes occurring in sch.txt:

¦  (\u00a6) 28759 := BROKEN BAR
§  (\u00a7)     3 := SECTION SIGN
ª  (\u00aa)     4 := FEMININE ORDINAL INDICATOR
°  (\u00b0)  2254 := DEGREE SIGN
³  (\u00b3)     1 := SUPERSCRIPT THREE
µ  (\u00b5) 29125 := MICRO SIGN
º  (\u00ba) 12875 := MASCULINE ORDINAL INDICATOR
½  (\u00bd)     1 := VULGAR FRACTION ONE HALF
Ä  (\u00c4)    37 := LATIN CAPITAL LETTER A WITH DIAERESIS
Ç  (\u00c7)     1 := LATIN CAPITAL LETTER C WITH CEDILLA
Ö  (\u00d6)    40 := LATIN CAPITAL LETTER O WITH DIAERESIS
Ü  (\u00dc)   127 := LATIN CAPITAL LETTER U WITH DIAERESIS
ß  (\u00df)  1283 := LATIN SMALL LETTER SHARP S
à  (\u00e0)    15 := LATIN SMALL LETTER A WITH GRAVE
ä  (\u00e4)  2536 := LATIN SMALL LETTER A WITH DIAERESIS
è  (\u00e8)     1 := LATIN SMALL LETTER E WITH GRAVE
ö  (\u00f6)  1408 := LATIN SMALL LETTER O WITH DIAERESIS
ù  (\u00f9)     1 := LATIN SMALL LETTER U WITH GRAVE
ü  (\u00fc)  3480 := LATIN SMALL LETTER U WITH DIAERESIS
…  (\u2026)  4242 := HORIZONTAL ELLIPSIS
€  (\u20ac) 28764 := EURO SIGN

The {X...X} style of coding serves several purposes:

{# #}  57882  : {#X#} X is coded in HK; appears in headword coding (see below)
{% %}  11793  : italic text
{kh}  2  : twice in hw='nirjanAvasara' (underlined)

There is no pseudo-xml type coding in sch.txt:

| Page breaks are coded as [Page...].
| Page breaks are more specifically coded as
| [Pageppp.c] where ppp is 0 filled page number from scan, and c = 1-3 is the
| column number.  In a few cases, an 'a' is suffixed to the page number.

The lines of the digitization generally represent ‘sections’ of the text; the actual line-breaks of the text are not coded.

Headword coding is exemplified by: .{#a#}100{#a°#}^2¦
The general form is .{#X#}100{#Y#}^h¦
where X (key1) is coded in Harvard-Kyoto transliteration, and
is normalized to remove such things as accents.
Y (key2) is coded in Anglicized-Sanskrit (AS) transliteration, which has
provision for accents.

The headwords are ordered according to Sanskrit alphabet ordering.

Sanskrit in the text appears in the European Indological form, which is coded in sch.txt with the the AS (Anglicized Sanskrit) coding. Some words are coded with The general AS scheme, as described in CDSL.pdf, uses Latin alphabetical letters ‘x (a-z,A-Z), possibly with suffixed numbers; the letter-number combinations are, in the general scheme:

x1 = macron
x2 = dot below
x3 = dot above
x4 = accent aigu
x5 = tilde
x6 = dash below
x7 = umlaut
x10 = circonflex (hat)
x11 = accent grave

Here are the characters that occur in sch.txt in this coding, with their approximate frequency:

Note on Accents:  When combined with a long vowel, accents are represented
in Unicode by code points:
\u0300  COMBINING GRAVE ACCENT
\u0301  COMBINING ACUTE ACCENT
The visual display of these combining characters varies in quality, and
is often either invisible or awkwardly placed.

A1  1131 := Ā  (\u0100)  LATIN CAPITAL LETTER A WITH MACRON
a1 31679 := ā  (\u0101)  LATIN SMALL LETTER A WITH MACRON
a13   33 := ā̀  (\u0101\u0300) LATIN SMALL LETTER A WITH MACRON and anudatta accent
A14    2 := Ā́ (\u0100\u0301) LATIN CAPITAL LETTER A WITH MACRON and udatta accent
a14  144 := ā́ (\u0101\u0301) LATIN SMALL LETTER A WITH MACRON and udatta accent
a18    1 := ā́ (\u0101) LATIN SMALL LETTER A WITH MACRON and 'crescent' (undone)
a3     6 := à (\u00e0) LATIN SMALL LETTER A with anudatta accent
a4   934 := á(\u00e1) LATIN SMALL LETTER A with udatta accent
a7     5 := ä  (\u00e4)  LATIN SMALL LETTER A WITH DIAERESIS
d2  1983 := ḍ  (\u1e0d)  LATIN SMALL LETTER D WITH DOT BELOW
D2     7 := Ḍ  (\u1e0c)  LATIN CAPITAL LETTER D WITH DOT BELOW
d3     1 := ḋ  (\u1e0b)  LATIN SMALL LETTER D WITH DOT ABOVE
e4    45 := é  (\u00e9)  LATIN SMALL LETTER E WITH ACUTE =
h2   966 := ḥ  (\u1e25)  LATIN SMALL LETTER H WITH DOT BELOW
I1    15 := Ī  (\u012a)  LATIN CAPITAL LETTER I WITH MACRON
i1  8094 := ī  (\u012b)  LATIN SMALL LETTER I WITH MACRON
i10   12 := î  (\u00ee)  LATIN SMALL LETTER I WITH CIRCUMFLEX
i13    3 := ī̀ (\u012b\u0300) LATIN SMALL LETTER I WITH MACRON and anudatta accent
i14   51 := ī̀ (\u012b\u0301) LATIN SMALL LETTER I WITH MACRON and udatta accent
I4     1 := Í (\u0049\u0301) LATIN CAPITAL LETTER I and udatta accent
i4   125 := í (\u00ed) LATIN SMALL LETTER I and udatta accent
k2     1 := ḳ  (\u1e33)  LATIN SMALL LETTER K WITH DOT BELOW (significance?)
l2    10 := ḷ  (\u1e37)  LATIN SMALL LETTER L WITH DOT BELOW
m2  2786 := ṃ  (\u1e43)  LATIN SMALL LETTER M WITH DOT BELOW
m3     1 := ṁ  (\u1e41)  LATIN SMALL LETTER M WITH DOT ABOVE
N2     1 := Ṇ  (\u1e46)  LATIN CAPITAL LETTER N WITH DOT BELOW
n2  5885 := ṇ  (\u1e47)  LATIN SMALL LETTER N WITH DOT BELOW
n3  1902 := ṅ  (\u1e45)  LATIN SMALL LETTER N WITH DOT ABOVE
n5  1059 := ñ  (\u00f1)  LATIN SMALL LETTER N WITH TILDE
o4    27 := ó  (\u00f3) LATIN SMALL LETTER O and udatta accent
o7     1 := ö  (\u00f6)  LATIN SMALL LETTER O WITH DIAERESIS
r12    1 := ṝ (\u1e5d) LATIN SMALL LETTER R WITH DOT ABOVE AND MACRON
R2   182 := Ṛ  (\u1e5a)  LATIN CAPITAL LETTER R WITH DOT BELOW
r2  3911 := ṛ  (\u1e5b)  LATIN SMALL LETTER R WITH DOT BELOW
r24    13 := ṛ́ (\u1e5b\u0301) LATIN SMALL LETTER R WITH DOT BELOW and udatta accent
r3     4 := ṙ  (\u1e59)  LATIN SMALL LETTER R WITH DOT ABOVE
S2    12 := Ṣ  (\u1e62)  LATIN CAPITAL LETTER S WITH DOT BELOW
s2  7179 := ṣ  (\u1e63)  LATIN SMALL LETTER S WITH DOT BELOW
S3     1 := Ṡ  (\u1e60)  LATIN CAPITAL LETTER S WITH DOT ABOVE
s3     2 := ṡ  (\u1e61)  LATIN SMALL LETTER S WITH DOT ABOVE
S4  3769 := Ś  (\u015a)  LATIN CAPITAL LETTER S WITH ACUTE
s4  7217 := ś  (\u015b)  LATIN SMALL LETTER S WITH ACUTE
t10    1 :=   (\u1e6d) LATIN SMALL LETTER S WITH 2 dots below.
T2     6 := Ṭ  (\u1e6c)  LATIN CAPITAL LETTER T WITH DOT BELOW
t2  5040 := ṭ  (\u1e6d)  LATIN SMALL LETTER T WITH DOT BELOW
U1     3 := Ū  (\u016a)  LATIN CAPITAL LETTER U WITH MACRON
u1  2976 := ū  (\u016b)  LATIN SMALL LETTER U WITH MACRON
u14   21 := ū́ (\u016b\u0301) LATIN SMALL LETTER U WITH MACRON and udatta accent
u3     8 := ù (\u00f9) LATIN SMALL LETTER U with anudatta accent
U4     1 := Ú (\u00da) LATIN CAPITAL LETTER U and udatta accent
u4    90 := ú (\u00fa) LATIN SMALL LETTER U with udatta accent
E409     1 := E409 Literary citation
E467     1 := E467 Literary citation
H43     1 := H43 Literary citation

DTD

sch.dtd:

<?xml version="1.0" encoding="UTF-8"?>
<!-- sch.dtd
 Mar 18, 2014

-->
<!ELEMENT  sch (H1)*>
<!ELEMENT H1 (h,body,tail) >
<!ENTITY % body_elts "i |s " >
<!-- h element -->
<!ELEMENT h  (key1,key2,hom?)>
<!ELEMENT key1 (#PCDATA) > <!-- in slp1 -->
<!ELEMENT key2 (#PCDATA )><!-- in AS -->
<!ELEMENT hom (#PCDATA)> <!-- homonym -->

<!ELEMENT body (#PCDATA  | %body_elts;)*>
<!ELEMENT i (#PCDATA | wide | sic)*> <!-- italic, Sanskrit, in AS transliteration -->
<!ELEMENT s (#PCDATA)> <!-- Sanskrit, in AS transliteration  -->

<!-- tail -->
<!ELEMENT tail (#PCDATA | L | pc )*>
<!ELEMENT L (#PCDATA) >
<!ELEMENT pc (#PCDATA) >

<!-- attributes  -->