Table Of Contents

Previous topic

SCH Schmidt Nachträge zum Sanskrit-Wörterbuch (Developer notes)

Next topic

SKD Sabda-kalpadruma (Developer notes)

This Page

SHS Shabda-Sagara Sanskrit-English Dictionary (Developer notes)

Date of digitization: 2014

Metadata

The original digitization is file shs_orig.txt, which is coded in the cp1252 (windows 1252) encoding, and is best viewed in a text editor which supports this encoding. For example, in Emacs, one may use the command revert-buffer-with-coding-system and then select cp1252 as the coding. The internet reference http://www.cp1252.com/ describes this coding system.

The file shs_orig_utf8.txt is a conversion of shs_orig.txt to the more common utf-8 encoding. The file shs.txt is also in the utf-8 encoding, and incorporates various editing changes, such as corrections of typographical errors.

There are several extended ascii codes occurring in shs.txt:

¦  (\u00a6) 47317 := BROKEN BAR
°  (\u00b0)     1 := DEGREE SIGN
º  (\u00ba)     3 := MASCULINE ORDINAL INDICATOR
Æ  (\u00c6)    47 := LATIN CAPITAL LETTER AE
æ  (\u00e6)   229 := LATIN SMALL LETTER AE
Π (\u0152)     3 := LATIN CAPITAL LIGATURE OE
œ  (\u0153)   214 := LATIN SMALL LIGATURE OE
‘  (\u2018)    11 := LEFT SINGLE QUOTATION MARK
’  (\u2019)    12 := RIGHT SINGLE QUOTATION MARK
“  (\u201c)   730 := LEFT DOUBLE QUOTATION MARK
”  (\u201d)   726 := RIGHT DOUBLE QUOTATION MARK
„  (\u201e)     1 := DOUBLE LOW-9 QUOTATION MARK
…  (\u2026)   179 := HORIZONTAL ELLIPSIS

The {X...X} style of coding serves several purposes:

{#X#}   208025 : {#X#} devanagari text, coded with HK
                 Note: Vertical bar '|' is used for danda in HK.
{%X%}       10 : italic text
{??}        10 : unreadable text

The <> style of coding is used as follows:

<>     53027  :  beginning of normal line
<HI>   47313  :  beginning of Headword line
<P>       69  :  beginning of paragraph (only in preface)
<H>       49  :  Headline. Primarily at 'letter' breaks in lexicon
<Poem>X</Poem> 9 : Delimit a Poem.
Page breaks are coded as [Page...].
In general, Page breaks have the form [PageX], where X usually has the
form PPP-C+ nn, where
PPP is 3-digit page number, 001 to 839
C is column identifier, a or b
nn is the number of following text lines in column C of page PPP.
In the title page,
X = ‘-titel+ 21’.
X = 001-a+ 43, 840-b and 840-1b contain list of abbreviations

The lines of the digitization represent lines of the text.

Headword coding is exemplified by: <HI>{#a#}¦
Here is the regular expression used in python programs to recognize headwords:
reHeadword = u’^<HI>{#(.*?)#}¦’
The headword appears in Devanagari in the text, and is encoded with the
Kyoto-Harvard transliteration.

The headwords are ordered according to Sanskrit alphabet ordering.

Some Sanskrit in the text appears in the European Indological form, which is coded in shs.txt with the the AS (Anglicized Sanskrit) coding.

The general AS scheme, as described in CDSL.pdf, uses Latin alphabetical letters ‘x (a-z,A-Z), possibly with suffixed numbers; the letter-number combinations are, in the general scheme:

x1 = macron
x2 = dot below
x3 = dot above
x4 = accent aigu
x5 = tilde
x6 = dash below
x7 = umlaut
x10 = circonflex (hat)
x11 = accent grave

Here are the characters that occur in shs.txt in this coding, with their approximate frequency:

Note:
The Indological form used in this text is different in most respects from the
current conventions.
Most notable is the use of the acute accent (coded as X4 for various X).
When X is a vowel, X4 represents the long vowel, which would be X1 (X with
macron) in current usage.
When X is a consonant, we have not established the relationship of X4 to current
conventions.  In Unicode representation of X4 in the displays, a single
unicode code point is used if possible; if not possible, the Unicode
COMBINING ACUTE ACCENT (\u0301) is appended to X; the quality of display of
the combining accent varies, but is generally poor with current browsers.

A4  1673 := Á  (\u00c1) LATIN CAPITAL LETTER A WITH ACUTE
a4  3511 := á (\u00e1) LATIN SMALL LETTER A WITH ACUTE
E4   321 := É  (\u00c9)  LATIN CAPITAL LETTER E WITH ACUTE
e4   500 := é  (\u00e9)  LATIN SMALL LETTER E WITH ACUTE
I4   385 := Í  (\u00cd) LATIN CAPITAL LETTER I WITH ACUTE
i4   273 := í (\u00ed) LATIN SMALL LETTER I WITH ACUTE
o4   163 := ó  (\u00f3) LATIN SMALL LETTER O WITH ACUTE
U4    62 := Ú  (\u00da) LATIN CAPITAL LETTER I WITH ACUTE
u4   153 := ú (\u00fa) LATIN SMALL LETTER U and udatta accent

Two other diacritical combinations are:
a10     1 := â  (\u00e2)  LATIN SMALL LETTER A WITH CIRCUMFLEX
a11     1 := à  (\u00e0)  LATIN SMALL LETTER A WITH GRAVE

Here are the AS codes for accented consonants that appear in the text;
When the COMBINING ACUTE ACCENT (\u0301) is
used, we show in this file the display as the letter followed by the single quote;
however, the combining accent is used in the web displays.

B4     1 := B' (\u0042\u0301) LATIN CAPITAL LETTER B WITH ACUTE
c4     1 := ć  (\u0107)  LATIN SMALL LETTER C WITH ACUTE
d4    54 := d' (\u0064\u0301) LATIN SMALL LETTER D WITH ACUTE
D4    54 := D' (\u0044\u0301) LATIN CAPITAL LETTER D WITH ACUTE
G4     3 := Ǵ  (\u01f4)  LATIN CAPITAL LETTER G WITH ACUTE
g4     2 := ǵ  (\u01f5)  LATIN SMALL LETTER G WITH ACUTE
h4     3 := h' (\u0068\u0301) LATIN SMALL LETTER H WITH ACUTE
H4    11 := H' (\u0048\u0301) LATIN CAPITAL LETTER H WITH ACUTE
J4     1 := J' (\u004a\u0301) LATIN CAPITAL LETTER J WITH ACUTE
K4     7 := Ḱ  (\u1e30)  LATIN CAPITAL LETTER K WITH ACUTE
k4     5 := ḱ  (\u1e31)  LATIN SMALL LETTER K WITH ACUTE
L4     1 := Ĺ  (\u0139)  LATIN CAPITAL LETTER L WITH ACUTE
M4     5 := Ḿ  (\u1e3e) LATIN CAPITAL LETTER M WITH ACUTE
m4     3 := ḿ  (\u1e3f) LATIN SMALL LETTER M WITH ACUTE
N4   695 := Ń  (\u0143)  LATIN CAPITAL LETTER N WITH ACUTE
n4   263 := ń  (\u0144)  LATIN SMALL LETTER N WITH ACUTE
P4     3 := Ṕ  (\u1e54)  LATIN CAPITAL LETTER P WITH ACUTE
R4     2 := Ŕ  (\u0154)  LATIN CAPITAL LETTER R WITH ACUTE
r4     7 := ŕ  (\u0155)  LATIN SMALL LETTER R WITH ACUTE
s4   111 := ś  (\u015b)  LATIN SMALL LETTER S WITH ACUTE
S4   687 := Ś  (\u015a)  LATIN CAPITAL LETTER S WITH ACUTE
T4    14 := T' (\u0054\u0301) LATIN CAPITAL LETTER T WITH ACUTE
t4    11 := t' (\u0074\u0301) LATIN SMALL LETTER T WITH ACUTE
V4     4 := V' (\u0056\u0301) LATIN CAPITAL LETTER V WITH ACUTE
w4     1 := ẃ  (\u1e83)  LATIN SMALL LETTER W WITH ACUTE
y4     1 := ý (\u00fd) LATIN SMALL LETTER Y WITH ACUTE
Y4     4 := Ý (\u00dd) LATIN CAPITAL LETTER Y WITH ACUTE

Finally, ‘0’ is used in HK-coded Devanagari to indicate an abbreviated form. So, this might be mistaken for an AS code. Here are the occurences:

A0   417
a0   800
d0     1
e0     4
H0     1
i0    44
I0     2
k0     1
o0    49
O4    19
R0     1
T0     3
U0     1
u0    69

DTD

shs.dtd:

<?xml version="1.0" encoding="UTF-8"?>
<!-- shs.dtd
 May 10, 2014

-->
<!ELEMENT  shs (H1)*>
<!ELEMENT H1 (h,body,tail) >
<!ENTITY % body_elts "i |s |lb | HI | H | Poem" >
<!-- h element -->
<!ELEMENT h  (key1,key2,hom?)>
<!ELEMENT key1 (#PCDATA) > <!-- in slp1 -->
<!ELEMENT key2 (#PCDATA )><!-- in AS -->
<!ELEMENT hom (#PCDATA)> <!-- homonym -->

<!ELEMENT body (#PCDATA  | %body_elts;)*>
<!ELEMENT i (#PCDATA | wide | sic)*> <!-- italic text -->
<!ELEMENT s (#PCDATA | lb)*> <!-- Sanskrit, in HK transliteration  -->
<!ELEMENT lb EMPTY>  <!-- line break -->
<!ELEMENT HI EMPTY>  <!-- line break, and begin headword -->
<!ELEMENT H EMPTY>  <!-- Header -->
<!ELEMENT Poem (s | lb)*>  <!-- Poem -->
<!-- tail -->
<!ELEMENT tail (#PCDATA | L | pc )*>
<!ELEMENT L (#PCDATA) >
<!ELEMENT pc (#PCDATA) >

<!-- attributes  -->