Table Of Contents

Previous topic

BEN Benfey Sanskrit-English Dictionary (Developer notes)

Next topic

BOP Bopp Glossarium Sanscritum (Developer notes)

This Page

BHS Edgerton Buddhist Hybrid Sanskrit Dictionary (Developer notes)

Date of digitization: 2013

Metadata

The original digitization is file bhs_orig.txt, which is coded in the cp1252 (windows 1252) encoding, and is best viewed in a text editor which supports this encoding. For example, in Emacs, one may use the command revert-buffer-with-coding-system and then select cp1252 as the coding. The internet reference http://www.cp1252.com/ describes this coding system.

The file bhs_orig_utf8.txt is a conversion of bhs_orig.txt to the more common utf-8 encoding. The file bhs.txt is also in the utf-8 encoding, and incorporates various editing changes, such as corrections of typographical errors.

There are several extended ascii codes occurring in bhs.txt:

¤  (\u00a4)    63 := CURRENCY SIGN
§  (\u00a7)   734 := SECTION SIGN
°  (\u00b0) 21747 := DEGREE SIGN
ç  (\u00e7)     2 := LATIN SMALL LETTER C WITH CEDILLA
œ  (\u0153)     1 := LATIN SMALL LIGATURE OE
‘  (\u2018)   891 := LEFT SINGLE QUOTATION MARK
’  (\u2019)   949 := RIGHT SINGLE QUOTATION MARK
…  (\u2026)  4092 := HORIZONTAL ELLIPSIS

The {X...X} style of coding serves several purposes:

{%.. %}  24894    italic texgt
{@..@}   33113 :  bold text
{--ux}  1  : on line 2, before text proper starts

The following <x> type tags are found in bhs.txt:

<g></g>  3  : Greek, uncoded
<>   69943  : line break
<H>     40  : Headline, letter breaks
<P>  17836  : Paragraph, begin headword
Page breaks are coded as [Page...].
The implicit first Page number is [Page001-a+61] occurring on line 1.
First is [Page001-b+ 61] occuring on line 62
Last is [Page623-b+ 7] occuring on line 87773
There are 1315 such lines

623*2 = 1246 so 1315 + 1 - 1246 = 69 The extra Page lines are of the form [Pagexxx-1a+n] or -1b or -2a or -2b or -3a or -3b; these alternate forms are used when one or more letter breaks occurs on a page. The lines of the digitization correspond to lines of the text.

Headword coding is exemplified by: <P>{@a-, an-@} <P>[2 {@am2s4a-dha1tri1@}, see {@am2sa-@}]. <P>{@-am2s4ika@} <P>{@am2s4u@}

Here is the regular expression used in python programs to recognize headwords. reHeadword = r’^<P>.*?{@(.*?)@}’

The headword is coded in AS.

The headwords are ordered according to Sanskrit alphabet ordering.

Sanskrit in the text appears in the European Indological form, which is coded in bhs.txt with the the AS (Anglicized Sanskrit) coding.

The general AS scheme, as described in CDSL.pdf, uses Latin alphabetical letters ‘x (a-z,A-Z), possibly with suffixed numbers; the letter-number combinations are, in the general scheme:

x1 = macron
x2 = dot below
x3 = dot above
x4 = accent aigu
x5 = tilde
x6 = dash below
x7 = umlaut
x10 = circonflex (hat)
x11 = accent grave

Here are the characters that occur in bhs.txt in this coding, with their approximate frequency:

A1   262 := Ā  (\u0100)  LATIN CAPITAL LETTER A WITH MACRON
a1 83276 := ā  (\u0101)  LATIN SMALL LETTER A WITH MACRON
a10     4 := â  (\u00e2)  LATIN SMALL LETTER A WITH CIRCUMFLEX
a11    28 := à  (\u00e0)  LATIN SMALL LETTER A WITH GRAVE
a2     2 := ạ  (\u1ea1)  LATIN SMALL LETTER A WITH DOT BELOW
a3     2 := à (\u00e0) LATIN SMALL LETTER A GRAVE
a4     7 := á(\u00e1) LATIN SMALL LETTER A ACUTE
a7   238 := ä  (\u00e4)  LATIN SMALL LETTER A WITH DIAERESIS
a9     3 := LATIN SMALL LETTER A WITH MACRON and dot below. Unrendered.
d2  3638 := ḍ  (\u1e0d)  LATIN SMALL LETTER D WITH DOT BELOW
D2     7 := Ḍ  (\u1e0c)  LATIN CAPITAL LETTER D WITH DOT BELOW
e10    16 := ê  (\u00ea)  LATIN SMALL LETTER E WITH CIRCUMFLEX
e11    19 := è  (\u00e8)  LATIN SMALL LETTER E WITH GRAVE
e3     1 := ė  (\u0117)  LATIN SMALL LETTER E WITH DOT ABOVE
e4   561 := é  (\u00e9)  LATIN SMALL LETTER E WITH ACUTE
e7     4 := ë  (\u00eb)  LATIN SMALL LETTER E WITH DIAERESIS
h2  7917 := ḥ  (\u1e25)  LATIN SMALL LETTER H WITH DOT BELOW
I1    28 := Ī  (\u012a)  LATIN CAPITAL LETTER I WITH MACRON
i1 11701 := ī  (\u012b)  LATIN SMALL LETTER I WITH MACRON
i10     4 := î  (\u00ee)  LATIN SMALL LETTER I WITH CIRCUMFLEX
K4    56 := Ḱ  (\u1e30)  LATIN CAPITAL LETTER K WITH ACUTE
l2   180 := ḷ  (\u1e37)  LATIN SMALL LETTER L WITH DOT BELOW
m2 18665 := ṃ  (\u1e43)  LATIN SMALL LETTER M WITH DOT BELOW
n2 13289 := ṇ  (\u1e47)  LATIN SMALL LETTER N WITH DOT BELOW
n3  3139 := ṅ  (\u1e45)  LATIN SMALL LETTER N WITH DOT ABOVE
N5     1 := Ñ  (\u00d1)  LATIN CAPITAL LETTER N WITH TILDE
n5  3988 := ñ  (\u00f1)  LATIN SMALL LETTER N WITH TILDE
o1    23 := ō  (\u014d)  LATIN SMALL LETTER O WITH MACRON
o10     9 := ô  (\u00f4)  LATIN SMALL LETTER O WITH CIRCUMFLEX
o7    11 := ö  (\u00f6)  LATIN SMALL LETTER O WITH DIAERESIS
R2    43 := Ṛ  (\u1e5a)  LATIN CAPITAL LETTER R WITH DOT BELOW
r2  5307 := ṛ  (\u1e5b)  LATIN SMALL LETTER R WITH DOT BELOW
r21    10 := ṝ  (\u1e5d)  LATIN SMALL LETTER R WITH DOT BELOW AND MACRON
S2     6 := Ṣ  (\u1e62)  LATIN CAPITAL LETTER S WITH DOT BELOW
s2 15603 := ṣ  (\u1e63)  LATIN SMALL LETTER S WITH DOT BELOW
S4  3046 := Ś  (\u015a)  LATIN CAPITAL LETTER S WITH ACUTE
s4 12641 := ś  (\u015b)  LATIN SMALL LETTER S WITH ACUTE
T2     4 := Ṭ  (\u1e6c)  LATIN CAPITAL LETTER T WITH DOT BELOW
t2  6839 := ṭ  (\u1e6d)  LATIN SMALL LETTER T WITH DOT BELOW
U1     8 := Ū  (\u016a)  LATIN CAPITAL LETTER U WITH MACRON
u1  6426 := ū  (\u016b)  LATIN SMALL LETTER U WITH MACRON
u10     1 := ū  (\u016b)  LATIN SMALL LETTER U WITH MACRON
u11     1 := ù  (\u00f9)  LATIN SMALL LETTER U WITH GRAVE
u7   102 := ü  (\u00fc)  LATIN SMALL LETTER U WITH DIAERESIS

DTD

bhs.dtd:

<?xml version="1.0" encoding="UTF-8"?>
<!-- bhs.dtd
 June 17, 2014

-->
<!ELEMENT  bhs (H1)*>
<!ELEMENT H1 (h,body,tail) >
<!ENTITY % body_elts "br | g | b | i " >
<!-- h element -->
<!ELEMENT h  (key1,key2,hom?)>
<!ELEMENT key1 (#PCDATA) > <!-- in slp1 -->
<!ELEMENT key2 (#PCDATA )><!-- in AS -->
<!ELEMENT hom (#PCDATA)> <!-- homonym -->

<!ELEMENT body (#PCDATA  | %body_elts;)*>
<!ELEMENT br EMPTY > <!-- line breaks in bhs.txt -->
<!ELEMENT g (#PCDATA)> <!-- Greek  -->
<!ELEMENT b (#PCDATA | g | br)*> <!-- bold  -->
<!ELEMENT i (#PCDATA | br | g )*> <!-- italic  -->

<!-- tail -->
<!ELEMENT tail (#PCDATA | L | pc )*>
<!ELEMENT L (#PCDATA) >
<!ELEMENT pc (#PCDATA) >

<!-- attributes  -->