Table Of Contents

Previous topic

AP90 Apte Practical Sanskrit-English Dictionary (Developer notes)

Next topic

BHS Edgerton Buddhist Hybrid Sanskrit Dictionary (Developer notes)

This Page

BEN Benfey Sanskrit-English Dictionary (Developer notes)

Date of digitization: 2014

Metadata

The original digitization is file ben_orig.txt, which is coded in the cp1252 (windows 1252) encoding, and is best viewed in a text editor which supports this encoding. For example, in Emacs, one may use the command revert-buffer-with-coding-system and then select cp1252 as the coding. The internet reference http://www.cp1252.com/ describes this coding system.

The file ben_orig_utf8.txt is a conversion of ben_orig.txt to the more common utf-8 encoding. The file ben.txt is also in the utf-8 encoding, and incorporates various editing changes, such as corrections of typographical errors.

There are several extended ascii codes occurring in ben.txt:

¤  (\u00a4)   237 := CURRENCY SIGN: indicates a 'long-short' vowel
¦  (\u00a6) 17324 := BROKEN BAR
º  (\u00ba)   511 := MASCULINE ORDINAL INDICATOR
Æ  (\u00c6)     9 := LATIN CAPITAL LETTER AE
Ç  (\u00c7)  2532 := LATIN CAPITAL LETTER C WITH CEDILLA
æ  (\u00e6)    50 := LATIN SMALL LETTER AE
ç  (\u00e7)  7431 := LATIN SMALL LETTER C WITH CEDILLA
œ  (\u0153)    28 := LATIN SMALL LIGATURE OE
‘  (\u2018)   169 := LEFT SINGLE QUOTATION MARK
’  (\u2019)   172 := RIGHT SINGLE QUOTATION MARK
„  (\u201e)     2 := DOUBLE LOW-9 QUOTATION MARK
†  (\u2020)  1386 := DAGGER

The {X...X} style of coding serves several purposes:

{# #}  22319  : {#X#}  Sanskrit coded in HK. Probably only in Headwords
{% %}  57590  : italic text. Usually (always?) Sanskrit coded in AS.
{@ @}  41385  : bold text. Usually (always?) Sanskrit coded in AS.
{??}     232  : unreadable text.
{| |?      2  : only once - wide spacing in text

The following <x> type tags are found in ben.txt:

<g></g>  1513  : Greek text
<>      76322  : Begining of 'normal' lines
<H>        46  : Section header.
<HI>      593  : Appear only in the Additions and Corrections Section
<NI>        1  : Appears in Preface
<P>     17455  : Part of Headword coding.
Page breaks are coded as [Page...].
Page breaks are more specifically coded, in the body of the dictionary, as:
Coded in body of dictionary as:
[Pagepppp-c+ nn] where
pppp is the page number (0 filled) from 0001 to 1127.
c is the column: ‘a’ or ‘b’
nn is the number of (following) lines in the page-column.
First is 0001-a, last is 1127-b

The lines of the digitization correspond to the lines of the text.

Headword coding is generally exemplified by:
<P>{#a’#}¦ {%a-,%}
Some other much less frequently occurring forms are:
<P>1. {#Uh#}¦ {%U10H,%}
<P>† 1. {#kal#}¦ {%KAL,%}
<P>{@I.@} {#kun~c#}¦ {%KUN4CH,%}
<P>{#aGgurIyaka#}¦ {%an3guri10yaka = an3guli10-
<P>{#aTani#}¦ and {#aTanI#} {%at2ani¤10,%}
The most common headword form may be abstracted as:
<P>{#X#}¦ {%Y%} where X is in Harvard-Kyoto coding and Y is in
Anglicized Sanskrit coding.
The most common form (identifying key1,key2) is
r’^<P>.*?{#(.*?)#}¦ {%(.?)%}’

This covers all but about 350 of the headwords, and specifies key1 and key2.
However, for completeness, we use the form
reHeadword = r’^<P>.*?{#(.*?)#}¦’ which only identifies key1.
In make_xml, an attempt is made to recognize some of the other forms.

Note that the part between <P> and {#, while usually absent (except in 431
cases) is uninterpreted by the current programs. It often contains a ‘homonym’
number.

The headwords are ordered according to Sanskrit alphabet ordering.

Most Sanskrit in the text appears in the European Indological form, which is coded in ben.txt with the the AS (Anglicized Sanskrit) coding. The general AS scheme, as described in CDSL.pdf, uses Latin alphabetical letters ‘x (a-z,A-Z), possibly with suffixed numbers; the letter-number combinations are, in the general scheme:

x1 = macron
x2 = dot below
x3 = dot above
x4 = accent aigu
x5 = tilde
x6 = dash below
x7 = umlaut
x10 = circonflex (hat)
x11 = accent grave

Here are the characters that occur in ben.txt in this coding, with their approximate frequency:

Note that the palatal sibilants are coded using the c-cedilla: Ç, ç,
rather than with an Anglicized Sanskrit code.
a1    10 := ā  (\u0101)  LATIN SMALL LETTER A WITH MACRON
A1     2 := Ā  (\u0100)  LATIN CAPITAL LETTER A WITH MACRON
A10  1477 := Ā  (\u0100)  LATIN CAPITAL LETTER A WITH MACRON
a10 33177 := â  (\u00e2)  LATIN SMALL LETTER A WITH CIRCUMFLEX
a11     4 := à  (\u00e0)  LATIN SMALL LETTER A WITH GRAVE
a2     2 := ạ  (\u1ea1)  LATIN SMALL LETTER A WITH DOT BELOW
a3     2 := à (\u00e0) LATIN SMALL LETTER A with grave (anudatta) accent
a4    46 := á(\u00e1) LATIN SMALL LETTER A with acute (udatta) accent
a7    21 := ä  (\u00e4)  LATIN SMALL LETTER A WITH DIAERESIS
d2  1299 := ḍ  (\u1e0d)  LATIN SMALL LETTER D WITH DOT BELOW
D2   157 := Ḍ  (\u1e0c)  LATIN CAPITAL LETTER D WITH DOT BELOW
e1     4 := ē  (\u0113)  LATIN SMALL LETTER E WITH MACRON
e10    34 := ê  (\u00ea)  LATIN SMALL LETTER E WITH CIRCUMFLEX
e4    12 := é  (\u00e9)  LATIN SMALL LETTER E WITH ACUTE
e7     5 := ë  (\u00eb)  LATIN SMALL LETTER E WITH DIAERESIS
h2   253 := ḥ  (\u1e25)  LATIN SMALL LETTER H WITH DOT BELOW
H2     4 := Ḥ  (\u1e24)  LATIN CAPITAL LETTER H WITH DOT BELOW
i1     9 := ī  (\u012b)  LATIN SMALL LETTER I WITH MACRON
I10   143 := Ī  (\u012a)  LATIN CAPITAL LETTER I WITH MACRON
i10  6531 := î  (\u00ee)  LATIN SMALL LETTER I WITH CIRCUMFLEX
i110     1 := NO DESCRIPTION
i2     4 := NO DESCRIPTION
i4    22 := í (\u00ed) LATIN SMALL LETTER I and acute (udatta) accent
i7     2 := ï  (\u00ef)  LATIN SMALL LETTER I WITH DIAERESIS
k2     1 := ḳ  (\u1e33)  LATIN SMALL LETTER K WITH DOT BELOW
L2     1 := Ḷ  (\u1e36)  LATIN CAPITAL LETTER L WITH DOT BELOW
l2    20 := ḷ  (\u1e37)  LATIN SMALL LETTER L WITH DOT BELOW
m2     6 := ṃ  (\u1e43)  LATIN SMALL LETTER M WITH DOT BELOW
m3     1 := ṁ  (\u1e41)  LATIN SMALL LETTER M WITH DOT ABOVE
M4     1 := NO DESCRIPTION
m4     1 := NO DESCRIPTION
M5    61 := M̄ (\u004d\0304) LATIN CAPITAL LETTER M, COMBINING TILDE
m5  1746 := m̄ (\u006d\u0304) LATIN CAPITAL LETTER M, COMBINING TILDE, not represented
n10     1 := NO DESCRIPTION
N2   171 := Ṇ  (\u1e46)  LATIN CAPITAL LETTER N WITH DOT BELOW
n2  4267 := ṇ  (\u1e47)  LATIN SMALL LETTER N WITH DOT BELOW
N3    79 := Ṅ  (\u1e44)  LATIN CAPITAL LETTER N WITH DOT ABOVE
n3   780 := ṅ  (\u1e45)  LATIN SMALL LETTER N WITH DOT ABOVE
N4    76 := Ń  (\u0143)  LATIN CAPITAL LETTER N WITH ACUTE
n4  6292 := ń  (\u0144)  LATIN SMALL LETTER N WITH ACUTE
N5     1 := Ñ  (\u00d1)  LATIN CAPITAL LETTER N WITH TILDE
n5     4 := ñ  (\u00f1)  LATIN SMALL LETTER N WITH TILDE
o1     5 := ō  (\u014d)  LATIN SMALL LETTER O WITH MACRON
o10    76 := ô  (\u00f4)  LATIN SMALL LETTER O WITH CIRCUMFLEX
O10     1 := Ô  (\u00d4)  LATIN CAPITAL LETTER O WITH CIRCUMFLEX
o4    42 := ó  (\u00f3) LATIN SMALL LETTER O and acute (udatta) accent
O7     1 := Ö (\u00d6) LATIN CAPITAL LETTER O WITH DIARESIS
o7   458 := ö  (\u00f6)  LATIN SMALL LETTER O WITH DIAERESIS
r1     1 := NO DESCRIPTION
r10     1 := NO DESCRIPTION
R2   519 := Ṛ  (\u1e5a)  LATIN CAPITAL LETTER R WITH DOT BELOW
r2  5223 := ṛ  (\u1e5b)  LATIN SMALL LETTER R WITH DOT BELOW
S2     1 := Ṣ  (\u1e62)  LATIN CAPITAL LETTER S WITH DOT BELOW
s2     1 := ṣ  (\u1e63)  LATIN SMALL LETTER S WITH DOT BELOW
s4     3 := ś  (\u015b)  LATIN SMALL LETTER S WITH ACUTE
T2   194 := Ṭ  (\u1e6c)  LATIN CAPITAL LETTER T WITH DOT BELOW
t2  2525 := ṭ  (\u1e6d)  LATIN SMALL LETTER T WITH DOT BELOW
t4     1 := NO DESCRIPTION
t5     1 := NO DESCRIPTION
u1     9 := ū  (\u016b)  LATIN SMALL LETTER U WITH MACRON
U10   141 := Ū  (\u016a)  LATIN CAPITAL LETTER U WITH MACRON
u10  2143 := ū  (\u016b)  LATIN SMALL LETTER U WITH MACRON
u4    13 := ú (\u00fa) LATIN SMALL LETTER U and acute (udatta) accent
u5     1 := ũ  (\u0169)  LATIN SMALL LETTER U WITH TILDE
u7     8 := ü  (\u00fc)  LATIN SMALL LETTER U WITH DIAERESIS
y7     1 := ŷ  (\u0177)  LATIN SMALL LETTER Y WITH CIRCUMFLEX

Notes:

1. Those marked 'NO DESCRIPTION' may be instances of miscoding; they have
  not been further examined.
2. The m5, M5 codes correspond to unicode sequences using a COMBINING TILDE;
  in web page displays, the quality of these displays is sometimes poor.
3. The vowels with accents (like a4) could represent either Sanskrit accents
  (udatta for a4) or acute accents of European languages.
4. Sanskrit long vowels are represented in this dictionary with the
  circumflex, such as ra10ma (râma).

DTD

ben.dtd:

<?xml version="1.0" encoding="UTF-8"?>
<!-- ben.dtd
 Mar 26, 2014

-->
<!ELEMENT  ben (H1)*>
<!ELEMENT H1 (h,body,tail) >
<!ENTITY % body_elts "i |s |b | g | lb | P" >
<!-- h element -->
<!ELEMENT h  (key1,key2,hom?)>
<!ELEMENT key1 (#PCDATA) > <!-- in slp1 -->
<!ELEMENT key2 (#PCDATA )><!-- in AS -->
<!ELEMENT hom (#PCDATA)> <!-- homonym -->

<!ELEMENT body (#PCDATA  | %body_elts;)*>
<!ELEMENT i (#PCDATA | lb)*> <!-- italic, Sanskrit, in AS transliteration -->
<!ELEMENT b (#PCDATA | lb)*> <!-- bold text -->
<!ELEMENT s (#PCDATA | lb)*> <!-- Sanskrit Devanagari, in HK transliteration  -->
<!ELEMENT g (#PCDATA)> <!-- Greek text (empty) -->
<!ELEMENT lb EMPTY> <!-- line break -->
<!ELEMENT P EMPTY> <!-- headword indicator -->

<!-- tail -->
<!ELEMENT tail (#PCDATA | L | pc )*>
<!ELEMENT L (#PCDATA) >
<!ELEMENT pc (#PCDATA) >

<!-- attributes  -->