Table Of Contents

Previous topic

BHS Edgerton Buddhist Hybrid Sanskrit Dictionary (Developer notes)

Next topic

BOR English-Sanskrit Dictionary (Developer notes)

This Page

BOP Bopp Glossarium Sanscritum (Developer notes)

Date of digitization: 2014

Metadata

The original digitization is file bop_orig.txt, which is coded in the cp1252 (windows 1252) encoding, and is best viewed in a text editor which supports this encoding. For example, in Emacs, one may use the command revert-buffer-with-coding-system and then select cp1252 as the coding. The internet reference http://www.cp1252.com/ describes this coding system.

The file bop_orig_utf8.txt is a conversion of bop_orig.txt to the more common utf-8 encoding. The file bop.txt is also in the utf-8 encoding, and incorporates various editing changes, such as corrections of typographical errors.

There are several extended ascii codes occurring in bop.txt:

£  (\u00a3)     8 := POUND SIGN  'roof' sign on vowel; precedes vowel
¤  (\u00a4)    70 := CURRENCY SIGN  indicating short vowel; precedes vowel
¦  (\u00a6) 10495 := BROKEN BAR  ends headword
§  (\u00a7)    37 := SECTION SIGN  paragraph sign
«  (\u00ab)  1118 := LEFT-POINTING DOUBLE ANGLE QUOTATION MARK
º  (\u00ba)    88 := MASCULINE ORDINAL INDICATOR
»  (\u00bb)  1109 := RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
æ  (\u00e6)     1 := LATIN SMALL LETTER AE
ç  (\u00e7)     5 := LATIN SMALL LETTER C WITH CEDILLA
œ  (\u0153)    16 := LATIN SMALL LIGATURE OE
“  (\u201c)     1 := LEFT DOUBLE QUOTATION MARK
”  (\u201d)    14 := RIGHT DOUBLE QUOTATION MARK
„  (\u201e)    11 := DOUBLE LOW-9 QUOTATION MARK
…  (\u2026)   755 := HORIZONTAL ELLIPSIS

The {X...X} style of coding serves several purposes:

{#X#}  40080  : {#X#} devanagari text, coded in HK
{%X%}  23434  : italic text
{| |}    658  : widely spaced text. Often combined with {%%}.
{??}      12  : unreadable text

A pseudo-xml type of markup also occurs in the digitization. The following <x> type tags are found in bop.txt:

<F>X</F>  70 : Footnote (a footnote at bottom of page is inserted at its
               point of reference)
<g></g> 1715 : Greek text - This is not coded.
<>     18436 : At beginning of most lines.
<H>       43 : Section heading, such as for words beginning with a new letter.
<HI>   11112 : Part of headword coding
<NI>       1 : Occurs in prefatory material
<P>        3 : Paragraph marker. Twice in Preface; once for a comment
               at the retroflex nasal.
Page breaks are coded as [Page...].
More specifically, the usual coding is:
[Pagexxx-y+ nn] where x is page number and
y is column number (a or b), and nn is the number of lines in column ‘y’ of
page ‘xxx’/
Some pages are coded as
[Pagexxx-zy+ nn], where ‘z’ is 1,2 or 3; these are cases of ‘letter breaks’,
where words ending with one letter end, and words ending with the next letter
begin.

The lines of the digitization correspond to lines of the text.

Headword coding is exemplified by: <HI>{#aMz#}¦ This general form is <HI>{#X#}¦ , where X (key1) is coded in Harvard-Kyoto transliteration. For words with multiple homonym entries, such as <HI>1. {#a#}¦ , the general form might be written as: <HI>h. {#X#}¦, where h is the homonym number.

A similar form occurs as exemplified by: <HI>c. {#adhi#}¦ This does not represent a headword, but a prefix form for the root headword under which this form appears.

The headwords are ordered according to Sanskrit alphabet ordering.

The text contains words in several languages. Some of this text is presented using a European alphabet, where letters have various kinds of accents or other markings. Such decorated text is coded in bop.txt with the the AS (Anglicized Sanskrit) coding.

The general AS scheme, as described in CDSL.pdf, uses Latin alphabetical letters ‘x (a-z,A-Z), possibly with suffixed numbers; the letter-number combinations are, in the general scheme:

x1 = macron
x2 = dot below
x3 = dot above
x4 = accent aigu
x5 = tilde
x6 = dash below
x7 = umlaut
x10 = circonflex (hat)
x11 = accent grave

Here are the characters that occur in bop.txt in this coding, with their approximate frequency:

A1     2 := Ā  (\u0100)  LATIN CAPITAL LETTER A WITH MACRON
a10  1345 := â  (\u00e2)  LATIN SMALL LETTER A WITH CIRCUMFLEX
a11    39 := à  (\u00e0)  LATIN SMALL LETTER A WITH GRAVE
a4    49 := á(\u00e1) LATIN SMALL LETTER A WITH ACUTE
a7    25 := ä  (\u00e4)  LATIN SMALL LETTER A WITH DIAERESIS
C4    22 := Ć  (\u0106)  LATIN CAPITAL LETTER C WITH ACUTE
c4    46 := ć  (\u0107)  LATIN SMALL LETTER C WITH ACUTE
d2     1 := ḍ  (\u1e0d)  LATIN SMALL LETTER D WITH DOT BELOW
E10     2 := Ê (\u00ca)  LATIN CAPITAL LETTER E WITH CIRCUMFLEX
e10   383 := ê  (\u00ea)  LATIN SMALL LETTER E WITH CIRCUMFLEX
e11     2 := è  (\u00e8)  LATIN SMALL LETTER E WITH GRAVE
e14   112 := ê  (\u00ea)  LATIN SMALL LETTER E WITH CIRCUMFLEX and ACUTE (rendering)
e4    59 := é  (\u00e9)  LATIN SMALL LETTER E WITH ACUTE
e5     8 := ẽ  (\u1ebd)  LATIN SMALL LETTER E WITH TILDE
e7    62 := ë  (\u00eb)  LATIN SMALL LETTER E WITH DIAERESIS
G4     8 := Ǵ  (\u01f4)  LATIN CAPITAL LETTER G WITH ACUTE
g4    21 := ǵ  (\u01f5)  LATIN SMALL LETTER G WITH ACUTE
i1     3 := ī  (\u012b)  LATIN SMALL LETTER I WITH MACRON
i10   129 := î  (\u00ee)  LATIN SMALL LETTER I WITH CIRCUMFLEX
i11     2 := ì  (\u00ec) LATIN SMALL LETTER I WITH ACUTE
i4     5 := í (\u00ed) LATIN SMALL LETTER I WITH ACUTE
i7     3 := ï  (\u00ef)  LATIN SMALL LETTER I WITH DIAERESIS
k4     1 := ḱ  (\u1e31)  LATIN SMALL LETTER K WITH ACUTE
n2     5 := ṇ  (\u1e47)  LATIN SMALL LETTER N WITH DOT BELOW
n3    21 := ṅ  (\u1e45)  LATIN SMALL LETTER N WITH DOT ABOVE
O1     1 := Ō  (\u014c)  LATIN CAPITAL LETTER O WITH MACRON
o10   235 := ô  (\u00f4)  LATIN SMALL LETTER O WITH CIRCUMFLEX
o11     5 := ò  (\u00f2)  LATIN SMALL LETTER O WITH GRAVE
o4    35 := ó  (\u00f3) LATIN SMALL LETTER O WITH ACUTE
o7    18 := ö  (\u00f6)  LATIN SMALL LETTER O WITH DIAERESIS
r2     8 := ṛ  (\u1e5b)  LATIN SMALL LETTER R WITH DOT BELOW
r21     2 := ṝ  (\u1e5d)  LATIN SMALL LETTER R WITH DOT BELOW AND MACRON
s4     1 := ś  (\u015b)  LATIN SMALL LETTER S WITH ACUTE
t2     8 := ṭ  (\u1e6d)  LATIN SMALL LETTER T WITH DOT BELOW
U1     1 := Ū  (\u016a)  LATIN CAPITAL LETTER U WITH MACRON
u1     1 := ū  (\u016b)  LATIN SMALL LETTER U WITH MACRON
u10   256 := ū  (\u016b)  LATIN SMALL LETTER U WITH MACRON
u11    84 := ù  (\u00f9)  LATIN SMALL LETTER U WITH GRAVE
u3     1 := ú (\u00f9) LATIN SMALL LETTER U WITH GRAVE
u4    10 := ú (\u00fa) LATIN SMALL LETTER U WITH ACUTE
U7     4 := Ü   (\u00dc)  LATIN CAPITAL LETTER U WITH DIAERESIS
u7    36 := ü  (\u00fc)  LATIN SMALL LETTER U WITH DIAERESIS
z4     4 := ź  (\u017a)  LATIN SMALL LETTER Z WITH ACUTE

DTD

bop.dtd:

<?xml version="1.0" encoding="UTF-8"?>
<!-- bop.dtd
 April 5, 2014

-->
<!ELEMENT  bop (H1)*>
<!ELEMENT H1 (h,body,tail) >
<!ENTITY % body_elts "i |s  | lb |HI|P|g|F" >
<!-- h element -->
<!ELEMENT h  (key1,key2,hom?)>
<!ELEMENT key1 (#PCDATA) > <!-- in slp1 -->
<!ELEMENT key2 (#PCDATA )><!-- in AS -->
<!ELEMENT hom (#PCDATA)> <!-- homonym -->

<!ELEMENT body (#PCDATA  | %body_elts;)*>
<!ELEMENT i (#PCDATA | wide | sic | lb | g)*> <!-- italic, Sanskrit, in AS transliteration -->
<!ELEMENT s (#PCDATA | lb)*> <!-- Sanskrit, in slp1 transliteration  -->
<!ELEMENT g (#PCDATA)> <!-- Greek (placeholder), always empty  -->
<!ELEMENT F (#PCDATA  | %body_elts;)*>  <!-- footnote -->
<!ELEMENT lb EMPTY>
<!ELEMENT HI EMPTY>
<!ELEMENT P EMPTY>

<!-- tail -->
<!ELEMENT tail (#PCDATA | L | pc )*>
<!ELEMENT L (#PCDATA) >
<!ELEMENT pc (#PCDATA) >

<!-- attributes  -->