Table Of Contents

Previous topic

BOR English-Sanskrit Dictionary (Developer notes)

Next topic

CAE Cappeller Sanskrit-English Dictionary (Developer notes)

This Page

BUR Burnouf Dictionnaire Sanscrit-Français (Developer notes)

Date of digitization: 2008

Metadata

The original digitization is file bur_orig.txt, which is coded in the cp1252 (windows 1252) encoding, and is best viewed in a text editor which supports this encoding. For example, in Emacs, one may use the command revert-buffer-with-coding-system and then select cp1252 as the coding. The internet reference http://www.cp1252.com/ describes this coding system.

The file bur_orig_utf8.txt is a conversion of bur_orig.txt to the more common utf-8 encoding. The file bur.txt is also in the utf-8 encoding, and incorporates various editing changes, such as corrections of typographical errors.

There are several extended ascii codes occurring in bur.txt:

¦  (\u00a6) 19774 := BROKEN BAR
§  (\u00a7)   129 := SECTION SIGN
°  (\u00b0)     9 := DEGREE SIGN
Ç  (\u00c7)   489 := LATIN CAPITAL LETTER C WITH CEDILLA
ç  (\u00e7)  8210 := LATIN SMALL LETTER C WITH CEDILLA

Note: Several additional extended ascii codes occur in bur_orig_utf8.txt;
these were removed in the course of editing changes, and thus do not
occur in the current bur.txt:
£  (\u00a3)     7 := POUND SIGN
¤  (\u00a4) 22092 := CURRENCY SIGN
¥  (\u00a5)     2 := YEN SIGN
µ  (\u00b5)  3825 := MICRO SIGN
¼  (\u00bc)     1 := VULGAR FRACTION ONE QUARTER
Æ  (\u00c6)     1 := LATIN CAPITAL LETTER AE
Ø  (\u00d8)     1 := LATIN CAPITAL LETTER O WITH STROKE
ø  (\u00f8)     1 := LATIN SMALL LETTER O WITH STROKE
œ  (\u0153)     1 := LATIN SMALL LIGATURE OE
’  (\u2019)     2 := RIGHT SINGLE QUOTATION MARK
”  (\u201d)     5 := RIGHT DOUBLE QUOTATION MARK

The {X...X} style of coding serves several purposes:

{#X#}  19902  := Devanagari, coded as HK (primarily in headwords)
{%X%}  71661  := italic
{??}      16  := unreadable
{@@}     161  := bold
{^X^}    386  := superscript

The <> style of coding is used as follows:

<g></g>  669 := greek text (not coded but identified)
<H>       48  := Header (letter change)
<P>    14887  := Sub-headwords identified by <P>{%...%}
There are a few (< 100) other instances of <P>, indicating a
paragraph break in the text.
The general form of a page break is [PageX-C] where
X = PPP is the page number, and C = 1 or 2 is the column number.
For pages with one or more letter breaks, X may have the form PPPY,
where Y is ‘a’ or ‘b’.
There are two missing (presumed blank) pages, 633 and 634, which
appear in digitization as [Page633] and [Page634].
The first page number of the dictionary body is 005-1.

The lines of the digitization generally represent ‘sections’ of the text; the actual line-breaks of the text are not coded.

Headword coding is exemplified by: .{#a#}¦
The general form is .{#X#}¦
where X is coded in Harvard-Kyoto transliteration.

The headwords are ordered according to Sanskrit alphabet ordering.

Sanskrit in the text usually appears in the European Indological form, which is coded in bur.txt with the the AS (Anglicized Sanskrit) coding.

The general AS scheme, as described in CDSL.pdf, uses Latin alphabetical letters ‘x (a-z,A-Z), possibly with suffixed numbers; the letter-number combinations are, in the general scheme:

x1 = macron
x2 = dot below
x3 = dot above
x4 = accent aigu
x5 = tilde
x6 = dash below
x7 = umlaut
x10 = circonflex (hat)
x11 = accent grave

This same scheme is used to code the diacriticals in French text.

Here are the characters that occur in bur.txt in this coding, with their approximate frequency:

NOTE: Many of these codings occur in French words, or other European languages.

A1     1 := Ā  (\u0100)  LATIN CAPITAL LETTER A WITH MACRON
a1 30672 := ā  (\u0101)  LATIN SMALL LETTER A WITH MACRON
a10  2854 := â  (\u00e2)  LATIN SMALL LETTER A WITH CIRCUMFLEX
a11  3226 := à  (\u00e0)  LATIN SMALL LETTER A WITH GRAVE
A2     5 := NOT AS
a2    89 := NOT AS
d2  2099 := ḍ  (\u1e0d)  LATIN SMALL LETTER D WITH DOT BELOW
D2     2 := Ḍ  (\u1e0c)  LATIN CAPITAL LETTER D WITH DOT BELOW
E10     1 := Ê  (\u00ca) LATIN CAPITAL LETTER E WITH CIRCUMFLEX todo
e10  2932 := ê  (\u00ea)  LATIN SMALL LETTER E WITH CIRCUMFLEX
e11  3371 := è  (\u00e8)  LATIN SMALL LETTER E WITH GRAVE
e4 23456 := é  (\u00e9)  LATIN SMALL LETTER E WITH ACUTE
e7    86 := ë  (\u00eb)  LATIN SMALL LETTER E WITH DIAERESIS
f1   104 := NOT AS
F1     1 := NOT AS
f2   366 := NOT AS
F2     2 := NOT AS
i1  6986 := ī  (\u012b)  LATIN SMALL LETTER I WITH MACRON
I1     1 := Ī  (\u012a)  LATIN CAPITAL LETTER I WITH MACRON
i10  1194 := î  (\u00ee)  LATIN SMALL LETTER I WITH CIRCUMFLEX
i7   123 := ï  (\u00ef)  LATIN SMALL LETTER I WITH DIAERESIS
L2     1 := Ḷ  (\u1e36)  LATIN CAPITAL LETTER L WITH DOT BELOW
l2    84 := ḷ  (\u1e37)  LATIN SMALL LETTER L WITH DOT BELOW
L21     1 := Ḹ  (\u1e38)  LATIN CAPITAL LETTER L WITH DOT BELOW AND MACRON
m2   353 := ṃ  (\u1e43)  LATIN SMALL LETTER M WITH DOT BELOW
N2     1 := Ṇ  (\u1e46)  LATIN CAPITAL LETTER N WITH DOT BELOW
n2  4983 := ṇ  (\u1e47)  LATIN SMALL LETTER N WITH DOT BELOW
N3     1 := Ṅ  (\u1e44)  LATIN CAPITAL LETTER N WITH DOT ABOVE
n3  1728 := ṅ  (\u1e45)  LATIN SMALL LETTER N WITH DOT ABOVE
n4   963 := ń  (\u0144)  LATIN SMALL LETTER N WITH ACUTE
N5     1 := Ñ  (\u00d1)  LATIN CAPITAL LETTER N WITH TILDE
n5  1484 := ñ  (\u00f1)  LATIN SMALL LETTER N WITH TILDE
o10   640 := ô  (\u00f4)  LATIN SMALL LETTER O WITH CIRCUMFLEX
o11     1 := ò  (\u00f2)  LATIN SMALL LETTER O WITH GRAVE
R2     1 := Ṛ  (\u1e5a)  LATIN CAPITAL LETTER R WITH DOT BELOW
r2  5484 := ṛ  (\u1e5b)  LATIN SMALL LETTER R WITH DOT BELOW
r21   338 := ṝ  (\u1e5d)  LATIN SMALL LETTER R WITH DOT BELOW AND MACRON
R21     1 := Ṝ  (\u1e5c)  LATIN CAPITAL LETTER R WITH DOT BELOW AND MACRON
s2  5953 := ṣ  (\u1e63)  LATIN SMALL LETTER S WITH DOT BELOW
T2     2 := Ṭ  (\u1e6c)  LATIN CAPITAL LETTER T WITH DOT BELOW
t2  3713 := ṭ  (\u1e6d)  LATIN SMALL LETTER T WITH DOT BELOW
U1     1 := Ū  (\u016a)  LATIN CAPITAL LETTER U WITH MACRON
u1  3416 := ū  (\u016b)  LATIN SMALL LETTER U WITH MACRON
u10   453 := ū  (\u016b)  LATIN SMALL LETTER U WITH MACRON
u11   276 := ù  (\u00f9)  LATIN SMALL LETTER U WITH GRAVE
u7     5 := ü  (\u00fc)  LATIN SMALL LETTER U WITH DIAERESIS

DTD

bur.dtd:

<?xml version="1.0" encoding="UTF-8"?>
<!-- bur.dtd
 June 17, 2014
-->
<!ELEMENT  bur (H1)*>
<!ELEMENT H1 (h,body,tail) >
<!ENTITY % body_elts "i|b|P|H | g|s" >
<!-- h element -->
<!ELEMENT h  (key1,key2)>
<!ELEMENT key1 (#PCDATA)>
<!ELEMENT key2 (#PCDATA )*>
<!-- special_chars -->
<!ELEMENT i (#PCDATA |g)*>  <!-- italic -->
<!ELEMENT b (#PCDATA)>  <!-- bold -->
<!ELEMENT s (#PCDATA)>  <!-- Devanagari in HK -->
<!ELEMENT P EMPTY>  <!-- Paragraph -->
<!ELEMENT H EMPTY>  <!-- Header -->

<!ELEMENT body (#PCDATA  | %body_elts;)*>
<!ELEMENT br EMPTY > <!-- line breaks in bur.txt -->
<!ELEMENT g (#PCDATA)> <!-- Greek  -->

<!-- tail -->
<!ELEMENT tail (#PCDATA | L | pc )*>
<!ELEMENT L (#PCDATA) >
<!ELEMENT pc (#PCDATA) >