Table Of Contents

Previous topic

MCI Mahabharata Cultural Index (Developer notes)

Next topic

MW Monier-Williams Sanskrit-English Dictionary (Developer notes)

This Page

MD Macdonell Sanskrit-English Dictionary (Developer notes)

Date of digitization: 2006

Metadata

The original digitization is file md_orig.txt, which is coded in the cp1252 (windows 1252) encoding, and is best viewed in a text editor which supports this encoding. For example, in Emacs, one may use the command revert-buffer-with-coding-system and then select cp1252 as the coding. The internet reference http://www.cp1252.com/ describes this coding system.

The file md_orig_utf8.txt is a conversion of md_orig.txt to the more common utf-8 encoding.

The md_orig files do not code separate lines of text on separate lines of the digitization, but use a certain system involving the ‘|’ character. By making use of this system, a program split_line.py constructs md_split.txt from md_orig_utf8.txt; the lines of md_split.txt are approximately the same as the lines of the scan. The program unsplit_line.py does the inverse conversion, just to be sure no information is lost.

The file md.txt is also in the utf-8 encoding, and incorporates various editing changes to md_split.txt, such as corrections of typographical errors.

There are several extended ascii codes occurring in md.txt:

¤  (\u00a4)  3396 := CURRENCY SIGN (significance unclear)
¦  (\u00a6) 20752 := BROKEN BAR
§  (\u00a7)     2 := SECTION SIGN  (in header part of digitization fiel)
°  (\u00b0)  4815 := DEGREE SIGN
±  (\u00b1)   218 := PLUS-MINUS SIGN  (with or without)
²  (\u00b2)    25 := SUPERSCRIPT TWO
¹  (\u00b9)   500 := SUPERSCRIPT ONE
½  (\u00bd)     6 := VULGAR FRACTION ONE HALF
‘  (\u2018)     1 := LEFT SINGLE QUOTATION MARK
’  (\u2019)     1 := RIGHT SINGLE QUOTATION MARK
‡  (\u2021)  2844 := DOUBLE DAGGER  (used to indicate that a sandhi is
                     to be applied. such as vi‡ati under kram
                     (Represented in displays as underline '_')
The square-root symbol is represented in the digitization as the
four-character sequence '(**)'.

The {X...X} style of coding serves several purposes:

{#X#}   21219 : {#X#} devanagari text, coded with HK
{@X@}   45777 : bold  Sanskrit text, coded with AS
{%X%}  133029 : italic text
{??}        1 : unreadable text

The following <x> type tags are found in md.txt:

<H1> 20748 : Begins new headword
<>   62243 : Begins normal line
<H>     43 : Headline.  Used at beginning of words starting with a new letter
<g>X</g> 9 : Greek. X is coded in an undocumented transliteration
Page breaks are coded as [Page...].
The first Page number is [Page1-1] occurring on line 7 of md.txt
In general, a page number has form [PageX-C] where X is page number and
C is column number (1,2,or 3).
X is generally a sequence of digits. However, it may be of form
UY, where U is a sequence of digits and Y is ‘a’,’b’,’c’,’d’, or ‘e’; these
forms occur on pages containing headwords starting with 2,3, etc different
letters.

The lines of the digitization approximately represent lines of the scanned text; see discussion of md_split.xt above.

There are several headword forms in the digitization.
<H1>{#a#}^1¦ headword=a, homonym=1
or
<H1>{@{#jakS#}@}^2¦~
or
<H1>{#[jaMh#}¦~
or
<H1>{@{#akS#}@}¦~

Most Sanskrit in the body of definitions is printed in a variant of the European Indological form, which is coded in md.txt as a variant of the AS (Anglicized Sanskrit) coding.

The general AS scheme, as described in CDSL.pdf, uses Latin alphabetical letters ‘x (a-z,A-Z), possibly with suffixed numbers; the letter-number combinations are, in the general scheme:

x1 = macron
x2 = dot below
x3 = dot above
x4 = accent aigu
x5 = tilde
x6 = dash below
x7 = umlaut
x10 = circonflex (hat)
x11 = accent grave

The text represents udatta and svarita accents in Indological text, by an acute accent and tilde above the accented vowel.

Here are the characters that occur in md.txt in this coding, with their approximate frequency:

A1  1364 := Â  (\u00c2)  LATIN CAPITAL LETTER A WITH CIRCUMFLEX
a1 24672 := â  (\u00e2)  LATIN SMALL LETTER A WITH CIRCUMFLEX
a11  1042 := â'   (\u00e2\u0301)  LATIN SMALL LETTER A WITH CIRCUMFLEX and ACUTE
a12     9 := â* (\u00e2\u0306) LATIN SMALL LETTER A WITH CIRCUMFLEX and BREVE
a14    11 := â'   (\u00e2\u0301)  LATIN SMALL LETTER A WITH CIRCUMFLEX and ACUTE
a15     9 := â~ (\u00e2\u0303) LATIN SMALL LETTER A WITH CIRCUMFLEX and TILDE
a3     1 := a3  (in Greek only)
a4  5583 := á (\u00e1) LATIN SMALL LETTER A with ACUTE
a5   162 := â'   (\u00e2\u0301)  LATIN SMALL LETTER A WITH CIRCUMFLEX and ACUTE
a8    29 := ă (\u0103) LATIN SMALL LETTER A WITH BREVE
e1     3 := ê  (\u00ea)  LATIN SMALL LETTER E WITH CIRCUMFLEX
e11     2 := ê'  (\u00ea\u0301)  LATIN SMALL LETTER E WITH CIRCUMFLEX and ACUTE
e4   292 := é  (\u00e9)  LATIN SMALL LETTER E WITH ACUTE
H1 20748 := H1 in <H1> tag
h4     2 := h4  (twice in Greek text)
i1  6826 := î  (\u00ee)  LATIN SMALL LETTER I WITH CIRCUMFLEX
I1    70 := Î  (\u00ce)  LATIN CAPITAL LETTER I WITH CIRCUMFLEX
i11   337 := î'  (\u00ee\u0301)  LATIN SMALL LETTER I WITH CIRCUMFLEX and ACUTE
i12     3 := î*  (\u00ee\u0306) LATIN SMALL LETTER I WITH CIRCUMFLEX and BREVE
i14     3 := î'  (\u00ee\u0301)  LATIN SMALL LETTER I WITH CIRCUMFLEX and ACUTE
i4   924 := í (\u00ed) LATIN SMALL LETTER I and ACUTE
i5    29 := î'  (\u00ee\u0301)  LATIN SMALL LETTER I WITH CIRCUMFLEX and ACUTE
i8    17 := ĭ (\u012d) LATIN SMALL LETTER I WITH BREVE
i9     5 := ĭ (\u012d) LATIN SMALL LETTER I WITH BREVE
m2     1 := m%  (\u006d\u0310)  LATIN SMALL LETTER M WITH CHANDRABINDU
N3    12 := Ṅ  (\u1e44)  LATIN CAPITAL LETTER N WITH DOT ABOVE
n3  1076 := ṅ  (\u1e45)  LATIN SMALL LETTER N WITH DOT ABOVE
N5    31 := Ñ  (\u00d1)  LATIN CAPITAL LETTER N WITH TILDE
n5   959 := ñ  (\u00f1)  LATIN SMALL LETTER N WITH TILDE
o11     1 := ô'  (\u00f4\u0301)  LATIN SMALL LETTER O WITH CIRCUMFLEX and ACUTE
o4   223 := ó  (\u00f3) LATIN SMALL LETTER O and ACUTE
U1    53 := Û  (\u00db)  LATIN CAPITAL LETTER U WITH CIRCUMFLEX
u1  2601 := û  (\u00fb)  LATIN SMALL LETTER U WITH CIRCUMFLEX
u11   164 := û'  (\u00fb\u0301)  LATIN SMALL LETTER U WITH CIRCUMFLEX and ACUTE
u12     1 := û (\u00fb\u0306) LATIN SMALL LETTER A WITH CIRCUMFLEX and BREVE
u14     7 := û' (\u00fb\u0301)  LATIN SMALL LETTER U WITH CIRCUMFLEX and ACUTE
u4   681 := ú (\u00fa) LATIN SMALL LETTER U and ACUTE
u5    15 := û' (\u00fb\u0301)  LATIN SMALL LETTER U WITH CIRCUMFLEX and ACUTE
u7     2 := ü  (\u00fc)  LATIN SMALL LETTER U WITH DIAERESIS
u8    11 := ŭ (\u016d) LATIN SMALL LETTER U WITH BREVE
u9     2 := ŭ (\u016d) LATIN SMALL LETTER U WITH BREVE
w11     1 := w11 once, in Greek

In addition to these codings, various Sanskrit letters are represented in Macdonell’s indological text by use of italics. All these details are mentioned by Macdonell on the last few pages of the Preface to the text.

DTD

md.dtd:

<?xml version="1.0" encoding="UTF-8"?>
<!-- md.dtd
 June 1, 2014

-->
<!ELEMENT  md (H1)*>
<!ELEMENT H1 (h,body,tail) >
<!ENTITY % body_elts "i |s |lb |b | g|H" >
<!-- h element -->
<!ELEMENT h  (key1,key2,hom?)>
<!ELEMENT key1 (#PCDATA) > <!-- in slp1 -->
<!ELEMENT key2 (#PCDATA )><!-- in AS -->
<!ELEMENT hom (#PCDATA)> <!-- homonym -->

<!ELEMENT body (#PCDATA  | %body_elts;)*>
<!ELEMENT i (#PCDATA | lb)*> <!-- italic text -->
<!ELEMENT b (#PCDATA | lb | i | s)*> <!-- bold text -->
<!ELEMENT s (#PCDATA | lb)*> <!-- Sanskrit, in HK transliteration  -->
<!ELEMENT lb EMPTY>
<!ELEMENT H EMPTY>
<!ELEMENT g (#PCDATA)> <!-- Greek -->
<!-- tail -->
<!ELEMENT tail (#PCDATA | L | pc )*>
<!ELEMENT L (#PCDATA) >
<!ELEMENT pc (#PCDATA) >

<!-- attributes  -->