Table Of Contents

Previous topic

MW Monier-Williams Sanskrit-English Dictionary (Developer notes)

Next topic

MWE Monier-Williams English-Sanskrit Dictionary (Developer notes)

This Page

MW72 Monier-Williams Sanskrit-English Dictionary (Developer notes)

Date of digitization: 2014

Metadata

The original digitization is file mw72_orig.txt, which is coded in the cp1252 (windows 1252) encoding, and is best viewed in a text editor which supports this encoding. For example, in Emacs, one may use the command revert-buffer-with-coding-system and then select cp1252 as the coding. The internet reference http://www.cp1252.com/ describes this coding system.

The file mw72_orig_utf8.txt is a conversion of mw72_orig.txt to the more common utf-8 encoding. The file mw72.txt is also in the utf-8 encoding, and incorporates various editing changes, such as corrections of typographical errors.

There are several extended ascii codes occurring in mw72.txt:

¤  (\u00a4)  7612 := CURRENCY SIGN
¦  (\u00a6) 25839 := BROKEN BAR
§  (\u00a7)    13 := SECTION SIGN
µ  (\u00b5)     1 := MICRO SIGN
º  (\u00ba) 21110 := MASCULINE ORDINAL INDICATOR
Æ  (\u00c6)    65 := LATIN CAPITAL LETTER AE
Ç  (\u00c7)   127 := LATIN CAPITAL LETTER C WITH CEDILLA
×  (\u00d7)    21 := MULTIPLICATION SIGN
ß  (\u00df)     3 := LATIN SMALL LETTER SHARP S
æ  (\u00e6)   299 := LATIN SMALL LETTER AE
ç  (\u00e7)     2 := LATIN SMALL LETTER C WITH CEDILLA
Π (\u0152)     1 := LATIN CAPITAL LIGATURE OE
œ  (\u0153)   372 := LATIN SMALL LIGATURE OE
‘  (\u2018) 12420 := LEFT SINGLE QUOTATION MARK
’  (\u2019) 12261 := RIGHT SINGLE QUOTATION MARK
“  (\u201c)    10 := LEFT DOUBLE QUOTATION MARK
”  (\u201d)    10 := RIGHT DOUBLE QUOTATION MARK
†  (\u2020)    42 := DAGGER
‡  (\u2021)     4 := DOUBLE DAGGER

The {X...X} style of coding serves several purposes:

{#X#}   25955 : {#X#} devanagari text, coded with HK
{%X%}  207551 : italic text
{??}     243  : unreadable or uncodeable text
{@X@}       1 : bold text (in title page)
{|X|}       2 : widely spaced text (in title page)
{e  1  := in preface
{s  1  := in preface

The <> style of coding is used as follows:

<F>X</F>  34 : footnote
<P>  55545  :  Paragraph begin (starts a new H1/H2 headword)
<g></g> 1709:  Greek. Uncoded
<H>  66: Centered headline, in preface and at letter breaks
<>  216787  : begin normal line.
<HI> 324 : Headline. All in preface, except for
     line 3997 <HI>{%--At2avi-s4ikhara, a1s,%} m. pl., N. of a people or
Page breaks are coded as [Page...].
Page breaks are coded as [PageX+ nn], where nn is number of lines.
In the body of dictionary, X has the form
pppp-c where pppp is the page number (from 0001 to 1186)
and c identifies the column (‘a’,’b’, or ‘c’)
In the preface, the form of X is either
R-c, or R where R is a lower-case roman-numeral

The lines of the digitization represent lines of the text.

In constructing an xml version and its derivative displays, headwords
are identified in one of two line-starting forms:
<P>.{#X#}¦ where X is KH transliteration of Devanagari.
<P>.{#akra#}¦ 2 X is KH, ‘2’ is homonym
<P>{%Aktu, us,%} X is in AS transliteration
<P>1. {%akra, as, a1, am,%} ‘1’ is homonym

Other headwords can be identified by the form exemplified by:
{%–An6s4a-karan2a, am,%}.
However, in the headword generation used here these are NOT included,
since they are technically different by virtue of occurring in the middle
of lines. A good additional task for the interested user would be
to extend the headword list to include these forms.

The headwords are ordered according to Sanskrit alphabet ordering.

Sanskrit in the text appears often in the European Indological form, which is coded in mw72.txt with the the AS (Anglicized Sanskrit) coding. This coding is also used to represent non-Sanskrit words, notably in the etymology material referrring to cognate words.

The general AS scheme, as described in CDSL.pdf, uses Latin alphabetical letters ‘x (a-z,A-Z), possibly with suffixed numbers; the letter-number combinations are, in the general scheme:

x1 = macron
x2 = dot below
x3 = dot above
x4 = accent aigu
x5 = tilde
x6 = dash below
x7 = umlaut
x10 = circonflex (hat)
x11 = accent grave

The Preface of the dictionary describes the correspondence between the ‘Indo-Romanic’ and ‘Nagari’ alphabets. In particular, the Indo-Romanic representations of this text differ in several points from the Monier-Williams dictionary of 1899..

In the displays, we have attempted to mimic the text in Unicode. Notably, ‘c4’ represents the unaspirated palatal, and ‘c4h’ the palatal (c with acute) The palatal nasal is n3 (n with dot above). The guttural nasal is a special case, being represented in the text as an ‘n’ with joining dot, which has no Unicode equivalent. In the mw72_orig_utf8.txt this has the representation ‘n.¤’; in mw72.txt this is changed to ‘n4’ (n with acute accent); since ‘n4’ does not occur elsewhere, this substitution causes no confusion. These facts are taken into account in converting the AS headwords to normalized slp1 spelling.

Here are the characters that occur in mw72.txt in this coding, with their approximate frequency:

A1  2380 := Ā  (\u0100)  LATIN CAPITAL LETTER A WITH MACRON
a1 157766 := ā  (\u0101)  LATIN SMALL LETTER A WITH MACRON
a10    77 := â  (\u00e2)  LATIN SMALL LETTER A WITH CIRCUMFLEX
a11    18 := à  (\u00e0)  LATIN SMALL LETTER A WITH GRAVE
a2     3 := ạ  (\u1ea1)  LATIN SMALL LETTER A WITH DOT BELOW
a4    22 := á(\u00e1) LATIN SMALL LETTER A with udatta accent
a7    13 := ä  (\u00e4)  LATIN SMALL LETTER A WITH DIAERESIS
C4  2601 := Ć  (\u0106)  LATIN CAPITAL LETTER C WITH ACUTE
c4 18081 := ć  (\u0107)  LATIN SMALL LETTER C WITH ACUTE
d2  8218 := ḍ  (\u1e0d)  LATIN SMALL LETTER D WITH DOT BELOW
D2    36 := Ḍ  (\u1e0c)  LATIN CAPITAL LETTER D WITH DOT BELOW
d4     1 := d' (\u0064\u0301) LATIN SMALL LETTER D WITH ACUTE
e1    22 := ē  (\u0113)  LATIN SMALL LETTER E WITH MACRON
e10    72 := ê  (\u00ea)  LATIN SMALL LETTER E WITH CIRCUMFLEX
e11     3 := è  (\u00e8)  LATIN SMALL LETTER E WITH GRAVE
e14     4 := ē'  (\u0113\u0301)  LATIN SMALL LETTER E WITH MACRON AND ACUTE
E4     2 := É  (\u00c9)  LATIN CAPITAL LETTER E WITH ACUTE
e4    54 := é  (\u00e9)  LATIN SMALL LETTER E WITH ACUTE
e7    19 := ë  (\u00eb)  LATIN SMALL LETTER E WITH DIAERESIS
g4     3 := ǵ  (\u01f5)  LATIN SMALL LETTER G WITH ACUTE
h2  2529 := ḥ  (\u1e25)  LATIN SMALL LETTER H WITH DOT BELOW
I1   137 := Ī  (\u012a)  LATIN CAPITAL LETTER I WITH MACRON
i1 47968 := ī  (\u012b)  LATIN SMALL LETTER I WITH MACRON
i10    25 := î  (\u00ee)  LATIN SMALL LETTER I WITH CIRCUMFLEX
i11     9 := ì (\u00ec) LATIN SMALL LETTER I WITH GRAVE
i4     7 := í (\u00ed) LATIN SMALL LETTER I and udatta accent
i7    35 := ï  (\u00ef)  LATIN SMALL LETTER I WITH DIAERESIS
k2     4 := ḳ  (\u1e33)  LATIN SMALL LETTER K WITH DOT BELOW
k4     3 := ḱ  (\u1e31)  LATIN SMALL LETTER K WITH ACUTE
l2   125 := ḷ  (\u1e37)  LATIN SMALL LETTER L WITH DOT BELOW
m2  2662 := ṃ  (\u1e43)  LATIN SMALL LETTER M WITH DOT BELOW
m3    11 := ṁ  (\u1e41)  LATIN SMALL LETTER M WITH DOT ABOVE
N2     1 := Ṇ  (\u1e46)  LATIN CAPITAL LETTER N WITH DOT BELOW
n2 29010 := ṇ  (\u1e47)  LATIN SMALL LETTER N WITH DOT BELOW
n3  5561 := ṅ  (\u1e45)  LATIN SMALL LETTER N WITH DOT ABOVE
N4     1 := Ń  (\u0143)  LATIN CAPITAL LETTER N WITH ACUTE
n4  7611 := ń  (\u0144)  LATIN SMALL LETTER N WITH ACUTE
n5     3 := ñ  (\u00f1)  LATIN SMALL LETTER N WITH TILDE
n6  3199 := ṉ  (\u1e49)  LATIN SMALL LETTER N WITH LINE BELOW
o1    39 := ō  (\u014d)  LATIN SMALL LETTER O WITH MACRON
o10    74 := ô  (\u00f4)  LATIN SMALL LETTER O WITH CIRCUMFLEX
o11     4 := ò  (\u00f2)  LATIN SMALL LETTER O WITH GRAVE
o4    15 := ó  (\u00f3) LATIN SMALL LETTER O and udatta accent
o7    26 := ö  (\u00f6)  LATIN SMALL LETTER O WITH DIAERESIS
p4     2 := ṕ  (\u1e55)  LATIN SMALL LETTER P WITH ACUTE
R2  2312 := Ṛ  (\u1e5a)  LATIN CAPITAL LETTER R WITH DOT BELOW
r2 19641 := ṛ  (\u1e5b)  LATIN SMALL LETTER R WITH DOT BELOW
r21     1 := ṝ  (\u1e5d)  LATIN SMALL LETTER R WITH DOT BELOW AND MACRON
s2     7 := ṣ  (\u1e63)  LATIN SMALL LETTER S WITH DOT BELOW
S4  8090 := Ś  (\u015a)  LATIN CAPITAL LETTER S WITH ACUTE
s4 25895 := ś  (\u015b)  LATIN SMALL LETTER S WITH ACUTE
T2    30 := Ṭ  (\u1e6c)  LATIN CAPITAL LETTER T WITH DOT BELOW
t2 14889 := ṭ  (\u1e6d)  LATIN SMALL LETTER T WITH DOT BELOW
t4     1 := t' (\u0074\u0301) LATIN SMALL LETTER T WITH ACUTE
U1   193 := Ū  (\u016a)  LATIN CAPITAL LETTER U WITH MACRON
u1 14412 := ū  (\u016b)  LATIN SMALL LETTER U WITH MACRON
u10    53 := ū  (\u016b)  LATIN SMALL LETTER U WITH MACRON
u11    47 := ù  (\u00f9)  LATIN SMALL LETTER U WITH GRAVE
u14     1 := ū'  (\u016b)  LATIN SMALL LETTER U WITH MACRON and acute
u4    15 := ú (\u00fa) LATIN SMALL LETTER U and udatta accent
u7    41 := ü  (\u00fc)  LATIN SMALL LETTER U WITH DIAERESIS
y10     1 := ŷ  (\u0177)  LATIN SMALL LETTER Y WITH CIRCUMFLEX
y4     8 := ý  (\u00fd)  LATIN SMALL LETTER Y WITH ACUTE
z2     3 := ẓ  (\u1e93)  LATIN SMALL LETTER Z WITH DOT BELOW
z4     6 := ź  (\u017a)  LATIN SMALL LETTER Z WITH ACUTE

DTD

mw72.dtd:

<?xml version="1.0" encoding="UTF-8"?>
<!-- mw72.dtd
 May 18, 2014

-->
<!ELEMENT  mw72 (H1)*>
<!ELEMENT H1 (h,body,tail) >
<!ENTITY % body_elts "i |s|lb | HI | H | F |P|g" >
<!-- h element -->
<!ELEMENT h  (key1,key2,hom?)>
<!ELEMENT key1 (#PCDATA) > <!-- in slp1 -->
<!ELEMENT key2 (#PCDATA )><!-- in AS -->
<!ELEMENT hom (#PCDATA)> <!-- homonym -->

<!ELEMENT body (#PCDATA  | %body_elts;)*>
<!ELEMENT i (#PCDATA | lb)*> <!-- italic -->
<!ELEMENT s (#PCDATA | lb)*> <!-- Sanskrit, in HK transliteration  -->

<!ELEMENT lb EMPTY>  <!-- line break -->
<!ELEMENT HI EMPTY>  <!-- line break, and begin headword -->
<!ELEMENT H EMPTY>  <!-- Header -->
<!ELEMENT P EMPTY>  <!-- Paragraph -->
<!ELEMENT F (#PCDATA  | %body_elts;)*> <!-- footnote -->
<!ELEMENT g (#PCDATA )><!-- Greek (usu. empty) -->

<!-- tail -->
<!ELEMENT tail (#PCDATA | L | pc )*>
<!ELEMENT L (#PCDATA) >
<!ELEMENT pc (#PCDATA) >

<!-- attributes  -->