Table Of Contents

Previous topic

MW72 Monier-Williams Sanskrit-English Dictionary (Developer notes)

Next topic

PE Puranic Encyclopedia (Developer notes)

This Page

MWE Monier-Williams English-Sanskrit Dictionary (Developer notes)

Date of digitization: 2013

Metadata

The original digitization is file mwe_orig.txt file, which is coded in the cp1252 (windows 1252 encoding), and is best viewed in a text editor which supports this encoding. For example, in Emacs, one may use the command revert-buffer-with-coding-system and then select cp1252 as the coding. The internet reference http://www.cp1252.com/ describes this coding system.

The mwe_orig_utf8.txt file uses the utf-8 encoding. The mwe.txt file is based on mwe_orig_utf8.txt, and includes corrections.

There are several extended ascii codes in mwe.txt:

¯  (\u00af)     1 := MACRON
æ  (\u00e6)     5 := LATIN SMALL LETTER AE
Π (\u0152)    23 := LATIN CAPITAL LIGATURE OE
œ  (\u0153)     4 := LATIN SMALL LIGATURE OE
‘  (\u2018) 12271 := LEFT SINGLE QUOTATION MARK
’  (\u2019) 12254 := RIGHT SINGLE QUOTATION MARK
“  (\u201c)     3 := LEFT DOUBLE QUOTATION MARK
”  (\u201d)     2 := RIGHT DOUBLE QUOTATION MARK
„  (\u201e)     1 := DOUBLE LOW-9 QUOTATION MARK

The {X...X} style of coding serves several purposes:

{#...#} 127534   : devanagari coded as HK
{%.. %}  63497   : italic
{??}  169  : Question re coding, or unreadable text.

The following <x> type tags are found in mwe.txt:

<HI>  32379  : At beginning of line, indicating start of headword
<H>  37  : At start of line. A 'headline' (various usage)
<HS>  22  : only in Preface
<P>   85  : only in Preface
Page breaks are coded as [Page...].
Page breaks are more specifically coded as
[Pageppp-c+ n]
where
ppp is 3-digit page number (pagination starts anew for each volume)
c = column. a or b, indicating first or secondcolumn.
n = number of lines in the following column of text.
Page breaks in the preface are of form
[PageP+ n] where P is 2 through 17, except
[Page5-c+ n], where c is a,b, or c.

The headwords are ordered according to English alphabet ordering.

Some Sanskrit text appears in European transliteration form; such text is coded with the AS (Anglicized Sanskrit) coding. The general AS scheme, as described in CDSL.pdf, uses Latin alphabetical letters ‘x (a-z,A-Z), possibly with suffixed numbers; the letter-number combinations are, in the general scheme:

x1 = macron
x2 = dot below
x3 = dot above
x4 = accent aigu
x5 = tilde
x6 = dash below
x7 = umlaut
x10 = circonflex (hat)
x11 = accent grave

Here are the characters that occur in mwe.txt in this coding, with their approximate frequency:

a1     7 := ā  (\u0101)  LATIN SMALL LETTER A WITH MACRON
A11     2 := À  (\u00c0)  LATIN CAPITAL LETTER A WITH GRAVE
A4    63 := Á   (\u00c1)  LATIN CAPITAL LETTER A WITH ACUTE
a4   362 := á(\u00e1) LATIN SMALL LETTER A WITH ACUTE
d2     2 := ḍ  (\u1e0d)  LATIN SMALL LETTER D WITH DOT BELOW
D2     1 := Ḍ  (\u1e0c)  LATIN CAPITAL LETTER D WITH DOT BELOW
E4     8 := É  (\u00c9)  LATIN CAPITAL LETTER E WITH ACUTE
i1     2 := ī  (\u012b)  LATIN SMALL LETTER I WITH MACRON
I4    14 := Í (\u00cd)  LATIN CAPITAL LETTER I WITH ACUTE
i4    89 := í (\u00ed) LATIN SMALL LETTER I WITH ACUTE
n2    18 := ṇ  (\u1e47)  LATIN SMALL LETTER N WITH DOT BELOW
n4     6 := ń  (\u0144)  LATIN SMALL LETTER N WITH ACUTE
O10     1 := Ô  (\u00d4)  LATIN CAPITAL LETTER O WITH CIRCUMFLEX
R2     8 := Ṛ  (\u1e5a)  LATIN CAPITAL LETTER R WITH DOT BELOW
r2    15 := ṛ  (\u1e5b)  LATIN SMALL LETTER R WITH DOT BELOW
R6     1 := Ṟ  (\u1e5e)  LATIN CAPITAL LETTER R WITH LINE BELOW
S4    48 := Ś  (\u015a)  LATIN CAPITAL LETTER S WITH ACUTE
s4    29 := ś  (\u015b)  LATIN SMALL LETTER S WITH ACUTE
t2     5 := ṭ  (\u1e6d)  LATIN SMALL LETTER T WITH DOT BELOW
U4     3 := Ú  (\u00da)  LATIN CAPITAL LETTER U WITH ACUTE
u4    65 := ú (\u00fa) LATIN SMALL LETTER U WITH ACUTE
L176197     1 := occurs in first line of mwe.txt (not AS)

DTD

mwe.dtd:

<?xml version="1.0" encoding="UTF-8"?>
<!-- mwe.dtd
 Oct, 2013

-->
<!ELEMENT  mwe (H1)*>
<!ELEMENT H1 (h,body,tail) >
<!ENTITY % body_elts " s  | lb  | H | i" >
<!-- h element -->
<!ELEMENT h  (key1,key2)>
<!ELEMENT key1 (#PCDATA)>
<!ELEMENT key2 (#PCDATA)*>

<!ELEMENT body (#PCDATA  | %body_elts;)*>
<!ELEMENT H EMPTY > <!-- a 'title' -->
<!ELEMENT lb EMPTY > <!-- line break -->
<!ELEMENT i (#PCDATA | lb)* > <!-- italic -->
<!ELEMENT s (#PCDATA | lb | i )*> <!-- Devanagari, in slp transliteration  -->
<!-- tail -->
<!ELEMENT tail (#PCDATA | L | pc )*>
<!ELEMENT L (#PCDATA) >
<!ELEMENT pc (#PCDATA) >

<!-- attributes  -->
<!ATTLIST C n (1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11) #IMPLIED>