Table Of Contents

Previous topic

VEI The Vedic Index of Names and Subjects (Developer notes)

Next topic

YAT Yates Sanskrit-English Dictionary (Developer notes)

This Page

WIL Wilson Sanskrit-English Dictionary (Developer notes)

Date of digitization: 2006

Metadata

The original digitization is file wil_orig.txt, which is coded in the cp1252 (windows 1252) encoding, and is best viewed in a text editor which supports this encoding. For example, in Emacs, one may use the command revert-buffer-with-coding-system and then select cp1252 as the coding. The internet reference http://www.cp1252.com/ describes this coding system.

The file wil_orig_utf8.txt is a conversion of wil_orig.txt to the more common utf-8 encoding. The file wil.txt is also in the utf-8 encoding, and incorporates various editing changes, such as corrections of typographical errors.

There are several extended ascii codes occurring in wil.txt:

¦  (\u00a6) 44577 := BROKEN BAR
°  (\u00b0)     5 := DEGREE SIGN
²  (\u00b2) 44747 := SUPERSCRIPT TWO
¼  (\u00bc)     1 := VULGAR FRACTION ONE QUARTER
½  (\u00bd)     3 := VULGAR FRACTION ONE HALF
Æ  (\u00c6)    48 := LATIN CAPITAL LETTER AE
æ  (\u00e6)   163 := LATIN SMALL LETTER AE
œ  (\u0153)    54 := LATIN SMALL LIGATURE OE
‘  (\u2018)     8 := LEFT SINGLE QUOTATION MARK
’  (\u2019)     7 := RIGHT SINGLE QUOTATION MARK
“  (\u201c)     2 := LEFT DOUBLE QUOTATION MARK
”  (\u201d)     2 := RIGHT DOUBLE QUOTATION MARK

The {X...X} style of coding serves several purposes:

{#X#}  200898 : {#X#} devanagari text, coded with HK
{%X%}   10790 : italic text
{@X@}   10790 : bold text
{??}        1 : unreadable text

The following <x> type tags are found in wil.txt:

<H>  47 : letter breaks
<pic> 1 : picture
<g>X</g> 22 :  Greek. Coding scheme for X unclear.
<A></A> 1 :  (Arabic?)
<R></R> 1 : Arabic
<ar></ar> 2: Persian
Page breaks are coded as [PageX], where X is page number
from 1 to 982. Each page has two columns, but the column breaks are
not specified in the digitization.

The lines of the digitization generally represent ‘sections’ of the text; the actual line-breaks of the text are not coded.

Headword coding is exemplified by: .{#a#}¦
The general form is .{#X#}¦
where X (key1) is coded in Harvard-Kyoto transliteration.

The headwords are ordered according to Sanskrit alphabet ordering.

Some Sanskrit in the text appears in the European Indological form, which is coded in wil.txt with the the AS (Anglicized Sanskrit) coding.

The general AS scheme, as described in CDSL.pdf, uses Latin alphabetical letters ‘x (a-z,A-Z), possibly with suffixed numbers; the letter-number combinations are, in the general scheme:

x1 = macron
x2 = dot below
x3 = dot above
x4 = accent aigu
x5 = tilde
x6 = dash below
x7 = umlaut
x10 = circonflex (hat)
x11 = accent grave

Here are the characters that occur in wil.txt in this coding, with their approximate frequency:

a1     1 := ā  (\u0101)  LATIN SMALL LETTER A WITH MACRON
a10     7 := â  (\u00e2)  LATIN SMALL LETTER A WITH CIRCUMFLEX
A4  1765 := Á (\u00c1)  LATIN CAPITAL LETTER A WITH ACUTE
a4  4038 := á(\u00e1) LATIN SMALL LETTER A WITH ACUTE
a7     7 := ä  (\u00e4)  LATIN SMALL LETTER A WITH DIAERESIS
E4   441 := É  (\u00c9)  LATIN CAPITAL LETTER E WITH ACUTE
e4   702 := é  (\u00e9)  LATIN SMALL LETTER E WITH ACUTE
e7     9 := ë  (\u00eb)  LATIN SMALL LETTER E WITH DIAERESIS
I10    66 := Ī  (\u012a)  LATIN CAPITAL LETTER I WITH MACRON
i10    13 := î  (\u00ee)  LATIN SMALL LETTER I WITH CIRCUMFLEX
I12   238 := Ǐ  (\u01cf)  LATIN CAPITAL LETTER I WITH CARON
i12    92 := ǐ  (\u01d0)  LATIN SMALL LETTER I WITH CARON
I4   406 := Í  (\u00cd)  LATIN CAPITAL LETTER I WITH ACUTE
i4   401 := í (\u00ed) LATIN SMALL LETTER I WITH ACUTE
i7    36 := ï  (\u00ef)  LATIN SMALL LETTER I WITH DIAERESIS
O4    28 := Ó  (\u00d3) LATIN SMALL LETTER O WITH ACUTE
o4   259 := ó  (\u00f3) LATIN SMALL LETTER O WITH ACUTE
o7    27 := ö  (\u00f6)  LATIN SMALL LETTER O WITH DIAERESIS
u10     2 := ū  (\u016b)  LATIN SMALL LETTER U WITH MACRON
U4    21 := Ú  (\u00da)  LATIN CAPITAL LETTER U WITH ACUTE
u4   238 := ú (\u00fa) LATIN SMALL LETTER U WITH ACUTE
u7     7 := ü  (\u00fc)  LATIN SMALL LETTER U WITH DIAERESIS

Note:  Several consonants appear in the scan with a trailing apostrophe (
sometimes printed as an 'acute accent' over the letter).  The digitization
sometimes represents such a consonant X as X4 and somtimes as X'.
The 'change_01' update changes all such consonant X4 to X'.

DTD

wil.dtd:

<?xml version="1.0" encoding="UTF-8"?>
<!-- wil.dtd
 June 25, 2014

-->
<!ELEMENT  wil (H1)*>
<!ELEMENT H1 (h,body,tail) >
<!ENTITY % body_elts "i |s|b|br|H|g|pic|A|R|ar " >
<!-- h element -->
<!ELEMENT h  (key1,key2,hom?)>
<!ELEMENT key1 (#PCDATA) > <!-- in slp1 -->
<!ELEMENT key2 (#PCDATA )><!-- in AS -->
<!ELEMENT hom (#PCDATA)> <!-- homonym -->

<!ELEMENT body (#PCDATA  | %body_elts;)*>
<!ELEMENT i (#PCDATA | br)*> <!-- italic -->
<!ELEMENT b (#PCDATA | br)*> <!-- bold -->
<!ELEMENT s (#PCDATA)> <!-- Sanskrit, in HK transliteration  -->
<!ELEMENT br EMPTY> <!-- line break -->
<!ELEMENT H EMPTY> <!-- headline (at letter breaks) -->
<!ELEMENT g (#PCDATA)> <!-- Greek, in unknown transliteration  -->
<!ELEMENT pic (#PCDATA)> <!-- label for an inline picture (once) -->
<!ELEMENT A EMPTY> <!-- Persian or Arabic, not coded -->
<!ELEMENT ar (#PCDATA)> <!-- Persian or Arabic, unknown transliteration -->
<!ELEMENT R EMPTY> <!-- Persian or Arabic, not coded -->

<!-- tail -->
<!ELEMENT tail (#PCDATA | L | pc )*>
<!ELEMENT L (#PCDATA) >
<!ELEMENT pc (#PCDATA) >

<!-- attributes  -->