Table Of Contents

Previous topic

ACC Catalogus Catalogorum (Developer notes)

Next topic

AP90 Apte Practical Sanskrit-English Dictionary (Developer notes)

This Page

AE Apte Student’s English-Sanskrit Dictionary (Developer notes)

Date of digitization: 2006

Metadata

The original digitization is file ae_orig.txt, which is coded in the cp1252 (windows 1252) encoding, and is best viewed in a text editor which supports this encoding. For example, in Emacs, one may use the command revert-buffer-with-coding-system and then select cp1252 as the coding. The internet reference http://www.cp1252.com/ describes this coding system.

The file ae_orig_utf8.txt is a conversion of ae_orig.txt to the more common utf-8 encoding. The file ae.txt is also in the utf-8 encoding, and incorporates various editing changes, such as corrections of typographical errors.

There are several extended ascii codes occurring in ae.txt:

°  (\u00b0)   248 := DEGREE SIGN
º  (\u00ba)     1 := MASCULINE ORDINAL INDICATOR  (prob. shld be \u00b0)
æ  (\u00e6)    12 := LATIN SMALL LETTER AE
œ  (\u0153)     6 := LATIN SMALL LIGATURE OE
‘  (\u2018) 10881 := LEFT SINGLE QUOTATION MARK
’  (\u2019) 10944 := RIGHT SINGLE QUOTATION MARK
“  (\u201c)     9 := LEFT DOUBLE QUOTATION MARK
”  (\u201d)     9 := RIGHT DOUBLE QUOTATION MARK
„  (\u201e)     2 := DOUBLE LOW-9 QUOTATION MARK

The {X...X} style of coding serves several purposes:

{# #}  69878  : {#X#} devanagari text, coded in HK
{@ @}  31845  : bold text. English only?
{% %}  34103  : italic text . English only?
{| |}      4  : {|X|} text is widely space
{??},{?} 590  : missing text.

There is some pseudo-xml type coding in ae.txt:

<>  54832  := Beginning of text line.  Note that 'empty' don't have this.
<P>  11396  := <><P> begins a headword line
<H>  35  := Header, for Letter section header, and a few other purposes.
<C1>  81  := table Column 1, in abbreviation appendix
<C2>  81  := table Column 2, in abbreviation appendix
<H1>  2  := Section header in appendix
<H2>  1  := Section sub-header in appendix
<HS>  1  := Head in preface
Page breaks are coded as [Page1]...[Page502]
The lines of the digitization represent the lines of the text.
The preface material is also digitized:
line 1-525 of ae.txt are title/preface.
Line 66640-66862 codes a section of abbreviations.
Headwords are like <><P>{@Abbreviate,@}
Note the headword is capitalized.
Sometimes, multiple headwords are shown: <><P>{@A, An@}
A blank line precedes each headword.
The headwords are English, of course.
reHeadword = r’^<><P>{@(.*?)@}’

The headwords are ordered according to English alphabet ordering.

Sanskrit in the text generally appears in Devanagari, which is coded as {#X#} where X is in the Harvard-Kyoto transliteration. Some Sanskrit and other non-English words contain various diacritical marks in the text, and are coded in ae.txt with the the AS (Anglicized Sanskrit) coding. Some words are coded with The general AS scheme, as described in CDSL.pdf, uses Latin alphabetical letters ‘x (a-z,A-Z), possibly with suffixed numbers; the letter-number combinations are, in the general scheme:

x1 = macron
x2 = dot below
x3 = dot above
x4 = accent aigu
x5 = tilde
x6 = dash below
x7 = umlaut
x10 = circonflex (hat)
x11 = accent grave

Here are the characters that occur in ae.txt in this coding, with their approximate frequency:

a1    22 := ā  (\u0101)  LATIN SMALL LETTER A WITH MACRON
a10    28 := â  (\u00e2)  LATIN SMALL LETTER A WITH CIRCUMFLEX
a4     3 := á(\u00e1) LATIN SMALL LETTER A with udatta accent
a7     2 := ä  (\u00e4)  LATIN SMALL LETTER A WITH DIAERESIS
C1    81 := Column 1, in abbreviations
C2    81 := Column 2, in abbreviations
H1     2 := Section header in appendix
H2     1 := Section sub-header in appendix
i1     3 := ī  (\u012b)  LATIN SMALL LETTER I WITH MACRON
i10     4 := î  (\u00ee)  LATIN SMALL LETTER I WITH CIRCUMFLEX
i4     1 := í (\u00ed) LATIN SMALL LETTER I and udatta accent
R1    12 := NO DESCRIPTION
R2     5 := Ṛ  (\u1e5a)  LATIN CAPITAL LETTER R WITH DOT BELOW
S4     1 := Ś  (\u015a)  LATIN CAPITAL LETTER S WITH ACUTE
u10     1 := ū  (\u016b)  LATIN SMALL LETTER U WITH MACRON
u3     1 := ú (\u00f9) LATIN SMALL LETTER U and anudatta accent

DTD

ae.dtd:

<?xml version="1.0" encoding="UTF-8"?>
<!-- ae.dtd
 March 31, 2014

-->
<!ELEMENT  ae (H1)*>
<!ELEMENT H1 (h,body,tail) >
<!ENTITY % body_elts "i |s |b | lb | quest" >
<!-- h element -->
<!ELEMENT h  (key1,key2,hom?)>
<!ELEMENT key1 (#PCDATA) > <!-- in slp1 -->
<!ELEMENT key2 (#PCDATA )><!-- in AS -->
<!ELEMENT hom (#PCDATA)> <!-- homonym -->

<!ELEMENT body (#PCDATA  | %body_elts;)*>
<!ELEMENT i (#PCDATA | lb | quest)*> <!-- italic text  -->
<!ELEMENT b (#PCDATA | lb | quest)*> <!-- bold  -->
<!ELEMENT s (#PCDATA | lb | quest)*> <!-- Sanskrit, in slp1 transliteration  -->
<!ELEMENT lb EMPTY> <!-- line break  -->
<!ELEMENT quest (#PCDATA | lb)*> <!-- question. Normally empty -->

<!-- tail -->
<!ELEMENT tail (#PCDATA | L | pc )*>
<!ELEMENT L (#PCDATA) >
<!ELEMENT pc (#PCDATA) >

<!-- attributes  -->