Table Of Contents

Previous topic

MWE Monier-Williams English-Sanskrit Dictionary (Developer notes)

Next topic

PGN Personal and Geographical Names in the Gupta Inscriptions (Developer notes)

This Page

PE Puranic Encyclopedia (Developer notes)

Date of digitization: 2013

Metadata

The original digitization is file pe_orig.txt, which is coded in the cp1252 (windows 1252) encoding, and is best viewed in a text editor which supports this encoding. For example, in Emacs, one may use the command revert-buffer-with-coding-system and then select cp1252 as the coding. The internet reference http://www.cp1252.com/ describes this coding system.

The file pe_orig_utf8.txt is a conversion of pe_orig.txt to the more common utf-8 encoding. The file pe.txt is also in the utf-8 encoding, and incorporates various editing changes, such as corrections of typographical errors.

There are several extended ascii codes occurring in pe.txt:

¤  (\u00a4)     1 := CURRENCY SIGN
½  (\u00bd)     1 := VULGAR FRACTION ONE HALF
×  (\u00d7)     4 := MULTIPLICATION SIGN
‘  (\u2018)   896 := LEFT SINGLE QUOTATION MARK
’  (\u2019)   892 := RIGHT SINGLE QUOTATION MARK
“  (\u201c)  2046 := LEFT DOUBLE QUOTATION MARK
”  (\u201d)  1951 := RIGHT DOUBLE QUOTATION MARK

The {X...X} style of coding serves several purposes:

{#X#}   26 : {#X#} devanagari text, coded with HK (after some Letters)
{@X@}    2 : bold text
{%X%} 3408 : italic text
{??}    19 : unreadable text

The <> style of coding is used as follows:

<F>...</F>  101 : Footnote
<Poem>...</Poem> 118 : Poem
<>   96356 : Line break
<HI>  8805 : Headword
<NI> 10510 : Subsection
<H>     27 : Title (letter breaks)
<P>      8 : Paragraph
<C10>    6  :  Table Column marker
<C11>    1  :  Table Column marker
<C12>    1  :  Table Column marker
<C1>   635  :  Table Column marker
<C2>   563  :  Table Column marker
<C3>   245  :  Table Column marker
<C4>   101  :  Table Column marker
<C5>    63  :  Table Column marker
<C6>    29  :  Table Column marker
<C7>    19  :  Table Column marker
<C8>     7  :  Table Column marker
<C9>     7  :  Table Column marker
Page breaks are coded as [Page...].
There are two forms.
First form:
[PagePPP-C+ n] where PPP is page number from 001 to 900,
and C is column (a or b)
Second form:
[PagePPP+ n] PPP = 901 to 922 (single column material)
Headword coding is exemplified by: <HI>ABHAYAM.
Here is the regular expression used in python programs to recognize headwords:
reHeadword = r’<HI>([A-Z][A-Z1-9]*)’
The first group is key1; it is coded with AS (and in all capital letters)

The headwords are ordered according to Sanskrit alphabet ordering.

Most Sanskrit in the text appears in the European Indological form, which is coded in pe.txt with the the AS (Anglicized Sanskrit) coding.

The general AS scheme, as described in CDSL.pdf, uses Latin alphabetical letters ‘x (a-z,A-Z), possibly with suffixed numbers; the letter-number combinations are, in the general scheme:

x1 = macron
x2 = dot below
x3 = dot above
x4 = accent aigu
x5 = tilde
x6 = dash below
x7 = umlaut
x10 = circonflex (hat)
x11 = accent grave

Here are the characters that occur in pe.txt in this coding, with their approximate frequency:

A1  7721 := Ā  (\u0100)  LATIN CAPITAL LETTER A WITH MACRON
a1 67706 := ā  (\u0101)  LATIN SMALL LETTER A WITH MACRON
d2  4974 := ḍ  (\u1e0d)  LATIN SMALL LETTER D WITH DOT BELOW
D2   380 := Ḍ  (\u1e0c)  LATIN CAPITAL LETTER D WITH DOT BELOW
h2   439 := ḥ  (\u1e25)  LATIN SMALL LETTER H WITH DOT BELOW
H2     9 := Ḥ  (\u1e24)  LATIN CAPITAL LETTER H WITH DOT BELOW
I1  1625 := Ī  (\u012a)  LATIN CAPITAL LETTER I WITH MACRON
i1 19497 := ī  (\u012b)  LATIN SMALL LETTER I WITH MACRON
l2     2 := ḷ  (\u1e37)  LATIN SMALL LETTER L WITH DOT BELOW
m2    13 := ṃ  (\u1e43)  LATIN SMALL LETTER M WITH DOT BELOW
M3   180 := Ṁ  (\u1e40)  LATIN CAPITAL LETTER M WITH DOT ABOVE
m3  2500 := ṁ  (\u1e41)  LATIN SMALL LETTER M WITH DOT ABOVE
N2  1010 := Ṇ  (\u1e46)  LATIN CAPITAL LETTER N WITH DOT BELOW
n2 20671 := ṇ  (\u1e47)  LATIN SMALL LETTER N WITH DOT BELOW
N3   356 := Ṅ  (\u1e44)  LATIN CAPITAL LETTER N WITH DOT ABOVE
n3  3161 := ṅ  (\u1e45)  LATIN SMALL LETTER N WITH DOT ABOVE
N5   197 := Ñ  (\u00d1)  LATIN CAPITAL LETTER N WITH TILDE
n5  2679 := ñ  (\u00f1)  LATIN SMALL LETTER N WITH TILDE
R2  1625 := Ṛ  (\u1e5a)  LATIN CAPITAL LETTER R WITH DOT BELOW
r2  6630 := ṛ  (\u1e5b)  LATIN SMALL LETTER R WITH DOT BELOW
S2  1027 := Ṣ  (\u1e62)  LATIN CAPITAL LETTER S WITH DOT BELOW
s2 17116 := ṣ  (\u1e63)  LATIN SMALL LETTER S WITH DOT BELOW
S4 12639 := Ś  (\u015a)  LATIN CAPITAL LETTER S WITH ACUTE
s4 11533 := ś  (\u015b)  LATIN SMALL LETTER S WITH ACUTE
T2   528 := Ṭ  (\u1e6c)  LATIN CAPITAL LETTER T WITH DOT BELOW
t2  5280 := ṭ  (\u1e6d)  LATIN SMALL LETTER T WITH DOT BELOW
U1   614 := Ū  (\u016a)  LATIN CAPITAL LETTER U WITH MACRON
u1  4898 := ū  (\u016b)  LATIN SMALL LETTER U WITH MACRON

DTD

pe.dtd:

<?xml version="1.0" encoding="UTF-8"?>
<!-- pe.dtd
 June 24, 2013

-->
<!ELEMENT  pe (H1)*>
<!ELEMENT H1 (h,body,tail) >
<!ENTITY % misc_empty "F | br | b | i | s | C1 | C2|
  C3 | C4 |C5 | C6 | C7 | C8 | C9 | C10 | C11 | NI | H" >
<!ENTITY % body_elts "  %misc_empty; | Poem  " >
<!-- h element -->
<!ELEMENT h  (key1,key2)>
<!ELEMENT key1 (#PCDATA)>
<!ELEMENT key2 (#PCDATA )*>
<!ELEMENT s (#PCDATA )*>
<!ELEMENT C1 EMPTY > <!-- Column 1 -->
<!ELEMENT C2 EMPTY > <!-- Column 2 -->
<!ELEMENT C3 EMPTY > <!-- Column 3 -->
<!ELEMENT C4 EMPTY>  <!-- Column 4 -->
<!ELEMENT C5 EMPTY>  <!-- Column 5 -->
<!ELEMENT C6 EMPTY>  <!-- Column 6 -->
<!ELEMENT C7 EMPTY>  <!-- Column 7 -->
<!ELEMENT C8 EMPTY>  <!-- Column 8 -->
<!ELEMENT C9 EMPTY>  <!-- Column 9 -->
<!ELEMENT C10 EMPTY > <!-- Column 10 -->
<!ELEMENT C11 EMPTY > <!-- Column 11 -->
<!ELEMENT H EMPTY > <!-- Header -->
<!ELEMENT NI EMPTY>  <!-- Numbered paragraph -->
<!ELEMENT F (#PCDATA | i | br | Poem | NI)*>  <!-- Footnote -->
<!ELEMENT Poem (#PCDATA | i | br | F)*>  <!-- Verses -->

<!ELEMENT body (#PCDATA  | %body_elts;)*>
<!ELEMENT br EMPTY > <!-- line breaks in pe.txt -->
<!ELEMENT b (#PCDATA | br)*> <!-- bold  -->
<!ELEMENT i (#PCDATA | br )*> <!-- italic  -->

<!-- tail -->
<!ELEMENT tail (#PCDATA | L | pc )*>
<!ELEMENT L (#PCDATA) >
<!ELEMENT pc (#PCDATA) >
<!-- attributes  -->