Table Of Contents

Previous topic

PE Puranic Encyclopedia (Developer notes)

Next topic

PUI The Purana Index (Developer notes)

This Page

PGN Personal and Geographical Names in the Gupta Inscriptions (Developer notes)

Date of digitization: 2014

Metadata

The original digitization is file pgn_orig.txt, which is coded in the cp1252 (windows 1252) encoding, and is best viewed in a text editor which supports this encoding. For example, in Emacs, one may use the command revert-buffer-with-coding-system and then select cp1252 as the coding. The internet reference http://www.cp1252.com/ describes this coding system.

The file pgn_orig_utf8.txt is a conversion of pgn_orig.txt to the more common utf-8 encoding. The file pgn.txt is also in the utf-8 encoding, and incorporates various editing changes, such as corrections of typographical errors.

There are several extended ascii codes occurring in pgn.txt:

§  (\u00a7)    31 := SECTION SIGN
©  (\u00a9)     1 := COPYRIGHT SIGN
º  (\u00ba)    14 := MASCULINE ORDINAL INDICATOR
‘  (\u2018)  1192 := LEFT SINGLE QUOTATION MARK
’  (\u2019)  1174 := RIGHT SINGLE QUOTATION MARK
“  (\u201c)   400 := LEFT DOUBLE QUOTATION MARK
”  (\u201d)   388 := RIGHT DOUBLE QUOTATION MARK

The {X...X} style of coding serves several purposes:

{# #}  470  : {#X#} devanagari text, coded in HK
{% %}  3397  : italic text
{??}  4  : unreadable text

The <> style of coding is used as follows:

<>  10022  := beginning of ordinary line
<H>  183  :=  A Heading
<HI>  1273  := Italic, possible heading
<P>  4049  := Begin Paragraph
<C1>  56  := Begin Column 1 in a table, in preface section
<C2>  56  := Begin Column 2 in a table
<C3>  55  := Begin Column 3 in a table
<HS>  2  := In preface section
<Picture>  1  := In appendix
Page breaks are coded as [Page...].
The general form of page breaks is [PageX+ n],
where ‘n’ is the number of following lines in page ‘X’.
X has several forms:
(a) title-Y , where Y is a lower-case roman numeral from ‘i’ to ‘xxviii’ (1 to 28)
(b) -PPP where PPP is a 0-filled number sequence, from 001 to 360.
(c) -Y, where Y is an upper-case roman numeral from ‘I’ to ‘X’
(d) PPP-c , where PPP is page number and ‘c’ is a column (‘a’ or ‘b’).
ppp goes from 361 to 378
(e) -PPP where PPP is page number from 379 to 382

The lines of the digitization generally represent lines of the text.

While this text is not a dictionary, we have identified many (484, as of this writing) sections of the text as if they were dictionary entries. The headwords for these entries are not in a dictionary alphabetical order.

Most of these headwords are identified in the coding of pgn.txt with the Python regular expression:

reHeadword = r’^<P>[0-9.() ]+{%(.*?)%}’, where the captured group is the
headword, in AS coding.
For instance:
<P>1. {%Gupta:%}
<P>2. {%Ghat2otkaca:%}
<P>(2) {%Daks2in2a1patha%}

Most Sanskrit words in the text appears in the European Indological form, which is coded in pgn.txt with the the AS (Anglicized Sanskrit) transliteration. Some non-Sanskrit words may also be coded with this transliteration.

The general AS scheme, as described in CDSL.pdf, uses Latin alphabetical letters ‘x (a-z,A-Z), possibly with suffixed numbers; the letter-number combinations are, in the general scheme:

x1 = macron
x2 = dot below
x3 = dot above
x4 = accent aigu
x5 = tilde
x6 = dash below
x7 = umlaut
x10 = circonflex (hat)
x11 = accent grave

Here are the characters that occur in pgn.txt in this coding, with their approximate frequency:

A1   296 := Ā  (\u0100)  LATIN CAPITAL LETTER A WITH MACRON
a1  6280 := ā  (\u0101)  LATIN SMALL LETTER A WITH MACRON
a2     1 := ạ  (\u1ea1)  LATIN SMALL LETTER A WITH DOT BELOW
d2   416 := ḍ  (\u1e0d)  LATIN SMALL LETTER D WITH DOT BELOW
D2    27 := Ḍ  (\u1e0c)  LATIN CAPITAL LETTER D WITH DOT BELOW
e4     1 := é  (\u00e9)  LATIN SMALL LETTER E WITH ACUTE
h2    44 := ḥ  (\u1e25)  LATIN SMALL LETTER H WITH DOT BELOW
I1     9 := Ī  (\u012a)  LATIN CAPITAL LETTER I WITH MACRON
i1  1272 := ī  (\u012b)  LATIN SMALL LETTER I WITH MACRON
l2     5 := ḷ  (\u1e37)  LATIN SMALL LETTER L WITH DOT BELOW
M3     1 := Ṁ  (\u1e40)  LATIN CAPITAL LETTER M WITH DOT ABOVE
m3   291 := ṁ  (\u1e41)  LATIN SMALL LETTER M WITH DOT ABOVE
N2     2 := Ṇ  (\u1e46)  LATIN CAPITAL LETTER N WITH DOT BELOW
n2  1768 := ṇ  (\u1e47)  LATIN SMALL LETTER N WITH DOT BELOW
n3   309 := ṅ  (\u1e45)  LATIN SMALL LETTER N WITH DOT ABOVE
n5   185 := ñ  (\u00f1)  LATIN SMALL LETTER N WITH TILDE
o1     2 := ō  (\u014d)  LATIN SMALL LETTER O WITH MACRON
o7     1 := ö  (\u00f6)  LATIN SMALL LETTER O WITH DIAERESIS
R2    52 := Ṛ  (\u1e5a)  LATIN CAPITAL LETTER R WITH DOT BELOW
r2   443 := ṛ  (\u1e5b)  LATIN SMALL LETTER R WITH DOT BELOW
S2    54 := Ṣ  (\u1e62)  LATIN CAPITAL LETTER S WITH DOT BELOW
s2  1307 := ṣ  (\u1e63)  LATIN SMALL LETTER S WITH DOT BELOW
S4   602 := Ś  (\u015a)  LATIN CAPITAL LETTER S WITH ACUTE
s4   838 := ś  (\u015b)  LATIN SMALL LETTER S WITH ACUTE
T2     5 := Ṭ  (\u1e6c)  LATIN CAPITAL LETTER T WITH DOT BELOW
t2   921 := ṭ  (\u1e6d)  LATIN SMALL LETTER T WITH DOT BELOW
U1    20 := Ū  (\u016a)  LATIN CAPITAL LETTER U WITH MACRON
u1   469 := ū  (\u016b)  LATIN SMALL LETTER U WITH MACRON
u7    13 := ü  (\u00fc)  LATIN SMALL LETTER U WITH DIAERESIS

Two other letter-number combinations occur with KH coding of Devanagari:
zlo0, an abbreviation of zloka (2 times)
pR0, an abbreviation of pRSTha  (5 times)

The pdf to which the digitization corresponds was obtained from archive.org, as file personalgeograph00sharuoft_bw.pdf. A complete set of page-by-page bookmarks was generated to accompany the downloadable pgn_bookmark.pdf.

DTD

pgn.dtd:

<?xml version="1.0" encoding="UTF-8"?>
<!-- pgn.dtd
 April 15, 2014

-->
<!ELEMENT  pgn (H1)*>
<!ELEMENT H1 (h,body,tail) >
<!ENTITY % body_elts "i |s | lb | H | HI | P" >
<!-- h element -->
<!ELEMENT h  (key1,key2,hom?)>
<!ELEMENT key1 (#PCDATA) > <!-- in slp1 -->
<!ELEMENT key2 (#PCDATA )><!-- in AS -->
<!ELEMENT hom (#PCDATA)> <!-- homonym -->

<!ELEMENT body (#PCDATA  | %body_elts;)*>
<!ELEMENT i (#PCDATA | lb)*> <!-- italic, Sanskrit, in AS transliteration -->
<!ELEMENT s (#PCDATA)> <!-- Sanskrit, in AS transliteration  -->
<!ELEMENT lb EMPTY> <!-- line break -->
<!ELEMENT H EMPTY> <!-- heading -->
<!ELEMENT HI EMPTY> <!-- italic, possible heading -->
<!ELEMENT P EMPTY> <!-- Paragraph, also, in headword -->
<!-- tail -->
<!ELEMENT tail (#PCDATA | L | pc )*>
<!ELEMENT L (#PCDATA) >
<!ELEMENT pc (#PCDATA) >

<!-- attributes  -->