Table Of Contents

Previous topic

IEG Indian Epigraphical Glossary (Developer notes)

Next topic

KRM Kṛdantarūpamālā (Developer notes)

This Page

INM Index to the Names in the Mahabharata (Developer notes)

Date of digitization: 2013

Metadata

The original digitization is file inm_orig.txt, which is coded in the cp1252 (windows 1252) encoding, and is best viewed in a text editor which supports this encoding. For example, in Emacs, one may use the command revert-buffer-with-coding-system and then select cp1252 as the coding. The internet reference http://www.cp1252.com/ describes this coding system.

The file inm_orig_utf8.txt is a conversion of inm_orig.txt to the more common utf-8 encoding. The file inm.txt is also in the utf-8 encoding, and incorporates various editing changes, such as corrections of typographical errors.

There are several extended ascii codes occurring in inm.txt:

¤  (\u00a4)    44 := CURRENCY SIGN
¦  (\u00a6) 13741 := BROKEN BAR
§  (\u00a7) 28805 := SECTION SIGN
º  (\u00ba) 20179 := MASCULINE ORDINAL INDICATOR
Æ  (\u00c6)     1 := LATIN CAPITAL LETTER AE
Ç  (\u00c7)  9391 := LATIN CAPITAL LETTER C WITH CEDILLA
×  (\u00d7)     1 := MULTIPLICATION SIGN
æ  (\u00e6)     7 := LATIN SMALL LETTER AE
ç  (\u00e7) 16408 := LATIN SMALL LETTER C WITH CEDILLA
œ  (\u0153)    13 := LATIN SMALL LIGATURE OE
‘  (\u2018)    42 := LEFT SINGLE QUOTATION MARK
’  (\u2019)    44 := RIGHT SINGLE QUOTATION MARK
“  (\u201c)  3588 := LEFT DOUBLE QUOTATION MARK
”  (\u201d)  3578 := RIGHT DOUBLE QUOTATION MARK
†  (\u2020)  9066 := DAGGER
‡  (\u2021)    13 := DOUBLE DAGGER
…  (\u2026)  2545 := HORIZONTAL ELLIPSIS

The {X...X} style of coding serves several purposes:

{#X#}      2 : Text in Devanagari, coded as HK.  only in title material
{%X%}  72502 : italic text
{@X@}  58275 : bold text
{|X|}    359 : widely spaced text
{??}     102 : unreadable text

The following <x> type tags are found in inm.txt:

<>  86324  : beginning of 'normal' line
<g></g>  10888  :
<F>X</F> 28  : Footnote
<P>  1848  : Begin Paragraph
<H>  155  : 'Header' line
<HI>  16368  : Headword indicator
<NI>  6  : usage unclear
<HS>  2  : only in title material
Page breaks are coded as [Page...].
The first page break for body of text is [Page001-a+ 48] at line 4152 of
inm.txt. The page breaks for the body of text are of general form
[PagePPP-C+ N] where PPP is page number, C is column (‘a’ or ‘b’) and
N is the number of lines of text on subsequent page.
Line 102878 of inm.txt is the last line of the body of dictionary, on
page 787.

The lines of the digitization represent lines of the text.

Headword coding is exemplified by:
<HI>{@X@}¦ or <HI>[{@X@}]¦
Here is the regular expression used in python programs to recognize headwords;
it is in headword.py.
reHeadword = r’^<HI>.*?{@(.*?)@}.*?¦’
The headword is coded in AS.
There are some homonyms coded as, for example,
<HI>{@Ça1n2d2ili1@}^1,¦
But our program logic ignores them.

The headwords are ordered according to Sanskrit alphabet ordering.

Sanskrit in the text appears in the European Indological form, which is coded in inm.txt with the the AS (Anglicized Sanskrit) coding. Some words are coded with The general AS scheme, as described in CDSL.pdf, uses Latin alphabetical letters ‘x (a-z,A-Z), possibly with suffixed numbers; the letter-number combinations are, in the general scheme:

x1 = macron
x2 = dot below
x3 = dot above
x4 = accent aigu
x5 = tilde
x6 = dash below
x7 = umlaut
x10 = circonflex (hat)
x11 = accent grave

Here are the characters that occur in inm.txt in this coding, with their approximate frequency:

A1  3181 := Ā  (\u0100)  LATIN CAPITAL LETTER A WITH MACRON
a1 89026 := ā  (\u0101)  LATIN SMALL LETTER A WITH MACRON
a10     5 := â  (\u00e2)  LATIN SMALL LETTER A WITH CIRCUMFLEX
a4  1055 := á(\u00e1) LATIN SMALL LETTER A with udatta accent
a9     3 := NO DESCRIPTION
d2  5186 := ḍ  (\u1e0d)  LATIN SMALL LETTER D WITH DOT BELOW
D2    23 := Ḍ  (\u1e0c)  LATIN CAPITAL LETTER D WITH DOT BELOW
e1     1 := ē  (\u0113)  LATIN SMALL LETTER E WITH MACRON
e7     4 := ë  (\u00eb)  LATIN SMALL LETTER E WITH DIAERESIS
h2 17149 := ḥ  (\u1e25)  LATIN SMALL LETTER H WITH DOT BELOW
I1   136 := Ī  (\u012a)  LATIN CAPITAL LETTER I WITH MACRON
i1 19191 := ī  (\u012b)  LATIN SMALL LETTER I WITH MACRON
l2     1 := ḷ  (\u1e37)  LATIN SMALL LETTER L WITH DOT BELOW
M2     6 := Ṃ  (\u1e42)  LATIN CAPITAL LETTER M WITH DOT BELOW
m2 17152 := ṃ  (\u1e43)  LATIN SMALL LETTER M WITH DOT BELOW
m3    15 := ṁ  (\u1e41)  LATIN SMALL LETTER M WITH DOT ABOVE
N2    34 := Ṇ  (\u1e46)  LATIN CAPITAL LETTER N WITH DOT BELOW
n2 23534 := ṇ  (\u1e47)  LATIN SMALL LETTER N WITH DOT BELOW
N3     2 := Ṅ  (\u1e44)  LATIN CAPITAL LETTER N WITH DOT ABOVE
n3  3958 := ṅ  (\u1e45)  LATIN SMALL LETTER N WITH DOT ABOVE
n4     2 := ń  (\u0144)  LATIN SMALL LETTER N WITH ACUTE
N5     2 := Ñ  (\u00d1)  LATIN CAPITAL LETTER N WITH TILDE
n5  3409 := ñ  (\u00f1)  LATIN SMALL LETTER N WITH TILDE
o1     1 := ō  (\u014d)  LATIN SMALL LETTER O WITH MACRON
O7     2 := Ö (\u00d6) LATIN CAPITAL LETTER O WITH DIARESIS
o7     8 := ö  (\u00f6)  LATIN SMALL LETTER O WITH DIAERESIS
R2   764 := Ṛ  (\u1e5a)  LATIN CAPITAL LETTER R WITH DOT BELOW
r2 13860 := ṛ  (\u1e5b)  LATIN SMALL LETTER R WITH DOT BELOW
r21    39 := ṝ  (\u1e5d)  LATIN SMALL LETTER R WITH DOT BELOW AND MACRON
s2     9 := ṣ  (\u1e63)  LATIN SMALL LETTER S WITH DOT BELOW
s4     1 := ś  (\u015b)  LATIN SMALL LETTER S WITH ACUTE
T2     3 := Ṭ  (\u1e6c)  LATIN CAPITAL LETTER T WITH DOT BELOW
t2  8250 := ṭ  (\u1e6d)  LATIN SMALL LETTER T WITH DOT BELOW
U1    74 := Ū  (\u016a)  LATIN CAPITAL LETTER U WITH MACRON
u1  6773 := ū  (\u016b)  LATIN SMALL LETTER U WITH MACRON

Letter-number combinations that are not AS occur in the tags <C1>, etc.

DTD

inm.dtd:

<?xml version="1.0" encoding="UTF-8"?>
<!-- inm.dtd
 June 8, 2013

-->
<!ELEMENT  inm (H1)*>
<!ELEMENT H1 (h,body,tail) >
<!ENTITY % misc_empty "F | FEND | br | g | b | i | s | C1 | C2| C2H |
  C3 | C3H |C4 |C5 |C6 | NI | H" >
<!ENTITY % body_elts "  %misc_empty;  " >
<!-- h element -->
<!ELEMENT h  (key1,key2)>
<!ELEMENT key1 (#PCDATA)>
<!ELEMENT key2 (#PCDATA )*>
<!-- special_chars UNUSED-->
<!ELEMENT C1 EMPTY > <!-- Column 1 -->
<!ELEMENT C2 EMPTY > <!-- Column 2 -->
<!ELEMENT C2H EMPTY > <!-- Column 2 -->
<!ELEMENT C3H EMPTY > <!-- Column 3 -->
<!ELEMENT C3 EMPTY > <!-- Column 3 -->
<!ELEMENT C4 EMPTY>  <!-- Column 4 -->
<!ELEMENT C5 EMPTY>  <!-- Column 5 -->
<!ELEMENT C6 EMPTY>  <!-- Column 6 -->
<!ELEMENT NI EMPTY>  <!-- ? -->
<!ELEMENT H EMPTY>  <!-- ? -->
<!ELEMENT F (#PCDATA | i | br)*>  <!-- Footnote -->

<!ELEMENT body (#PCDATA  | %body_elts;)*>
<!ELEMENT br EMPTY > <!-- line breaks in inm.txt -->
<!ELEMENT g (#PCDATA)> <!-- Greek  -->
<!ELEMENT b (#PCDATA | g | br)*> <!-- bold  -->
<!ELEMENT i (#PCDATA | br | g | C1)*> <!-- italic  -->

<!-- tail -->
<!ELEMENT tail (#PCDATA | L | pc )*>
<!ELEMENT L (#PCDATA) >
<!ELEMENT pc (#PCDATA) >
<!-- attributes  -->