Table Of Contents

Previous topic

VCP Vacaspatya (Developer notes)

Next topic

WIL Wilson Sanskrit-English Dictionary (Developer notes)

This Page

VEI The Vedic Index of Names and Subjects (Developer notes)

Date of digitization: 2013

Metadata

The original digitization is file vei_orig.txt, which is coded in the cp1252 (windows 1252) encoding, and is best viewed in a text editor which supports this encoding. For example, in Emacs, one may use the command revert-buffer-with-coding-system and then select cp1252 as the coding. The internet reference http://www.cp1252.com/ describes this coding system.

The file vei_orig_utf8.txt is a conversion of vei_orig.txt to the more common utf-8 encoding. The file vei.txt is also in the utf-8 encoding, and incorporates various editing changes, such as corrections of typographical errors.

There are several extended ascii codes occurring in vei.txt:

¤  (\u00a4)    20 := CURRENCY SIGN
°  (\u00b0)     2 := DEGREE SIGN
º  (\u00ba)    15 := MASCULINE ORDINAL INDICATOR
×  (\u00d7)     6 := MULTIPLICATION SIGN
æ  (\u00e6)    43 := LATIN SMALL LETTER AE
ç  (\u00e7)     2 := LATIN SMALL LETTER C WITH CEDILLA
œ  (\u0153)    20 := LATIN SMALL LIGATURE OE
‘  (\u2018)  7803 := LEFT SINGLE QUOTATION MARK
’  (\u2019)  7816 := RIGHT SINGLE QUOTATION MARK

The {X...X} style of coding serves several purposes:

{@X@}  9032  : bold text
{%X%} 14997  : italic text
{|X|}    7   : widely spaced text
{??}   167   : unreadable text

The following <x> type tags are found in vei.txt:

<F>X</F> 11177  : Footnote
<g></g>    147  : Placeholder for Greek text
<>       37064  : start of new line
<C1>        33  : column 1
<C2>        32  : column 2
<C3>        28  : column 3
<H>         46  : Title line for Letter Break and other purposes
<HI>      4569  : Used in Sanskrit Index and English Index coding at end of Vol II
<HI1>     1701  : Used for sub-items in English Index
<HS>         3  : Used in preface material
<NI>         1  : Used in preface material
<P>       5243  : start of new paragraph. Identifies headwords
Page breaks are coded as [Page...].
Page breaks are more specifically coded as [PageV-X+ n], where
V is the volume (‘1’ or ‘2’)
n is the number of lines in following page
X is either
PPP for a typical page
PPPS where S is ‘a’ or ‘b’ and represents the column
in a multi-column page (for the Sanskrit Index at end of Volume 2)
title-PPP where PPP is 001 or 002 (V=1)
R where R is a lower case Roman numeral (in preface)

In the digitization, the Title and Preface pages occur between
1-544 and 2-001.

The lines of the digitization generally represent lines of the text.

Headword coding is exemplified by: <P>{@Akra.@} and <P>1. {@Aks2a,@}
Here is the regular expression used in python programs to recognize headwords:
reHeadword = r’<P>[1-8. ]*{@(.*?)@}’
In examples like <P>1. {@Aks2a,@}, ‘1.’ represents a homonym number; but these
are not ignored in vei.xml.
The headword is spelled in AS (with first letter capitalized).

The headwords are ordered according to Sanskrit alphabet ordering.

Sanskrit in the text appears in the European Indological form, which is coded in vei.txt with the the AS (Anglicized Sanskrit) coding. Some words in other languages, such as German and French, also have diacritical marks that are coded by this AS schem.

The general AS scheme, as described in CDSL.pdf, uses Latin alphabetical letters ‘x (a-z,A-Z), possibly with suffixed numbers; the letter-number combinations are, in the general scheme:

x1 = macron
x2 = dot below
x3 = dot above
x4 = accent aigu
x5 = tilde
x6 = dash below
x7 = umlaut
x10 = circonflex (hat)
x11 = accent grave

Here are the characters that occur in vei.txt in this coding, with their approximate frequency:

A1  1623 := Ā  (\u0100)  LATIN CAPITAL LETTER A WITH MACRON
a1 25565 := ā  (\u0101)  LATIN SMALL LETTER A WITH MACRON
a10     5 := â  (\u00e2)  LATIN SMALL LETTER A WITH CIRCUMFLEX
a11     1 := à  (\u00e0)  LATIN SMALL LETTER A WITH GRAVE
a4     5 := á(\u00e1) LATIN SMALL LETTER A with udatta accent
a7   227 := ä  (\u00e4)  LATIN SMALL LETTER A WITH DIAERESIS
d2   715 := ḍ  (\u1e0d)  LATIN SMALL LETTER D WITH DOT BELOW
e1     1 := ē  (\u0113)  LATIN SMALL LETTER E WITH MACRON
e11     2 := è  (\u00e8)  LATIN SMALL LETTER E WITH GRAVE
E4     2 := É  (\u00c9)  LATIN CAPITAL LETTER E WITH ACUTE
e4   163 := é  (\u00e9)  LATIN SMALL LETTER E WITH ACUTE
e7     1 := ë  (\u00eb)  LATIN SMALL LETTER E WITH DIAERESIS
h2   561 := ḥ  (\u1e25)  LATIN SMALL LETTER H WITH DOT BELOW
I1     6 := Ī  (\u012a)  LATIN CAPITAL LETTER I WITH MACRON
i1  6155 := ī  (\u012b)  LATIN SMALL LETTER I WITH MACRON
i4     1 := í (\u00ed) LATIN SMALL LETTER I and udatta accent
l2    17 := ḷ  (\u1e37)  LATIN SMALL LETTER L WITH DOT BELOW
m2  4900 := ṃ  (\u1e43)  LATIN SMALL LETTER M WITH DOT BELOW
m3     1 := ṁ  (\u1e41)  LATIN SMALL LETTER M WITH DOT ABOVE
n2  9132 := ṇ  (\u1e47)  LATIN SMALL LETTER N WITH DOT BELOW
n3  1009 := ṅ  (\u1e45)  LATIN SMALL LETTER N WITH DOT ABOVE
n5  1041 := ñ  (\u00f1)  LATIN SMALL LETTER N WITH TILDE
o1     8 := ō  (\u014d)  LATIN SMALL LETTER O WITH MACRON
o4     1 := ó  (\u00f3) LATIN SMALL LETTER O and udatta accent
o7   146 := ö  (\u00f6)  LATIN SMALL LETTER O WITH DIAERESIS
R2   453 := Ṛ  (\u1e5a)  LATIN CAPITAL LETTER R WITH DOT BELOW
r2  2582 := ṛ  (\u1e5b)  LATIN SMALL LETTER R WITH DOT BELOW
r21    2 := ṝ  (\u1e5d)  LATIN SMALL LETTER R WITH DOT BELOW AND MACRON
S2    62 := Ṣ  (\u1e62)  LATIN CAPITAL LETTER S WITH DOT BELOW
s2  5538 := ṣ  (\u1e63)  LATIN SMALL LETTER S WITH DOT BELOW
S4  3887 := Ś  (\u015a)  LATIN CAPITAL LETTER S WITH ACUTE
s4  4434 := ś  (\u015b)  LATIN SMALL LETTER S WITH ACUTE
t2  1752 := ṭ  (\u1e6d)  LATIN SMALL LETTER T WITH DOT BELOW
U1    41 := Ū  (\u016a)  LATIN CAPITAL LETTER U WITH MACRON
u1  2438 := ū  (\u016b)  LATIN SMALL LETTER U WITH MACRON
u4     1 := ú  (\u00fa) LATIN SMALL LETTER U and udatta accent
U7    37 := Ü  (\u00dc)  LATIN CAPITAL LETTER U WITH DIAERESIS
u7   304 := ü  (\u00fc)  LATIN SMALL LETTER U WITH DIAERESIS

DTD

vei.dtd:

<?xml version="1.0" encoding="UTF-8"?>
<!-- vei.dtd
 July 22, 2013

-->
<!ELEMENT  vei (H1)*>
<!ELEMENT H1 (h,body,tail) >
<!ENTITY % body_elts "F |  br | H | b | i | s | C1 | C2 | C3 | g" >
<!-- h element -->
<!ELEMENT h  (key1,key2)>
<!ELEMENT key1 (#PCDATA)>
<!ELEMENT key2 (#PCDATA )*>

<!ELEMENT body (#PCDATA  | %body_elts;)*>
<!ELEMENT br EMPTY > <!-- line breaks in bhs.txt -->
<!ELEMENT F (#PCDATA | br | i | b | g)*> <!-- Footnote  -->
<!ELEMENT s (#PCDATA | br)*> <!-- Devanagari, in HK transliteration  -->
<!ELEMENT H EMPTY> <!--   -->
<!ELEMENT b (#PCDATA | br)*> <!-- bold  -->
<!ELEMENT i (#PCDATA | br )*> <!-- italic  -->
<!ELEMENT C1 EMPTY> <!--  Column 1  -->
<!ELEMENT C2 EMPTY> <!--  Column 2  -->
<!ELEMENT C3 EMPTY> <!--  Column 3  -->
<!ELEMENT g (#PCDATA) > <!-- Placeholder for Greek text -->

<!-- tail -->
<!ELEMENT tail (#PCDATA | L | pc )*>
<!ELEMENT L (#PCDATA) >
<!ELEMENT pc (#PCDATA) >
<!-- attributes  -->