Table Of Contents

Previous topic

SKD Sabda-kalpadruma (Developer notes)

Next topic

STC Stchoupak Dictionnaire Sanscrit-Français (Developer notes)

This Page

SNP Sanskrit Names of Plants (Developer notes)

Date of digitization: 2014

Metadata

The original digitization is file snp_orig.txt, which is coded in the cp1252 (windows 1252) encoding, and is best viewed in a text editor which supports this encoding. For example, in Emacs, one may use the command revert-buffer-with-coding-system and then select cp1252 as the coding. The internet reference http://www.cp1252.com/ describes this coding system.

The file snp_orig_utf8.txt is a conversion of snp_orig.txt to the more common utf-8 encoding. The file snp.txt is also in the utf-8 encoding, and incorporates various editing changes, such as corrections of typographical errors.

There are several extended ascii codes occurring in snp.txt:

¤  (\u00a4)    32 := CURRENCY SIGN  Used to represent 'breve'
                     For example, in aja1ji1¤  the final letter is
                     represented with a macron and breve in text,
                     and indicates that the vowel may be long or short.
                    There is no single unicode code point representing
                    'i with macron and breve'
×  (\u00d7)     2 := MULTIPLICATION SIGN
ç  (\u00e7)     2 := LATIN SMALL LETTER C WITH CEDILLA
‘  (\u2018)     5 := LEFT SINGLE QUOTATION MARK
’  (\u2019)     5 := RIGHT SINGLE QUOTATION MARK
“  (\u201c)    14 := LEFT DOUBLE QUOTATION MARK
”  (\u201d)    14 := RIGHT DOUBLE QUOTATION MARK

The {X...X} style of coding serves one purpose in snp.txt:

{%X%}   1187 : italic text

The following <x> type tags are found in snp.txt:

<>   2271  :=  normal beginning of line
<H>   464  := headline. Primarily, used in Headwords
<P>  1756  := paragraph
<HI>   42  := Headline. Primarily in preface materials
<g></g> 1  := Greek language (uncoded)
Page breaks are coded as [Page...].
In general, a page break has form [PageX+ n] where X is page number and
n is the number of lines in the digitization for page X.
The forms of X are:
title-Y (Y = v,VI)
-PPP where PPP = 520 to 611 for first part
PPP = 425 to 465 for second part
In SNP, these page numbers refer to parts of two different publications,
as indicated in snpheader.xml.

The lines of the digitization generally represent lines of the text.

Headword coding is exemplified by: .{#a#}100{#a°#}^2¦
The headword forms in snp are exemplified by
<H>{%kat2utumbi1%} (part 1)
<H>atimuktaka (part 2)
The headword is coded in AS

Sanskrit in the text appears in the European Indological form, which is coded in snp.txt with the the AS (Anglicized Sanskrit) coding.

The general AS scheme, as described in CDSL.pdf, uses Latin alphabetical letters ‘x (a-z,A-Z), possibly with suffixed numbers; the letter-number combinations are, in the general scheme:

x1 = macron
x2 = dot below
x3 = dot above
x4 = accent aigu
x5 = tilde
x6 = dash below
x7 = umlaut
x10 = circonflex (hat)
x11 = accent grave

Here are the characters that occur in snp.txt in this coding, with their approximate frequency:

A1    14 := Ā  (\u0100)  LATIN CAPITAL LETTER A WITH MACRON
a1  1385 := ā  (\u0101)  LATIN SMALL LETTER A WITH MACRON
a10     1 := â  (\u00e2)  LATIN SMALL LETTER A WITH CIRCUMFLEX
a11     1 := à  (\u00e0)  LATIN SMALL LETTER A WITH GRAVE
a7     4 := ä  (\u00e4)  LATIN SMALL LETTER A WITH DIAERESIS
d2   128 := ḍ  (\u1e0d)  LATIN SMALL LETTER D WITH DOT BELOW
D2    30 := Ḍ  (\u1e0c)  LATIN CAPITAL LETTER D WITH DOT BELOW
e4    10 := é  (\u00e9)  LATIN SMALL LETTER E WITH ACUTE
E7     3 := Ë  (\u00cb)  LATIN CAPITAL LETTER E WITH DIARESIS
e7     1 := ë  (\u00eb)  LATIN SMALL LETTER E WITH DIAERESIS
h2    13 := ḥ  (\u1e25)  LATIN SMALL LETTER H WITH DOT BELOW
i1   671 := ī  (\u012b)  LATIN SMALL LETTER I WITH MACRON
i7     1 := ï  (\u00ef)  LATIN SMALL LETTER I WITH DIAERESIS
m2    35 := ṃ  (\u1e43)  LATIN SMALL LETTER M WITH DOT BELOW
m3     1 := ṁ  (\u1e41)  LATIN SMALL LETTER M WITH DOT ABOVE
n2   286 := ṇ  (\u1e47)  LATIN SMALL LETTER N WITH DOT BELOW
n3    87 := ṅ  (\u1e45)  LATIN SMALL LETTER N WITH DOT ABOVE
n5    34 := ñ  (\u00f1)  LATIN SMALL LETTER N WITH TILDE
r2   156 := ṛ  (\u1e5b)  LATIN SMALL LETTER R WITH DOT BELOW
s2   398 := ṣ  (\u1e63)  LATIN SMALL LETTER S WITH DOT BELOW
S4    45 := Ś  (\u015a)  LATIN CAPITAL LETTER S WITH ACUTE
s4   950 := ś  (\u015b)  LATIN SMALL LETTER S WITH ACUTE
T2     1 := Ṭ  (\u1e6c)  LATIN CAPITAL LETTER T WITH DOT BELOW
t2   257 := ṭ  (\u1e6d)  LATIN SMALL LETTER T WITH DOT BELOW
u1   269 := ū  (\u016b)  LATIN SMALL LETTER U WITH MACRON

K358, S88     1 := (not AS)used in bibliographic entry for part 2 of digitization

DTD

snp.dtd:

<?xml version="1.0" encoding="UTF-8"?>
<!-- snp.dtd
 May 28, 2014

-->
<!ELEMENT  snp (H1)*>
<!ELEMENT H1 (h,body,tail) >
<!ENTITY % body_elts "i |lb |g |P |H |HI" >
<!-- h element -->
<!ELEMENT h  (key1,key2)>
<!ELEMENT key1 (#PCDATA) > <!-- in slp1 -->
<!ELEMENT key2 (#PCDATA )><!-- in AS -->

<!ELEMENT body (#PCDATA  | %body_elts;)*>
<!ELEMENT i (#PCDATA | lb)*> <!-- italic text-->
<!ELEMENT lb EMPTY>  <!-- line break -->
<!ELEMENT g EMPTY>  <!-- Greek text -->
<!ELEMENT P EMPTY>  <!-- 'Paragraph' -->
<!ELEMENT H EMPTY>  <!-- Headline -->
<!ELEMENT HI EMPTY>  <!-- Headline -->

<!-- tail -->
<!ELEMENT tail (#PCDATA | L | pc )*>
<!ELEMENT L (#PCDATA) >
<!ELEMENT pc (#PCDATA) >

<!-- attributes  -->