Table Of Contents

Previous topic

PUI The Purana Index (Developer notes)

Next topic

PWG Böhtlingk and Roth Grosses Petersburger Wörterbuch (Developer notes)

This Page

PW Böhtlingk Sanskrit-Wörterbuch in kürzerer Fassung (Developer notes)

Date of digitization: 2007

Metadata

The original digitization is file pw_orig.txt, which is coded in the cp1252 (windows 1252) encoding, and is best viewed in a text editor which supports this encoding. For example, in Emacs, one may use the command revert-buffer-with-coding-system and then select cp1252 as the coding. The internet reference http://www.cp1252.com/ describes this coding system.

The file pw_orig_utf8.txt is a conversion of pw_orig.txt to the more common utf-8 encoding. The file pw.txt is also in the utf-8 encoding, and incorporates various editing changes, such as corrections of typographical errors.

There are several extended ascii codes occurring in pw.txt:

£  (\u00a3)    14 := POUND SIGN
¦  (\u00a6) 135784 := BROKEN BAR   (ends headword )
¨  (\u00a8)   170 := DIAERESIS
ª  (\u00aa) 28038 := FEMININE ORDINAL INDICATOR
       ª = udAtta accent  = raised devanagari 'u' (text) = '/' in slp1
       ªª = svarita accent = vertical bar above (text) = '^' in slp1
®  (\u00ae)  7966 := REGISTERED SIGN
                    (following words, joined by _, are Latin genus-species)
¯  (\u00af) 79942 := MACRON  ¯X == <ls>X</ls>  (X extends to space or Macron
°  (\u00b0) 20493 := DEGREE SIGN
²  (\u00b2) 78388 := SUPERSCRIPT TWO (English subheading = ²a) ²b) etc.)
³  (\u00b3) 38162 := SUPERSCRIPT THREE  (Number subhead = ³1) ³2) etc.
´  (\u00b4)   115 := ACUTE ACCENT
¹  (\u00b9)  3874 := SUPERSCRIPT ONE  (Greek subhead = ¹a) ¹b) etc.)
»  (\u00bb)  1405 := RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
Ä  (\u00c4)     1 := LATIN CAPITAL LETTER A WITH DIAERESIS
Ö  (\u00d6)     6 := LATIN CAPITAL LETTER O WITH DIAERESIS
Ü  (\u00dc)   356 := LATIN CAPITAL LETTER U WITH DIAERESIS
ß  (\u00df)     2 := LATIN SMALL LETTER SHARP S
ä  (\u00e4) 24061 := LATIN SMALL LETTER A WITH DIAERESIS
é  (\u00e9)     1 := LATIN SMALL LETTER E WITH ACUTE
ê  (\u00ea)     5 := LATIN SMALL LETTER E WITH CIRCUMFLEX
ë  (\u00eb)     1 := LATIN SMALL LETTER E WITH DIAERESIS
ö  (\u00f6) 13559 := LATIN SMALL LETTER O WITH DIAERESIS
ü  (\u00fc) 36320 := LATIN SMALL LETTER U WITH DIAERESIS
ý  (\u00fd)   132 := LATIN SMALL LETTER Y WITH ACUTE
                     may represent a superscript '2'. Always with literary
                    source 'VP' ?
ƒ  (\u0192) 11531 := LATIN SMALL LETTER F WITH HOOK
                     ƒPage1.001-2ƒ
†  (\u2020) 30879 := DAGGER
                     {%†Soma-Stengel%}  (Soma non-italic, Stengel italic)
•  (\u2022) 203444 := BULLET  precedes gender
…  (\u2026) 681351 := HORIZONTAL ELLIPSIS
                      A space, within a grouping
‹  (\u2039) 140933 := SINGLE LEFT-POINTING ANGLE QUOTATION MARK
                      ‹X› = non-italic text
›  (\u203a) 141104 := SINGLE RIGHT-POINTING ANGLE QUOTATION MARK

The {X...X} style of coding serves several purposes:

#{X}    :  devanagari text, coded with HK
{%X%} 194080 : italic text {%†Soma-Stengel%} (Soma non-italic, Stengel italic)
Within italic text, indicates following word not italic

The following <x> type tags are found in pw.txt:

<*Caus.>  10  :=   preceded by M-dash
<Caus.>  1687  :=
<*Intens.>  1  :=
<+>  7834  :=  'mdash Mit'
<A></A>  47  := Arabic text  (also |<A></A>
</G>  1  :=  error. 94079 for </g>
<gr>  1  :=    coding error, for <g>
<g>X</g>  183  := Greek.  X coded in unknown transliteration
 |<RUSSISCH></RUSSISCH>  1  :=  Russian
<sz></sz>  1  := metric symbols
<= ‹Fährmann>  1  :=  {%Ferge.%} |<= ‹Fährmann>›
   "|<= ‹Fährmann>›" is not in scan. Maybe a more modern form. Corrected
<?>  42  :=  unreadable ? (usage unclear)
<ATI>  1  :=   error: appears gratuitous @ line 36743
<H1>  135784  := headword
|<sic>  28  := indication of error in scan ?
Page breaks are coded as ƒPage1.070-3ƒ
In General
ƒPageV.PPP-Cƒ V = volume 1-3, PPP = page within volume, C = 1,2,3 Column
The first Page number is ƒPage1.001-1ƒ occuring (wrongly) at the END of
line 5
A few page breaks show a variant coding:
3.84-2, 3.85-1, 3.85-3, 3.87-3, 4.23-1,
4.84-2, 4.95-2, 4.0152-3, 5.67-2, 5.99-1
7.50-1, 7.54-3

The lines of the digitization generally represent ‘sections’ of the text; the actual line-breaks of the text are not coded. The Addenda Title/Foreword pages are not coded.

Headword coding:
<H1>X¦
Structure of X:
000{Y}1{Z}W (000 may be 100, perhaps others - significance unclear)
where Y and Z are the ‘key1’ and ‘key2’ forms of the headword; Y and Z are
coded in the Kyoto-Harvard transliteration.
and W, if present, represents a homonym, in form ^n (where n is a sequence
of 1 or more digits.

Here is the regular expression used in python programs to recognize headwords:
reHeadword = r’^<H1>...{(.*?)}1{(.*?)}([^}]*?))¦’
The headwords are ordered according to Sanskrit alphabet ordering.

Anglicized Sanskrit coding appears in literary citation abbreviations and elsewhere when the scan shows ‘indological’ coding of Sanskrit. The general AS scheme, as described in CDSL.pdf, uses Latin alphabetical letters ‘x (a-z,A-Z), possibly with suffixed numbers; the letter-number combinations are, in the general scheme:

x1 = macron
x2 = dot below
x3 = dot above
x4 = accent aigu
x5 = tilde
x6 = dash below
x7 = umlaut
x10 = circonflex (hat)
x11 = accent grave

However, non-standard AS abounds; notably ‘7’ represents circumflex (rather than the usual AS diaresis). Here are the AS codes that occur with their approximate frequency:

A1    13 := Ā  (\u0100)  LATIN CAPITAL LETTER A WITH MACRON
a1     2 := ā  (\u0101)  LATIN SMALL LETTER A WITH MACRON
A4    23 := Á (\u00c1) LATIN CAPITAL LETTER A with ACUTE
a4    25 := á (\u00e1) LATIN SMALL LETTER A with ACUTE
A7 27809 := Â  (\u00c2)  LATIN CAPITAL LETTER A WITH CIRCUMFLEX
a7 10753 := â  (\u00e2)  wrong LATIN SMALL LETTER A WITH CIRCUMFLEX
C2 11942 := Ç  (\u00c7)  LATIN CAPITAL LETTER C WITH CEDILLA
c2  1822 := ç  (\u00e7)  LATIN SMALL LETTER C WITH CEDILLA
C3     3 := Ċ  (\u010a)  LATIN CAPITAL LETTER C WITH DOT ABOVE
c3     2 := ċ  (\u010b)  LATIN SMALL LETTER C WITH DOT ABOVE
C4    12 := Ć  (\u0106)  LATIN CAPITAL LETTER C WITH ACUTE
c4     1 := ć  (\u0107)  LATIN SMALL LETTER C WITH ACUTE
d2   531 := ḍ  (\u1e0d)  LATIN SMALL LETTER D WITH DOT BELOW
D2  1161 := Ḍ  (\u1e0c)  LATIN CAPITAL LETTER D WITH DOT BELOW
G3     7 := Ġ  (\u0120)  LATIN CAPITAL LETTER G WITH DOT ABOVE
g3     4 := ġ  (\u0121)  LATIN SMALL LETTER G WITH DOT ABOVE
G4  7638 := Ǵ  (\u01f4)  LATIN CAPITAL LETTER G WITH ACUTE (= 'j')
g4  1173 := ǵ  (\u01f5)  LATIN SMALL LETTER G WITH ACUTE (= 'j')
h2    37 := ḥ  (\u1e25)  LATIN SMALL LETTER H WITH DOT BELOW
H3     2 := Ḣ  (\u1e22)  LATIN CAPITAL LETTER H WITH DOT ABOVE
I1    14 := Ī  (\u012a)  LATIN CAPITAL LETTER I WITH MACRON (prob shld be I7)
i1    98 := ī  (\u012b)  LATIN SMALL LETTER I WITH MACRON (prob shld be i7)
I4     2 := Í   (\u00cd) LATIN CAPITAL LETTER I and ACUTE
i4    19 := í (\u00ed) LATIN SMALL LETTER I and ACUTE
i5     2 := ĩ  (\u0129)  LATIN SMALL LETTER I WITH TILDE
I7  1017 := Î  (\u00d4)  LATIN CAPITAL LETTER I WITH  CIRCUMFLEX
i7  2033 := î  (\u00ee)  LATIN SMALL LETTER I WITH  CIRCUMFLEX
K4  5274 := Ḱ  (\u1e30)  LATIN CAPITAL LETTER K WITH ACUTE (= 'c'
k4   398 := ḱ  (\u1e31)  LATIN SMALL LETTER K WITH ACUTE
M5   715 := M̄ (\u004d\0304) LATIN CAPITAL LETTER M, COMBINING TILDE
m5   554 := m̄ (\u006d\u0304) LATIN CAPITAL LETTER M, COMBINING TILDE, not represented
N2  1914 := Ṇ  (\u1e46)  LATIN CAPITAL LETTER N WITH DOT BELOW
n2  3997 := ṇ  (\u1e47)  LATIN SMALL LETTER N WITH DOT BELOW
n3     4 := ṅ  (\u1e45)  LATIN SMALL LETTER N WITH DOT ABOVE
N4   908 := Ń  (\u0143)  LATIN CAPITAL LETTER N WITH ACUTE  (palatal nasal)
n4   284 := ń  (\u0144)  LATIN SMALL LETTER N WITH ACUTE
N5   772 := Ñ  (\u00d1)  LATIN CAPITAL LETTER N WITH TILDE (guttural nasal)
n5   606 := ñ  (\u00f1)  LATIN SMALL LETTER N WITH TILDE
R2  3611 := Ṛ  (\u1e5a)  LATIN CAPITAL LETTER R WITH DOT BELOW
r2  1325 := ṛ  (\u1e5b)  LATIN SMALL LETTER R WITH DOT BELOW
t2   781 := ṭ  (\u1e6d)  LATIN SMALL LETTER T WITH DOT BELOW
T2  1452 := Ṭ  (\u1e6c)  LATIN CAPITAL LETTER T WITH DOT BELOW
U1     2 := Ū  (\u016a)  LATIN CAPITAL LETTER U WITH MACRON
u1    37 := ū  (\u016b)  LATIN SMALL LETTER U WITH MACRON
U4     4 := Ú (\u00da) LATIN CAPITAL LETTER U WITH ACUTE
u4    12 := ú (\u00fa) LATIN SMALL LETTER U and ACUTE
U7   232 := Û  (\u00db) LATIN CAPITAL LETTER U WITH CIRCUMFLEX
u7   780 := û  (\u00fb)  LATIN SMALL LETTER U WITH CIRCUMFLEX
Quite a few other letter-number combinations occur but do not represent AS
codings; these are often in literary references.
These non-AS instances are:
A10     1 := not AS
a1036     1 := not AS
A105     2 := not AS
A108     1 := not AS
A124     1 := not AS
A136     1 := not AS
A143     1 := not AS
A17     1 := not AS
a1793     1 := not AS
A18     1 := not AS
A1936     2 := not AS
a1936     7 := not AS
A197     2 := not AS
A2     5 := not ls
a2     5 := not ls
A205     1 := not AS
A23     1 := not AS
A25     1 := not AS
A263     1 := not AS
A280     1 := not AS
A282     1 := not AS
A292     1 := not AS
A3     3 := not AS
A314     1 := not AS
A339     1 := not AS
A37     2 := not AS
A388     1 := not AS
A404     1 := not AS
A432     1 := not AS
A438     1 := not AS
A470     1 := not AS
A476     1 := not AS
A479     1 := not AS
A495     1 := not AS
A5     4 := not AS
a5     6 := not AS
A513     1 := not AS
A53     1 := not AS
A543     1 := not AS
A554     1 := not AS
A568     1 := not AS
A6     5 := not AS
a6     3 := not AS
A61     1 := not AS
A630     1 := not AS
a70     1 := not AS
A72     2 := not AS
A74     7 := not AS
a74     1 := not AS
a77     2 := not AS
A774     1 := not AS
A8     2 := not AS
a8     3 := not AS
A806     1 := not AS
A92     1 := not AS
A99     1 := not AS
B2     1 := Ḅ  not AS
B6     1 := Ḇ  not AS
C1     1 := not AS
c1     1 := not AS
C17     1 := not AS
c210     1 := not AS
C23     2 := not AS
c268     1 := not AS
C27     3 := not AS
C29     1 := not AS
c6     1 := not AS
d175     1 := not AS
d24     1 := not AS
D260     1 := not AS
D277     1 := not AS
d3     1 := not AS
D3     1 := not AS
d4     1 := not AS
D4     1 := not AS
E4     1 := not AS
e4     9 := not AS
f14     1 := not AS
g2     1 := not AS
H4    17 := not AS
h4    12 := not AS
h42     1 := not AS
H5     1 := not AS
h8     1 := not AS
I10     1 := not AS
I121     1 := not AS
I136     1 := not AS
I2     2 := Ị  not AS
i2     2 := not AS
I3    20 := İ  not AS
g47     1 := not AS
G5     4 := not AS
g5    11 := not AS
h0     1 := not AS
J2     1 := not AS
J4     2 := not AS
j8     1 := not AS
K1     1 := not AS
k2     1 := not AS
K2     5 := not AS
k3     2 := not AS
K3     4 := not AS
k410     1 := not AS
K4120     1 := not AS
K4156     1 := not AS
K4184     1 := not AS
K4187     1 := not AS
K4196     1 := not AS
K42     1 := not AS
K422     1 := not AS
K44     4 := not AS
K445     1 := not AS
K456     1 := not AS
K469     1 := not AS
K496     1 := not AS
K5     5 := not AS
k5     5 := not AS
l10     1 := not AS
L4     1 := not AS
M197     1 := not AS
M2     1 := not AS
m2     1 := not AS
M4     6 := not AS
m4     7 := not AS
M50     1 := not AS
M8     1 := not AS
N0     1 := not AS
N1     1 := not AS
n13     2 := not AS
n1885     1 := not AS
N22     1 := not AS
n27     1 := not AS
N29     1 := not AS
n35     1 := not AS
N6     1 :=  not AS
n6     1 :=   not AS
n7     2 := not AS
o1     1 := not AS
O10     1 := not AS
o4     9 := not AS
o474     1 := not AS
o487     1 := not AS
P2     4 := not AS
p2    58 := not AS
P3     1 := not AS
P4     7 := not AS
P5     1 := not AS
r1     1 := not AS
R140     1 := not AS
R18     1 := not AS
R20     1 := not AS
r22     1 := not AS
R3     1 := not AS
r3     2 := not AS
R4     7 := not AS
r4     6 := not AS
R5     2 := not AS
r6     1 := not AS
R6     5 := not AS
s0     1 := not AS
S1     1 := not AS
S10     1 := not AS
S173     1 := not AS
S2     1 := not AS
s2     1 := not AS
S22     1 := not AS
S3     1 := not AS
S4     5 := not AS
S440     1 := not AS
S5     1 := not AS
S7     1 := not AS
S84     1 := not AS
S95     1 := not AS
S96     1 := not AS
T11     1 := not AS
T26     1 := not AS
t3     1 := not AS
T4     5 := not AS
T9     1 := not AS
U10     4 := not AS
u112     1 := not AS
u2     2 := not AS
U3     1 := not AS
u5     2 := not AS
U6     2 := not AS
V2     7 := not AS
v2     1 := not AS
V27     1 := not AS
v3     1 := not AS
V3     1 := not AS
V4     1 := not AS
w4     2 := not AS
w5     1 := not AS
x10     1 := not AS
x11     2 := not AS
x12     2 := not AS
x13     1 := not AS
x14     1 := not AS
x15     1 := not AS
x16     1 := not AS
x17     1 := not AS
x18     1 := not AS
x19     1 := not AS
x20     2 := not AS
x22     1 := not AS
x24     1 := not AS
x26     1 := not AS
x6     1 := not AS
x8     3 := not AS
Z4     1 := not AS

DTD

pw.dtd:

<?xml version="1.0" encoding="UTF-8"?>
<!-- pw.dtd
 June 6, 2014

-->
<!ELEMENT  pw (H1)*>
<!ELEMENT H1 (h,body,tail) >
<!ENTITY % body_elts "i |noti |s|divm|gram|g|ls|plus|Caus|Intens|sic|A|
   RUSSISCH" >
<!-- h element -->
<!ELEMENT h  (key1,key2,hom?)>
<!ELEMENT key1 (#PCDATA) > <!-- in slp1 -->
<!ELEMENT key2 (#PCDATA )><!-- in AS -->
<!ELEMENT hom (#PCDATA)> <!-- homonym -->

<!ELEMENT body (#PCDATA  | %body_elts;)*>
<!ELEMENT i (#PCDATA |gram|s|noti|ls|sic|g)*> <!-- italic -->
<!ELEMENT noti (#PCDATA |gram|ls|g|s|i)*> <!-- non-italic -->
<!ELEMENT s (#PCDATA)> <!-- Sanskrit, in AS transliteration  -->
<!ELEMENT divm (#PCDATA ) > <!-- section marker -->
<!ELEMENT gram (#PCDATA ) > <!-- Grammatical category -->
<!ELEMENT g (#PCDATA)> <!-- Greek  -->
<!ELEMENT A EMPTY> <!-- Arabic -->
<!ELEMENT RUSSISCH EMPTY> <!-- Russian  (once) -->
<!ELEMENT ls (#PCDATA )*> <!-- literary sources -->
<!ELEMENT plus EMPTY> <!-- sub-heading begin -->
<!ELEMENT Caus EMPTY> <!-- causal sub-heading begin -->
<!ELEMENT Intens EMPTY> <!-- intensive sub-heading begin (only once) -->
<!ELEMENT sic EMPTY > <!-- Undocumented -->
<!-- tail -->
<!ELEMENT tail (#PCDATA | L | pc )*>
<!ELEMENT L (#PCDATA) >
<!ELEMENT pc (#PCDATA) >

<!-- attributes  -->

<!ATTLIST divm type CDATA #REQUIRED > <!--e = English, g=Greek, n=Number  -->
<!ATTLIST divm n CDATA #REQUIRED > <!-- name of section  -->
<!ATTLIST gram n CDATA #REQUIRED > <!-- name of grammatical type -->