Table Of Contents

Previous topic

PW Böhtlingk Sanskrit-Wörterbuch in kürzerer Fassung (Developer notes)

Next topic

SCH Schmidt Nachträge zum Sanskrit-Wörterbuch (Developer notes)

This Page

PWG Böhtlingk and Roth Grosses Petersburger Wörterbuch (Developer notes)

Date of digitization: 2006

Metadata

The original digitization is file pwg_orig.txt; it is coded in the cp1252 (windows 1252) encoding, and is best viewed in a text editor which supports this encoding. For example, in Emacs, one may use the command revert-buffer-with-coding-system and then select cp1252 as the coding. The internet reference http://www.cp1252.com/ describes this coding system.

pwg_orig_utf8.txt is the utf-8 encoding of the original file. The original file, pwg_orig_utf8, is decomposed into three parts: pwg0.txt = the body of the dictionary pwgheader.txt = title page, etc. pwgdel.txt = some erroneously included lines.

pwg.txt is constructed by incorporating various editing changes to pwg0.txt. As of this writing, the editing changes are in the files change_01.txt and change_02.txt.

There are several extended ascii codes in pwg.txt:

   (\u00a0)     5 := NO-BREAK SPACE
¤  (\u00a4) 827943 := CURRENCY SIGN ¯{¤X¤} == <ls>X</ls>
¦  (\u00a6) 122744 := BROKEN BAR (ends headword )
§  (\u00a7)    14 := SECTION SIGN
ª  (\u00aa) 190685 := FEMININE ORDINAL INDICATOR
       ª = udAtta accent  = raised devanagari 'u' (text) = '/' in slp1
       ªª = svarita accent = vertical bar above (text) = '^' in slp1
«  (\u00ab)    59 := LEFT-POINTING DOUBLE ANGLE QUOTATION MARK
¯  (\u00af) 413971 := MACRON  See 'CURRENCY SIGN' above
°  (\u00b0) 83348 := DEGREE SIGN
±  (\u00b1)     1 := PLUS-MINUS SIGN
²  (\u00b2) 23976 := SUPERSCRIPT TWO (English subheading = ²a) ²b) etc.)
³  (\u00b3) 64369 := SUPERSCRIPT THREE (Number subhead = ³1) ³2) etc.
´  (\u00b4)     1 := ACUTE ACCENT
¸  (\u00b8) 102935 := CEDILLA
           ¸ = anudAtta accent = horizontal bar below (text) = '\' in slp1
¹  (\u00b9)  1501 := SUPERSCRIPT ONE (Greek subhead = ¹a) ¹b) etc.)
º  (\u00ba)     4 := MASCULINE ORDINAL INDICATOR
»  (\u00bb)    59 := RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
¼  (\u00bc)     2 := VULGAR FRACTION ONE QUARTER
½  (\u00bd)    10 := VULGAR FRACTION ONE HALF
Ä  (\u00c4)     1 := LATIN CAPITAL LETTER A WITH DIAERESIS
Ç  (\u00c7)   627 := LATIN CAPITAL LETTER C WITH CEDILLA
Ö  (\u00d6)   127 := LATIN CAPITAL LETTER O WITH DIAERESIS
Ü  (\u00dc)   358 := LATIN CAPITAL LETTER U WITH DIAERESIS
à  (\u00e0)     2 := LATIN SMALL LETTER A WITH GRAVE
ä  (\u00e4) 28789 := LATIN SMALL LETTER A WITH DIAERESIS
é  (\u00e9)     2 := LATIN SMALL LETTER E WITH ACUTE
ë  (\u00eb)     6 := LATIN SMALL LETTER E WITH DIAERESIS
ö  (\u00f6) 16974 := LATIN SMALL LETTER O WITH DIAERESIS
ü  (\u00fc) 40981 := LATIN SMALL LETTER U WITH DIAERESIS
“  (\u201c)     3 := LEFT DOUBLE QUOTATION MARK
”  (\u201d)    13 := RIGHT DOUBLE QUOTATION MARK
„  (\u201e)    11 := DOUBLE LOW-9 QUOTATION MARK
†  (\u2020) 51869 := DAGGER †{X} codes wide-spacing in X.
‡  (\u2021)     1 := DOUBLE DAGGER
•  (\u2022) 124338 := BULLET
…  (\u2026) 2896355 := HORIZONTAL ELLIPSIS
    used in place of space to define 'connected' text , e.g., literary citations

The {X...X} style of coding serves several purposes:

{#...#}  450077  : devanagari text, coded as HK with some extensions:
  MM = candra-bindu
  ª  = udAtta accent
  ªª = svarita accent
  ¸  = anudAtta accent
  |  = danda
{%.. %}  36609 : italic text
{??}, {?}  385 : unreadable text
{@..@}  55913 : bold
¯{¤X¤}  827943 X is literary citation
†{X} 51869  means wide-spacing in X
{Ç}  623   Significance unclear. Sometimes, {Ç}{?} indicates a metric pattern

The following <x> type tags are found in pwg.txt:

<A></A> Arabic
<R></R> Russian
<g>X</g> Greek (non-standard coding system adapted from AS, with a=alpha, etc.)
<F>X</F> Footnote
<H1> introduces a headword
-<P>- {#X#} indicates a paragraph; typically a prefix form for a verb.
<H>  76  : a page header, as for beginning of a letter, as <H>{#A#}
<?>  5  : Scan unclear
<NV></NV> 21 times.  incomplete integration of text improvement sections
<VN></VN> 3 times.  Similar
<sic> 1 : undocumented
<UL>  6 : undocumented
Page breaks are coded as [Page...].
Page breaks are more specifically coded as
[Page0v.xxxx] where xxxx is 0 filled page number from scan, and y = 1-7 is the
volume of the page. Pages are numbered consecutively within volumes.
Each scan page contains two columns, given indvidual page numbers. Thus,
Page01.0001 and Page01.0002 appear on the first scan page, etc.

The lines of the digitization generally represent ‘sections’ of the text; the actual line-breaks of the text are not coded.

Headwords are coded as <H1>000{X}1{Y}¦ where X and Y are
in Harvard-Kyoto transliteration, and X is normalized to remove such things
as accents.

The headwords are ordered according to Sanskrit alphabet ordering.

Volume 1 contains words a-au (words beginning with ‘a’ to words beginning ‘au’)
Volume 2 k-ch
Volume 3 j-dh
Volume 4 n-ph
Volume 5 b-m
Volume 6 y-v
Volume 7 z-h, then on page 7.1685 to the end, a section of
“Verbesserungen und Nachträge zum ganzen Werke”
(“Improvements and additions to the whole works”).

Sanskrit in the text appears mostly in devanagari and is coded as {#X#}. Some words are coded with the AS (Anglicized Sanskrit) coding. The general AS scheme, as described in CDSL.pdf, uses Latin alphabetical letters ‘x (a-z,A-Z), possibly with suffixed numbers; the letter-number combinations are, in the general scheme:

x1 = macron
x2 = dot below
x3 = dot above
x4 = accent aigu
x5 = tilde
x6 = dash below
x7 = umlaut
x10 = circonflex (hat)
x11 = accent grave

This coding appears notably in the literary citations of the text, but occasionally is used to code Greek and French.

Here are the characters that occur in pwg.txt in this coding, with their approximate frequency:

A1     7 := Ā  (\u0100)  LATIN CAPITAL LETTER A WITH MACRON
a1   253 := ā  (\u0101)  LATIN SMALL LETTER A WITH MACRON
A10 135212 := Â  (\u00c2)  LATIN CAPITAL LETTER A WITH CIRCUMFLEX
a10 17059 := â  (\u00e2)  LATIN SMALL LETTER A WITH CIRCUMFLEX
a11     3 := à  (\u00e0)  LATIN SMALL LETTER A WITH GRAVE
A2     9 := Ạ  (\u1ea0)  LATIN CAPITAL LETTER A WITH DOT BELOW
a2     3 := ạ  (\u1ea1)  LATIN SMALL LETTER A WITH DOT BELOW
A3    40 := À  (\u00c0) LATIN CAPITAL LETTER A WITH GRAVE
a3     3 := à (\u00e0) LATIN SMALL LETTER A WITH GRAVE
A4    12 := Á  (\u00c1) LATIN CAPITAL LETTER A WITH ACUTE
a4   126 := á  (\u00e1) LATIN SMALL LETTER A WITH ACTUE
a5     1 := ã   (\u00e3) LATIN SMALL LETTER A WITH TILDE
a7     1 := ä  (\u00e4)  LATIN SMALL LETTER A WITH DIAERESIS
B2     2 := Ḅ  (\u1e04)  LATIN CAPITAL LETTER B WITH DOT BELOW
C2 90690 := Ç  (\u00c7)  LATIN CAPITAL LETTER C WITH CEDILLA
c2  4225 := ç  (\u00e7)  LATIN SMALL LETTER C WITH CEDILLA
c4     2 := ć  (\u0107)  LATIN SMALL LETTER C WITH ACUTE
d2   900 := ḍ  (\u1e0d)  LATIN SMALL LETTER D WITH DOT BELOW
D2  1794 := Ḍ  (\u1e0c)  LATIN CAPITAL LETTER D WITH DOT BELOW
d3     1 := ḋ  (\u1e0b)  LATIN SMALL LETTER D WITH DOT ABOVE
D3     1 := Ḋ  (\u1e0a)  LATIN CAPITAL LETTER D WITH DOT ABOVE
d4     1 := d' (\u0064\u0301)      Latin small letter d with acute (in Greek)
D4     1 := D'  (\u0044\u0301) latin capital letter D with acute
d6     3 := ḏ  (\u1e0f)  LATIN SMALL LETTER D WITH LINE BELOW
e10    21 := ê  (\u00ea)  LATIN SMALL LETTER E WITH CIRCUMFLEX
E10     2 := ê  (\u00ca)  LATIN CAPITAL LETTER E WITH CIRCUMFLEX
E2     1 := Ẹ  (\u1eb8)  LATIN CAPITAL LETTER E WITH DOT BELOW
e3     9 := ė  (\u0117)  LATIN SMALL LETTER E WITH DOT ABOVE
E4     1 := É  (\u00c9)  LATIN CAPITAL LETTER E WITH ACUTE
e4   409 := é  (\u00e9)  LATIN SMALL LETTER E WITH ACUTE
E7     1 := Ë   (\u00cb)  LATIN CAPITAL LETTER E WITH DIARESIS
F4     1 := F' (\u0066\u0301) LATIN CAPITAL LETTER F WITH ACUTE
G1     9 := Ḡ  (\u1e20)  LATIN CAPITAL LETTER G WITH MACRON
G10    10 := Ĝ  (\u011c)  LATIN CAPITAL LETTER G WITH CIRCUMFLEX
g10     1 := ĝ  (\u011d)  LATIN SMALL LETTER G WITH CIRCUMFLEX
g3     1 := ġ  (\u0121)  LATIN SMALL LETTER G WITH DOT ABOVE
G4 24815 := Ǵ  (\u01f4)  LATIN CAPITAL LETTER G WITH ACUTE
g4  1567 := ǵ  (\u01f5)  LATIN SMALL LETTER G WITH ACUTE
H10     1 := Ĥ  (\u0124)  LATIN CAPITAL LETTER H WITH CIRCUMFLEX
H2     3 := Ḥ  (\u1e24)  LATIN CAPITAL LETTER H WITH DOT BELOW
h2   110 := ḥ  (\u1e25)  LATIN SMALL LETTER H WITH DOT BELOW
h4    47 := h' (\u0068\u0301) latin small letter h with acute
h5     6 :=  h5  (greek, eta with tilde)
H7     1 := NO DESCRIPTION
h7     1 :=  (\u1e27) LATIN SMALL LETTER H WITH DIARESIS (in Greek)
i1     5 := ī  (\u012b)  LATIN SMALL LETTER I WITH MACRON
I1     2 := Ī  (\u012a)  LATIN CAPITAL LETTER I WITH MACRON
I10  6041 := Î  (\u00ce)  LATIN CAPITAL LETTER I WITH CIRCUMFLEX
i10  3413 := î  (\u00ee)  LATIN SMALL LETTER I WITH CIRCUMFLEX
I3    21 := İ  (\u0130)  LATIN CAPITAL LETTER I WITH DOT ABOVE
I4     1 := Í (\u00cd)  LATIN CAPITAL LETTER I WITH ACUTE
i4    56 := í (\u00ed) LATIN SMALL LETTER I WITH ACUTE
I5     1 := Ĩ  (\u0128)  LATIN CAPITAL LETTER I WITH TILDE
i5     8 := ĩ  (\u0129)  LATIN SMALL LETTER I WITH TILDE
i7     4 := ï  (\u00ef)  LATIN SMALL LETTER I WITH DIAERESIS
K2     2 := Ḳ  (\u1e32)  LATIN CAPITAL LETTER K WITH DOT BELOW
k2     2 := ḳ  (\u1e33)  LATIN SMALL LETTER K WITH DOT BELOW
K4 20461 := Ḱ  (\u1e30)  LATIN CAPITAL LETTER K WITH ACUTE
k4   690 := ḱ  (\u1e31)  LATIN SMALL LETTER K WITH ACUTE
l2     6 := ḷ  (\u1e37)  LATIN SMALL LETTER L WITH DOT BELOW
L2     1 := Ḷ  (\u1e36)  LATIN CAPITAL LETTER L WITH DOT BELOW
M1    70 :=  Not AS, In Devanagari text
M2    53 := Ṃ  (\u1e42)  LATIN CAPITAL LETTER M WITH DOT BELOW
m2     2 := ṃ  (\u1e43)  LATIN SMALL LETTER M WITH DOT BELOW
M3     5 := Ṁ  (\u1e40)  LATIN CAPITAL LETTER M WITH DOT ABOVE
m3     1 := ṁ  (\u1e41)  LATIN SMALL LETTER M WITH DOT ABOVE
M4     4 := Ḿ  (\u1e3e)  LATIN CAPITAL LETTER M WITH ACUTE
M5  3378 := M̄ (\u004d\0304) LATIN CAPITAL LETTER M, COMBINING TILDE
m5   802 := m̄ (\u006d\u0304) LATIN CAPITAL LETTER M, COMBINING TILDE, not represented
N2  4695 := Ṇ  (\u1e46)  LATIN CAPITAL LETTER N WITH DOT BELOW
n2 12297 := ṇ  (\u1e47)  LATIN SMALL LETTER N WITH DOT BELOW
n3     2 := ṅ  (\u1e45)  LATIN SMALL LETTER N WITH DOT ABOVE
N4 14496 := Ń  (\u0143)  LATIN CAPITAL LETTER N WITH ACUTE
n4   365 := ń  (\u0144)  LATIN SMALL LETTER N WITH ACUTE
N5  4254 := Ñ  (\u00d1)  LATIN CAPITAL LETTER N WITH TILDE
n5   738 := ñ  (\u00f1)  LATIN SMALL LETTER N WITH TILDE
o10     6 := ô  (\u00f4)  LATIN SMALL LETTER O WITH CIRCUMFLEX
O10     2 := Ô  (\u00d4)  LATIN CAPITAL LETTER O WITH CIRCUMFLEX
o3    21 :=  Not AS;  In Devanagari text
o4    64 := ó  (\u00f3) LATIN SMALL LETTER O and udatta accent
p2     1 := ?  latin small letter p with dot below - smudge?
p3     1 := ṗ  (\u1e57)  LATIN SMALL LETTER P WITH DOT ABOVE
R1     5 :=  Not AS;  In Devanagari text
R2 39466 := Ṛ  (\u1e5a)  LATIN CAPITAL LETTER R WITH DOT BELOW
r2  1964 := ṛ  (\u1e5b)  LATIN SMALL LETTER R WITH DOT BELOW
R4     2 := Ŕ  (\u0154)  LATIN CAPITAL LETTER R WITH ACUTE
r4     3 := ŕ  (\u0155)  LATIN SMALL LETTER R WITH ACUTE
S2     3 := Ṣ  (\u1e62)  LATIN CAPITAL LETTER S WITH DOT BELOW
s3     1 := ṡ  (\u1e61)  LATIN SMALL LETTER S WITH DOT ABOVE
S4     1 := Ś  (\u015a)  LATIN CAPITAL LETTER S WITH ACUTE
s4     4 := ś  (\u015b)  LATIN SMALL LETTER S WITH ACUTE
t2  1394 := ṭ  (\u1e6d)  LATIN SMALL LETTER T WITH DOT BELOW
T2  9749 := Ṭ  (\u1e6c)  LATIN CAPITAL LETTER T WITH DOT BELOW
t4     5 := t' (\u0074\u0301) latin small letter t with acute
u0     4 := u  In Devanagari after 'a', a-u diaresis.  Addition to HK coding
u1    20 := ū  (\u016b)  LATIN SMALL LETTER U WITH MACRON
U10  2117 := Û  (\u00db)  LATIN CAPITAL LETTER U WITH CIRCUMFLEX
u10  1156 := û  (\u00fb)  LATIN SMALL LETTER U WITH CIRCUMFLEX
u11     1 := ù  (\u00f9)  LATIN SMALL LETTER U WITH GRAVE
U2     1 := Ụ  (\u1ee4)  LATIN CAPITAL LETTER U WITH DOT BELOW (possible smudge in pwg)
u2     1 := ụ  (\u1ee5)  LATIN SMALL LETTER U WITH DOT BELOW
U3     4 := Not AS;  In Devanagari text
u3     4 := ú (\u00f9) LATIN SMALL LETTER U WITH GRAVE
u4    58 := ú (\u00fa) LATIN SMALL LETTER U WITH ACUTE
u5     9 := ũ  (\u0169)  LATIN SMALL LETTER U WITH TILDE
u7     3 := ü  (\u00fc)  LATIN SMALL LETTER U WITH DIAERESIS
w4    12 := ẃ (\u1e83) LATIN SMALL LETTER W WITH ACUTE (in Greek)
w5     4 :=  w5  (greek, omege with tilde)

There are several differences between the representation of features of Sanskrit in the AS coding of PWG and the AS coding present in the Monier-Williams dictionary of 1899. For instance, long vowels appear with a circumflex diacritical mark in PWG, but with a macron in MW. A full description of these differences has not been done.

In the course of making the above table, several instances were found where a letter-number combination of the digitization should not be interpreted and displayed as in the table. Some of these were corrected in the digitization by inserting a space between the letter and the digit; it is likely there are other such changes that should be made to improve the digitization.

DTD

pwg.dtd:

<?xml version="1.0" encoding="UTF-8"?>
<!-- pwg.dtd
 Nov 22, 2013

-->
<!ELEMENT  pwg (H1)*>
<!ELEMENT H1 (h,body,tail) >
<!ENTITY % body_elts "br | g |  i |ls |P |s | lang |  UL | H | divm | gram
 | F | wide | NV | VN" >
<!-- h element -->
<!ELEMENT h  (key1,key2,hom?)>
<!ELEMENT key1 (#PCDATA) > <!-- in slp1 -->
<!ELEMENT key2 (#PCDATA )><!-- in slp1 -->

<!ELEMENT body (#PCDATA  | %body_elts;)*>
<!ELEMENT br EMPTY > <!-- line breaks in pwg.txt -->
<!ELEMENT P EMPTY > <!-- 'Paragraph' marker (pre-verb prefix) -->
<!ELEMENT sic EMPTY > <!-- Undocumented -->
<!ELEMENT UL EMPTY > <!-- Undocumented -->
<!ELEMENT NV (#PCDATA | %body_elts; )* > <!-- Undocumented -->
<!ELEMENT VN (#PCDATA | %body_elts; )* > <!-- Undocumented -->
<!ELEMENT H EMPTY > <!-- Mark a letter break ? -->
<!ELEMENT divm (#PCDATA ) > <!-- section marker -->
<!ELEMENT gram (#PCDATA ) > <!-- Grammatical category -->
<!ELEMENT lang (#PCDATA)> <!-- Various foreign languages -->
<!ELEMENT F (#PCDATA | %body_elts;)*> <!-- Footnote -->
<!ELEMENT wide (#PCDATA | i)*> <!-- text with wide spacing  -->
<!ELEMENT g (#PCDATA)> <!-- Greek  -->
<!ELEMENT i (#PCDATA | wide | sic)*> <!-- italic  -->
<!ELEMENT ls (#PCDATA | i | wide | lang | s)*> <!-- literary sources -->
<!ELEMENT s (#PCDATA)> <!-- Sanskrit devanagari, in slp1 transliteration  -->
<!ELEMENT hom (#PCDATA)> <!-- homonym -->
<!-- tail -->
<!ELEMENT tail (#PCDATA | L | pc )*>
<!ELEMENT L (#PCDATA) >
<!ELEMENT pc (#PCDATA) >

<!-- unused -->
<!ELEMENT b (#PCDATA | g | br)*> <!-- bold  -->

<!-- attributes  -->
<!ATTLIST lang n CDATA #REQUIRED > <!-- Arabic, Russian, Greek -->
<!ATTLIST divm type CDATA #REQUIRED > <!--e = English, g=Greek, n=Number  -->
<!ATTLIST divm n CDATA #REQUIRED > <!-- name of section  -->
<!ATTLIST gram n CDATA #REQUIRED > <!-- name of grammatical type -->