Table Of Contents

Previous topic

STC Stchoupak Dictionnaire Sanscrit-Français (Developer notes)

Next topic

VEI The Vedic Index of Names and Subjects (Developer notes)

This Page

VCP Vacaspatya (Developer notes)

Date of digitization: 2013

Metadata

The original digitization is file vcp_orig_cp1252.txt, which is in the cp1252 encoding and is best viewed in a text editor which supports this encoding. For example, in Emacs, one may use the command revert-buffer-with-coding-system and then select cp1252 as the coding. The internet reference http://www.cp1252.com/ describes this coding system.

There are several extended ascii codes in vcp.txt:

¦  (\u00a6) 48353 := BROKEN BAR
‘  (\u2018)   248 := LEFT SINGLE QUOTATION MARK
’  (\u2019)   249 := RIGHT SINGLE QUOTATION MARK
“  (\u201c) 76768 := LEFT DOUBLE QUOTATION MARK
”  (\u201d) 76493 := RIGHT DOUBLE QUOTATION MARK
„  (\u201e)     2 := DOUBLE LOW-9 QUOTATION MARK

The {X...X} style of coding serves two purposes in vcp.txt:

{??} 4580  : Unreadable
       50  : Revised (Jul 2014), due to corrections
{@X@} 48365 : Headword coding (X in Harvard-Kyoto transliteration)

The <> style of coding is used as follows:

<HI>  51721  : At beginning of line, indicating start of headword
<>   355827  :  At beginning of other 'normal' lines
<H>    66  : At start of line. A 'headline' (various usage)
<P>  2331  : At start of line, indicating a paragraph indentation
<Picture> 71 : Diagrams in text
The rest are 'Column' indicators for textual tabular arrays:
<C10>  4  :
<C11>  4  :
<C12> 4
<C1>  70  :
<C2>  70  :
<C3>  70  :
<C4>  70  :
<C5>  68  :
<C6>  66  :
<C7>  4  :
<C8>  4  :
<C9>  4  :
Page breaks are coded as [Page...].
Page breaks are more specifically coded as
[Page0035-a+ 31] indicating page 35, column a (1st col.), 31 lines in column.
The pagination is continuous through all 6 volumes; The last dictionary
page is [Page5441-b+ 31].
A handful of page designations don’t follow this pattern:
[Page1595+ 37] a table, so not two columns
[Page1596+ 31] ditto
[Page2764+ 39] A table, but Columns not marked (correct?)
Headwords are coded in vcp.txt as:
^<HI>{@X@}¦ Where X is slp1 coding (of Devanagari text)
(Note: The original digitization (vcp_orig_cp1252.txt) uses HK coding)
Here is the regular expression used in python programs to recognize headwords.
reHeadword = r’^<HI>{@(.*?)@}¦’

The headwords are ordered according to Sanskrit alphabet ordering. However, about 2% of the identified headwords are out of alphabetical order.

The introduction is not coded in vcp.txt; it appears in the separate file vcp-preface.txt. A short ending section appears in the separate file vcp-end.txt.

Sanskrit in the text appears in Devanagari, and is coded in vcp.txt with the KH (Kyoto-Harvard) coding, with two variants:

MM codes chandra-bindu
D. (and D.h) codes 'nukta'. This '.' coding is left in vcp.txt, but
   removed in vcp.xml.

Headwords are also coded in HK.

There is no Anglicized Sanskrit coding in vcp.txt.

Scanned images, in pdfs vcp_title and vcp1_bookmark are from Thomas Malten; vcp2_bookmark, ..., vcp6_bookmark) are almost all from http://archive.org downloads, as files vacaspatyam02tarkuoft.pdf, ..., vacaspatyam06tarkuoft.pdf. However, a few (12) pages missing from these last 5 archive.org pdfs were obtained from the high-resolution scans used for digitization by Malten.

DTD

vcp.dtd:

<?xml version="1.0" encoding="UTF-8"?>
<!-- skd.dtd
 Oct, 2013

-->
<!ELEMENT  vcp (H1)*>
<!ELEMENT H1 (h,body,tail) >
<!ENTITY % body_elts " s  | lb |C | HI |P | H| Picture|edit" >
<!-- h element -->
<!ELEMENT h  (key1,key2)>
<!ELEMENT key1 (#PCDATA)>
<!ELEMENT key2 (#PCDATA)*>

<!ELEMENT body (#PCDATA  | %body_elts;)*>
<!ELEMENT C EMPTY > <!-- 'column' in table -->
<!ELEMENT HI EMPTY > <!-- begin headword -->
<!ELEMENT P EMPTY > <!-- begin 'paragraph', used irregularly -->
<!ELEMENT H EMPTY > <!-- a 'title' -->
<!ELEMENT lb EMPTY > <!-- line break -->
<!ELEMENT edit EMPTY > <!-- marks point where vcp.txt edited -->
<!ELEMENT s (#PCDATA  )*> <!-- Devanagari, in HK transliteration  -->
<!ELEMENT Picture EMPTY > <!-- text diagram -->
<!-- tail -->
<!ELEMENT tail (#PCDATA | L | pc )*>
<!ELEMENT L (#PCDATA) >
<!ELEMENT pc (#PCDATA) >
<!ATTLIST edit type CDATA #IMPLIED>
<!-- attributes  -->
<!ATTLIST C n (1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12) #IMPLIED>