Table Of Contents

Previous topic

Dictionaries (developers)

Next topic

AE Apte Student’s English-Sanskrit Dictionary (Developer notes)

This Page

ACC Catalogus Catalogorum (Developer notes)

Date of digitization: 2013

Metadata

The original digitization is file acc_orig.txt, which is coded in the cp1252 (windows 1252) encoding, and is best viewed in a text editor which supports this encoding. For example, in Emacs, one may use the command revert-buffer-with-coding-system and then select cp1252 as the coding. The internet reference http://www.cp1252.com/ describes this coding system.

The file acc_orig_utf8.txt is a conversion of acc_orig.txt to the more common utf-8 encoding. The file acc.txt is also in the utf-8 encoding, and incorporates various editing changes, such as corrections of typographical errors.

There are several extended ascii codes occurring in acc.txt:

¤  (\u00a4)  9302 := CURRENCY SIGN
¦  (\u00a6) 48869 := BROKEN BAR
§  (\u00a7)     1 := SECTION SIGN
«  (\u00ab)     1 := LEFT-POINTING DOUBLE ANGLE QUOTATION MARK
¯  (\u00af)     1 := MACRON
º  (\u00ba)    44 := MASCULINE ORDINAL INDICATOR
»  (\u00bb)     1 := RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
Ç  (\u00c7)  7682 := LATIN CAPITAL LETTER C WITH CEDILLA
ç  (\u00e7) 15152 := LATIN SMALL LETTER C WITH CEDILLA
‘  (\u2018)    91 := LEFT SINGLE QUOTATION MARK
’  (\u2019)    91 := RIGHT SINGLE QUOTATION MARK
“  (\u201c)     2 := LEFT DOUBLE QUOTATION MARK
”  (\u201d)     3 := RIGHT DOUBLE QUOTATION MARK
„  (\u201e)     1 := DOUBLE LOW-9 QUOTATION MARK
†  (\u2020)     2 := DAGGER

The {X...X} style of coding serves several purposes:

{#X#}    49906 : {#X#} devanagari text, coded with HK
{%X%}    1461 : italic text
{??}     22 : unreadable text
{@X@}    1 : in a header
{|X|}    1 : widely spaced text (preface material)

The following <x> type tags are found in acc.txt:

<HI>  53695  : Marks headwords
<>  30653  : start of new line
<P>  166  :  start of new line
<HI1>  26991  : start of new line
<H>  52  : Title line
<F>..</F>  24  : Footnote
Page breaks are coded as [Page...].
Page breaks are more specifically coded as
[PageV-PPP-C+ N] or [PageV-PPP+ N].
The [PageV-PPP-C+ N] form is the usual form; in this form,
V is the volume (1,2 or 3)
PPP is page within volume
C is column, usually ‘a’ or ‘b’; sometimes ‘a1’ or ‘b1’
N is usu. a number with 1 or 2 digits (number of lines in following column)
Occasionally, it is ‘No’ (in Corrections sections).
The [PageV-PPP+ N] form is used for title, preface pages.

The lines of the digitization represent lines of the text.

Headwords are coded in the general form <HI>{#X#}¦ where X is coded in Harvard-Kyoto transliteration.

The headwords are ordered according to Sanskrit alphabet ordering.

Sanskrit in the text appears in the European Indological form, which is coded in acc.txt with the the AS (Anglicized Sanskrit) coding. The general AS scheme, as described in CDSL.pdf, uses Latin alphabetical letters ‘x (a-z,A-Z), possibly with suffixed numbers; the letter-number combinations are, in the general scheme:

x1 = macron
x2 = dot below
x3 = dot above
x4 = accent aigu
x5 = tilde
x6 = dash below
x7 = umlaut
x10 = circonflex (hat)
x11 = accent grave

Here are the characters that occur in acc.txt in this coding, with their approximate frequency:

A1  2636 := Ā  (\u0100)  LATIN CAPITAL LETTER A WITH MACRON
a1 98121 := ā  (\u0101)  LATIN SMALL LETTER A WITH MACRON
a10     1 := â  (\u00e2)  LATIN SMALL LETTER A WITH CIRCUMFLEX
a2     1 := ạ  (\u1ea1)  LATIN SMALL LETTER A WITH DOT BELOW
a4     1 := á(\u00e1) LATIN SMALL LETTER A with udatta accent
a7    59 := ä  (\u00e4)  LATIN SMALL LETTER A WITH DIAERESIS
d2  4314 := ḍ  (\u1e0d)  LATIN SMALL LETTER D WITH DOT BELOW
D2    90 := Ḍ  (\u1e0c)  LATIN CAPITAL LETTER D WITH DOT BELOW
e4     4 := é  (\u00e9)  LATIN SMALL LETTER E WITH ACUTE
h2   581 := ḥ  (\u1e25)  LATIN SMALL LETTER H WITH DOT BELOW
I1 27135 := Ī  (\u012a)  LATIN CAPITAL LETTER I WITH MACRON
i1 22218 := ī  (\u012b)  LATIN SMALL LETTER I WITH MACRON
i10     4 := î  (\u00ee)  LATIN SMALL LETTER I WITH CIRCUMFLEX
i7     2 := ï  (\u00ef)  LATIN SMALL LETTER I WITH DIAERESIS
l2    16 := ḷ  (\u1e37)  LATIN SMALL LETTER L WITH DOT BELOW
m2  3723 := ṃ  (\u1e43)  LATIN SMALL LETTER M WITH DOT BELOW
n1  4290 := n- (\u006e\u0304) LATIN SMALL LETTER M, COMBINING MACRON
N2     6 := Ṇ  (\u1e46)  LATIN CAPITAL LETTER N WITH DOT BELOW
n2 20345 := ṇ  (\u1e47)  LATIN SMALL LETTER N WITH DOT BELOW
n3  1945 := ṅ  (\u1e45)  LATIN SMALL LETTER N WITH DOT ABOVE
n5  3057 := ñ  (\u00f1)  LATIN SMALL LETTER N WITH TILDE
o1     3 := ō  (\u014d)  LATIN SMALL LETTER O WITH MACRON
o7    14 := ö  (\u00f6)  LATIN SMALL LETTER O WITH DIAERESIS
R2   256 := Ṛ  (\u1e5a)  LATIN CAPITAL LETTER R WITH DOT BELOW
r2  6440 := ṛ  (\u1e5b)  LATIN SMALL LETTER R WITH DOT BELOW
T2   146 := Ṭ  (\u1e6c)  LATIN CAPITAL LETTER T WITH DOT BELOW
t2 14610 := ṭ  (\u1e6d)  LATIN SMALL LETTER T WITH DOT BELOW
U1    21 := Ū  (\u016a)  LATIN CAPITAL LETTER U WITH MACRON
u1  6283 := ū  (\u016b)  LATIN SMALL LETTER U WITH MACRON
u10     1 := ū  (\u016b)  LATIN SMALL LETTER U WITH MACRON
U7     2 := Ü  (\u00dc)  LATIN CAPITAL LETTER U WITH DIAERESIS
u7   839 := ü  (\u00fc)  LATIN SMALL LETTER U WITH DIAERESIS
S6     1 := Not AS.

The palatal sibilants are represented in the text and digitization with the (extended-ascii) characters Ç and ç.

DTD

acc.dtd:

<?xml version="1.0" encoding="UTF-8"?>
<!-- acc.dtd
 June 10, 2014

-->
<!ELEMENT  acc (H1)*>
<!ELEMENT H1 (h,body,tail) >
<!ENTITY % body_elts "F |  br | H | b | i | s" >
<!-- h element -->
<!ELEMENT h  (key1,key2,hom?)>
<!ELEMENT key1 (#PCDATA) > <!-- in slp1 -->
<!ELEMENT key2 (#PCDATA )><!-- in AS -->

<!ELEMENT body (#PCDATA  | %body_elts;)*>
<!ELEMENT br EMPTY > <!-- line breaks in bhs.txt -->
<!ELEMENT F (#PCDATA | br)*> <!-- Footnote  -->
<!ELEMENT s (#PCDATA | br)*> <!-- Devanagari, in HK transliteration  -->
<!ELEMENT H EMPTY> <!--   -->
<!ELEMENT b (#PCDATA | br)*> <!-- bold  -->
<!ELEMENT i (#PCDATA | br )*> <!-- italic  -->

<!-- tail -->
<!ELEMENT tail (#PCDATA | L | pc )*>
<!ELEMENT L (#PCDATA) >
<!ELEMENT pc (#PCDATA) >

<!-- attributes  -->