Abstract. The Perseus Digital Library includes tools for morphological analysis; encoding and presentation of lexica; metadata and cataloguing; abstract mapping of various SGML and XML DTDs; and display of document sections. The system was developed for Ancient Greek and has been extended to support Latin and Italian. We describe how we are generalizing this document management system for other languages and for use by other projects. The original implementation did not clearly separate infrastructure from project-specific data and configuration. Determining which data elements, template files, and routines are part of the system and which are part of the various corpora has helped us determine what is crucial to the infrastructure of a multi-lingual digital library, which naming conventions and meta-data standards should be shared among co-operating projects, and what features can be configurable by individual projects. One goal of the present work is to make the Perseus DL infrastructure available as open-source software.
The Perseus Project manages a digital library containing almost 9 million words of Greek, over 4 million words of Latin, and growing corpora in Italian and German, as well as 55 million words of English. We have developed tools for morphological analysis; encoding and presentation of lexica; metadata and cataloguing; abstract mapping of various SGML and XML DTDs; and display of document sections. We are now working on generalizing this document management system for use by other projects. In this paper, we will explain our strategies for packaging this existing infrastructure, describe the problems we've run in to and how we solved them, and describe the system itself and how other projects expect to use it. The Perseus Digital Library is on line at http://www.perseus.tufts.edu.
Our document management system and linguistic tools were originally designed for Ancient Greek. We extended the system to Latin several years ago, and more recently have added full support for Italian; this fall, we have begun work on support for Arabic. Supporting a language in our system includes encoding one or more lexica in SGML, adding this language's inflexion rules to the morphological analysis database, and adding texts to the digital library. We use the lexicon to seed the morphological database, then make new words available as they appear in new texts added to the collection. Section 2 describes the morphological analysis program, and section 3 describes our methods for encoding lexica and presenting them in the digital library.
Although we use the TEI DTD for new texts, we have a variety of older texts that use different DTDs, and we have also incorporated materials from other projects that use other DTDs, or use the TEI in different ways. Our system manages varying DTDs by mapping relevant structural features to abstract structures. For example, "chapter 5" might be represented by the fifth occurrence of a "<chapter>" element, by "<div2 type=chapter n=5>", by "<milestone unit=chapter id="Ch5" />", or in some other way. The mapping scheme converts all of these to the abstract "chapter 5". Thereafter, routines processing the XML files need not be concerned with the details of their original DTDs. We describe the abstract structure mapping further below, section 4.
One of the strengths of our document management system is its flexible, modular display front end. We present texts in HTML over the Web, but have also experimented with presentation in XML and in Adobe's PDF. Readers can request display of a section of a document, using the standard citation scheme for the text if there is one (for example, book and line for Homer's Iliad, or book, chapter, and verse for the Bible). The document system identifies the desired section based on structural metadata provided by the DTD mapping engine and converts it to display format based on a template supplied by the corpus editor or digital librarian. User preferences also affect the display; in particular, since we cannot yet assume that all potential readers have access to Unicode display fonts, we can convert non-Roman characters into any of several popular fonts for display.
Additional modules in the display system manage implicit searching of feature databases (for geography, timelines, and keyword lookups) and automatic connection of texts to morphological analyses, glosses, and collocation data.
In generalizing this document management system for use outside the Perseus Project, we have identified various problems and limitations. Most important, the original implementation did not clearly separate infrastructure from project-specific data and configuration. Determining which data elements, template files, and routines are part of the system and which are part of the various corpora has helped us determine what is crucial to the infrastructure of a multi-lingual digital library, which naming conventions and meta-data standards should be shared among co-operating projects, and what features can be configurable by individual projects. Section 5 discusses what we have learned from the exercise of generalizing the system.
Morpheus, the rule-based morphological system, is the foundation of linguistic analysis in the Perseus Digital Library. It was first developed for Greek by Gregory Crane in 1985, extended to Latin in 1996, and extended to Italian in 1999. Morpheus maintains separate databases for morphological information ("what are the endings of the present tense, active voice?") and lexical information ("is volo a regular verb?" "what is the stem of femina?"). This allows new forms to be added easily, usually automatically: if the stem ama- (Latin, = "love") is known to belong to a first-conjugation verb, forms like amavisti, amabantur, ames will be recognized wherever they appear in the texts.
The original implementation, Greek Morpheus, can handle regular verbs and nouns, irregular verbs (in Greek, mostly suppletive) and nouns, verb prefixes (a very common kind of derivation), and the various dialects of Greek in common use in the archaic and classical periods. Virtually all inflections in Greek are endings, though many past-tense verb forms take a prefix (the "temporal augment") and some stems are formed by reduplication of the first consonant. Morpheus therefore assumes that inflected words can be divided into stems and endings. The stems are related to lexical headwords (e.g. the stems pemp- and pepomph- belong to the verb pempô, "send") so that tools using Morpheus can offer definitions as well as morphological analyses. For each stem, moreover, Morpheus knows the relevant grammatical category (the "conjugation" or "declension"), which determines the possible endings. It can then recognize that pempoimi is a valid form, but pempeiên is not: both use endings for the first person singular, present optative active, but only the first of these endings is appropriate for the verb pempô.
When the Perseus Project received a grant from the National Endowment for the Humanities to add coverage of Roman art, Roman history, and Latin literature, one of the first necessary steps was to generalize Morpheus to handle Latin. Because Latin morphology, like Greek, uses endings, this was straightforward. The only processes in Latin that were not already accounted for in Greek were assimilation of prefixes and syncopation of certain endings. In Greek, most verb prefixes end in vowels (epi-, kata-, apo-, and so on), while Latin has many prefixes that end in consonants (ad-, in-, sub-). These prefixes may be left as they are (adfero, "carry to") or may be assimilated to a following consonant (affero); current printed Latin editions may do either. Morpheus had to recognize that affero = ad- + fero.
Latin also has a class of perfect endings that can be syncopated: amavisti, the full form (= "thou hast loved"), frequently appears as amasti. The syncopated endings could be considered simply alternate endings, analogous to the dialectical variations in Greek, but it is convenient to recognize the syncopation because this is how the forms are usually presented in textbooks and student grammars.
Certain clitic particles in Latin are conventionally written as suffixes (-que, -ve, -ne), and Morpheus has to recognize those as well, but this rarely presents problems as there are few cases where the form is ambiguous. That is, forms like eque which admit of two analyses (vocative of equus, "O horse," or e + -que, "and out of") are quite rare.
Extension of Morpheus to Italian was straightforward, since Italian morphology works on the same principles as Greek and Latin. We expect that Morpheus in its present form will work for any Indo-European language, and for any other language whose morphology is based primarily on endings. Other languages present more problems. Currently, we are planning work on Arabic, related to one of our collaborators' projects in the history of science. Morpheus is not the best tool for Arabic morphology, which is based on vowel changes and infixation as well as affixation. Moreover, standard Arabic texts do not even contain the vowels, which means it is necessary to parse a form in context to recognize which of several possible forms it is. We expect, therefore, to use a different morphological analysis engine for Arabic.
Because we use the TEI DTD for our texts, it was a natural choice for our lexica as well. We do not use the strict TEI dictionary tag set, however, because older print dictionaries are not completely consistent in their structure and the strict structures associated with the <entry> tag set do not allow for the inevitable variation that occurs in these dictionaries. For this reason, we use the much less strict <entryfree> syntax for most of the dictionaries in the digital library.
The actual display of dictionary entries is handled by the document management system described below. The lexica are also integrated with Morpheus: every Greek and Latin word anywhere in the Perseus digital library is linked to a word study tool, based on the morphological analysis. Users can click on a word and see an automatically generated hypertext giving the morphological analysis and other resources based on the dictionary headword. These resources include a short definition (automatically extracted from the lexicon), word frequency charts, links to searching tools, links to grammar helps and, of course, links to the full definitions of the word in the dictionary.
Tagging all of our lexica according to a consistent format such as the TEI has allowed us to develop several scalable tools to extract and re-present the knowledge encoded in these documents. As noted above, the Perseus morphological analysis system maintains a separate database for lexical information. This has allowed us to develop programs to extract lists of lexical forms from dictionary entries, check them against the existing lexical database, and add new entries where appropriate. Thus, when the National Endowment for the Humanities provided funding to enter the standard unabridged Greek-English lexicon (LSJ), we were able to add approximately 70,000 extra words to the lexical database with little extra hand work. Similarly, one of the first steps in expanding the morphological analysis system to Italian was entering an Italian dictionary, extracting lexical information, and using it to create a new lexical database. Our work on Arabic will begin with entry of Lane's Arabic-English Lexicon.
We can extract additional information from lexica. We have developed programs that extract definitions from the lexica and generate lists of words with similar definitions, or possible synonyms, using vector-space document similarity models. This works not only for other words in the same lexicon, but for words in other lexica, even in other languages. In addition, we have written programs to scan dictionary entries and extract short definitions that we can present to end-users as part of the word study tool, or can include in a vocabulary list for students.
These lexica also provide important data to the Perseus citation and cross referencing engine. This tool allows us to display links to other texts that cite the document currently being displayed. A simple example is a commentary, which explicitly talks about another text. For example, when a reader views the text of Thucydides or Homer's Iliad, we are able to show notes from several commentaries about these texts. Much more exciting, however, is the fact that each of these citations is also displayed as a link from the cited text back to the commentary. For example, a reader of Herodotus 3.119 might be interested to know that Jebb cites this passage in his commentary on Sophocles' Antigone. Our text display system generates a link to Jebb's commentary when a user is reading this passage in the text of Herodotus. Lexica are rich with the sorts of citations that make this display system truly useful. The LSJ Greek-English lexicon, for example, contains more than 200,000 citations of texts that exist in the Perseus digital library. Each of these citations is converted into a link allowing users to see that the dictionary offers specific suggestions about the way that a word is being used in a particular context. For example, a person reading Homer's Odyssey 16.323 will see an active link from the word phere to the section of the LSJ entry for pherô that cites this passage.
All of the lexica that have been made publicly available in the Perseus Digital Library are general dictionaries, designed to provide broad coverage of a language. These sorts of dictionaries, however, often cannot provide the level of detail that is necessary to understand how a single author is using a word. For this reason, we are working on several specialized lexica for classical authors such as Homer and Pindar. These dictionaries can be integrated into the word study tool and displayed when users are reading works by one of these authors, while they can also provide information that can be used in all of the knowledge management tools described above. Because these reversible citations and automatically presented dictionary definitions seem to us an effective way to integrate linguistic information into the presentation of a text, we are also working on specialized lexica for English authors, notably the Shakespeare lexica of Dyce, Onions, and Schmidt. We will therefore need to enhance our linguistic infrastructure to offer help to readers of texts in the primary language of the digital library.
The Perseus text processing system manages XML and SGML texts encoded according to various different DTDs. The key to the system is the mapping of specific SGML elements to abstract structural elements. If a user wishes to read Our Mutual Friend, book 3, chapter 6, or if a commentary refers to Iliad, book 22, line 361, the document management system can identify this section of the text by its citation scheme (by book and chapter, or book and line), no matter what DTD was used for Dickens or for Homer.
In addition, the text processing system manages multiple versions of the same text. Just as structural elements are mapped to abstract structures, so texts are mapped to abstract works (called "abstract bibliographic objects," or ABO). A user reading Homer may begin with the Greek text, but can also move to the English text of the same section. Similarly, a commentary written with reference to the Greek text can be offered to readers of the English text.
Using our system, digital librarians create partial mappings between elements in a DTD (e.g., div1, div2, and lb) and abstract structural elements (act, scene, and line) from which the text processing system generates lookup tables (indices) of the elements so mapped. Thus what is encoded as <div2 type="scene"> in one document and as <scene> in another are both indexed as an abstract, structural "scene." This mapping hides the use of different DTDs from the higher-level processing routines.
These abstractions facilitate the implementation of knowledge discovery tools, including full-text searching (based on words, not mere strings), identification of toponyms and generation of maps, identification of dates and generation of timelines, and implicit keyword searching.
The Perseus Digital Library is well known among classicists. Other scholars wishing to make electronic editions of Greek and Latin texts have wanted to use its resources, in particular the lexica and morphological analysis facilities. Until recently, they could only do this by making explicit links from their HTML texts to the Perseus site.
A collaborating project, the Stoa Consortium at the University of Kentucky, had begun to develop tools of its own, but it quickly became clear that this would be too much work for the Stoa's limited resources. At the same time, other projects also expressed interest in the digital library toolset. We therefore decided to make the toolset available. It will ultimately be generally available under an open-source license, though it is not yet ready for general release.
The Perseus toolset was written for one project, operating one digital library on one web server. Naturally, it was not written with portability in mind. In the course of converting "project" code to "product" code, we discovered that we had not clearly distinguished infrastructure, project- specific data, and configuration data. Getting all this straight has helped us identify the real core of each of the modules in the system.
We have worked on meta-data standards and naming rules for texts and ABOs. While each project could determine its own naming rules, it is convenient if projects that are to share data can ensure there are no name conflicts. In the Perseus Digital Library, texts are stored in SGML or XML files with descriptive names (for example, soph.oc_eng.sgml for an English translation of Oedipus at Colonus by Sophocles). When the text is normalized, its XML version receives an internal name like 1999.01.0190.xml, which serves to identify the text to the rest of the system. The formal name of this text is Perseus:text:1999.01.0190, and each derived file (normalized XML, lookup table, citation list, and so on) has a file name incorporating the numbers 1999.01.0190. The formal naming scheme has three parts: the naming authority ("Perseus"), the object type ("text"), and the specific object identifier ("1999.01.0190"). The naming authority section of the name is crucial for federation of libraries: a link to "Perseus:text:1999.01.0190" is a link to a text in the Perseus library, not the local library. If two co-operating digital libraries were to have copies of the same SGML source file, they could use the same name for it.
Naming for ABOs is similar. Oedipus at Colonus, for example, is known to the digital library as Perseus:abo:tlg,0011,007, while Hamlet is Perseus:abo:shak,hamlet. Every text that is a version of one of these plays (a translation, a particular edition) has a meta-data record that indicates it is a version of this abstract bibliographic object; every commentary, similarly, has a meta-data record declaring it a commentary on the abstract bibliographic object. Just as for texts, the formal name falls into three parts: naming authority, object type, and specific identifier. The intention here is that co-operating libraries will use the same identifiers for ABOs even if they include different versions of the texts. For example, if a hypothetical co-operating library called Livres were to include French translations of these two plays, the French texts might be called Livres:text:2000.01.0001 and Livres:text:2000.01.0001, but they would be declared versions of Perseus:abo:tlg,0011,007 and Perseus:abo:shak,hamlet respectively. But if the Livres library were to produce an edition of, say, Racine's Phèdre, it would assign its own ABO identifier, perhaps Livres:abo:r1.
Meta-data assertions about the texts come from two different places, the TEI header and a hand-maintained database. The TEI header supplies most of the Dublin Core fields that we use: title, creator, contributor, language, and source (from the <sourceDesc> element). Our texts have DC type "text"; we also use type "image" for the pictures in our digital library. From the TEI header we also determine some project-specific meta-data elements, in particular the funder and the citation scheme (as described in section 4 above). Co-operating projects will use these fields in the same way by virtue of using the DTD in the same way.
The hand-maintained meta-database includes the Dublin Core relation field, which we use to relate works to collections and other groupings of texts. We also use the relation field to indicate the ABO that a particular SGML file is a version of or a commentary on, if any. We use the Dublin Core date field with the "Available" qualifier to indicate the date on which our electronic version of the text became available; we do not currently record the creation date of the original work, the publication date of the print edition we worked from, or any of the other various relevant dates. We use the Dublin Core identifier field for those few texts that are in HTML rather than SGML or XML; it holds a URL for the file. One important project-specific field that is maintained by hand is a publication status: public, in development, or restricted access. Co- operating projects will need to maintain these fields as well. Currently, these fields can be updated by a web-based application or by editing a canonical textual version of the database.
We do not currently use the Dublin Core publisher, format, coverage, rights, or subject fields for texts.
The basic display mechanism uses HTML templates to format pages, and stylesheets (written in CoST) to turn XML into HTML. We allow projects, or collections within projects, to override all or part of the default stylesheet or template, so that their texts can have a distinctive appearance. Because the Perseus Digital Library already uses this facility extensively, it is easy to provide it to co-operating projects as well.
Agosti, Maristella., Fabio Crestani, Massimo Melucci. 1998. "On the Use of Information Retrieval Techniques for the Automatic Construction of Hypertext". Information Processing and Management 32:2, 133-144.
Arms, William Y. 2000. Digital Libraries. Cambridge: MIT Press.
Birnbaum, David, and David A. Mundie. 1999. "The Problem of Anomalous Data." Markup Languages: Theory and Practice 1.4, 1-14.
Burnard, Lou. 1995. "What is SGML, and How Does It Help?" Computers and the Humanities 29, 41-50. http://www.uic.edu/orgs/tei/sgml/teiedw25/.
Crane, Gregory. 1991. "Generating and Parsing Classical Greek." Literary and Linguistic Computing 6, 243-245.
Crane, Gregory. 1998. "New Technologies for Reading: The Lexicon and the Digital Library." Classical World 91, 471- 501.
Crane, Gregory. 2000. "Extending a Digital Library: Beginning a Roman Perseus." New England Classical Journal 27, 140-160.
Lane, Edward William. 1863. An Arabic-English Lexicon. London: Williams and Norgate.
Liddell, Henry George, Robert Scott, Sir Henry Stuart Jones, Roderick McKenzie. 1843. A Greek-English Lexicon. Ninth edition, 1940. Oxford University Press.
Lesk, Michael. 1997. Practical Digital Libraries: Books, Bytes, and Bucks. San Francisco: Morgan Kaufmann Publishers.
Lubell, Joshua. 1999. "Structured Markup on the Web: A Tale of Two Sites." Markup Languages: Theory and Practice 1.3, 7-22.
Rydberg-Cox, Jeffrey A. 2000. "Word Co-Occurrence and Lexical Acquisition in Ancient Greek Texts." Literary and Linguistic Computing 15, 121-129.
Rydberg-Cox, Jeffrey A. (forthcoming) "Mining Data from the Electronic Greek Lexicon." Classical Journal.
Rydberg-Cox, Jeffrey A., Robert F. Chavez, Anne Mahoney, David A. Smith, Gregory R. Crane. 2000. "Knowledge Management in the Perseus Digital Library." Ariadne 25, http://www.ariadne.ac.uk/issue25/rydberg-cox/
Smith, David A., Anne Mahoney, Jeffrey A. Rydberg-Cox. 2000. "Management of XML Documents in an Integrated Digital Library. Proceedings of Extreme Markup Languages 2000, 219-224.
Sperberg-McQueen, C., and L. Burnard. 1994. Guidelines for Electronic Text Encoding and Interchange. Chicago: Text Encoding Initiative.
This routine creates the lookup table that abstracts from the specific element names used in the DTD. It is run on each new or changed text. Ideally, the lists of elements to be indexed, mappings from concrete tags to abstract structural elements, and elements to be suppressed would be in an external table rather than directly in the code. Adding support for another DTD requires modifying these lists, which therefore are at least party project-specific configuration rather than code.
Notes about significant features appear in bold throughout the text.
#!/usr/bin/perl
use XML::Parser;
## useful tags to index
my %idxTags = map { $_, 1 } qw(div div0 div1 div2 div3 div4 div5 div6 div7
group text front body back milestone pb cb lb l
frag entry entryfree orth
poem speech sp section
head figure docauthor doctitle
pageinfo printpgno illus);
my %tagTypes = ( elements to be mapped onto abstractions
'pb' => 'page', e.g., the <pb> element denotes an abstract page
'cb' => 'column',
'lb' => 'line',
'l' => 'line',
'speech' => 'sp',
'frag' => 'fragment',
'pageinfo' => 'spage',
'printpgno' => 'page',
'illus' => 'figure',
'entryfree' => 'entry',
);
my %fakeEmpty = (
'controlpgno' => 1,
'pageinfo' => 1,
'printpgno' => 1,
);
my %suppress = (
'tei.header' => 1,
'teiheader' => 1,
'note' => 1,
'verse' => 1,
'oracle' => 1,
'quotedtext' => 1,
'quote' => 1,
'castgroup' => 1,
'list' => 1,
'table' => 1,
'rdg' => 1,
'bibl' => 1, # mostly to kill
);
my %printContent = (
head => 1,
orth => 1,
figure => 1,
illus => 1,
docauthor => 1,
doctitle => 1,
);
The Expat XML parser uses call-back routines
my $parser = new XML::Parser(Handlers => {Start => \&handle_start,
End => \&handle_end,
Char => \&handle_char,
Default => \&handle_default});
my $pstack = 0;
## We might need to defer some lines if they happen inside others.
my @defer = ();
## Keep track of element context
my @context = ();
## Keep track of active elements
my @suppress = (0);
my $chunk = 0;
## Keep track of language for heads
my(@lang) = ('');
my $pfile = shift @ARGV;
if ($pfile ne '') {
$parser->parsefile($pfile);
}
else {
$parser->parse(*STDIN);
}
foreach my $defLine (@defer) {
print $defLine, "\n";
}
sub handle_start { Called for start of a new element
my $p = shift;
my $el = shift;
my %atts = @_;
## This may be useful later...
delete $atts{'teiform'};
my $curContext = join("",@context);
push @context, make_start_tag($el,\%atts);
push @suppress, ($suppress{lc($el)} ? 1 : $suppress[$#suppress]);
my $newlang = $atts{'lang'} ? lc($atts{'lang'}) : $lang[$#lang];
push @lang, $newlang;
print make_start_tag($el,\%atts) if $pstack && !$suppress[$#suppress];
$el = lc $el;
++$pstack if $printContent{$el};
return if $pstack > 1;
## If the current tag's purpose in life is to print its content,
## but it's being suppressed, bail out.
return if $printContent{$el} && $suppress[$#suppress];
if (!$idxTags{$el} or ($pstack and !$printContent{$el})) {
if ($chunk and $atts{'id'} ne '') {
my $idLine = join("\t", $p->current_byte, $p->depth, 'id',
$atts{'id'}, 0, $curContext);
if ($pstack) {
push @defer, $idLine;
}
else {
foreach my $defLine (@defer) {
print $defLine, "\n";
}
@defer = ();
print $idLine, "\n";
}
}
return;
}
return if ($el eq 'milestone') && ($atts{'unit'} eq 'para');
## Only throw out PHI prose lines for now.
return if ($el eq 'lb') && (lc($atts{'ed'}) eq 'phi');
my $n = $atts{'n'};
my $id = $atts{'id'};
## Are we an empty tag? If not, tielut will blow away our state when we
## exit this tag. tielut loads the output of this routine into a database
my $isEmpty = 0;
$isEmpty = 1 if $fakeEmpty{$el} || ($p->recognized_string =~ /\/>$/);
my $type = $tagTypes{$el};
if ($atts{'name'} ne '') { # For old Perseus texts.
$type = $atts{'name'};
}
elsif ($type eq 'line' and lc($n) eq 'tr') {
$type = 'tr line';
$n = '-1';
}
elsif ($atts{'type'} ne '' and $el ne 'entry' and $type ne 'entry') {
$type = $atts{'type'};
}
elsif ($atts{'unit'} ne '') {
$type = $atts{'unit'};
}
if ($atts{'ed'} ne '' and lc($atts{'ed'}) ne 'p' and $type) {
$type = "$atts{'ed'} $type";
}
$type = $el if $type eq '';
## Some DIVs are troublesome.
$suppress[$#suppress] = 1 if $suppress{$type};
if ($type eq 'line' and $n eq '') {
$n = '-1';
}
## If the line is split, only count the initial one.
return if ($type eq 'line'
and (lc($atts{'part'}) eq 'm' or lc($atts{'part'}) eq 'f'));
$n = $atts{'key'} if defined($atts{'key'});
$chunk++;
foreach my $defLine (@defer) {
print $defLine, "\n";
}
@defer = ();
Output: byte position in XML file, nest depth of tags, abstract element type,
counter (how many of this element we've seen), whether this is an empty "marker"
element as opposed to a container, and the list of tags open at this point in
the XML
if ($n =~ /=/ and !defined($atts{'key'})) {
my $count = 0;
foreach my $i (split /:/, $n) {
my($curType,$curN) = split /=/, $i, 2;
print "\n" if $count++;
print join("\t", $p->current_byte, $p->depth, lc($curType), $curN,
$isEmpty, $curContext);
}
}
else {
print join("\t", $p->current_byte, $p->depth, lc($type), $n,
$isEmpty, $curContext);
}
if ($id ne '') {
my $idLine = join("\t", $p->current_byte, $p->depth, 'id', $id,
$isEmpty, $curContext);
if ($pstack) {
push @defer, $idLine;
}
else {
print "\n", $idLine;
}
}
if ($type eq 'card' and $n ne '') {
print "\n", join("\t", $p->current_byte, $p->depth, 'line', $n,
$isEmpty, $curContext);
}
## Only suppress the content of suppressed elements, not their LUT lines.
if ($pstack and !$suppress[$#suppress]) {
print "\t";
print "" if $lang[$#lang];
}
else {
print "\n";
}
1;
}
sub handle_end { Called for end of an element
my $p = shift;
my $el = shift;
pop @context;
my $oldlang = pop @lang;
--$pstack if $printContent{lc($el)};
return if pop(@suppress);
if ($pstack) {
print "$el>";
}
elsif ($printContent{lc($el)}) {
print " " if $oldlang;
print "\n";
}
1;
}
sub handle_char { Called for contents of elements
my($p,$s) = @_;
return unless $pstack;
return if $suppress[$#suppress];
$s = $p->original_string; # we want all characters escaped
$s =~ s/\n/ /gs;
print $s;
}
sub handle_default {
## Do nothing.
}
sub make_start_tag {
my $el = shift;
my $atts = shift;
my $res = "<$el";
while (my($att,$val) = each %$atts) {
$val =~ s/\&/\&/g;
$val =~ s/\</g;
$val =~ s/>/\>/g;
$val =~ s/\"/\"/g;
$val =~ s/\'/\'/g;
$res .= " $att=\"$val\"";
}
$res .= ">";
$res;
}
This is a relatively short entry from the Greek-English Lexicon, showing the use of the <entryfree> element and the inconsistent structure of the print entry. Greek is encoded in Beta-code, as devised by the Thesaurus Linguae Graecae project. Note the <sense> tags and their attributes: the first group, which could be labelled "I." but is not , centers on the idea of a messenger, while the only sense in the second group, labelled "II.", is a cult title for a god. The lexicon supplies citations for each sense, and we have encoded them using the ABO codes for the authors and works they refer to .
<entryFree key="a)/ggelos"><orth extent=full lang=greek>a)/ggelos</orth>,
<gen lang=greek>o(</gen>, <gen lang=greek>h(</gen>,
<tr>messenger, envoy</tr>,
<bibl n="Perseus:abo:tlg,0012,001:2:26"><author>Il.</author><biblScope>2.26</biblScope></bibl>, etc.;
<foreign lang=greek>di' a)gge/lwn o(mile/ein tini/</foreign> <bibl n="Perseus:abo:tlg,0016,001:5:92"><author>Hdt.</author><biblScope>5.92</biblScope></bibl>.<foreign lang=greek>z/</foreign>,
cf. <bibl><title>SIG</title><biblScope>229.25</biblScope></bibl> (<placeName>Erythrae</placeName>):—
prov., <foreign lang=greek>*)ara/bios a)/</foreign>., of a loquacious person, <bibl n="Perseus:abo:tlg,0541,001:32"><author>Men.</author><biblScope>32</biblScope></bibl>.
<sense n="2" level="3"> generally, <tr>one that announces</tr> or <tr>tells</tr>, e.g. of birds of augury,
<bibl n="Perseus:abo:tlg,0012,001:24:292"><author>Il.</author><biblScope>24.292</biblScope></bibl>,
<bibl n="Perseus:abo:tlg,0012,001:296"><biblScope>296</biblScope></bibl>;
<foreign lang=greek>*mousw=n a)/ggelos</foreign>, of a poet, <bibl n="Perseus:abo:tlg,0002,001:769"><author>Thgn.</author><biblScope>769</biblScope></bibl>;
<foreign lang=greek>a)/ggele e)/aros . . xelidoi=</foreign> <bibl n="Perseus:abo:tlg,0261,001:74"><author>Simon</author><biblScope>74</biblScope></bibl>;
<foreign lang=greek>a)/. a)/fqoggos</foreign>, of a beacon, <bibl n="Perseus:abo:tlg,0002,001:549"><author>Thgn.</author><biblScope>549</biblScope></bibl>;
of the nightingale, <foreign lang=greek>o)/rnis . . *dio\s a)/</foreign>. <bibl n="Perseus:abo:tlg,0011,005:149"><author>S.</author><title>El.</title><biblScope>149</biblScope></bibl>:
c. gen. rei, <foreign lang=greek>a)/. kakw=n e)mw=n</foreign> <bibl n="Perseus:abo:tlg,0011,002:277"><author>Id.</author><title>Ant.</title><biblScope>277</biblScope></bibl>;
<foreign lang=greek>a)/ggelon glw=ssan lo/gwn</foreign> <bibl n="Perseus:abo:tlg,0006,008:203"><author>E.</author><title>Supp.</title><biblScope>203</biblScope></bibl>;
<foreign lang=greek>ai)/sqhsis h(mi=n a)/.</foreign> <bibl n="Perseus:abo:tlg,2000,001:5:3:3"><author>Plot.</author><biblScope>5.3.3</biblScope></bibl>;
neut. pl., <foreign lang=greek>a)/ggela ni/khs</foreign> <bibl n="Perseus:abo:tlg,2045,001:34:226"><author>Nonn.</author><title>D.</title><biblScope>34.226</biblScope></bibl>. </sense>
<sense n="3" level="3"> <tr>angel</tr>,
<bibl n="Perseus:abo:tlg,0527,001:28:12"><author>LXX</author> <title>Ge.</title><biblScope>28.12</biblScope></bibl>,
al., <bibl n="Perseus:abo:tlg,0031,001:1:24"><title>Ev.Matt.</title><biblScope>1.24</biblScope></bibl>,
al., <bibl n="Perseus:abo:tlg,0018,001:2:604"><author>Ph.</author><biblScope>2.604</biblScope></bibl>, etc. </sense>
<sense n="4" level="3"> in later philos., <tr>semi-divine being</tr>,
<foreign lang=greek>h(liakoi\ a)/.</foreign> <bibl n="Perseus:abo:tlg,2003,001:4:141b"><author>Jul.</author><title>Or.</title><biblScope>4.141b</biblScope></bibl>,
cf. <bibl n="Perseus:abo:tlg,2023,006:2:6"><author>Iamb.</author><title>Myst.</title><biblScope>2.6</biblScope></bibl>,
<bibl><author>Procl.</author></bibl> <tr>in R.</tr><bibl><biblScope>2.243</biblScope></bibl> K.;
<foreign lang=greek>a)/. kai\ a)rxa/ggeloi</foreign> <bibl><title>Theol.Ar.</title><biblScope>43.10</biblScope></bibl>,
cf. <bibl n="Perseus:abo:tlg,4066,003:183"><author>Dam.</author><title>Pr.</title><biblScope>183</biblScope></bibl>,
al.: also in mystical and magical writings, <bibl><author>Herm.</author></bibl> ap. <bibl><author>Stob.</author><biblScope>1.49.45</biblScope></bibl>,
<bibl><title>PMag.Lond.</title><biblScope>46.121</biblScope></bibl>, etc. </sense>
<sense n="II" level="2"> title of Artemis at Syracuse, <bibl><author>Hsch.</author></bibl></sense>
</entryFree>