:::::::::::::: ./readme.all :::::::::::::: :::::::::::::: ./parsed/prd/readme.prd :::::::::::::: README file for parsed (.prd) files This directory contains files bracketed for syntactic structure. The input files are of the .dps type; that is, the turns are joined so that "sentences" are not split across turns. POS tag information is not included in these files (see mrg/). The restart annotations of the dysluency files is passed through to these files, although syntactic annotators used their own judgement in bracketing the restarts. Because of time constraints, disagreements between the dysfluency and syntactic annotators were not resolved and thus there are some discrepancies between the dysfluency files and the parsed files in the interpretation of restarts. Bracketing policy is described in detail in the Bracketing Guidelines (parseguide1.ps) and Bracketing Switchboard (parseguide2.ps) in the docs/ directory). The files were created from input with the turns automatically joined, run through FIDDITCH (Don Hindle's deterministic parser), simplified by an emacs-lisp program called "crunch", and then corrected by a human annotator. :::::::::::::: ./parsed/mrg/readme.mrg :::::::::::::: README file for Merged (.mrg) files The files in this directory were automatically created by inserting the part of speech tags from a tagged text file (.pos file) into a parsed text file (.prd file). The tags are inserted as nodes immediately dominating the terminals. The -NONE- node means that there is no part of speech for that terminal symbol (i.e., it is a "null element"; see docs/parseguide1.ps). :::::::::::::: ./dysfl/dff/readme.dff :::::::::::::: README file for dysfluency-annotated files This directory contains files annotated for dysfluencies. There are 1126 files organized in 12 directories numbered 00 to 12. Note that only the first 639 files have a corresponding parsed file in this release. All dysfluency annotation was done by hand and consists of marking speech units, various types of words (fillers, editing terms, discourse markers, etc.) and restarts. For details of the annotation system, see the Dysfluency Annotation Stylebook (dflguide.ps) for the Switchboard Corpus in the /docs directory. :::::::::::::: ./dysfl/mgd/readme.mgd :::::::::::::: README file for .mgd files This hierarchy contains files with part-of-speech tags and dysfluency annotation created by automatically merging the .pos and .dff files. There are 1126 files organized in 12 directories numbered 00 to 12. Note that only the first 639 files have a corresponding parsed file in this release. :::::::::::::: ./dysfl/dps/readme.dps :::::::::::::: README file for .dps files This directory contains files with part-of-speech tags and dysfluency annotation. In addition turns have been joined where in the original format they continued across the other speaker's turn. Thus in these files, any sentence uttered by one speaker but split over two turns is joined together and appears in its entirety in the first of the turns. For this reason there are some "empty" turns included in the files. There are 1126 files organized in 12 directories numbered 00 to 12. Note that only the first 639 files have a corresponding parsed file in this release. The files are created by automatically merging the dysfluency-annotated and part-of-speech tagged files and then automatically joining the turns. :::::::::::::: ./docs/readme.txt :::::::::::::: This directory contains documentation of the annotation styles used by Treebank. Note that formal papers are presented in both PostScript and Adobe Acrobat "pdf" formats. tagguide1.{ps,pdf} - part-of-speech stylebook tagguide2.{ps,pdf} - addendum to the part-of-speech stylebook, outlining changes and new policy for Switchboard dflguide.{ps,pdf} - dysfluency annotation stylebook for Switchboard parseguide1.{ps,pdf} - syntactic annotation stylebook Treebank II (note this file is very large) parseguide2.{ps,pdf} - addendum to the syntactic annotation stylebook outlining changes and new policy for Switchboard arpa94.{ps,pdf} - Marcus et al. 1994, "The Penn Treebank: Annotating Predicate Argument Structure", presented at the ARPA Human Language Technology Workshop, March 1994. Introduction to Treebank II bracketing style. bracket.txt - informal notes with additional information on new policies affecting Switchboard and Brown changes.txt - a summary of tag and label changes affecting Switchboard and Brown annotation README.cdrom2 - the README file from the Penn Treebank 2 release containing information on the WSJ and Atis data :::::::::::::: ./docs/readme.cd2 :::::::::::::: This is a revisedtrimmed-down version of the main README file that accompanied the Penn Treebank Release 2 CDROM, which featured a million words of 1989 Wall Street Journal material annotated in Treebank II style. Portions of that README file are being included here with Penn Treebank Release 3 in order to provide additional information about material that was present in the earlier release. The Treebank II bracketing style, which is designed to allow the extraction of simple predicate-argument structure, is described in doc/arpa94 and the new bracketing style manual (in doc/manual/). In addition, there is a small sample of ATIS-3 material, also annotated in Treebank II style. INVENTORY and DESCRIPTIONS The directory structure of this release is similar to the previous release. doc/ --Documentation. This directory contains information about who the annotators of the Penn Treebank are and what they did as well as LaTeX files of the Penn Treebank's Guide to Parsing and Guide to Tagging. parsed/ --Parsed Corpora. These are skeletal parses, without part-of-speech tagging information. To reflect the change in style from our last release, these files now have the extension of .prd. atis/ --Air Travel Information System transcripts. April 1994 Approximately 5000 words of ATIS3 material. The material has a limited number of sentence types. It was created by Don Hindle's Fidditch and corrected once by a human annotator (Grace Kim). wsj/ --1989 Wall Street Journal articles. November 1993 Most of this material was processed from our -October 1994 previous release using tgrep "T" programs. However, the 21 files in the 08 directory and the file wsj_0010 were initially created using the FIDDITCH parser (partially as an experiment, and partly because the previous release of these files had significant technical problems). All of the material was hand-corrected at least once, and about half of it was revised and updated by a different annotator. The revised files are likely to be more accurate, and there is some individual variation in accuracy. The file doc/wsj.wha lists who did the correction and revision for each directory. tagged/ --Tagged Corpora. atis/ --Air Travel Information System transcripts. April 1994 The part-of-speech tags were inserted by Ken Church's PARTS program and corrected once by a human annotator (Robert MacIntyre). wsj --'88-'89 Wall Street Journal articles. Winter These files have not been reannotated since the -Spring 1990 previous release. However, a number of technical bugs have been fixed and a few tags have been corrected. See tagged/README.pos for details. The new work in Release 2 was funded by the Linguistic Data Consortium. Previous versions of this data were primarily funded by DARPA and AFOSR jointly under grant No. AFOSR-90-006, with additional support by DARPA grant No. N0014-85-K0018 and by ARO grant No. DAAL 03-89-C0031 PRI. Seed money was provided by the General Electric Corporation under grant No. J01746000. We gratefully acknowledge this support. Richard Pito deserves special thanks for providing the tgrep tool, which proved invaluable both for preprocessing the parsed material and for checking the final results. We are also grateful to AT&T Bell Labs for permission to use Kenneth Church's PARTS part-of-speech labeller and Donald Hindle's Fidditch parser. Finally, we are very grateful to the exceptionally competent technical support staff of the Computer and Information Science Department at the University of Pennsylvania, including Mark-Jason Dominus, Mark Foster, and Ira Winston. :::::::::::::: ./tagged/pos/readme.pos :::::::::::::: README file for (POS-)tagged files This directory contains files tagged for Part of Speech. There are 1126 files organized in 12 directories numbered 00 to 12. Note that not all files found here have a corresponding parsed file in this release. Originally, each of the texts was run through PARTS (Ken Church's stochastic part-of-speech tagger) or Eric Brill's tagger and then corrected by a human annotator. The square brackets surrounding phrases in the texts are the output of a stochastic NP parser that is part of PARTS and are best ignored. Words are separated from their part-of-speech tag by a forward slash. In cases of uncertainty concerning the proper part-of-speech tag, words are given alternate tags, which are separated from one another by a vertical bar. The order in which the alternate tags appear is not significant, but has not been standardized. In the Switchboard data, there are also tags including carets (^), which indicate various kinds of transcription errors. The part-of-speech tags used are described in detail in the POS tagging guides (tagguide1.ps and tagguide2.ps in the docs/ directory). :::::::::::::: ./readme.1st :::::::::::::: README for Penn Treebank CDROM Release 3 ======================================== This CDROM contains the following previously released material: + WSJ tagged and parsed text + Atis tagged and parsed text + Brown tagged text (parsed text is new) and the following new material: + Switchboard tagged, dysfluency-annotated, and parsed text + Brown parsed text For information about the WSJ and Atis data, see the README.cdrom2 in the docs/ directory. The Switchboard dataset includes: + 1126 files tagged and dysfluency-annotated + 650 files parsed (a subset of the 1126) These files are organized into 3 subdirectories, named "2","3","4", according to the initial digit of the 4-digit file-id number. The number of files per directory is shown here: tagged and dysfluency-annotated: 2/ 455 files 3/ 477 files 4/ 194 files parsed: 2/ 236 files 3/ 260 files 4/ 154 files The Brown Corpus dataset includes the following Brown subsets: + cf popular lore + cg belles lettres, biography, memoires, etc. + ck general fiction + cl mystery and detective fiction + cm science fiction + cn adventure and western fiction + cp romance and love story + cr humor all subsets are complete except /cf which contains files 1-32 and /cg which contains files 1-36. Directory structure readme.1st - this file readme.all - concatenation of all other readme files in this release (these are all found within the directories listed below) tagged/ - part-of-speech tags only pos/ atis/ brown/c*/ swbd/{2,3,4}/ wsj/ dysfl/ dff/ - dysfluency annotation only swbd/{2,3,4}/ mgd/ - dysfluency annotation and part-of-speech tags swbd/{2,3,4}/ dps/ - dysfluency annotation, part-of-speech tags and turns joined swbd/{2,3,4}/ parsed/ prd/ - syntactic annotation only atis/ brown/c*/ swbd/{2,3,4}/ wsj/ mrg/ - syntactic annotation and part- of-speech tags atis/ brown/c*/ swbd/{2,3,4}/ wsj/ docs/ - annotation style manuals and other information