|

|
|
Treebank-3
| |
| Item Name: | Treebank-3 |
| Authors: | Mitchell P. Marcus, Beatrice Santorini, Mary Ann Marcinkiewicz and Ann Taylor |
| LDC Catalog No.: | LDC99T42 |
| ISBN: | 1-58563-163-9 |
| Data Type: | text |
| Data Source(s): | microphone speech, newswire, telephone speech, transcribed speech, varied |
| Project(s): | GALE, TIDES |
| Application(s): | natural language processing, parsing, tagging |
| Language(s): | English |
| Language ID(s): | eng |
| Distribution: | 1 CD, Web Download |
| Member fee: | $0 for 1999 members |
| Non-member Fee: | US $3150.00 |
| Reduced-License Fee: | US $1575.00 |
| Extra-Copy Fee: | US $150.00 |
| Non-member License: | yes |
| Online documentation: | yes |
| Licensing Instructions: | Subscription Members, Standard Members, Non-Members |
| Citation: | Mitchell P. Marcus, Beatrice Santorini, Mary Ann Marcinkiewicz and Ann Taylor 1999 Treebank-3 Linguistic Data Consortium, Philadelphia |
|
Introduction
This CD-ROM contains the following Treebank-2 Material:
- One million words of 1989 Wall Street Journal material annotated in
Treebank II style.
- A small sample of ATIS-3 material annotated in Treebank II style.
- A fully tagged version of the Brown Corpus.
and the following new material:
- Switchboard tagged, dysfluency-annotated, and parsed text
- Brown parsed text
The Treebank bracketing style is designed to allow the extraction of
simple predicate/argument structure. Over one million words of text
are provided with this bracketing applied.
Data
The Penn Treebank (PTB) project selected 2,499 stories from a three year Wall
Street Journal (WSJ) collection of 98,732 stories for syntactic
annotation. These 2,499 stories have been distributed in both Treebank-2
(LDC1999T42) and Treebank-3 (LDC1999T42) releases of
PTB. Treebank-2 includes the raw text for each story. Three "map"
files are available in a compressed file via ftp and
provide the relation between the 2,499 PTB filenames and the
corresponding WSJ DOCNO strings in TIPSTER.
Updates
After publication, it was discovered that not all of the postscript (*.ps)
files had been converted to pdfs and that some of the converted pdfs contained errors. For pdf copies of the documentation files, please
go to addenda for a list of the files available.
Copyright |
|
|