CCGbank README ============== Publication title: CCGbank 1.1 Authors: Julia Hockenmaier and Mark Steedman Addresses: Julia Hockenmaier is currently at: Institute for Research in Cognitive Science University of Pennsylvania 3401 Walnut Street, Suite 400A Philadelphia, PA 19104-6228, USA Mark Steedman School of Informatics University of Edinburgh 2 Buccleuch Place Edinburgh, EH8 9LW Scotland, United Kingdom Email: juliahr@cis.upenn.edu, steedman@inf.ed.ac.uk Data type: Text Data sources: The parsed Wall Street Journal subcorpus of the Penn Treebank II Project: Edinburgh Wide-Coverage CCG parsing project. http://groups.inf.ed.ac.uk/ccg The purpose of this project is to develop wide-coverage statistical parsers for Combinatory Categorial Grammar. CCGbank has been used to develop state-of-the-art wide-coverage statistical parsers for Combinatory Categorial Grammar (Hockenmaier and Steedman (2002), Clark, Hockenmaier and Steedman (2002), Hockenmaier (2003a,b), Clark and Curran (2003,2004)), as well as CCG supertaggers (Clark and Curran, 2004). So far, these parsers have been used for semantic role labeling (Gildea and Hockenmaier, 2003), to create Discourse-Representation-Theory structure (Bos et. al, 2004), as well as in question-answering systems (Clark, Steedman and Curran, 2004). Applications: Parsing, natural language processing Languages: American English License: Grant: EPSRC grant GR/M96889 and an EPSRC studentship Copyright: Julia Hockenmaier and Mark Steedman. Portions (c) Trustees of the University of Pennsylvania. Corpus structure and data attributes: Data type: Text. File format: There are three different file formats: human-readable HTML files that contain the syntactic derivations and the predicate-argument structure, predicate-argument-structure files that contain the predicate-argument structure representation of each sentence (for evaluation), and derivation files that contain the syntactic derivations (to train parsers). These file formats are described in detail in the appendix of the tech report. Number of files: 2,338 files in HTML version (including index.html files) 2,312 files in AUTO version 2,312 files in PARG version File format: ASCII, HTML Size of the data: There are three versions of the same data (HTML, AUTO and PARG), corresponding to 48,934 sentences or 1,148,426 tokens of annotated text. The total size of the corpus is 340MB. The HTML version is 220 MB, the PARG version is 46MB, and the AUTO version is 74 MB. Description of the contents of every directory: Each file format has its own directory tree (HTML, PARG, AUTO). In each of these directories, the file structure is parallel to that of the original Penn Treebank II. The LEX directory contains two lexicons extracted from sections 00 and 02-21. Each pair is followed by its frequency, the probability of the word given the category and the probability of the category given the word. The RAW directory contains the raw text of sections 00 and 23 (only including those sentences for which CCGbank has a derivation). This updated version 1.1 corrects a misalignment of sentences between the PARG and AUTO files, as well as a problem with some original POS tags in the AUTO files. It also contains the ccgbank.00-24.t2c TGrep2 file which was not contained on the previous CD.