The Manually Annotated Sub-Corpus First Release (MASC I), Linguistic Data Consortium
(LDC) catalog number LDC2010T22 and isbn 1-58563-569-3, is the first of three
releases of 500,000 words of MASC data developed as part of the
American National Corpus (ANC) project. MASC I consists of approximately
80,000 words of contemporary spoken and written American English annotated for
a variety of linguistic phenomena. The MASC
project is sponsored by the National Science Foundation and was established
to address, to the extent possible, many of the obstacles to the creation of
large-scale, robust, multiply-annotated corpora of English covering a wide range
of genres of written and spoken language data. Researchers from Vassar College,
Columbia University and the International Computer Science Institute, University
of California at Berkeley are the principal participants the WordNet
project provides consulting.
The source texts in MASC I are drawn from the open portion of the American
National Corpus (ANC) Second Release LDC2005T35, which includes written
texts and spoken transcripts of American English from a broad range of genres
produced since 1990 and from the Language
Understanding Annotation Corpus LDC2009T09, (LU Corpus), a collection of
various genres including broadcast, newswire, email and telephone speech annotated
for committed belief, event and entity coreference, dialog acts and temporal
relations. All of the words of data in MASC I have validated annotations for
token, part of speech, sentence boundary, noun chunks, verb chunks, named entities
and Penn Treebank syntax.
Full-text FrameNet annotations
are available for seventeen texts and WordNet word sense annotations are available
for 1000 occurrences of each of fifty-three words. Annotations of all or portions
of the sub-corpus for a wide variety of other linguistic phenomena have been
contributed by other projects. Software and services available from the ANC
project website enable transduction of MASC into a wide variety of physical
The MASC directory contains two folders: masc-1.0.3 and masc_wordsense.
masc-1.0.3 contains the actual MASC corpus and consists of two folders, spoken
and written. The spoken folder contains data and annotations for
spoken material, and the written folder contains the same for written texts.
The files in each of the respective folders have naming conventions that describe
the contents of the file. masc_wordsense contains the MASC sentence samples with word sense annotations
using WordNet sense numbers as the annotation values.
Additional information, updates, bug fixes may be available in the
LDC catalog entry for this corpus at
Portions © 2000 The Associated Press, © 1987-1989 Dow Jones &
Company, Inc., © 2000 New York Times, © 1997-2002, 2010 Trustees of
the University of Pennsylvania
© 2010 Linguistic Data
Consortium , Trustees of the
University of Pennsylvania . All Rights Reserved.