Classic Corpora in the Catalog: Arabic Treebank

Arabic Treebank Sample

 

The Penn/LDC Arabic Treebank (ATB) project began in 2001 with support from the DARPA TIDES program and later, the DARPA GALE and BOLT programs. The original focus was on Modern Standard Arabic (MSA), not natively spoken and not homogenously acquired across its writing and reading community. In addition to the expected issues associated with complex data annotation, LDC encountered several challenges unique to a highly inflected language with a rich history of traditional grammar. LDC relied on traditional Arabic grammar, as well as established and modern grammatical theories of MSA -- in combination with the Penn Treebank approach to syntactic annotation -- to design an annotation system for Arabic. (Maamouri, et al., 2004). LDC was innovative with respect to traditional grammar when necessary and when other syntactic approaches were found to account for the data. LDC also developed a wide-coverage MSA morphological analyzer, LDC Standard Arabic Morphological Analyzer (SAMA) Version 3.1 (LDC2010L01), which greatly benefited ATB development. Revisions to the annotation guidelines during the DARPA GALE program (principally related to tokenization and syntactic annotation) improved inter-annotator agreement and parsing scores.

ATB corpora were annotated for morphology, part-of-speech, gloss, and syntactic structure.  Data sets based on MSA newswire developed under the revised annotation guidelines include Arabic Treebank: Part 1 v 4.1 (LDC2010T13), Arabic Treebank: Part 2 v 3.1 (LDC0211T09) and Arabic Treebank: Part 3 v 3.2 (LDC2010T08). Other genres are represented in Arabic Treebank – Broadcast News v 1.0 (LDC2012T07) and Arabic Treebank – Weblog (LDC2016T02).  

LDC’s later work on Egyptian Arabic treebanks in the DARPA BOLT program benefited from the strides in its MSA treebank annotation pipeline. As for the challenges presented by informal, dialectal material, collaborator Columbia University provided a normalized Arabic orthography to account for instances of Romanized script (Arabizi) in the data and developed a morphological analyzer (CALIMA) in parallel, working in a tight feedback loop with LDC’s annotation team. SAMA and CALIMA were synchronized in the Egyptian Arabic treebanks, the former used for MSA tokens and the latter used for Egyptian Arabic tokens. Resulting corpora include BOLT Egyptian Arabic Treebank – Discussion Forum (LDC2018T23), Conversational Telephone Speech (LDC2021T12), and SMS/Chat (LDC2021T17).

ATB corpora and its related releases are available for licensing to LDC members and nonmembers. For more information about licensing LDC data, visit Obtaining Data.

Classic Corpora in the Catalog: CSR

newspapers

The CSR (continuous speech recognition) corpus series was developed in the early 1990s under DARPA’s Spoken Language Program to support research on large-vocabulary CSR systems. 

CSR-I (WSJ0) Complete (LDC93S6A) and CSR-II (WSJ1) Complete (LDC94S13A) contain speech from a machine-readable corpus of Wall Street Journal news text. They also include spontaneous dictation by journalists of hypothetical news articles as well as transcripts.

The text in CSR-I (WSJ0) was selected to fall within either a 5,000-word subset or a 20,000-word subset. Audio includes speaker-dependent and speaker-independent sections as well as sentences with verbalized and nonverbalized punctuation. (Doddington, 1992). CSR-II features “Hub and Spoke” test sets that include a 5,000-word subset and a 64,000-word subset. Both data sets were collected using two microphones – a close-talking Sennheiser HMD414 and a second microphone of varying type. 

WSJ0 Cambridge Read News (LDC95S24) was developed by Cambridge University and consists of native British English speakers reading CSR WSJ news text, specifically, sentences from the 5,000-word and 64,000-word subsets. All speakers also recorded a common set of 18 adaptation sentences.  

The CSR corpora continue to have value for the research community. CSR-I (WSJ0) target utterances were used in the CHiME2 and CHiME3 challenges which focused on distant-microphone automatic speech recognition in real-world environments. CHiME2 WSJ0 (LDC2017S10) and CHiME2 Grid (LDC2017S07) each contain over 120 hours of English speech from a noisy living room environment. CHiME3 (LDC2017S24) consists of 342 hours of English speech and transcripts from noisy environments and 50 hours of noisy environment audio. 

CSR-I target utterances were also used in the Distant-Speech Interaction for Robust Home Applications (DIRHA) Project which addressed natural spontaneous speech interaction with distant microphones in a domestic environment. DIRHA English WSJ Audio (LDC2018S01) is comprised of approximately 85 hours of real and simulated read speech from native American English speakers in an apartment setting with typical domestic background noises and inter/intra-room reverberation effects.

Multi-Channel WSJ Audio (LDC2014S03), designed to address the challenges of speech recognition in meetings, contains 100 hours of audio from British English speakers reading sentences from WSJ0 Cambridge Read News. There were three recording scenarios: a single stationary speaker, two stationary overlapping speakers, and one single moving speaker. 

All CSR corpora and their related data sets are available for licensing by Consortium members and non-members. Visit Obtaining Data for more information.

Classic Corpora in LDC’s Catalog: AMR

AMR sample

 

Abstract Meaning Representation (AMR) annotation was developed by LDC, SDL/Language Weaver, Inc., the University of Colorado's Computational Language and Educational Research group and the Information Sciences Institute at the University of Southern California. It is a semantic representation language that captures "who is doing what to whom" in a sentence. Each sentence is paired with a graph that represents its whole-sentence meaning in a tree-structure. AMR utilizes PropBank frames, non-core semantic roles, within-sentence coreference, named entity annotation, modality, negation, questions, quantities, and so on to represent the semantic structure of a sentence largely independent of its syntax. 

LDC’s Catalog contains three cumulative English AMR publications: Release 1.0 (LDC2014T12), Release 2.0 (LDC2017T10), and Release 3.0  (LDC2020T02). The combined result in AMR 3.0 is a  semantic treebank of roughly 59,255 English natural language sentences from broadcast conversations, newswire, weblogs, web discussion forums, fiction and web text and includes multi-sentence annotations. 

LDC has also published Chinese Abstract Meaning Representation 1.0 (LDC2019T07) and 2.0 (LDC2021T13) developed by Brandeis University and Nanjing Normal University. These corpora contain AMR annotations for approximately 20,000 sentences from Chinese Treebank 8.0 (LDC2013T21). Chinese AMR follows the basic principles developed for English, making adaptations where necessary to accommodate Chinese phenomena.

Abstract Meaning Representation 2.0 - Four Translations (LDC2020T07), developed by the University of Edinburgh, School of Informatics, consists of Spanish, German, Italian and Chinese Mandarin translations of a subset of sentences from AMR 2.0.

Visit LDC’s Catalog for more details about these publications.