Obtaining DataUsing DataProviding DataCreating Data
About LDCMembersCatalogProjectsPapersLDC OnlineSearchContact UsUPennHome

LDC Catalog | By Type and Source | By Year | Top Ten | Projects | Catalog Search



SAID

Item Name: SAID
Authors: Koenraad Kuiper, Heather McCann, Heidi Quinn, Therese Aitchison, and Kees van der Veer
LDC Catalog No.: LDC2003T10
ISBN: 1-58563-268-6
Release Date: Jun 26, 2003
Data Type: text
Data Source(s): dictionaries
Application(s): machine translation, parsing
Language(s): English
Language ID(s): eng
Distribution: Web Download
Member fee: $0 for 2003 members
Non-member Fee: US $500.00
Reduced-License Fee: US $250.00
Extra-Copy Fee: N/A
Non-member License: yes
Online documentation: yes
Licensing Instructions: Subscription Members, Standard Members, Non-Members
Citation: Koenraad Kuiper, et al.
2003
SAID
Linguistic Data Consortium, Philadelphia

Introduction

SAID (A Syntactically Annotated Idiom Dataset) was produced by Linguistic Data Consortium (LDC) catalog number LDC2003T10 and ISBN 1-58563-268-6.

The purpose if this corpus is to provide data for investigating the structural configurations in which English idioms are typically found. The assumption was that, since idioms are phrasal lexical items (PLIs), they would therefore have structural properties which are idiosyncratic. In order to study the structural properties of phrasal lexical items, the data is more useful if it is syntactically annotated.

Data

The data was originally drawn from four dictionaries of English idioms.

Only citation forms, suitably adapted for this purpose, were used. The citation files were amalgamated. The rationale for the selection was that these are among the biggest and most comprehensive lists of English idioms.

There are 13,467 phrasal lexical items in this corpus.

The analysis of the phrasal lexical items was manual, while the bracketting symmetry was checked computationally.

In order to facilitate machine manipulation of the annotated data, the manual analysis was converted to PROLOG format. This involved expansions of those PLIs which had optional constituents so that both the case with and the case without the options were made available.

The files are provided in text format, which each record separated by a carriage return.

Sponsorship

The New Zealand Vice Chancellors' Committee

The University of Canterbury

Updates

There are no updates available at this time.

Content Copyright

Portions © 2003 Koenraad Kuiper, Heather McCann, Heidi Quinn, Therese Aitchison, Kees van der Veer


About LDC | Members | Catalog | Projects | Papers | LDC Online | Search / Help | Contact Us | UPenn | Home | Obtaining Data | Creating Data | Using Data | Providing Data

Contact: ldc@ldc.upenn.edu

(c) 1992-2010 Linguistic Data Consortium, University of Pennsylvania. All Rights Reserved.