|

|
|
Hansard French/English
| |
| Item Name: | Hansard French/English |
| Authors: | Salim Roukos, David Graff, and Dan Melamed |
| LDC Catalog No.: | LDC95T20 |
| ISBN: | 1-58563-048-9 |
| Data Type: | text |
| Data Source(s): | government documents |
| Application(s): | machine translation |
| Language(s): | English, French |
| Language ID(s): | ENG, FRA |
| Distribution: | 1 CD |
| Member fee: | $0 for 1995, 1996, 1997 members |
| Non-member Fee: | US$6500.00 |
| Reduced-License Fee: | US$3250.00 |
| Extra-Copy Fee: | US$150.00 |
| Non-member License: | yes |
| Online documentation: | yes |
| Licensing Instructions: | Subscription Members, Standard Members, Non-Members |
| Citation: | Salim Roukos, David Graff, and Dan Melamed 1995 Hansard French/English Linguistic Data Consortium, Philadelphia |
|
| The Hansard Corpus consists of parallel texts in English and
Canadian French, drawn from official records of the
proceedings of the Canadian Parliament. While the content is
therefore limited to legislative discourse, it spans a broad
assortment of topics and the stylistic range includes
spontaneous discussion and written correspondance along with
legislative propositions and prepared speeches.
The collection presented here has been assembled by the LDC by
way of archives from two distinct secondary sources. Material
from one time period of parliamentary proceedings was acquired
through the IBM T. J. Watson Research Center, while material
from another period was acquired through Bell Communications
Research Inc. (Bellcore). The combined collection covers a
time span from the mid-1970's through 1988, with no apparent
duplication between the two data sources.
Aside from covering different time periods, the two archives
have different organization and have undergone different
amounts and kinds of processing in being prepared as a
parallel language resource. In addition, the Bellcore set
itself comprises two distinct types of data -- one appears to
be the main parliamentary proceedings (similar in nature to
the IBM set), while the other consists of transcripts from
committee hearings.
The three sets have been kept distinct in this publication and each
is described in greater detail in separate documentation files on the
CD-ROM.
In terms of what the three sets have in common:
- They are rendered here using the 8-bit ISO-Latin1 character
encoding standard.
- They use a minimal amount of SGML tagging to identify
sentences or paragraphs.
- All sets are organized using a parallel file structure, in
which the content of a given English text file is matched by
the content of a corresponding French text file.
- The SGML text files for the IBM and the Bellcore
committee-hearings data are published in compressed form,
using the public-domain GNU-Zip utility (gzip). The Bellcore
main-session files are not compressed.
In terms of differences between the three sets:
- The IBM collection is presented as a sequence of parallel
sentences (there are nearly 2.87 million parallel sentence
pairs in the set).
- The Bellcore data are presented as sequences of paragraphs.
- The Bellcore main-session data is accompanied by mapping
files that provide computed paragraph alignments and
word-token correspondences; no additional alignment data are
provided for the Bellcore committee texts (and none are needed
for the IBM sentences).
Content Copyright |
|
|