|

|
|
Web 1T 5-gram, 10 European Languages Version 1
| |
| Item Name: | Web 1T 5-gram, 10 European Languages Version 1 |
| Authors: | Thorsten Brants, Alex Franz |
| LDC Catalog No.: | LDC2009T25 |
| ISBN: | 1-58563-525-1 |
| Release Date: | Oct 20, 2009 |
| Data Type: | text |
| Data Source(s): | web collection |
| Application(s): | language identification, language modeling, machine learning, machine translation |
| Language(s): | Czech, Dutch, French, German, Italian, Polish, Portuguese, Romanian, Spanish, Swedish |
| Language ID(s): | ces, deu, fra, ita, nld, pol, por, ron, spa, swe |
| Distribution: | 7 DVD |
| Member fee: | $0 for 2009 members |
| Non-member Fee: | US$150.00 |
| Reduced-License Fee: | N/A |
| Extra-Copy Fee: | US$150.00 |
| Non-member License: | yes |
| Member License: | yes |
| Online documentation: | yes |
| Licensing Instructions: | Subscription Members, Standard Members, Non-Members |
| Citation: | Thorsten Brants, Alex Franz 2009 Web 1T 5-gram, 10 European Languages Version 1 Linguistic Data Consortium, Philadelphia |
|
Introduction
Web 1T 5-gram, 10 European Languages Version 1 was created by Google,
Inc. It consists of word n-grams and their observed frequency counts for ten
European languages: Czech, Dutch, French, German, Italian, Polish, Portuguese,
Romanian, Spanish and Swedish. The length of the n-grams ranges from unigrams
(single words) to five-grams. The n-gram counts were generated from approximately
one hundred billion word tokens of text for each language, or approximately one trillion
total tokens.
The n-grams were extracted from publicly-accessible web pages from October
2008 to December 2008. This data set contains only n-grams that appeared at
least 40 times in the processed sentences. Less frequent n-grams were discarded.
While the aim was to identify and collect pages from the specific target languages
only, it is likely that some text from other languages may be in the final data.
This dataset will be useful for statistical language modeling, including machine
translation, speech recognition and other uses.
Data
The input encoding of documents was automatically detected, and all text was
converted to UTF8.
The following table contains statistics for the entire release.
File sizes (entire corpus): approximately 27.9 GB compressed (bzip2) text
files
| Total number of tokens: | 1,306,807,412,486 |
| Total number of sentences: | 150,727,365,731 |
| Total number of unigrams: | 95,998,281 |
| Total number of bigrams: | 646,439,858 |
| Total number of trigrams: | 1,312,972,925 |
| Total number of fourgrams: | 1,396,154,236 |
| Total number of fivegrams: | 1,149,361,413 |
| Total number of n-grams: | 4,600,926,713 |
Samples
For an example of the data in this corpus please examine this sample file.
Content Copyright
Portions © 2009 Google Inc., © 2009 Trustees of the University of
Pennsylvania |
|
|