Obtaining DataUsing DataProviding DataCreating Data
About LDCMembersCatalogProjectsPapersLDC OnlineSearchContact UsUPennHome

LDC Catalog | By Type and Source | By Year | Top Ten | Projects | Catalog Search



Web 1T 5-gram Version 1

Item Name: Web 1T 5-gram Version 1
Authors: Thorsten Brants, Alex Franz
LDC Catalog No.: LDC2006T13
ISBN: 1-58563-397-6
Release Date: Sep 19, 2006
Data Type: text
Data Source(s): web collection
Application(s): language modeling, machine learning, natural language processing
Language(s): English
Language ID(s): eng
Distribution: 6 DVD
Member fee: $0 for 2006 members
Non-member Fee: US$150.00
Reduced-License Fee: US$150.00
Extra-Copy Fee: US$150.00
Non-member License: yes
Member License: yes
Online documentation: yes
Licensing Instructions: Subscription Members, Standard Members, Non-Members
Citation: Thorsten Brants, Alex Franz
2006
Web 1T 5-gram Version 1
Linguistic Data Consortium, Philadelphia

Introduction

This data set, contributed by Google Inc., contains English word n-grams and their observed frequency counts. The length of the n-grams ranges from unigrams (single words) to five-grams. We expect this data will be useful for statistical language modeling, e.g., for machine translation or speech recognition, as well as for other uses.

Source Data

The n-gram counts were generated from approximately 1 trillion word tokens of text from publicly accessible Web pages.

Character Encoding

The input encoding of documents was automatically detected, and all text was converted to UTF8.

Tokenization

The data was tokenized in a manner similar to the tokenization of the Wall Street Journal portion of the Penn Treebank. Notable exceptions include the following:

  • Hyphenated word are usually separated, and hyphenated numbers usually form one token.
  • Sequences of numbers separated by slashes (e.g. in dates) form one token.
  • Sequences that look like urls or email addresses form one token.

Data Sizes

File sizes: approx. 24 GB compressed (gzip'ed) text files

Number of tokens:    1,024,908,267,229
Number of sentences:    95,119,665,584
Number of unigrams:         13,588,391
Number of bigrams:         314,843,401
Number of trigrams:        977,069,902
Number of fourgrams:     1,313,818,354
Number of fivegrams:     1,176,470,663

Sample Data

The following is an example of the 3-gram data contained this corpus:

ceramics collectables collectibles	55
ceramics collectables fine	130
ceramics collected by	52
ceramics collectible pottery	50
ceramics collectibles cooking	45
ceramics collection ,	144
ceramics collection .	247
ceramics collection 	120
ceramics collection and	43
ceramics collection at	52
ceramics collection is	68
ceramics collection of	76
ceramics collection |	59
ceramics collections ,	66
ceramics collections .	60
ceramics combined with	46
ceramics come from	69
ceramics comes from	660
ceramics community ,	109
ceramics community .	212
ceramics community for	61
ceramics companies .	53
ceramics companies consultants	173
ceramics company !	4432
ceramics company ,	133
ceramics company .	92
ceramics company 	41
ceramics company facing	145
ceramics company in	181
ceramics company started	137
ceramics company that	87
ceramics component (	76
ceramics composed of	85
ceramics composites ferrites	56
ceramics composition as	41
ceramics computer graphics	51
ceramics computer imaging	52
ceramics consist of	92

The following is an example of the 4-gram data in this corpus:

serve as the incoming	92
serve as the incubator	99
serve as the independent	794
serve as the index	223
serve as the indication	72
serve as the indicator	120
serve as the indicators	45
serve as the indispensable	111
serve as the indispensible	40
serve as the individual	234
serve as the industrial	52
serve as the industry	607
serve as the info	42
serve as the informal	102
serve as the information	838
serve as the informational	41
serve as the infrastructure	500
serve as the initial	5331
serve as the initiating	125
serve as the initiation	63
serve as the initiator	81
serve as the injector	56
serve as the inlet	41
serve as the inner	87
serve as the input	1323
serve as the inputs	189
serve as the insertion	49
serve as the insourced	67
serve as the inspection	43
serve as the inspector	66
serve as the inspiration	1390
serve as the installation	136
serve as the institute	187
serve as the institution	279
serve as the institutional	461
serve as the instructional	173
serve as the instructor	286
serve as the instructors	161
serve as the instrument	614
serve as the instruments	193
serve as the insurance	52
serve as the insurer	82
serve as the intake	70
serve as the integral	68

Content Copyright

Portions © 2006 Google Inc., © 2006 Trustees of the University of Pennsylvania


About LDC | Members | Catalog | Projects | Papers | LDC Online | Search / Help | Contact Us | UPenn | Home | Obtaining Data | Creating Data | Using Data | Providing Da ta

Contact: ldc@ldc.upenn.edu

(c) 1992-2008 Linguistic Data Consortium, University of Pennsylvania. All Rights Reserved.