ࡱ> 7 bjbjUU _7|7|yl$>4>1Mz8pL(LLLLLLL$N PLELa LaaaLaLaa&nPId\L, p@"y> J \L$M01MJUQUQ\La>>Writing a Corpus Cookbook Martin Wynne Oxford Text Archive Oxford University Computing Services Oxford UK - OX2 6NN martin.wynne@ota.ahds.ac.uk Abstract This paper discusses the issues arising from the planning of a guide to good practice in developing linguistic corpora for UK academics, a project currently being undertaken at the Oxford Text Archive (OTA). Following a study of the OTA's subject coverage and Collections Policy, it was decided that additional support should be offered in the area of Linguistics, with a workplan to improve the OTA's provision for linguistics. One of the key activities to be undertaken is the publication of an AHDS Guide to Good Practice covering the development of linguistic corpora. This paper reports on the experiences so far in planning and writing this book. The following questions are addressed: What existing resources are there in this area, and are they adequate? What is the guide for? Who is the guide for? What is the UK academic linguistics community? What is good/best practice? How should the OTA service develop? Specifically, the question of what is best practice in digitising and/or encoding language corpora is considered in most detail. As an example, the advantages and example of recommending the use of XML markup is considered. In view of the diversity of resources which are being and will be developed, the advantages of an open, eclectic and non-prescriptive approach are considered. Guidelines for building corpora Scholars increasingly want real language data, and are turning to corpora. While many general and specialised corpora do exist, often scholars find that they need to build their own data set. Evidence for this is given by the recent publication of various volumes dealing with the introduction of corpus techniques to different areas of linguistic study (Dodd 2000, Kenny 2001) and of the construction of particular types of corpus (Ghadessy et al 2001). Corpora can be expensive to make and so research grants are usually needed to develop them. Research grant applications in the humanities are increasingly assessed on the basis of their technical, as well as general academic, viability. The Arts and Humanities Research Board, a key funding body in the humanities in the UK, now insists on a separate technical evaluation of project proposals which include as a key part of the proposed workplan the creation of an electronic resource. As well as general issues of project management, this evaluation focuses on the questions of whether current best practice is being followed and whether the proposed resource is needed and not already in existence. It is not currently clear how this best practice is identified, assessed and spread. For experienced corpus linguists, there should not really be a problem, as they are part of the community which originates, discusses and develops the various practices in the field. The problem is more with the increasing numbers of linguists who have become aware of the methodologies and techniques and technologies regarding corpora and want to adopt some of them. They will not, by definition, normally start off with a thorough grounding in these practices. How do they learn what they need to know in order to design, build and analyse a corpus without making mistakes made and corrected in the past? 1.1 Existing sources for best practice If you want to build a language corpus, where can you get help? Certainly, you want to know what resources already exist in your field, and probably, you want to know what is considered best practice as well. You probably need to apply for funding too, and want to know what practices are considered suitable by the funding bodies. Finding this sort of information is not easy, at least in the United Kingdom (UK) academic arena. There are a wealth of books, particularly of an introductory nature, on corpus linguistics. Many are aimed at students (Barnbrook 1996, McEnery and Wilson 2001, Biber et al 1998, Kennedy 1998), although they can be useful as basic introductions to the field. All include useful case studies, and Biber et al (1998) has the very useful Methodology Boxes, but there are in total only thirty-five pages on corpus design, building and analysis, so it is useful only as a very brief introduction. Other introductory works are more theoretical (Sinclair 1991, Stubbs 1996, which also share broadly similar intellectual outlooks). Almost all of the above are dated. Kennedy (1998), for example, although a relatively recent publication, can only report on the announced plans to build the Bank of English in a section on the COBUILD corpora. Descriptions of corpus building projects are very useful repositories of information. The lessons to be learned are however of diminishing value as technology and practice changes. Such reports are published in learned journals where aim is usually to report on interesting and original findings rather than describing the methodology in detail. Special issues (e.g. Zampolli and Ostler 1993) specifically on the topic of corpus development are however worth tracking down. One article, Atkins Clear and Ostler (1992) comes closest to fulfilling what is needed by those searching for guidance on best practice in corpus development, but this is now a decade old. The journal Literary and Linguistic Computing is probably the best periodical source for articles about methodology in corpus linguistics. The COBUILD team published a volume reporting on many aspects of their innovative project (Sinclair 1991), and while this was of great value at the time, it is very old now by corpus linguistics standards and is thus chiefly of historical interest. Handbooks to accompany the release of corpora are also useful, although inevitably partial in their treatment of the general and theoretical issues (Francis and Ku era 1964, Johannson, Leech and Goodluck 1978, Aston and Burnard 1998). Many are freely available online at the ICAME website ( HYPERLINK "http://khnt.hit.uib.no/icame/manuals/brown/" http://khnt.hit.uib.no/icame/manuals/brown/). On the more practical level, the AHDS Guide to Good Practice Creating and Documenting Electronic Texts is of real value in terms of capturing digital text, documentation and preservation measures. It is however concerned rather more with the production of digital surrogates of manuscripts and does not attempt to deal specifically with issues relating to language corpora. The AHDS guides are discussed more below. There are some online resources of value,  HYPERLINK "http://www.ahds.ac.uk/corpus.htm" http://www.ahds.ac.uk/corpus.htm. The AHDS also provide a couple of case studies on writing the technical appendix of research grant applications, although not specifically for language corpus building. There are other online resources such as the corpus linguistics pages at the University of Essex ( HYPERLINK "http://clwww.essex.ac.uk/w3c/" http://clwww.essex.ac.uk/w3c/), at UCREL in Lancaster University ( HYPERLINK "http://www.comp.lancs.ac.uk/ucrel/" http://www.comp.lancs.ac.uk/ucrel/) and Michael Barlows website ( HYPERLINK "http://www.ruf.rice.edu/~barlow/corpus.html" http://www.ruf.rice.edu/~barlow/corpus.html). These have a wealth of useful links to resources of various types. Another group of publications is the growing number of proceedings of conferences and the occasional festschrift, but these have a collection of articles selected on the basis of original scholarly content, availability, self-selection by submission to a conference and the identity of the author. These are not intended to be criteria which produce guides to best practice. In summary, there is a lot of very useful information available. But it is dispersed in many different types of publication, and you need to be an initiate to the field of corpus linguistics to know how to piece together the puzzle. In terms of identifying current best practice, the current literature suffers in general from being outdated, too specific or too hard to find. The goals of a guide to good practice are different to the goals of all of the above varieties of publication. But no one article or book currently provides precisely all of the following: a practical guide to how to go about building a corpus; articles written by leading experts in the field, chosen for their expertise and writing ability; state of the art the application of current best practice; a range of alternative theoretical and methodological approaches. It is therefore proposed by the Arts Humanities Data Service (AHDS) in the UK that the Oxford Text Archive (OTA) commission a guide to good practice for developing linguistic corpora. A possible objection to a venture of this type is to say that it is spoon-feeding and dumbing-down the academic enterprise. Scholars should be aware of what is happening in the field, be able do their own literature review and should study and critique existing practices. This is true as regards the expert practitioners of corpus linguistics, but they are not the target audience for the guide. Rather, the authors will be drawn from their ranks. The target audience here is non-corpus linguists, who, in the real world, do not have time to go away and become experts in another field, or sub-field, in order to fill in a research grant application. 1.2 Finding existing resources As noted above, it is also necessary, for funding reasons, to know what resources are already in existence and available. This is also desirable for practical and ethical reasons, and for investigating and learning from the practices followed in the construction of corpora in the past. There is unfortunately no way to gain a comprehensive view of what has been made, and to gain information about the resource and its availability. There are printed guides, such as Condron, Fraser and Sutherland (2000) but these are partial and inevitably out of date by the time they are published (and there are currently no plans to update this title). Portals such as HUMBUL (http://www.humbul.ac.uk/) aim to keep up with resources on the web, but do not currently cover language resources very thoroughly. The Open Language Archives Community (OLAC) initiative aims to make this process easier. The OTA holdings will soon be made more easily discoverable, identifiable and accessible through the participation of the OTA in the Open Language Archives Community initiative. OTA resource descriptions are now delivered to the OLAC portal, so that users looking for language resources have only to send a query to the central virtual data provider in order to find some basic information about the resources which are available in a multitude of archives. The aim is to bridge the gap between potential users and the mass of unconnected information about the resources. OLAC will be launched in January 2002. Various websites also offer lists of links to resources, and email discussion lists help to bring researchers together, and these are currently the most useful resources. No other solution to this problem is proposed here, but it is noted that resources and the associated documentation are themselves repositories of information about best practice and are potentially a very valuable source of information. A Guide to Good Practice The problem of providing access to best practice for non-expert users has been examined in section 1 above. In section two, a proposed solution to this problem is outlined. This is the plan to for the OTA to produce a guide to best practice in developing linguistic corpora. First, a little background on the OTA and on the AHDS Guides to Good Practice. 2.1 The Oxford Text Archive The Oxford Text Archive is a long-established facility supported by Oxford University, and based within the Humanities Computing Unit. Founded in 1976 by Lou Burnard, the OTA has over twenty years experience of serving the research and teaching needs of electronic text users within the scholarly community. The OTA holds several thousand electronic texts and linguistic corpora, in a variety of languages. Its holdings include electronic editions of works by individual authors, standard reference works such as the Bible and mono- and bilingual dictionaries, and a range of language corpora. The OTA does not produce digital resources, and relies upon deposits from the wider community as the primary source of high-quality materials. The OTA seeks to address three key areas: Collect, catalogue, preserve, and redistribute digital resources of interest to those working in literary and linguistic studies within the UK's Higher Education and FE communities. Develop appropriate licensing conditions and technical mechanisms for the effective distribution of such resources. Promote good practice in the creation and use of such resources in both research and teaching. Since 1996, the OTA has been working as a Service Provider for the national Arts and Humanities Data Service (AHDS) to support academics working in all areas of literary and linguistic studies in the UK. One aspect of this work is the provision of advice to UK academics on the creation and use of electronic texts and text corpora. Following a study of the OTA's subject coverage and Collections Policy, it was decided that additional support should be offered in the area of linguistics. Martin Wynne has been appointed to the post of Information Officer (Linguistics) with a brief to improve the OTA's provision for linguistics and to raise the profile of the OTA within this sector of the academic community. 2.2 Guides to Good Practice The series of Guides to Good Practice from the Arts and Humanities Data Service (AHDS) aims to provide the arts and humanities teaching and research communities with practical instruction in applying recognised standards and good practice to the creation, preservation and use of digital resources. The OTA has already been responsible for one publication, Morrison, Popham and Wikander (2000). It is now proposed that this be complemented by a guide to the more specific problems associated with creating corpora. On the AHDS website the following text can be found (http://ahds.ac.uk/guides.htm): Developing Linguistic Corpora: Drawing on the experiences of the British National Corpus Project, this Guide will examine the creation of large-scale electronic corpora for linguistic study. It will also identify the factors which must be taken into account when designing, creating, and distributing such corpora. The next sub-section discusses progress so far in the planning of this publication. 2.3 Developing Linguistic Corpora A draft proposal for the title Developing Linguistic Corpora has now been made and is outlined in this subsection. The proposal is for an online publication with chapters written by selected authors. This title will have the intended readership of UK academics planning a project which involves developing a language corpus. Little or no knowledge of corpora and computational techniques is assumed, although readers can be expected to be linguists. The title will essentially be a web publication, to be hosted at the OTA website (supported by the Oxford University Computing Services) and also linked from the AHDS website. Access will be free and unrestricted. A printed version may also be made available, either as a PDF file available on a print-on-demand basis, or in a limited print run from Oxbow Books, the publishers of the existing titles in the series of AHDS Guides to Good Practice. The proposed structure of the book is that there will be eight chapters, covering the following topics: The text and the corpus: some basic principles; Design and representativeness; Metadata and textual encoding; Working with the speech; Working with languages other than English; Corpora with more than one language; Linguistic annotation; Archiving and distribution. Authors are selected on the basis that they have as many as possible of the following attributes: essentially UK-based, or part of the UK community; an academic linguist; expert and pre-eminent in their field, have extensive experience of practical work in the relevant field. ability to write clearly and accessibly.The intended style is non-technical. Extensive bibliographies and appendices will be included to give references, acknowledgements and to link to more detailed, advanced works or alternative approaches, but for the text a rather more informal style, less academic than a monograph or textbook, and more like a user manual should be adopted. Each chapter will, as far as is possible, have the same structure: basic principles, a survey of relevant different approaches, and a case study. As this will be an electronic publication, the possibilities of hypertext and multimedia publishing on the web may be exploited. In particular, the absence of the space restrictions of a print publication mean that extended appendices, case studies and illustrative material may be used. While, as mentioned above, there may be a printed version of the Guide, this should not be a barrier to innovation in the web edition, and the conversion to print will be undertaken by the OTA without the necessity of additional work by the authors. It is expected that the publication will have a high worldwide readership and be highly influential on future practice, at least in the UK community. 2.4 Standards, Recommendations, Choices It is not the intention that the guide should ally itself with any one particular theoretical outlook or standardization initiative. Despite the links that the Oxford University Computing Services has with the TEI Consortium, the OTA maintains an open mind towards the TEI and other standards. The guide will aim to make users aware of the choices open to them. It is not clear that there is a single encoding standard that should be adopted in all cases. Indeed some fundamental questions need to be asked about encoding standards. Metadata standards are constructive and useful, if they have the practical result that resources are easier to identify. When it comes to text encoding standards however, there is a danger that attempting to adhere to or develop versions of over-elaborate generalized markup schemes becomes the all-consuming task in a project, or even a barrier to getting started. Some important questions need to be asked about encoding standards. Who are the standards for? Do they exist to fulfill a need from resource creators and users or does the momentum come from the standards development community? And are standards in a sufficiently mature state and sufficiently accepted in the community that it is appropriate to recommend their adoption? Perhaps for the more general, representative, conventional text corpora, it could be argued that there are solidly accepted practices, though even here there are many important dissenters. For the many new types of corpora that are constantly being developed, there cannot be an already accepted standard. There is also the danger that valuable new and innovative practices may be stifled by the imposition of a standardized procedure. For this reason, an open and eclectic model should be preferred, whereby users are not forced to adopt a standard which may not be appropriate for their needs. 2.5 The Text Encoding Initiative The TEI Guidelines for the encoding of text corpora (Sperberg-McQueen and Burnard 1999 and  HYPERLINK "http://www.hcu.ox.ac.uk/TEI/Guidelines2/index.html" http://www.hcu.ox.ac.uk/TEI/Guidelines2/index.html) can be considered a de facto standard and so are worth considering in a little detail. There is some resistance to uptake of the TEI recommendations because of the single tree annotation problem. This is a feature of SGML (and hence XML), and is not therefore limited to the TEI, but since the TEI recommends the use of SGML and now XML, it is a relevant issue. Lets look at a real example of the problem. In a corpus creation and analysis project, all passages of reported speech and thought were identified, categorised and annotated in a corpus (Wynne, Short and Semino 1998). As would be expected, these passages often overlap with text structural elements; such as paragraphs, as in the following example:

One officer said: 'This is like an episode from Inspector Morse.

"The victim was single but we believe he had several lady friends. .

"It is possible that it was something in the background of one of those relationships that caused his death. .

"We don't think he was linked with any criminals or involved in any secret wrong doing." .

Police have not ruled out the possibility of a contract killing by a hitman.

The passages of reported speech are tagged as spcat elements, in this text with the values NRS (narrators report of speech, i.e. a reported clause), DS (direct speech) and N (simple narrative). However, there is a bracketing paradox here because the spcat element with value DS overlaps with the p (paragraph elements). This is by no means an unusual problem. Indeed, there is no reason to expect it to be possible to categorise elements of textual structure and elements of linguistic structure in the same hierarchical structure. This is a problem which occurs with many types of annotation at the discourse level, where elements overlap with sentences, paragraphs, speaker turns and other elements. There are possible technical solutions. One is to break the linguistic elements where a new textual boundary begins, and to use identifier values to connect the two or more discontinuous sections of the broken passage. This however is counter-intuitive for the analyst and does not reflect the reality of the linguistic units. Another proposed alternative is to use stand-off annotation, whereby the linguistic annotation is kept in a separate tree structure and there are pointers to where the elements start and end. This is an elegant solution from the technical point of view, and there is the important advantage that the integrity of the text is not compromised by the linguistic markup. However, from a practical point of view, it causes many problems. Firstly, it becomes a precondition of the encoding that there is a sufficiently fine-grained level of segmentation of the text, so that the relevant positions for the start and end of the linguistic elements can be pointed to. This may therefore require word (or morpheme or phoneme) segmentation, which presupposes a processing and interpretation of the data for which there may be no linguistic or theoretical motivation. It also involves a lot more work. The annotated text is not readable by humans without a complex transformation (requiring more work and more software). Segments of the text cannot easily be analysed, annotated, viewed or extracted in isolation, as the annotated text is the product of knitting together at least two files. Editing of the text, for example to correct typographic or OCR errors, becomes very complex as the linkages with the annotation need to be preserved. Finally, while XML does in theory offer the possibility of stand-off annotation, the present author is not aware of a successful application to date which has been successfully reused, nor is he aware of readily available tools to display the annotated data. It is of course to be hoped that current initiatives to promote the use of XML stand-off annotation are successful and result in the production of tools to enable the acceptance of the technology in the community. The criticism here is intended merely to note that the technology has not yet reached such a situation and it would therefore not be appropriate to insist on its use at the moment. Incidentally, the approach adopted in the case of the above example was to encode the linguistic elements in the text (as can be seen in the example), so that the result was not a well-formed SGML document. Since the corpus was relatively small, and the aim of the project was principally to test and refine the categorization scheme, the readability of the corpus was considered more important than any procedures that would require parsing the markup in the corpus. The researchers in question realize that this may be a problem for future reusability of the corpus, but the extra resources of time, money and expertise that would have been involved in developing a more elegant technical solution were not available. It should be noted that the Bank of English has for more than a decade successfully used a (non-XML) type of stand-off, whereby the annotation is stored in a separate file from the text, but it seems to be a software model which is suitable principally for that particular and unique database and analysis tool (lookup). The Bank of English is a very large, industrial-scale resource using technology which is appropriate to a project of that size, and this is probably why that particular model has never been copied in more than a decade. The motivations behind standardization of text markup are reusability and preservation, and these are of course noble aims. However, as Sinclair (forthcoming) points out, one of the biggest traps in corpus building is to try to second-guess the future, so our decisions should be pragmatic and based on current best practice. Perhaps the most important guidelines are the rather banal and obvious ones of good, clear metadata and good documentation of the procedures. Future Plans In conclusion, a new guide to planning and executing a project to develop a linguistic corpus is needed. In order to fill the gap, at least in the British context, a Guide to Best Practice is proposed, with chapters written by key, respected, experienced practitioners and writers in a set of core areas, such as design and representativeness, metadata, spoken data, working with multilingual texts, annotation, archiving and distribution. This is only one initiative which aims to address the problem of access to resources and expertise in corpus linguistics for newcomers and non-expert practitioners in the field. In the forthcoming period, the OTA will make a concerted attempt to develop a new collections policy for language resources. Through consultation with the community, new directions will be explored in terms of the archiving and delivery of resources useful to academics in the subject area of linguistics. A fresh and forward-looking approach is proposed whereby existing and traditional notions of resources and online services are reappraised and new directions are sought. It is not taken for granted, for example, that the delivery of entire text corpora to individual users is the most useful model. Methods of online extraction of linguistic information from resources held in the archive will be tested. The viability of providing access to tools and language processing services will also be investigated. There is a however a constant drive to acquire new resources such as corpora for the OTA, and to make the existing resources more easily available and accessible, through the development of new, improved cataloguing procedures and a rights management strategy. The functionality of the website and the catalogue search mechanisms are also being upgraded. The present period is seeing the convergence of techniques, technologies and standards in several related fields which have in common the goal of delivering linguistic content through electronic means. These include relatively established technologies such as ebooks, internet search engines, electronic delivery of language reference and translation services, as well as emerging technologies such as the mobile internet, virtual libraries and the semantic web. The distribution of corpora, or of the linguistic knowledge embedded in corpora, is clearly related to these large-scale industrial developments. The OTA is fortunate that colleagues in the Oxford University Computing Services and in other departments of the University such as the Bodleian Library are involved in developing practices in many of these areas. References Guy Aston, and Lou Burnard. 1998. The BNC Handbook: Exploring the British National Corpus with SARA. Edinburgh University Press, Edinburgh. Sue Atkins, Jeremy Clear and Nicholas Ostler. 1992. Corpus Design Criteria in Literary and Linguistic Computing 7(1): 1-16. Douglas Biber, Susan Conrad, and Randi Reppen. 1998. Corpus Linguistics: Investigating Language Structure and Use. Cambridge University Press, Cambridge, UK. Geoff Barnbrook. 1996. Language and Computers. Edinburgh University Press, Edinburgh. Francis Condron, Michael Fraser and Stuart Sutherland. 2000. Guide to Digital Resources in the Humanities., CTI Textual Studies, Oxford. Bill Dodd. 2000. Working with German Corpora. The University of Birmingham Press, Birmingham, UK. Nelson Francis and Henry Ku era. 1964. Manual of Information to accompany the a standard corpus of present-day edited American English, for use with digital computers. Department of Linguistics, Brown University, Providence, Rhode Island. Roger Garside, Geoffrey Leech and Anthony McEnery (eds.). 1997. Corpus Annotation. London: Longman. Mohsen Ghadessy, Alex Henry and Robert L. Roseberry (eds.). 2001. Small Corpus Studies and ELT: Theory and practice. John Benjamin, Amsterdam. Stig Johannson and Geoffrey Leech. 1986. Manual of Information to Accompany the Lancaster-Oslo/Bergen Corpus of British English, for use with digital computers. Department of English, University of Oslo. Graeme Kennedy. 1998. An Introduction to Corpus Linguistics. Longman, London. Dorothy Kenny. 2001. Lexis and creativity in translation: a corpus-based study, St. Jerome, Manchester. Anthony McEnery and Andrew Wilson. 2001. Corpus Linguistics (2nd edition). Edinburgh University Press, Edinburgh. John Sinclair. 1991. Corpus, Concordance, Collocation. Oxford University Press, Oxford. John Sinclair (ed.). 1987. Looking Up. HarperCollins, London. C. M. Sperberg-McQueen and Lou Burnard, (eds.). 1999. Guidelines for Electronic Text Encoding and Interchange. TEI P3 Text Encoding Initiative. Revised reprint: Oxford May 1999 (http://www.hcu.ox.ac.uk/TEI/Guidelines/index.htm). Michael Stubbs. 1996. Text and Corpus Analysis. Blackwell, Oxford. John Sinclair, forthcoming, Corpora and Lexicography in Piet van Sterkenburg (ed.), Coursebook on Lexicography. Onderwerp. Martin Wynne, Mick Short and Elena Semino. 1998. A corpus-based investigation of speech, thought and writing presentation in English narrative texts in Antoinette Renouf (ed), Explorations in Corpus Linguistics. Rodopi, Amsterdam. Martin Wynne. 2001. An Archive for all of Europe in Workshop Proceedings: Sharing Tools and Resources from the 39th Annual Meeting of the Association for Computational Linguistics and 10th Conference of the European Chapter (copies from http://www.mkp.com). Antonio Zampolli and Nicholas Ostler (eds.). 1993. Special Section on Corpora, Literary and Linguistic Computing 8(4).  At the time of writing, the authors of the proposed guide have not yet been commissioned and the project is subject to the approval of the series Editorial Board.  The views expressed in this paper are those of the author, not those of the Arts and Humanities Data Service. &tjk* + , I J o p ! !!!L!M!114,5/5Ѻѯѯѯѯ0J6CJ]aJaJ j0JUjUjUjU0JjU0JCJaJmH nH sH tH jU jU6]PJaJmHsHCJCJPJmHsH5CJPJ\mHsHmH sH 3';`gtkH(O3 & F3$$IfTFw 64 Fa$If $$Ifa$3 8-u! #$@%x%%&Y&)),/31L12255 & F/555c7.8/8<<=3>QQTQUQVQQQQQUTeVVVVVVVWW`WeWyW{WWW3s>s`sst;t}tttuyuuuuvwVxgxxx2yyyz8zrzzzzz{3{p{|{{|||}-}7}H*0J OJQJ^J6]0JCJaJmH nH sH tH jU jUmH sH 0J j0JUaJaJK567c7-:I:v;<=3>U> ?@ADBtBBBBBC2CNCCCC D & F  & F ^ & F DcD"IJIJLuOPP$SUTYTkTTTTTUUUUUU VeV'e$d%d&d'dNOPQ^e & F eV$YkZ,bdghijjkno3s>ssHtt@> Heading 2$<@& 5\aJJ@J Heading 3$<@&5CJ\aJmH sH <A@< Default Paragraph Font<Z@< Plain TextCJaJmH sH tH<O< url$<<a$OJPJQJmHsHJO"J Section$ & Fa$5CJPJ\aJZoZ Subsection Heading $x5CJPJ\aJtHTO2T References Heading $a$5CJPJ\aJdOBd Abstract Heading $]^a$5CJPJ\aJBORB Abstract$]^a$PJNObN References$x^`a$ CJPJaJFOrF Heading$,^a$5CJPJ\aJ,B@, Body Text$a$:^@: Normal (Web)dd[$\$.U@. Hyperlink >*B*ph4+@4 Endnote TextCJaJ6*@6 Endnote ReferenceH*>V@> FollowedHyperlink >*B* ph&X@& Emphasis6]jOj ExampleI$$d%d&d'dNOPQ,.3zz.';`gtkH(O3  8 -u @ x !Y!$$'*3,L,--0012c2-5I5v67839U9 :;<D=t=====>2>N>>>> ?c?"DJDEGuJKK$NUOYOkOOOOOPPPPPP QeQ$TkU,]_bcdejfij3n>nnHoon>nnnnnnnnnnnoo=o>o?oAooooooooooon>nnnnnnnnnnnoo=o>o?oAooooooooooohN*uCAO%[D`%qJ>Bc.Ow-gWfUo elvC~As^`s56789;<B*CJH*OJQJS*TX^JaJo(ph;77^7`56789B*CJH*OJQJS*TX^JaJo(ph.77^7`56CJOJQJ^JaJo(..7^7`CJOJQJ^JaJo(.... 78^7`CJOJQJ^JaJo( ..... 78^7`CJOJQJ^JaJo( ...... 7^7`CJOJQJ^JaJo(....... 7^7`CJOJQJ^JaJo(........ 7^7`CJOJQJ^JaJo(.........h ^`OJQJo(h ^`OJQJo(oh pp^p`OJQJo(h @ @ ^@ `OJQJo(h ^`OJQJo(oh ^`OJQJo(h ^`OJQJo(h ^`OJQJo(oh PP^P`OJQJo(h^`.h^`.hpLp^p`L.h@ @ ^@ `.h^`.hL^`L.h^`.h^`.hPLP^P`L.h^`.h^`.hpLp^p`L.h@ @ ^@ `.h^`.hL^`L.h^`.h^`.hPLP^P`L.A77^7`56789;<B*CJH*OJQJS*TX^JaJo(ph;77^7`56789B*CJH*OJQJS*TX^JaJo(ph1.177^7`56CJOJQJ^JaJo(..7^7`CJOJQJ^JaJo(.... 78^7`CJOJQJ^JaJo( ..... 78^7`CJOJQJ^JaJo( ...... 7^7`CJOJQJ^JaJo(....... 7^7`CJOJQJ^JaJo(........ 7^7`CJOJQJ^JaJo(.........A77^7`56789;<B*CJH*OJQJS*TX^JaJo(ph;77^7`56789B*CJH*OJQJS*TX^JaJo(ph77^7`56CJOJQJ^JaJo(..7^7`CJOJQJ^JaJo(.... 78^7`CJOJQJ^JaJo( ..... 78^7`CJOJQJ^JaJo( ...... 7^7`CJOJQJ^JaJo(....... 7^7`CJOJQJ^JaJo(........ 7^7`CJOJQJ^JaJo(.........h ^`OJQJo(h ^`OJQJo(oh pp^p`OJQJo(h @ @ ^@ `OJQJo(h ^`OJQJo(oh ^`OJQJo(h ^`OJQJo(h ^`OJQJo(oh PP^P`OJQJo(h ^`OJQJo(h ^`OJQJo(oh pp^p`OJQJo(h @ @ ^@ `OJQJo(h ^`OJQJo(oh ^`OJQJo(h ^`OJQJo(h ^`OJQJo(oh PP^P`OJQJo(h hh^h`OJQJo(h 88^8`OJQJo(oh ^`OJQJo(h   ^ `OJQJo(h   ^ `OJQJo(oh xx^x`OJQJo(h HH^H`OJQJo(h ^`OJQJo(oh ^`OJQJo(h ^`OJQJo(h ^`OJQJo(oh pp^p`OJQJo(h @ @ ^@ `OJQJo(h ^`OJQJo(oh ^`OJQJo(h ^`OJQJo(h ^`OJQJo(oh PP^P`OJQJo(h ^`OJQJo(h ^`OJQJo(oh pp^p`OJQJo(h @ @ ^@ `OJQJo(h ^`OJQJo(oh ^`OJQJo(h ^`OJQJo(h ^`OJQJo(oh PP^P`OJQJo(h ^`OJQJo(h ^`OJQJo(oh pp^p`OJQJo(h @ @ ^@ `OJQJo(h ^`OJQJo(oh ^`OJQJo(h ^`OJQJo(h ^`OJQJo(oh PP^P`OJQJo( hh^h`OJQJo(^`.pp^p`.@ @ ^@ `.^`.^`.^`.^`.PP^P`.O/>zlvj 0mIlvOw-g>BcqJUo*uC!e1O%[D ]T}                                                                                           z@pp8tpp@qrz`@``8@`v`@UnknownGz Times New Roman5Symbol3& z ArialG MS Mincho-3 fg?5 z Courier New;Wingdings"qh[F0[&[fcHd3%0d'{d 2QMaking a Corpus Cookbook Martin Wynne Martin WynneOh+'0 ,8 T ` l xMaking a Corpus Cookbook0aki Martin Wynneus artart Normal.dote Martin Wynneus 99tMicrosoft Word 9.0k@6@D x@Vcq@ žyHd՜.+,D՜.+,T hp  Oxford Text Archive3'{ Making a Corpus Cookbook Titled 8@ _PID_HLINKSA$yv3http://www.hcu.ox.ac.uk/TEI/Guidelines2/index.html-r ,http://www.ruf.rice.edu/~barlow/corpus.htmlq7 #http://www.comp.lancs.ac.uk/ucrel/http://clwww.essex.ac.uk/w3c/QP!http://www.ahds.ac.uk/corpus.htm,p,http://khnt.hit.uib.no/icame/manuals/brown/  !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMOPQRSTUWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~Root Entry F)yData N1TableVUQWordDocument_SummaryInformation(DocumentSummaryInformation8CompObjjObjectPool)y)y  FMicrosoft Word Document MSWordDocWord.Document.89q