TGRREPDOC(1) USER COMMANDS TGRREPDOC(1) NAME tgrepdoc - documentation for tgrep SYNOPSIS This man page describes how to use tgrep, how to construct tgrep patterns, how to control printing with tgrep, how to optimize your searches, how to specify regular expressions in tgrep, and how to prepare your own encoded corpus for use with tgrep. DESCRIPTION INTRODUCTION: tgrep is grep for trees. With tgrep you specify a pattern using node names and relationships between nodes. The pat- tern is then matched against a corpus of tree structures (usually natural language sentences) and those trees (sen- tences) which match your pattern are selected for printing. tgrep supports many ways to print out what you want from the trees your pattern matches. tgrep is designed to be very fast at the expense of having to pre-encode the text it searches. There are two ways to use tgrep: directly from the command line, and via a script language called _T. _T is a powerful language which allows you to manipulate the data in a corpus. This file describes how to use _t_g_r_e_p from the com- mand line. To learn about _T please see the file README.T in the doc/ subdirectory of the tgrep release or ask your sys- tem manager. Unfortunately, no man page exists for _T yet. USING TGREP: To use tgrep you need to specify a pattern and an encoded corpus. An encoded corpus is created using the _t_p_r_e_p(1) utility. _t_g_r_e_p can also read old encoded corpus files created with the outdated _p_r_e_p_a_r_e__c_o_r_p_u_s(1) utility. These old corpora consist of a vocabulary file (which lists all the "words" used in the corpus in alphabetical order) and a corpus file which encodes the structure of the corpus. The vocabulary file (indicated by the suffix ".vocab") is deli- berately human readable while the corpus file (indicated by the suffix ".n") is not. The newer corpus format consists of a single file with a ".crp" or ".corpus" suffix. Whenever you use tgrep you must specify an encoded corpus: that is you must specify a .crp or both a .vocab and a .n file. The .crp or .n corpus files may be specified last on the command line or by setting the environment variable TGREP_CORPUS. The .vocab file is specified with the "-v Sun Release 4.1 Last change: 11 November 1992 1 TGRREPDOC(1) USER COMMANDS TGRREPDOC(1) vocabulary-file " option. For example: tgrep 'NP < PP' foo.corpus will use foo.corpus as the encoded corpus file (new format). tgrep -v foo.vocab 'NP < PP' foo.n will use foo.vocab as the vocabulary file and foo.n as the encoded corpus file with the old format corpus files. *** Free advice: stay away from "old format" corpora, unless you're happy to use incomplete data. If you are working frequently only with one corpus you can avoid having to specify the corpus files on the command line each time you execute tgrep by setting the environment vari- ables TGREP_VOCAB and TGREP_CORPUS to the vocabulary and encoded corpus files respectively (if you are using the old format) or the environment variable TGREP_CORPUS to the .corpus file. For example, if you wanted to use the Wall Street Journal corpus shipped with tgrep you would put these lines in your .cshrc or .tcshrc file (depending on weather you use the c-shell or the tc-shell). setenv TGREP_CORPUS "/mnt/unagi/nldb/tgrep/WSJ.n" setenv TGREP_VOCAB "/mnt/unagi/nldb/tgrep/WSJ.vocab" Also, you should add /mnt/unagi/nldb/tgrep/ to your PATH environment variable. (don't forget to "rehash" after you change your PATH environment variable). PROBLEMS WITH TGREP: If there are any problems with tgrep please send mail to tgrep-support@linc.cis.upenn.edu and they will be fixed ASAP. Even if you discover a clever workaround, please let us know about the problem (and of course the workaround!). SPECIFYING PATTERNS FOR TGREP: tgrep is designed to find patterns in tree structured (bracketed) text. Therefore, the patterns you specify for tgrep consist of various ways to match nodes in a tree and various ways to relate those nodes to each other. Specifying Nodes in a Search Pattern: You specify a node in tgrep by either using an explicit Sun Release 4.1 Last change: 11 November 1992 2 TGRREPDOC(1) USER COMMANDS TGRREPDOC(1) name such as "NP", "S", "of", "doctor", etc.. or by a regular expression such as "/.*ing/" or by a wildcard which matches any node name and is specified by two underscores, i.e. "__". Regular expressions are indicated by surrounding the node-name in slashes (/). Therefore, to perform a search for words ending in "ing" one would specify "/.*ing/". Regular expressions can be specified as in the UNIX line editor ed(1). A regular expression matches a word if any part of the word is matched by the regular expression. For example, the regular expression "/[Cc]hild/" not only matches "child" and "Child" but also "children" and "Lafite-Rothschild". In order to match just what you specify the pattern must be anchored by using the the caret (^) anchors the regular expression at the beginning of a word. So for example "/^[Cc]hild/" will not match "Lafite- Rothschild" but will match "child-care" and "children" etc. To constrain the regular expression even further, specify the dollar-sign ($) as the last character in the regular expression. This has the effect of anchor- ing the regular expression at the end of a word. For example "/[Cc]hild$/" will not match "children" but will match "Lafite-Rothschild" and "brainchild". Finally, to truly constrain the regular expression, use both the caret and the dollar-sign as in "/^[Cc]hild$/" which will match only "child" and "Child". There is a complete reference at the end of this file on how to construct regular expressions. The wildcard node name (i.e. "__") matches anything and is particularly useful. If you are concerned about speed try to avoid the wildcard as much as possible (see the section below on Optimizing Your Searches). tgrep also supports a way to match one or more dif- ferent node names to a particular node by use of the vertical bar (i.e. '|'). For example, /^[Cc]hild.*$/|kid|youngster will match several popular ways of identifying young persons including: child, Child, children, Children, kid, and youngster. (Note that this pattern will also match "child-care"). There is no limit to the number of words or regular expressions which can be grouped by '|' into a single node name using. A Fun example: This will print out all terminals in the corpus which consist of two identical sequences (of 2 or more characters) in a row. Sun Release 4.1 Last change: 11 November 1992 3 TGRREPDOC(1) USER COMMANDS TGRREPDOC(1) tgrep -a '`/^.*)1$/ #< __ ' Please see the section below on specifying regular expressions. Specifying Relationships Between Nodes: There are several basic kinds of relationships between nodes. They are: A < B A immediately dominates B A > B A is dominated by B A > B A is immediately dominated by B A <<, B B is a leftmost descendant of A A <<` B B is a rightmost descendant of A A . B A immediately precedes B A .. B A precedes B A $ B A and B are sisters (note that A $ A is FALSE) A $. B A and B are sisters and A immediately precedes B A $.. B A and B are sisters and A precedes B Their negations are also available and they are: A !< B A does not immediately dominate B A !> B A is not dominated by B A !> B A is not immediately dominated by B A !<<, B B is not a leftmost descendant of A A !<<` B B is not a rightmost descendant of A A !. B A does not immediately precede B A !.. B A does not precede B A !$ B A is not a sister of B (note that A !$ A is TRUE) A !$. B it is not true that A $. B A !$.. B it is not true that A $.. B NOTE that the symbols < > $ have alternatives. Sun Release 4.1 Last change: 11 November 1992 4 TGRREPDOC(1) USER COMMANDS TGRREPDOC(1) symbol i.e. alternate i.e. --------------------------------------------- < <<` { {{` < !< ^ !^ > >> } }} $ $. % %. ! !< @ @< NOTE that there are no explicit <, and <` relationships to compliment the <<, and <<` relationships. However, note the following equivalences: this is equivalent to this available operator ------------------------------------------------ A <, B A <1 B A <` B A <- B One day <, and <` will be directly supported. Constructing Patterns: tgrep patterns are composed of a node followed by the rela- tionships which that node participates in. For example, 1) S < NP << S will match an S node which immediately dominates an NP *and* which dominates some other S node. Note that the second relationship "<< S" refers to the first S and not to the NP. This syntax has been adopted to avoid using clumsy AND statements. You can use parenthesis to group relationships so that, 2) S < (NP << S) will match an S node which immediately dominates an NP node which in turn dominates some S node. The key building block for tgrep patterns are "relation- ships". A relationship consists of a "master" node followed by a series of relationships to other nodes. In example (1) only the first S is a master node while in the second exam- ple both the first S and the NP nodes are master nodes. As you can see, the first node in a pattern (or the first node following a left parenthesis) is a master node which is related to the nodes to its right in the pattern by the relationship "operators". (The use of the term "operator" is wrather clumsy since the operators indicate relationships and not operations on their arguments, however the term relationship is reserved for a master-node/operator/node Sun Release 4.1 Last change: 11 November 1992 5 TGRREPDOC(1) USER COMMANDS TGRREPDOC(1) triple). The master node in a relationship which is enclosed in parenthesis represents that relationship to the rest of the pattern. Therefore the following are equivalent: S < NP << S (S < NP) << S ***Special note on Dominance Relationships*** The order in which dominance relationships are speci- fied is NOT relevant. The following pairs of patterns are equivalent: NP < PP < covert NP < covert < PP S << * < SINV S < SINV << * S < (NP << *) << T S << T < (NP << *) ***Special note on the Nth Child Relationships*** In order to specify the exact child position which a node must (or must not) occupy use the Nth child rela- tionships (> !>> << !<< $.. !$.. $ !$ .. !.. The not relationships are more expensive than their positive counterparts. REGULAR EXPRESSION: tgrep regular expressions support a limited form of regular-expression notation. A regular expression (RE) specifies a set of character strings to match against - such as "any string containing digits 5 through 9" or "only lines containing uppercase letters." A member of this set of strings is said to be _m_a_t_c_h_e_d by the regular expression. Where multiple matches are present in a line, a regular expression matches the _l_o_n_g_e_s_t of the _l_e_f_t_m_o_s_t matching strings. Regular expressions can be built up from the following "single-character" RE's: _c Any ordinary character not listed below. An ordinary character matches itself. \ Backslash. When followed by a special character, the RE matches the "quoted" character. A backslash fol- lowed by one of <, >, (, ), {, or }, represents an _o_p_e_r_a_t_o_r in a regular expression, as described below. . Dot. Matches any single character except NEWLINE. ^ As the leftmost character, a caret (or circumflex) Sun Release 4.1 Last change: 11 November 1992 14 TGRREPDOC(1) USER COMMANDS TGRREPDOC(1) constrains the RE to match the leftmost portion of a line. A match of this type is called an "anchored match" because it is "anchored" to a specific place in the line. The ^ character loses its special meaning if it appears in any position other than the start of the RE. $ As the rightmost character, a dollar sign constrains the RE to match the rightmost portion of a line. The $ character loses its special meaning if it appears in any position other than at the end of the RE. ^_R_E$ The construction ^_R_E$ constrains the RE to match the entire line. [_c...] A nonempty string of characters, enclosed in square brackets matches any single character in the string. For example, [abcxyz] matches any single character from the set `abcxyz'. When the first character of the string is a caret (^), then the RE matches any charac- ter _e_x_c_e_p_t NEWLINE and those in the remainder of the string. For example, `[^45678]' matches any character except `45678'. A caret in any other position is interpreted as an ordinary character. []_c...] The right square bracket does not terminate the enclosed string if it is the first character (after an initial `^', if any), in the bracketed string. In this position it is treated as an ordinary character. [_l-_r] The minus sign, between two characters, indicates a range of consecutive ASCII characters to match. For example, the range `[0-9]' is equivalent to the string `[0123456789]'. Such a bracketed string of characters is known as a _c_h_a_r_a_c_t_e_r _c_l_a_s_s. The `-' is treated as an ordinary character if it occurs first (or first after an initial ^) or last in the string. The following rules and special characters allow for con- structing RE's from single-character RE's: A concatenation of RE's matches a concatenation of text strings, each of which is a match for a successive RE in the search pattern. * A single-character RE, followed by an asterisk (*) matches _z_e_r_o or more occurrences of the single- character RE. Such a pattern is called a _c_l_o_s_u_r_e. For example, [a-z][a-z]* matches any string of one or more Sun Release 4.1 Last change: 11 November 1992 15 TGRREPDOC(1) USER COMMANDS TGRREPDOC(1) lower case letters. \{m\} \{m,\} \{m,n\} A one-character RE followed by \{_m\}, \{_m,\}, or \{_m,_n\} is an RE that matches a _r_a_n_g_e of occurrences of the one-character RE. The values of _m and _n must be nonnegative integers less than 256; \{_m\} matches _e_x_a_c_t_l_y _m occurrences; \{_m,\} matches _a_t _l_e_a_s_t _m occurrences; \{_m,_n\} matches _a_n_y _n_u_m_b_e_r of occurrences _b_e_t_w_e_e_n _m and _n, inclusively. Whenever a choice exists, the RE matches as many occurrences as possible. \(...\) An RE enclosed between the character sequences \( and \) matches whatever the unadorned RE matches, but saves the string matched by the enclosed RE in a numbered substring register. There can be up to nine such sub- strings in an RE, and parenthesis operators can be nested. \_n Match the contents of the _nth substring register from the current RE. This provides a mechanism for extract- ing matched substrings. For example, the expression ^\(..*\)\1$ matches a line consisting entirely of two adjacent non-null appearances of the same string. When nested parenthesized substrings are present, _n is determined by counting occurrences of \( starting from the left. PREPARING YOUR OWN CORPUS FOR USE The user can prepare his own encoded corpus for use with tgrep by using _t_p_r_e_p(1). Please refer to that man page. EXAMPLES Here are a couple very simple example patterns: To look for preopositional phrase (PP) attachment to a preceeding verb phease (VP) as opposed to a preceeding noun phrase (NP): VP < NP < PP to the preceeding NP instead VP < (NP < VP) To search for sentence initial NPs Sun Release 4.1 Last change: 11 November 1992 16 TGRREPDOC(1) USER COMMANDS TGRREPDOC(1) S <1 NP SEE ALSO prepare-corpus(1) tgrep(1) BUGS It is possible that some of the nodes in the pattern do not get printed when they are not descendants of the node which matches the top-level master node. Consider that this pat- tern NP < (dog . VP) when applied to this sentence (S (NP the dog) (VP swam)) will give this output (NP the dog) Clearly the pattern matches but there will be no ouptut corresponding to the VP in the pattern. It is helpfull to keep in mind this pitfall when dealing with the precedence relationships. If any other bugs are encountered please send electronic mail to tgrep-support@linc.cis.upenn.edu with the subject line as one of the following and a concise description as the body of the message: installation for installation problems. bug for reporting a bug in tgrep. feature request for requesting a new feature be added to tgrep. information request for requesting other information about tgrep. help if you are _r_e_a_l_l_y stuck and need help. other for other communications. Sun Release 4.1 Last change: 11 November 1992 17 TGRREPDOC(1) USER COMMANDS TGRREPDOC(1) COPYRIGHT Copyright 1993, 1994 Richard Pito. AUTHOR Richard Pito (pito@unagi.cis.upenn.edu) under grant from the Benjamin Franklin Institute. Sun Release 4.1 Last change: 11 November 1992 18