This is intended to be a relatively simple intro to tgrep. See README.long for a fuller description. WHY TGREP? Treebank bracketing is represented by multiple levels of parentheses that look a lot like Lisp: Treebank II: (TOP (S (NP-SBJ my best friend) (VP gave (NP me) (NP chocolate) (NP-TMP yesterday)) .)) Treebank I (approximate): (TOP (S (NP my best friend) (VP gave (NP me) (NP chocolate)) (NP yesterday)) .) You can't effectively search this kind of thing using plain old grep, because grep can only look at single lines. So we have tgrep, which can look for geometric relationships in the "trees" represented above. BASICS See the file README for set-up instructions. Here's a simple tgrep command: tgrep TOP This would find all sentences in the TGREP_CORPUS defined in your .cshrc. Here's a more complicated example: tgrep -f -w 'S < /^NP-TPC/' /mnt/unagi/nldb/tgrep/wsj_skel.crp which decomposes like this: -f print source Filenames -w print Whole sentence (not just the matching part) '...' search pattern, enclosed in quotes to keep csh happy /...wsj_skel.crp tgrepable corpus to search, instead of the TGREP_CORPUS defined in your .cshrc Another possibly useful switch is -a, which prints out all matches found in a given sentence instead of just the first one. Not needed with -w, though. You can't really search files directly with tgrep. Instead you must use pre-compiled "tgrepable" corpus files. There are several of these available in /nldb/tgrep; look for files with a .crp or .corpus extension and/or see the INVENTORY file. SEARCH PATTERNS The syntax of search expressions is detailed in README.long. Here are a few examples. All of them happen to match the above example sentences. 'NP' anything called just NP. In the Treebank II example above, this only matches the "me" and "chocolate" NPs, since the -SBJ and -TMP make the others different in tgrep's eyes. 'TOP' anything called TOP. Since every sentence is enclosed in TOP brackets, this matches everything. 'VP < NP' VP immediately dominates an NP (VP has a child called "NP") 'VP <3 NP' VP has NP as its 3rd child 'NP > VP' an NP that has a VP parent 'VP << chocolate' VP contains the word "chocolate" (i.e. "VP" dominates "chocolate", directly or indirectly) 'S \!<< balderdash' S does not contain the word "balderdash" [the backslash (\) before the exclamation point (!) above keeps csh/tcsh from doing history substitution ... hmm, just take my word for it and put it in.] 'NP $.. NP' two NPs have a common parent (first NP precedes a sister also called NP) '/^NP/' // encloses a grep-style regular expression, where "^" and "$" mean beginning and end of _word_ instead of line. Regexps can be pretty hairy, but this one is worth knowing -- it means "anything starting with NP", such as: NP, NP-SBJ, NP-TMP, NP-ADV-TPC-1, ... If you use several operators in a row, the results are a bit unintuitive at first. For instance, 'S < VP < NP' does NOT mean "S has a VP child that has an NP child" but rather "S has a VP child _and_ an NP child". To get the first meaning, you need to use parentheses: 'S < (VP < NP)' You can put all this together in fairly complex ways. For instance: 'S <1 /^NP/ < (VP < (NP $.. NP))' means: Get all Ss that start with an NP (not necessarily the subject) and that dominate a VP that in turn has two NP children -- in other words, sentences with what might be double-object VPs. (In Treebank II, they probably will really be double objects. In Treebank I, they could also be things like "I saw yesterday my best friend from college.") INFO ON TREEBANK For extensive information on Treebank II bracketing, see ftp.cis.upenn.edu:/pub/treebank/doc/manual or try /mnt/unagi/nldb/manual/current/manual.glommed for a VERY long mishmash of (1) latex source of the manual and (2) minutes of Switchboard bracketing decisions. Searchable, though. For information on Treebank I (old) bracketing, see ftp.cis.upenn.edu:/pub/treebank/doc/old-bktguide.tex. You can also try sending me mail, but it might take me a while to respond. ---------------------------------------------------------------- robertm@unagi.cis 3/1/96