Representing multilingual text in memory and in a relational database. John Thomson Modern databases can record text in Unicode, but knowing a sequence of unicode code points is not enough for proper processing of text. It is necessary also to record the language, not only of each string, but often of runs within strings. This allows the same code points to be treated in different ways when they are used to represent different languages. Spell checking is the most obvious example of a task that requires knowledge of more than the sequence of code points, but collating, keyboarding, and sometimes line breaking and rendering, can all depend on it also. For word processing and similar purposes, it is necessary for runs to record style information and perhaps explicit formatting information. For linguistic and similar studies it is necessary to associate annotations of various kinds with runs of text. It is sometimes necessary to record multiple translations and other variations of the same information. Not only is it necessary to be able to record all this information, but it is frequently necessary to be able to edit it. The annotation process may well begin before the final form of the text is decided. (Annotations containing notes from the author to himself or reviewers are one example. A back translation may be created while the original work is in draft form.) In such a situation, a major challenge is identifying the right size chunk of text to represent as a text field in the database and as an object in a programming language. Should it be a run, a paragraph, a section, or something larger? There are many tradeoffs in space, performance, and naturalness of programming. We have settled on an approach in which a string is never larger than a paragraph, but may include a sequence of runs with differing properties. Multiple variations of the same information are recorded as separate fields and objects. Another important tradeoff is how the finer structure is represented in the database. Representing run information using tables is elegant and may facilitate certain kinds of queries, but it is costly in both space and time. We have found that a binary representation of the run structure works well. Where it is really desirable to perform run-level queries using SQL (for example, to find all text that is in a particular style, or marked with a particular annotation), it is possible to write SQL that can decode the run information, though it is not particularly efficient. For most queries, we can write SQL that will find a superset of the desired strings, load them into memory, and perform more detailed checks there. When we require a tighter or more efficient linkup between runs and other objects in the database--for example, to tag runs of text as relevant to a particular topic--we can add fields or relationships at the string, or paragraph, level. For example, we can make a joiner table to indicate which paragraphs contain material relevant to a topic, and (in the binary run information) indicate exactly which part of the paragraph is relevant. Using the joiner table we can very efficiently find paragraphs containing relevant material, load them into memory, and display them with the specifically relevant material highlighted. It is also possible, though we have not yet made use of this idea, to represent hierarchical structure within the text in this way. One approach is to mark each run with its most specific property, and also indicate a depth. This works well for something like nested strings in different languages, where we mainly want to know the actual language of the text, but knowing how things are nested is sometimes useful for things like bidirectional layout. We can also associate an ordered sequence of properties with each run, so we could indicate something like "sentence 1, noun phrase, noun" for a particular word. Once we have such a structure in the database, we need a convenient representation of it in memory, one that can readily be shared by many components. We have developed an implementation of a string as a COM object which can provide access to all the run information. We have both an immutable version of this string, which is convenient for various componets to pass around and hold onto without needing to make their own copies, and a string builder object which can also be modified. Next, we need ways of manipulating the information which make use of the information about the language of the runs. We can handle rendering, collating, and keyboarding in language-dependent ways, and expect to add language-dependent line breaking, word tokenizing, spell checking, and other facilities. Finally, we need to be able to edit the text while preserving the run information. This is relatively straightforward except for the need to build our own text editing components. When editing within a run, we just need to adjust all the subsequent run boundaries so all the run properties, including annotations, stay correctly lined up. When editing at a run boundary, it is a little harder to be sure what the user intends, but we have developed a number of strategies, based on how the selection came to be where it is, for guessing the user's intention. For example, if the user has just used an arrow key to move over a character, new typing takes on the properties of that character rather than the adjacent one.