This article introduces the idea that parsing in the sense associated with computational linguistics and natural language processing may be understood as information compression by multiple alignment, unification and search (ICMAUS). In this context, 'multiple alignment' has a meaning which is similar to its meaning in bio-informatics, while 'unification' means a simple merging of matching patterns, a meaning which is related to but simpler than the meaning of that term in logic.

This concept of parsing is embodied in a software model, SP52. The organisation and operation of the model are described with a simple example of what the model can do.

An example is presented showing how the same theoretical framework and the same software model may support the production of sentences, not just analysis of sentences.

The accompanying article (Wolff, 1988) presents some other, more realistic examples showing how syntax may be represented in the proposed formalism and how sentences may be parsed in terms of that syntax.

These proposals are the latest phase in a programme of research developing the 'SP' conjecture that all kinds of computing and formal reasoning may usefully be understood as information compression by pattern matching, unification and search. In this context, it is anticipated that the ICMAUS framework will, in the future, facilitate the integration of syntax with semantics and the integration of parsing with such things as the unsupervised learning of linguistic and non-linguistic cognitive structures and the drawing of deductive and probabilistic inferences. Although these matters are outside the scope of these two articles, they provide an important motivation for developing the ICMAUS framework, with the expectation that, in the future, it may be extended with relatively little modification to become a fully integrated system for the understanding and production of language.

The proposed framework provides a method of representing the syntax of natural languages in a manner which is, arguably, simpler and more direct than with other methods. The examples presented in the accompanying article (Wolff, 1998) show how such things as discontinuous dependencies and cross-serial dependencies in syntax can be accommodated.

By contrast with some of the simpler and less sophisticated methods of parsing, the proposed framework accommodates ambiguities in syntax quite naturally and will normally deliver two or more alternative parsings where that is appropriate.

1. INTRODUCTION

This article introduces the idea that the notion of 'parsing' may be understood as a process of information compression by multiple alignment with 'unification' and 'search' (ICMAUS). In this context, 'multiple alignment' has a meaning which is similar to its meaning in bio-informatics, while 'unification' means a simple merging of matching patterns, a meaning which is related to but simpler than the meaning of that term in logic. The article also describes a software model, SP52, in which the concepts are embodied, with examples of what the model can do.

1.1 Novelty of Proposals

A concept of parsing is already well-established in the literature on data compression (see, for example, Storer (1988)). In that context, it means a process of analysing data into segments, each of which is replaced by a relatively short 'code' associated with the given segment in a 'dictionary' of segments.

But this kind of parsing is simpler than the kinds of parsing associated with theoretical linguistics, computational linguistics or the branch of AI concerned with natural language processing. In the first case, alternatives can be chosen only at one level whereas in 'linguistic' kinds of parsing, which are the focus of interest in this article, there are normally alternatives at arbitrarily many levels in the grammar which is used to guide the parsing.

Research on parsing and related topics within computational linguistics and AI does not normally consider these topics in terms of information compression (IC) (but see, for example, Berger et al. (1996) and Hu et al. (1997)). However, there is a well-developed tradition of parsing and linguistic analysis in terms of probabilities, with associated concepts such as "stochastic grammars", "maximum-likelihood", "Bayesian inference" and "statistical analysis" (see, for example, Abney (1997), Ahadi and Woodland (1997), Dreuth and Rüber (1997), Takahashi and Sagayama (1997), Wu (1997), Sproat et al. (1996), Lucke (1995)) and there is a close connection between probabilities and IC.¹

The most novel feature of these proposals appears to be the idea that parsing may be understood as a multiple alignment problem (somewhat different from the concept of multiple alignment in bio-informatics) together with the related idea that syntactic structure may be expressed in a relatively simple and direct manner as 'patterns' without meta-symbols as will be explained below. These two ideas appear to be significantly different from any established concepts and may open the way to new ways of understanding the representation and analysis of language structure.

1.2 Background and Context

The proposals in this article have been developed within a programme of research developing the 'SP' conjecture that all kinds of computing and formal reasoning may usefully be understood as information compression by pattern matching, unification and search, and developing a 'new generation' computing system based on this thinking (Wolff, 1990-1996c).²

The ICMAUS proposals are one realisation of 'pattern matching, unification and search'. Besides parsing, this framework appears to have potential to accommodate several other aspects of computing, including unsupervised learning (Wolff, 1996c), deductive and probabilistic reasoning (Wolff, 1996a, 1996c), information retrieval (Wolff, 1994a) and pattern recognition (Wolff, 1995a). It can be argued (Wolff, 1996b) that the ICMAUS model has at least the generality of the Turing model and may thus be regarded as a general model of computing.³

1.2.1 Related Research.

As an attempt to integrate concepts across several areas of computing, the SP programme naturally has many connections with other research (some connections are described in Wolff (1991, 1993, 1995b)). Perhaps the closest links are with work on Algorithmic Complexity Theory (see, for example, Li and Vitányi (1993)) and Minimum Length Encoding (otherwise known as Minimum Description Length encoding or Minimum Message Length encoding - see, for example, Wallace and Boulton (1968), Rissanen (1978), Belloti and Gammerman (1990), Gammerman (1991)) which is itself closely related to Bayesian inference (see, for example, Cheeseman (1990), Pednault (1991)). However, there are important differences in objectives and orientation between these areas of work and the SP programme (described in Wolff (1995b), Section 3.6, and Wolff (1994b), Sections 7.1 and 7.2).

1.3 Evaluation of these Proposals

Not unreasonably, readers may ask what justification there may be for pursuing an alternative approach to parsing when there are several quite serviceable methods in existence already. The following subsections summarise some of the considerations which are relevant in answering this question.

1.3.1 Speed of Processing.

In terms of speed, the current model (SP52) - to be described below - is no rival for many current methods running on a conventional computer. This is because multiple alignment problems are generally 'expensive' in terms of processing power and are not normally solved quickly on conventional computers. However, it should be born in mind that the ICMAUS proposals have been developed within a programme of research aiming to develop new forms of computer which are likely to differ quite radically from conventional computers. In particular, it is anticipated that very high levels of parallelism will be applied, resulting in dramatic increases in power, well in excess of what is required to achieve parsing by ICMAUS with adequate speed for most practical purposes.

1.3.2 Representation of Syntax.

An integral part of this approach to parsing is a method of representing the syntax of language with potential advantages over existing methods:

Arguably, the method is simpler and more 'direct' than with existing methods. This may be seen most clearly in the way in which discontinuous dependencies in syntax may be represented. This is illustrated and discussed in the accompanying article (Wolff, 1998).
For reasons which are largely outside the scope of these two articles, the proposed method of representing syntactic structure may facilitate the automatic learning of syntax within a learning framework which is a generalisation of the parsing framework (see below).

1.3.3 Ambiguity in Parsing.

The system to be described always yields a set of alternative alignments, graded in terms of the measure of IC. Thus, compared with some of the simpler and less sophisticated methods of parsing, the method will normally detect ambiguities which may exist in the material to be parsed and will provide alternative parsings showing the nature of the ambiguity. The system also adapts its parsings in an appropriate manner in response to disambiguating contexts. Examples are presented and discussed in the accompanying article.

1.3.4 Integration of Syntax with Semantics.

In line with the goals of the research programme of which this work is a part, the method of representing syntax has been chosen deliberately with an eye on wider issues in the representation of knowledge than the representation of syntax. Thus, for example, the kinds of hierarchies and heterarchies with inheritance of attributes which are recognised in the representation of non-linguistic knowledge may be represented in a manner which closely parallels the method of representing syntactic structure which is described in this article (Wolff, 1995a, 1995c). If, as anticipated, it proves possible to represent diverse kinds of knowledge in a uniform manner, this should facilitate the smooth integration of syntax with semantics.

1.3.5 Integration of Parsing with Other Functions.

If the scope of "ICMAUS" is as wide as is anticipated (indicated in Section 1.2), the SP52 model should generalise with relatively little modification to embrace such functions as deductive and probabilistic reasoning, (fuzzy) pattern recognition, (fuzzy) information retrieval, unsupervised (automatic) learning of new structures, and others. In short, the ideas described in this and the accompanying article may be seen as a first step towards the development of a system for understanding and production of language, fully integrated with other aspects of 'intelligence'.

1.4 Presentation

The next section (Section 2) introduces multiple alignment problems in general terms. Using a simple example, Section 3 describes how parsing of language may be seen as multiple alignment. Section 4 describes the search metric used in the SP52 model - how 'good' alignments may be distinguished from 'bad' ones in terms of IC. Section 5 describes the SP52 model, a computer simulation of the proposed SP system, running on a conventional computer. Section 6 describes how the production of language may be seen in terms of multiple alignment - between a 'coded' representation of the a sentence and rules in a grammar. The article ends with a brief discussion of anticipated future developments and some concluding remarks.

The accompanying article demonstrates some of the capabilities of the SP52 model in parsing examples which are more complicated than the examples considered in this article. Topics covered in the second article include ambiguity in parsing, recursion, discontinuous dependencies (DDs) in syntax (including DDs which are embedded one within another and DDs which overlap each other), the syntax of English auxiliary verbs and the kind of 'cross-serial dependency' which appears in Swiss German and Dutch.

- -

1. As is discussed later, coding by the Shannon-Fano-Elias method (or the Huffman method) provides the key which connects probability with IC.

2. IC may be interpreted as a process of removing unnecessary (redundant) complexity in information - and thus maximising 'simplicity' - whilst preserving as much as possible of its non- redundant descriptive 'power'. Hence the name 'SP'.

3. At the time of writing, it appears that no other authors have proposed ICMAUS as a model of any aspect of computing.

2 MULTIPLE ALIGNMENT PROBLEMS

The term multiple alignment is normally associated with the computational analysis of (symbolic representations of) sequences of DNA bases or sequences of amino acid residues as part of the process of elucidating the structure, functions or evolution of the corresponding molecules. The aim of the computation is to find one or more alignments of matching symbols in two or more sequences which are, in some sense, 'good'. Possible meanings for that term are discussed in Section 4, below. An example of an alignment of DNA sequences is shown in Figure 1.

  G G A     G     C A G G G A G G A     T G     G   G G A
  | | |     |     | | | | | | | | |     | |     |   | | |
  G G | G   G C C C A G G G A G G A     | G G C G   G G A
  | | |     | | | | | | | | | | | |     | |     |   | | |
A | G A C T G C C C A G G G | G G | G C T G     G A | G A
  | | |           | | | | | | | | |   |   |     |   | | |
  G G A A         | A G G G A G G A   | A G     G   G G A
  | |   |         | | | | | | | |     |   |     |   | | |
  G G C A         C A G G G A G G     C   G     G   G G A

Figure 1
A 'good' alignment amongst five DNA sequences (adapted from Fig. 6 in Roytberg (1992), with permission from Oxford University Press).

In this area of research, it is widely recognised that the number of possible alignments of symbols is normally too large to be searched exhaustively and that, to achieve a search which has acceptable computational complexity, 'heuristic' techniques must normally be used. Heuristic techniques include 'hill climbing' (sometimes called 'descent'), 'beam search', 'genetic algorithms', 'simulated annealing', 'dynamic programming' and others. With these techniques, searching is done in stages, with a progressive narrowing of the search in successive stages using some kind of measure of goodness of alignments to guide the search. These techniques may be described generically as 'metrics-guided search'.

With these techniques, ideal solutions cannot normally be guaranteed but acceptably good approximate solutions can normally be found without excessive computational demands.

There is now a fairly large literature about methods for finding good alignments amongst two or more sequences of symbols. Some of the existing methods are reviewed in Taylor (1988), Barton (1990), Chan, Wong et al. (1992) and Day and McMorris (1992). For reasons which will be explained in the next section, none of the current methods seem to be entirely suitable for incorporation in a the proposed SP system.

2.1 Varieties of the Multiple Alignment Problem

In this research, the multiple alignment problem has been generalised in the following way:

One (or more) of the sequences of symbols to be aligned has a special status and is designated as 'New'. In the context of parsing, this would be the sentence (or other sequence of symbols) which is to be parsed.

All other sequences are designated as 'Old'. In the context of parsing, this would be the sequences of symbols which represent grammatical 'rules' (more about this in Section 3, below).

A 'good' alignment is one which, through the unification of symbols in New with symbols in Old, and through unifications amongst the symbols in Old, leads to a relatively large amount of compression of New in terms of the sequences in Old. How this may be done is explained in Section 4, below.

An implication of this way of framing the alignment problem is that any given sequence in Old may appear two or more times in any one alignment and may therefore be aligned with itself (with the obvious restriction that any one instance of a symbol may not be aligned with itself).

It should be clear that the concept of multiple alignment may be generalised to two- dimensional (or even higher-dimensional) patterns. There is likely to be a case, at some stage in the SP programme, for extending the ideas described in this article into the domain of two or more dimensions.

3 PARSING AS MULTIPLE ALIGNMENT

This section describes how the parsing of a sentence in terms of a grammar may be seen in terms of multiple alignment. The following section describes how such a parsing may be evaluated in terms of IC.

3.1 Representing a Grammar with Patterns of Symbols

Figure 2 shows a simple context-free phrase-structure grammar (CF-PSG) describing a fragment of the syntax of English.⁴ This grammar generates sentences like ' t h i s b o y l o v e s t h a t g i r l', ' t h a t b o y h a t e s t h i s g i r l', and so on. Any of these sentences may be parsed in terms of the grammar giving a labelled bracketing like this:

       (S(NP(D t h i s)(N b o y))(V l o v e s)
            (NP(D t h a t)(N g i r l)))

or an equivalent representation in the form of a tree.

       S -> NP V NP
       NP -> D N
       D -> t h i s
       D -> t h a t
       N -> g i r l
       N -> b o y
       V -> l o v e s
       V -> h a t e s

Figure 2
A CF-PSG describing a fragment of English syntax.

Figure 3 shows the grammar from Figure 2 expressed as a set of strings, sequences or patterns of symbols.⁵

Each pattern in this 'grammar' is like a re-write rule in the CF-PSG notation except that the rewrite arrow has been removed, some other symbols have been introduced (' 0', ' 1' and symbols with an initial ' #' character) and there is a number to the right of each rule.⁶

The number to the right of each rule in Figure 3 is a frequency of occurrence of the rule in a parsing of a notional sample of the language. These frequencies of occurrence will be discussed later.

S NP #NP V #V NP #NP #S      500
NP D #D N #N #NP            1000
D 0 t h i s #D               600
D 1 t h a t #D               400
N 0 g i r l #N               300
N 1 b o y #N                 700
V 0 l o v e s #V             650
V 1 h a t e s #V             350
Figure 3
A simple grammar written as patterns of symbols. For each rule, there is a number on the right representing the frequency of occurrence of the rule in a notional sample of the language [basic1, basic2].

The reasons for the symbols which have been added to each rule will become clear but a few words of explanation are in order here. The symbols ' 0' and ' 1' have been introduced to differentiate the two versions of the ' D' patterns, and likewise for the ' N' patterns and ' V' patterns. They enter into matching and unification in exactly the same way as other symbols. Although the symbols are the same as are used in other contexts to represent numbers they do not have the meaning of numbers in this grammar.

The symbols which begin with ' #' (e.g., ' #S', ' #NP') serve as 'termination markers' for patterns in the grammar. Although their informal description as 'termination markers' suggests that these symbols are meta symbols with special meaning, they have no hidden meaning and they enter into matching and unification like every other symbol.⁷

In general, all the symbols which can be seen in Figure 3 and the examples presented in the accompanying article enter into matching and unification in the same way. Although some of these symbols can be seen to serve a distinctive role, there is no hidden meaning attached to any of them and no formal distinction between upper- and lower-case letters or between digit symbols and alphabetic symbols and so on.⁸

3.2 Parsing as Alignment of a Sentence and Rules in a Grammar

Figure 4 shows how a parsing of the sentence ' t h i s b o y l o v e s t h a t g i r l' may be seen as an alignment of patterns which includes the sentence and relevant rules from the grammar shown in Figure 3. The similarity between this alignment and the conventional parsing may be seen if the symbols in the alignment are 'projected' on to a single sequence, thus:

S NP D 0 t h i s #D N 0 b o y #D #NP V 0 l o v e s #V

NP D 1 t h a t #D N 1 g i r l #N #NP #S

In this projection, the two instances of ' NP' in the second column of the alignment have been merged or 'unified' and likewise for the two instances of ' D' in the third column and so on wherever there are two or more instances of a symbol in any column.

         t h i s        b o y            l o v e s           t h a t        g i r l           
         | | | |        | | |            | | | | |           | | | |        | | | |           
     D 0 t h i s #D     | | |            | | | | |           | | | |        | | | |           
     |           |      | | |            | | | | |           | | | |        | | | |           
     |           |  N 1 b o y #N         | | | | |           | | | |        | | | |           
     |           |  |         |          | | | | |           | | | |        | | | |           
  NP D           #D N         #N #NP     | | | | |           | | | |        | | | |           
  |                               |      | | | | |           | | | |        | | | |           
  |                               |  V 0 l o v e s #V        | | | |        | | | |           
  |                               |  |             |         | | | |        | | | |           
S NP                             #NP V             #V NP     | | | |        | | | |    #NP #S 
                                                      |      | | | |        | | | |     |     
                                                      |  D 1 t h a t #D     | | | |     |     
                                                      |  |           |      | | | |     |     
                                                      |  |           |  N 0 g i r l #N  |     
                                                      |  |           |  |           |   |     
                                                      NP D           #D N           #N #NP

Figure 4
Parsing of a sentence as an alignment amongst sequences representing the sentence and relevant rules in the grammar in Figure 3 [basic1]

This projection is the same as the conventional parsing except that ' 0' and ' 1' symbols are included, right bracket symbols (' )') are replaced by 'termination markers' and each of the upper- case symbols is regarded both as a 'label' for a structure and as a left bracket for that structure.

Notice that the pattern ' NP D ; N ; #NP' appears twice in the alignment in Figure 4, in accordance with what was said in Section 2.1. In general, any pattern in the grammar used for parsing may appear two or more times in an alignment. Other examples are shown in this and the accompanying article.

As was noted in Section 2.1, the sentence or other sequence of symbols to be parsed is regarded as New, while the rules in the grammar are regarded as Old. For the sake of readability and ease of interpretation, New is normally placed at the top of each alignment with patterns from Old below it.

For the sake of clarity in Figure 4 and Figure 6 in this article and the alignments shown in the accompanying article, each appearance of a pattern in any alignment is given a line to itself (so the two appearances of ' NP D ; N ; #NP' in Figure 4 are on two different lines). Apart from the convention that New is always at the top, the order in which patterns appear (from top to bottom of the alignment) is entirely arbitrary. An alignment in which the patterns appear in one order is totally equivalent to an alignment in which they appear in any other order, provided all other aspects of the alignment are the same.

- - -

4. For the sake of clarity in exposition and saving of space, all the grammars shown in this and the accompanying article are much simpler than would be required in any practical system. Naturally, this means that many subtleties of English must be omitted from these grammars. For similar reasons, all examples of parsings which are presented have been chosen so that they are small enough to fit on one page without resorting to font sizes which are too small to read. However, for the reasons given in Section 5.4, the proposals in these two articles appear to be general and scalable to realistically large grammars and parsings.

5. In this context, a symbol is any string of characters bounded by spaces and a pattern is an ordered set of symbols appearing on one line. (Where a pattern is too long for the width of the page, a continuation of a line may be shown by the use of indentation.) Later in the article it will be convenient to distinguish the concept of a substring from a subsequence. The former is a sequence of symbols within a larger sequence where the symbols as they appear in the larger sequence are contiguous, one with the next. The latter is a sequence of symbols within a larger sequence where the symbols, as they appear in the larger sequence, need not be contiguous, one with the next.

6. For the remainder of this article, quote marks will be dropped when referring to any grammar like that in Figure 3 which is expressed as patterns of symbols. Likewise, the word 'rule' with respect to this kind of grammar will be referred to without quote marks.

7. See qualification of this assertion below.

8. The foregoing assertions are not strictly true of the method of evaluating alignments which is used in the SP52 model (see Section 4.7.4). The principle of "no meta symbols" and thus "no hidden meanings for symbols" is an ideal which this research aims to attain. But, as a temporary solution to the problem of scoring alignments pending something better, a distinction has been recognised between symbols which begin with ' #' and all other symbols.

4 EVALUATION OF AN ALIGNMENT IN TERMS OF COMPRESSION

Intuitively, a good alignment is one which has many hits (positive matches between symbols), few gaps (sequences of one or more symbols which are not part of any hit) and, where there are gaps, they should be as short as possible. It is possible to use measures like these directly in computer programs for finding good multiple alignments and, indeed, they commonly are. However, our confidence in the validity of measures like these may be increased if they can be placed within a broader theoretical framework.

Concepts of information, IC and related concepts of probability may provide a suitable framework. Work on the evaluation of multiple alignments in this tradition includes Reichert, Cohen et al. (1973), Felsenstein (1981), Allison, Wallace et al. (1992), Chan, Wong et al. (1992), Allison and Wallace (1994), Wolff (1994a).

As was indicated in Section 2.1, a good alignment is defined here as one which provides a basis for an economical coding of New in terms of the patterns Old.

The compression method which is described in Sections 4.1 to 4.4, below, exploits the elementary principle that a (relatively long) sequential pattern which repeats two or more times in a body of information may be replaced by a shorter 'code', 'codeword' or 'tag pattern' associated with that pattern in some kind of 'dictionary' of patterns. In effect, each instance of the pattern in the data is unified with the same pattern as it appears in the repository of patterns. This is the basis of all standard methods for IC (Storer, 1988).

The present proposals differ from standard methods in two ways:

Each code serves a dual role: to identify the corresponding pattern uniquely within the grammar, and to mark the left and right ends of the pattern. For present purposes, the second role is required to remove the ambiguity which would otherwise exist about left-to- right sequencing of symbols in alignments.⁹
In the present scheme, the coding principle may be applied through two or more 'levels' so that the symbols which encode a sequence of two or more patterns.at one level may themselves be recognised as an instance of a recurrent pattern which has its own code at the next higher level. Examples are presented and discussed below.

A key point in this connection is that a recurrent pattern may be discontinuous in the sense that the symbols in the pattern are not necessarily contiguous as they appear in any or all of its occurrences. In other words, a recurrent pattern may appear as a subsequence within larger patterns. Thus, for example, a sequence of symbols like 'A B C D E F' may be recognised as a recurrent pattern within a set of instances which includes patterns like ' P A B Q C R D E F S', ' A L B C D M N E F O P', ' X A B C D Y E F Z' and so on.

In what follows, I shall first give an informal explanation of the method of calculating the compression associated with any alignment using the example shown in Figure 4 (Sections 4.1 to 4.4). Then the principles embodied in the method are discussed in Section 4.6 and a formal summary of the method is presented in Section 4.7.

4.1 Encoding Individual Symbols

The simplest way to encode individual symbols in the sentence and the grammar is with a 'block' code using a fixed number of bits for each symbol. In the grammar in Figure 3, there are 24 symbol types so the minimum number of bits required for each symbol is [log₂ 24] = 5 bits per symbol.

In fact, the SP52 model (described in Section 5) uses variable-length codes for symbols, assigned in accordance with the Shannon-Fano-Elias (S-F-E) coding scheme (described by Cover and Thomas (1991)) so that the shortest codes represent the most frequent symbols and vice versa.¹⁰

Notice that the number of bits required for each symbol is entirely independent of the number of characters in the name of the symbol as it is shown in the examples. Names of symbols are chosen purely for their mnemonic value and to aid comprehension.

There are many variations and refinements that may be made at this level but, in general, the choice of coding system for individual symbols is not critical for the principles to be described below where the focus of interest is the exploitation of redundancy which may be attributed to sequences of two or more symbols rather than any redundancy attributed to unbalanced frequencies of individual symbols.

For reasons which are given in Section 6 connected with the decoding of information, the code for each symbol has two different values (in bits): a 'minimum cost' which is the theoretical minimum number of bits needed to represent that symbol according to the S-F-E calculations, and an 'actual cost' which is the (larger) number of bits that are needed to allow robust decoding of information as well as decoding (see Section 6).

In the following informal description of the encoding principles, the distinction between the 'minimum cost' and the 'actual cost' of each symbol is not important and will be ignored. For the sake of simplicity in this presentation, it will be assumed that all symbols are encoded with the same number of bits so that 'one symbol' can be treated as the minimum unit of information.

4.2 Encoding Words

A word in the sentence to be parsed may be encoded using some shorter code which is associated with the word in the dictionary and which identifies the word uniquely within the dictionary. In this case, the grammar in Figure 3 plays the part of the dictionary of segments in a standard compression system.

In accordance with the remarks above, the code pattern must show where the word starts and also where it finishes. In some cases, this can be achieved by using letter symbols from the word itself. Thus ' g l' could be used as an abbreviation for ' g i r l' very much in the same way as we use abbreviations like 'B'ham' for 'Birmingham' in ordinary writing. However, because of the haphazard nature of such markers, there are advantages if we use code symbols introduced for the purpose at the beginnings and ends of words.

In the light of these remarks, it should be clear that the word ' t h i s' in the grammar shown in Figure 3 may be encoded as ' D 0 #D', the word ' l o v e s' may be encoded as ' V 0 #V' and likewise for the other words. In all cases except ' b o y' there is a modest saving of one or two symbols for each word. If the grammar contained words with fewer than three letters, then the 'saving' in those cases would be negative!

4.3 Encoding Phrases

Consider the phrase ' t h i s b o y'. If this were encoded with a code pattern for each word, the result would be ' D 0 #D N 1 # N' which is only one symbol smaller than the original. However, we can encode the phrase with fewer symbols by taking advantage of the fact that the sequence ' D 0 #D N 1 #N' has a subsequence, ' D #D N # N', which is a substring within the pattern ' NP D #D N #N #NP' in the grammar. Notice that the sequence ' D #D N #N' is discontinuous within the sequence ' D 0 #D N 1 #N' in the sense described earlier.

Since the 'noun phrase' pattern ' NP D #D N #N #NP' is in the grammar, we may replace the substring, ' D #D N #N', by the 'code' sequence ' NP #NP'. But then, to encode the two words within the noun phrase (' t h i s' and ' b o y'), we must add the symbols, ' 0' and ' 1' from ' D 0 #D N 1 # N' so that the final coded sequence is ' NP 0 1 #NP'.

Notice how the symbols ' NP' and ' #NP' in the code pattern ' NP 0 1 #NP' serve as a disambiguating context so that the symbol ' 0' identifies the pattern ' D 0 t h i s #D' and the symbol ' 1' identifies the pattern ' N 1 b o y #N'.

The overall cost of the code pattern ' NP 0 1 #NP' is 4 symbols compared with the original 7 symbols in ' t h i s b o y' - a saving of 3 symbols. In a similar way, the phrase ' t h a t g i r l' may be encoded as ' NP 1 0 #NP' which is 4 symbols smaller than the original.

4.4 Encoding the Sentence

Given the two noun phrases in their encoded forms (' NP 0 1 #NP' for ' t h i s b o y' and ' NP 1 0 #NP' for ' t h a t g i r l') and the encoding of ' l o v e s' as ' V 0 #V', the whole sentence may be encoded as ' NP 0 1 #NP V 0 #V NP 1 0 #NP'.

However, this sequence contains the subsequence ' NP #NP V #V NP #NP' and this sequence is a substring within the 'sentence' pattern ' S NP #NP V #V NP #NP #S' and this is in the grammar. So we may replace the sequence ' NP #NP V #V NP #NP' by the 'code' sequence ' S #S'. To discriminate the words in this sentence we must add the symbols ' 0 1 0 1 0' from the sequence ' NP 0 1 #NP V 0 #V NP 1 0 #NP'. The overall result is an encoded representation of the sentence as:

S 0 1 0 1 0 #S.

The 7 symbols in this encoding of the sentence represents a substantial compression compared with the 20 symbols in the unencoded sentence.

4.5 Taking Account of the Sizes of Gaps

The account of pattern matching and coding in Sections 4.3 and 4.4 illustrates the way in which 'matching' in the proposed scheme embraces the matching of subsequences (where the matched symbols need not be contiguous) as well as the more traditional matching of coherent substrings (where the matched symbols are always contiguous, one with the next).

In this connection, most people have a strong intuition that, where there are gaps in matching, small gaps or no gaps are 'better' than large ones. For example, this match:

would normally be regarded as a better match than this one:

It seems that our intuitions in this area can be justified with an analysis in terms of information theory. An attempt at this analysis is presented in Section A1 together with an account of how the sizes of gaps are allowed for in the SP52 model.

4.6 Discussion

Each pattern expresses sequential redundancy in the data to be encoded and this sequential redundancy can be exploited to reduce the number of symbols which need to be written out explicitly. In the grammar shown in Figure 3, each pattern for an individual word expresses the sequential redundancy of the letters within that word; the pattern for a noun phrase expresses the sequential redundancy of 'determiner' followed by 'noun'; and the pattern for a sentence expresses the sequential redundancy of the pattern: 'noun phrase' followed by 'verb' followed by 'noun phrase'.

Since this principle operates at all levels in the 'hierarchy' of patterns, many of the symbols at intermediate levels may be omitted completely. A sentence may be specified with symbols marking the start and end of the sentence pattern together with interpolated symbols which discriminate amongst alternatives at lower levels.

Notice that these ideas are only applicable to alignments which can 'project' into a single sequence of symbols, as is the case with the alignment shown in Figure 4. Any alignment like this:

a x b               a b x
|   |    or this    | |
a y b               a b y

where there is a 'mismatch' of symbols, cannot be evaluated in this way. For present purposes, any such alignment is excluded from consideration. When the SP model is generalised to other areas such as learning, it is intended that alignments like those just shown will be evaluated alongside those which can project without mismatches.

The method which has been described illustrates the role of context in the encoding of information. Any one symbol like '0' or '1' is ambiguous in terms of the patterns in the grammar in Figure 3. But in the context of the pattern 'S 0 1 0 1 0 #S', it is possible to assign each instance of '0' or '1' unambiguously to one of the words in the grammar, giving the sequence of words in the original sentence. It appears that ICMAUS provides a mechanism for 'decoding' the encoded form of the sentence, as is discussed in Section 6, below.

4.7 Summary of Method for Calculating the Compression Associated with an Alignment

The proposed method of calculating the compression score (CS) associated with an alignment of patterns is summarised in more formal terms here. This is the method embodied in the SP52 model, described in Section 5.

The method is designed to calculate the compression of New information (the sentence to be parsed) which may be achieved by 'encoding' New information in terms of Old information (where Old information is the patterns of symbols representing the grammar used in parsing). This CS is calculated as:

CS = B_N - B_E

where B_N is the number of bits required to represent New information in 'raw' form (without any encoding except S-F-E coding at the level of single symbols), and B_E is the number of bits required for the encoding of New information in terms of Old information. How these values are calculated is described below.

As previously noted, this method only applies to alignments which can be 'projected' into a single sequence of symbols. As indicated in Section 7.2.1, it is intended that the method will be generalised in the future to accommodate alignments of any kind.

4.7.1 Information Costs of Symbols.

If a simple block code is used for symbols, then the 'minimum cost', M, for each symbol is

bits where |S| is the number of symbol types in the alphabet of symbol types (S) used throughout New and Old.

As previously noted, the value of M for each symbol type (and thus each individual symbol) is calculated in SP52 by the S-F-E method. For any one symbol type, the input for this calculation is the (notional) frequency of occurrence of the symbol type which is derived in a straightforward manner as:

where f_i is the (notional) frequency of the ith pattern in the grammar (illustrated by the numbers on the right of Figure 3), o_i is the number of occurrences of the given symbol in the ith pattern and P is the number of patterns in the grammar.

Whichever way the value of M is calculated, the 'actual cost', A, of each symbol is

where c is a factor whose size is not critical except that it must be bigger than 1.

4.7.2 Calculation of E, the Minimum Number of Bits Required for the Encoding of a Given Pattern in Old.

The calculation of B_E for any alignment requires a value for the 'encoding cost', E, for each pattern from Old which appears in the alignment.

Since there is a frequency of occurrence associated with each pattern in any grammar, it is possible to calculate a theoretical minimum for the value of E for each pattern using the S-F-E method. However, there is an alternative method of calculating E which, for present purposes, appears to be more useful and which has been adopted in the SP52 model described in Section 5.

In summary, the alternative method is to calculate E as

where D_i is the M value for the ith symbol in a subsequence of n 'discrimination' symbols within the given pattern which identifies the pattern uniquely amongst the patterns in the grammar without over-specifying the pattern.

Ideally, the discrimination symbols for a pattern would be whatever subsequence of the pattern was most distinctive of the pattern, regardless of the position of the symbols within the pattern. However, in the SP52 model, two constraints have been imposed:

The simplifying assumption has been made that the discrimination symbols are the smallest substring of one or more symbols starting at the beginning of the pattern which enables the pattern to be identified uniquely within the grammar. For any pattern, it is easy to discover what this substring is by a process of systematic comparison of candidate substrings with corresponding symbols in other patterns in the grammar.
Although a constrained subsequence of symbols is used in calculating the value of E for the pattern, this does not mean that a pattern can only ever be recognised by those symbols and no others. In the SP52 model, a pattern can be fully or partially recognised by any subsequence of its symbols.
For each pattern which ends in a symbol containing the hash character ('#'), this 'termination' symbol is added to the set of discrimination symbols for the pattern if it is not otherwise there.

The second constraint is not entirely satisfactory, not least because it conflicts with the principle that all symbols should have the same status and none should have any 'hidden' meaning. The constraint has been adopted to compensate for the fact that 'termination' symbols are largely redundant for the purpose of identifying patterns because each one duplicates the information in the first symbol of the same pattern (the primary function of 'termination' symbols is to mark the ends of patterns). And this redundancy means that, without the constraint, the method of calculating B_E (see Section 4.7.4, below) can sometimes fail to discriminate amongst alignments which clearly differ in their potential for economical encoding.

4.7.3 Calculation of B_N (the Number of Bits Required to Express Symbols from New in 'Raw' Form, without Encoding).

For any one alignment, B_N is calculated as:

where A_i is the 'actual cost' of the symbol corresponding to the ith hit in a sequence of hits, H₁ .... H_h, with an adjustment to be described in the next paragraph. The hit sequence H₁ .... H_h comprises the hits between symbols in New and symbols in patterns in Old. The symbols from New in this hit sequence are a subsequence of the sequence N₁ ... N_n, which is the pattern in New.

Before this formula is applied, the value of each A_i is adjusted to take account of any 'gap' which may exist between the given hit and any previous hits in the sequence of hits between New and patterns in Old. For this purpose, the alignment is treated as if it were two sequences of symbols: the sequence of symbols which is New (the sentence being parsed) and the sequence of symbols which is the projection of the alignment into a single sequence. Then, for the symbol corresponding to the ith hit in the sequence H₁ .... H_h, the adjusted value of A_i is calculated as:

where, a_i is the actual cost of the symbol corresponding to the ith hit in H₁ .... H_h, and F_s is the sth entry in a table of 'scaling factors' whose rationale is described in Appendix A1. The value of F₁ is always 1.

For each hit in H₁ .... H_h after the first, the variable s (which represents the 'span' between the current hit in H₁ .... H_h and the preceding hit) is calculated as:

where P_i is the position in N₁ ... N_n of the symbol corresponding to the ith hit in H₁ .... H_h, P_i-1 is the position in N₁ ... N_n of the symbol corresponding to the (i-1)th hit in H₁ .... H_h. C_i and C_i-1 are the analogous positions in the projection of the alignment into a single sequence - which means that C_i and C_i-1 represent columns in the alignment itself.

As indicated above, the rationale for this method of adjusting for gaps between symbols is described and discussed in Appendix A1.

4.7.4 Calculation of B_E (the Number of Bits Required to Encode the Alignment).

For each new alignment, the value of B_E is:

where E_i is the 'encoding cost' of the Old pattern appearing on one of a lines of the alignment other than the top line (where New appears) and S is the saving in encoding costs arising from the fact that some patterns in the alignment convey information about the sequential arrangement of other patterns in the alignment or the selection of other patterns in the alignment where alternatives are possible in a given context. The 'encoding cost' of any pattern is the value of E for that pattern, calculated as described in Section 4.7.2. Notice that if any pattern appears two or more times in the alignment, its encoding cost is added a corresponding number of times to the sum of encoding costs.

The calculation of B_E depends on three main ideas:

As previously noted, a pattern may be fully or partially recognised by any subsequence of the pattern. In other words, it is not necessary to use the specific symbols which were used in calculating the value of E for that pattern. As a general rule when the grammar is largely free of redundancy, if the M values of the relevant symbols (adjusted for gaps - see next) add up to the value of E then that subsequence of symbols will identify the pattern uniquely amongst the other patterns in the grammar. Where there is redundancy in the grammar, more bits may be needed to achieve unique identification of a pattern.
If there are gaps in a sequence of hits, information values must be reduced in accordance with the rules used in calculating the value of B_N (Section 4.7.3), described in Appendix A1.
For present purposes, there is nothing to be gained by over-specifying a pattern. If one pattern matches a second pattern by the minimum number of symbols to achieve identification of that second pattern, then the saving in encoding costs.from this source is maximal. Any additional hits between the two patterns do not give any additional saving in encoding costs.

B_E is calculated in the following way:

For each row (R) in the alignment corresponding to a pattern from Old, create a variable (V) containing the value of E for the pattern in that row.
Traverse the alignment from left to right examining the columns containing two or more symbols (including symbols in New). Any such column is designated a 'hit' column (C_H).
For each C_H which contains two or more symbols from patterns in Old (which we may designate C_HO), examine each row which has a hit symbol from Old in the column (designated R_HO). For this symbol, calculate M_A, an 'adjusted' value of M for the symbol, taking account of any gap which may exist between the given C_HO and any previous C_H. The method of making the adjustment is the same as is used for calculating the value of B_N (Section 4.7.3 and Appendix A1) except that, for each R_HO, the gaps (or spans) are measured as if all the rows in the alignment except the given R_HO is treated as if it were a single pattern to which the pattern in the given R_HO is aligned. As in the calculation of B_N, it is assumed that there is no gap associated with the first C_H for any given pattern.
For each C_HO, examine each R_HO and, amongst these rows, identify the 'leading' row, R_HOL, whose pattern starts furthest to the left in the alignment (if there is a tie, make an arbitrary choice amongst the ties). For example, in Figure 4, for either of the two columns which contains a hit between ' D' in 'D 0 t h i s #D' and ' D' in ' NP D #D N #N #NP', the R_HOL is the one containing ' NP D #D N #N #NP' (row 4 in the first case and row 9 in the second case); for either of the two columns containing a hit between ' NP' in ' NP D #D N #N #NP' and ' NP' in ' S NP #NP V #V NP #NP #S', the R_HOL is the row containing ' S NP #NP V #V NP #NP #S' (row 6 in both cases).
For each C_HO, consider, in turn, each R_HO, excluding the R_HOL. For each row considered, subtract the value of M_A from the value of V for that row. If the new value of V is less than 0, V is set to 0 and no further subtraction from that instance of V is allowed.
When all relevant columns have been examined and the values of the V variables have been reduced, calculate

where a is the number of rows in the alignment and the summation excludes the top line (which contains New).

The rationale for this method of calculating B_E is that it gives us the sum of the E values of the patterns from Old corresponding to each row of the alignment after the first, with a reduction for hits between those patterns (with an adjustment for gaps in accordance with the reasoning presented in Appendix A1).

The reason for reducing the value of B_E when there are hits between patterns in Old is that any such hit reflects a degree of 'coverage' of one pattern from Old by another such pattern. To the extent that one pattern provides information that also exists in another pattern there is a reduced need for the second pattern to be identified in the encoding. In the extreme case, where two patterns are identical, only one of them need be identified in the encoding.

As indicated above, any saving in encoding costs resulting from the coverage of one or more patterns by another cannot exceed the E value for each pattern - any additional hits are 'wasted'. Hence, the V value for any row cannot be reduced below 0.

In the method described above, the 'leading' row for any one column (R_HOL) is regarded as the row with which the other symbols in the column are unified. Hence, for the given column, this is the row where the V value is not reduced by the value of M_A. Intuitively, the left-to-right bias in the definition of 'leading row' is less theoretically 'clean' than if all concepts were entirely symmetrical between left and right directions in the alignment. However, the concepts as described are the best to have been found so far and seem to work quite well.

- - -

9. Although standard methods of data compression use codes without explicit termination markers, there is an unavoidable need for information about the left and right boundaries of each coded pattern and this information is normally provided implicitly by the use of space characters or end- of-line characters or something similar, or by the use of a trie for the encoding and decoding of data.

10. The Huffman coding scheme, described in the same source and also in Storer (1988), is better known than the S-F-E scheme and is a little more efficient in terms of IC. But the S-F-E scheme allows probabilities to be inferred more reliably from the sizes of codes assigned by the scheme and it is anticipated that this will be advantageous when the model is generalised and applied to the drawing of probabilistic inferences.

5 THE SP52 MODEL

Given the example sentence discussed earlier (shown at the top of Figure 4) and the grammar in Figure 3, the SP52 model can find the alignment shown in Figure 4 and, in accordance with our intuitions, to identify it as the best alignment amongst the several which it forms for the given sentence and the given grammar. Given relevant sentences and grammars, the model is also capable of finding all the other alignments shown in this article and identify them as the best in terms of IC amongst alternative alignments for a given sentence and grammar. The model can also find all the alignments shown in the accompanying article, with exceptions which are noted and discussed in that article.

The SP52 model works by building alignments in a pairwise fashion selecting the 'best' in terms of compression at each stage. The method thus constitutes a fairly straightforward application of 'metrics-guided' search: examine large search spaces in stages, narrowing the search progressively at each stage using some kind of 'search metric' to guide the search.

Alignments can be built up in a pairwise manner because, at every stage, new alignments are accepted only if they can 'project' into a one-dimensional pattern as described in Section 3.2. Since any alignment can be treated as a single sequence of symbols it is possible to match it against any of the original patterns in the grammar or any of the alignments formed at earlier stages.

The program starts by searching for 'good' alignments between the sentence to be parsed and patterns in the grammar. For the example in Figure 4, the best alignments found at this stage are between the individual words in the sentence and corresponding patterns in the grammar.

At the next stage, the program looks for 'good' alignments between the best of the alignments previously found and patterns in the grammar. The 'best' alignments at this stage are ones between the alignments corresponding to the words and 'higher level' patterns in the grammar. Thus ' D 0 t h i s #D' and ' N 1 b o y #N' form an alignment with ' NP D #D N #N #NP', giving 'NP D 0 t h i s #D N 0 b o y #D #NP'; likewise, ' V 0 l o v e s #V' forms an alignment with ' S NP #NP V #V NP #NP #S' giving 'NP #NP V 0 l o v e s #V NP #NP #S'; and ' D 1 t h a t #D' and ' N 0 g i r l #N' form an alignment with ' NP D #D N #N #NP' giving ' NP D 1 t h a t #D N 1 g i r l #N #NP'.

Finally, ' NP D 0 t h i s #D N 0 b o y #D #NP' and ' NP D 1 t h a t #D N 1 g i r l #N #NP' are aligned with ' NP #NP V 0 l o v e s #V NP #NP #S' giving the result shown in Figure 4. At each stage, many 'worse' alignments are formed which are weeded out by the selection process.

An outline of how the model works is shown in Figure 5. The sections which follows are intended to complement this outline, explaining aspects of the model and filling in detail as required.

main()
{
1. Read the rules of the grammar, each one with a frequency of occurrence in a
notional sample of the language, and store the patterns with their
frequencies in Old.
2. Read the sentence to be parsed and store it in New.
3. From the frequency of each pattern, derive a frequency for each symbol in
the grammar (as described in Section 4.7.1)
4. Using the frequencies of the symbols with the method, assign to each
symbol in New and Old a number of bits representing the
'minimum' information 'cost' of that symbol. Also, calculate
an 'actual' information cost for each symbol (see Section 4.7.1).
5. For each pattern in the grammar, calculate E, the minimum number of bits needed
to encode that pattern (see Section 4.7.4).
6. Select the sentence to be parsed and add it as the first 'driving pattern' to
an otherwise empty list of driving patterns.
7. while (new alignments are being formed)
compress()
8. Out of all the new alignments which have been formed, print the ones
with the best CSs.
}
compress()
{
1. Clear the 'hit structure' (described in the text).
2. while (there are driving patterns which have not yet been processed)
{
2.1 Select the first or next driving pattern in the set of driving patterns.
2.2 while (there are more symbols in the current driving pattern)
{
2.2.1 Working left to right through the current driving pattern, select
the first or next symbol in the pattern.
2.2.2 'Broadcast' this symbol to make a yes/no match with
every symbol in the 'target patterns' in Old.
2.2.3 Record each positive match (hit) in a 'hit structure' (as
described in the text). As more symbols are broadcast, the
hit structure builds up a record of sequences of hits
between the driving pattern and the several target patterns
in Old. As each hit sequence is extended, the
compression score of the corresponding alignment is
estimated using a 'cheap to compute' method of estimation.
2.2.4 If the space allocated for the hit structure is filled at any time,
the system 'purges' the worst hit sequences from the hit
structure to release more space. The selection uses the
estimates of compression scores assigned to each
hit sequence in Step 6.
}
}

3 For each hit sequences which has an estimated compression score above
some threshold value and which will 'project' into a single sequence
(as described in the text), convert the hit sequence into the corresponding
alignment. Discard this alignment if it is identical with any alignment
already in Old. Otherwise, compute the compression score using the method
described in Section 4, print the new alignment and add it to Old. If no
new alignments are formed, quit the compress() function.
4 Excluding the original patterns in Old, examine all the alignments that have
been added to Old since the beginning of processing and choose a subset of
these alignments using the method described in the text. Remove from
Old all the alignments which have not been selected. The original
patterns are never removed from Old.
5 Clear the list of driving patterns and then, using the same method as is used
in 4 but (usually) with a more restrictive parameter, select a subset of
the alignments remaining in Old and add references to those alignments
to the list of driving patterns (these patterns are not removed from Old
and may therefore also be target patterns on the next cycle).
}

Figure 5 Outline of the organisation of the SP52 model.

5.1 Preliminary Processing

5.1.1 Calculation of the Information Cost of Each Symbol.

As was described in Section 3.1, each rule in the grammar has an associated frequency of occurrence in some notional sample of the language. In Step 3 of main() in Figure 5, the model derives the frequency of occurrence of each symbol type as described in Section 4.7.1, above.

These frequencies are then used (in Step 4 of main()) to calculate the minimum number of bits needed to represent each symbol type using the S-F-E coding scheme (see Cover and Thomas (1991)), as was described in Section 4.7.1. The resulting values for each symbol type are then assigned as 'minimum cost' values to corresponding symbols in New and Old. Each symbol in New and Old is also given an 'actual cost' which is the minimum cost increased by an arbitrary factor, rounded up to ensure that the actual cost is at least one bit larger than the minimum cost (see Section 4.7.1).

The use of variable-length codes for symbols implies the use of a trie or other device to divide an unsegmented bit stream into its constituent symbols. Strictly speaking, the information cost of this trie should be entered into the calculation of encoding costs. However, for a given alphabet of symbols, this cost is constant and becomes a vanishingly small proportion of total costs when large volumes of data are processed, as would normally be the case. For this reason, the information cost of identifying variable-length codes for symbols is ignored.

5.1.2 Establishing the Encoding Cost of Each Pattern in Old.

In Step5 of main() in Figure 5, each pattern in the grammar is assigned a minimum number of bits required to discriminate the pattern from other patterns in the grammar using frequencies of the patterns with the S-F-E method, as was outlined in Section 4.7.2.

5.2 Building the 'Hit Structure' (Step 2 of the compress() Function in Figure 5)

The compress() function in Figure 5 is the heart of the SP52 model. This section and the one which follows amplify aspects of the description in Figure 5.

The SP52 model is designed to search for alignments which can be 'projected' into a single sequence as described in Section 3.2. At every stage in the search, alignments which will not project in this way are discarded (see Section 5.3.1, below). This is a temporary measure until an appropriate treatment for non-projecting alignments has been developed, as seems to be required when the model is generalised to include learning (see Section 7.2.1, below).

Since every alignment which is retained by the model will project on to a single pattern, this means that every such alignment may be treated as if it were a one-dimensional pattern of symbols. This facilitates the building of new alignments, as was sketched at the beginning of Section 5.

As can be seen from the figure and inferred from the outline description at the beginning of Section 5, the compress() function is applied iteratively. On the first cycle, the 'driving' pattern is simply the sentence to be parsed. On subsequent cycles, the list of driving patterns is a subset of the alignments formed in preceding cycles. Iteration stops when no new alignments can be found which satisfy conditions described below.

5.2.1 Fuzzy Matching of One Pattern with Another.

Step 2 of the compress() function is based on the central process in SP21 (Wolff, 1994a), a process which is related to dynamic programming (DP, Wagner and Fischer (1974)) and is designed to find 'fuzzy' matches which are 'good' between one 'driving' pattern and one or more 'target' patterns. In this context, a 'fuzzy' match is one where only a subsequence of the symbols in one pattern need match the symbols in the other pattern and vice versa.

The technique is to 'broadcast' each symbol in the driving pattern to make a yes/no match with each symbol in the set of target patterns and to record sequences of positive matches (hits) in a 'hits structure'. Each sequence of hits (termed a hit sequence) represents an alignment between the driving pattern and one of the target patterns.

As is described in Wolff (1994a), the hit structure has the form of a (list-processing) trie with each node representing a hit and each path from the root to a leaf node representing a sequence of hits.

An important difference between the matching process in SP52 and that in SP21 is that, in the latter program, hit sequences only contain one driving pattern and one target pattern, while in SP52 there can be two or more driving patterns matched to a single target pattern. This allows the model to find hit sequences like that between the two driving patterns, ' D 0 t h i s #D' and ' N 1 b o y #N' and the target pattern ' NP D #D N #N #NP'.

5.2.2 No One Instance of a Symbol Should Ever Be Matched with Itself.

Since driving patterns can also be target patterns, any one pattern may be aligned with itself. That being so, a check is made to ensure that no instance of a symbol is ever matched against itself. Obviously, any such match would be meaningless in terms of the identification of redundancy.

Since any symbol in the driving pattern and any symbol in the target pattern may have been derived by the unification of two or more other symbols, a check is also made to exclude all hits where the set of symbols from which one of the hit symbols was derived has one or more symbols in common with the set of symbols from which the other hit symbol was derived. In short, while any given pattern from the grammar may appear two or more times in one alignment, no symbol in any of the original patterns in Old ever appears in the same column as itself in any alignment.

5.2.3 The Order of Symbols in New must be Preserved.

As the matching process has been described so far, it would be entirely possible for the system to align a pattern like ' NP D 1 t h a t #D N 1 g i r l #N #NP' in the example considered earlier with the first ' NP #NP' in a pattern like ' NP #NP V 0 l o v e s #V NP #NP #S' from the same example and to align 'NP D 0 t h i s #D N 0 b o y #D #NP' with the second ' NP #NP'. To avoid the formation of alignments like this which violate the order of the symbols in New, the system makes checks to ensure, at all stages, that the order of the symbols in New is honoured.

5.2.4 Estimation of Compression Scores.

While the hit structure is being built, the compression score for the alignment corresponding to each hit sequence may be calculated at every stage but only at the cost of a lot of processing which would slow the model down. Consequently, a simple method of estimating the compression score is used in Step 2.2.3 of Figure 5 which is computationally 'cheap'. Although it gives results which do not correspond exactly with the values calculated using the formulae presented in Section 4, the differences appear not to be critical for the purposes of purging the hit structure (Step 2.2.4 in Figure 5, Section 5.2.5, below) or determining the threshold for converting hit sequences into alignments (Step 3 in Figure 5, Section 5.3, below).

5.2.5 Purging the Hit Structure.

If the space allocated to the hit structure is exhausted at any time, the hit structure is 'purged' or, more literally, 'pruned' to remove branches corresponding to the worst 50% of the hit sequences (where the meaning of 'worst' is determined using the estimates of compression scores calculated in Step 2.2.3 of the compress() function). In this way, space is released in which new sequences of hits can be stored.

5.2.6 Distinctive Features of the Technique.

The technique of recording hits in a trie using list processing, coupled with the mechanism for purging the hit structure whenever the available space is filled, is probably the most important difference between the SP21 technique for finding partial matches and the more traditional kinds of DP:

Both strings being compared can be arbitrarily long.
The 'depth' of searching can be controlled by varying the space available for the hit structure: larger spaces give better results than smaller ones.
Unlike standard DP algorithms, the system delivers a set of alternative alignments between two sequences rather than a single 'best' alignment.

5.3 Building, Scoring and Selection of Alignments

5.3.1 Building Alignments and Scoring Them (Step 3 of the compress() Function).

When the hit structure for a set of driving patterns has been built, the best hit sequences are converted into the corresponding alignments, excluding all alignments which will not 'project' onto a single sequence (as described in Section 3.2) and excluding alignments with a CS below a threshold value, T.

The value of T needs to be quite low and may even be negative to allow the search process to explore regions which are superficially unpromising but which can lead to good results.

The process of converting a hit sequence into an alignment achieves two things: it creates a one-dimensional sequence of symbols which is a unification of the driving pattern or patterns with the target pattern and it creates a two-dimensional array representing the alignment itself. For each alignment, the array occupies a portion of memory of exactly the right size, allocated dynamically at the time the alignment is formed.

The one-dimensional sequence may enter into matching and unification in later iterations of the compress() function, while the two-dimensional array allows the full structure of the alignment to be seen and can be used in later checks to ensure that no instance of a symbol is ever matched with itself (Section 5.2.2) and to ensure that the order of symbols in New is not violated (Section 5.2.3).

From time to time, identical alignments are formed via different routes. The program checks each of the newly-formed alignments against alignments already in Old. Any alignment which duplicates one already in Old is discarded. The process of comparing alignments is indifferent to the order (from top to bottom) in which patterns appear in the alignment (cf. Section 3.2, above).

Every new alignment which survives the check for duplicates is added to Old and its CS is computed using the method and formulae described in Section 4.

5.3.2 Preliminary Selection of Alignments.

The model has two boolean parameters which can be switched on or off at the start of each run and which control whether or not certain alignments are discarded at stage 3 of the compress() function (Figure 5). If the first of these two parameters is switched on, the program discards any alignment where there is one or more gaps in the sequence of 'hit' symbols in New (i.e., each of those symbols in New which, in the given alignment, is in the same column as one or more symbols in Old). If the second parameter is switched on, the program discards any alignment where there is one or more gaps in the sequence of symbols within any pattern in Old where each symbol is part of a hit with a symbol in New (i.e., lies in the same column as a symbol from New).

The main purpose of these parameters is to save time in testing examples. Unless the 'correct' alignment is one that has gaps in any of its hit sequences as just described, the program will normally give the same final result whether the parameters are switched on or off, but processing is quicker if one or both of them are switched on.

5.3.3 Selection of Alignments.

Apart from purging the hit structure when space is exhausted, the second major way in which the SP52 model narrows its search space is a two-fold selection of alignments at the end of every cycle of the compress() function:

Excluding all the original patterns in Old, the program examines the alignments which have been added to Old since the start of processing and selects a subset by a method to be described. All the other alignments are removed from Old and discarded.
Using the same method, the program selects a subset of the alignments which remain in Old to be used as driving patterns on the next cycle. These alignments are not removed from Old so they may also function as target patterns.

At first sight it seems natural to select alignments purely on the basis of their compression scores. However, it can easily happen that, at intermediate stages in processing, the best alignments are trivial variations of each other and involve the same subset of the symbols from New. If selection is made by choosing alignments with a CS above a certain threshold, the alignments which are chosen may all involve the same subset of the symbols in New, while other alignments, containing symbols from other parts of New, may be lost. If this happens, the model cannot ever build an alignment which contains all or most the symbols in New and may thus never find the 'correct' answer.

A solution to this problem which seems to work well is to make selections in relation to the symbols in New which appear in the alignments. This is achieved by setting up a two-dimensional array with one column for each symbol in New and a smallish number of rows (more about this below). The model examines the alignments in Old in order of their CS and, for each alignment, enters a reference to the alignment in every column corresponding to a symbol from New which forms a hit with (one or more) symbols from Old in the alignment, providing that the column is not already full. When all the alignments have been examined (or when every cell in the array is full, if that is sooner), a list of the alignments in the array is compiled, not counting the second or subsequent appearances of any alignment in the array.

The merit of this technique is that it can 'protect' any alignment which is the best alignment for a given subsequence of the symbols in New (or is second or third best etc) but which may, nevertheless, have a relatively low CS compared with other alignments in Old.

The number of alignments which are selected depends, inter alia, on the number of rows in the array. For the selection of alignments to be retained in Old, the number which gives 'correct' results is typically 10 or 20, determined by a parameter of the model. For the selection of driving patterns, a typical number is 3 or 4, determined by another parameter of the model.

However many rows there are in the table, the number of alignments selected is normally relatively large towards the beginning of processing (when each alignment contains only a relatively small part of New) and normally decreases to 2 or 3 at later stages when each alignment contains most of the symbols in New.

5.4 Computational Complexity

Given that all the example grammars in this article and the accompanying one are much smaller than would be required in any realistic system, and given the well-known computational demands of multiple alignment problems, readers may reasonably ask whether the proposed framework for parsing would scale up to handle realistically large grammars and longer sentences.

It has been shown elsewhere (Wolff, 1994a, 1995c) that the time complexity of the function (SP21) for finding good partial matches between two patterns, which is embedded in compress(), is O(nm) in a serial processing environment, where n is the sum of the lengths of the driving patterns and m is the sum of the lengths of the 'target' patterns (this big O value is typical of systems which, like SP21, perform dynamic programming or something similar). In a parallel processing environment, the time complexity should approach O(n), n m, depending on how the parallel processing is applied. In all environments, the space complexity is O(m) - which is substantially better than the space complexity of any 'standard' dynamic programming algorithm.

In SP52, alignments are built by pairwise alignment of patterns and alignments. The number of such pairings required to build an alignment for a whole sentence appears to be approximately log₂(n / c), where n is the length of the sentence (in symbols) and c is a constant.

The values of n and m for successive pairings will vary, starting relatively small when n is the length of the sentence and m is the sum of the lengths of the patterns in the grammar and becoming larger when new alignments have been added to Old and selected as driving patterns. However, the selection process described in Section 5.3.3 ensures that, after the initial growth, the sizes of n and m remain relatively constant: alignments in the later stages are longer than in the early stages but there are fewer of them. For the purpose of assessing computational complexity, it seems reasonable to assume that variations in the sizes of n and m from the start to finish of processing can be discounted.

Overall, the time complexity of SP52 in a serial processing environment should be approximately O(log₂ n · nm), where n is the length of the sentence and m is the sum of the lengths of the patterns in the grammar. In a parallel processing environment, the time complexity may approach O(log₂ n · n), depending on how the parallel processing is applied. The space complexity should be O(m).

In summary, there is reason to think that the approach to parsing which is embodied in SP52 will not lead to running times or demands for storage which are hopelessly impractical when the this approach is applied to realistically large grammars or sentences which are longer than any considered in either of these two articles.

6 DECODING BY COMPRESSION: THE PRODUCTION OF LANGUAGE

As described in Section 4, a succinct, coded representation of a sentence may be derived from a 'good' alignment amongst a set of sequences which includes the sentence and rules in an appropriate grammar. This section proposes an idea which at first sight may seem contradictory or paradoxical: that the decoding of a coded representation of a sentence may be achieved by precisely the same process of compression (by multiple alignment, unification and search) as was used to achieve the original encoding! Although this may superficially appear to be nonsense, careful reading of this section should convince readers that the proposal is sound and that no laws of logic or mathematics have been violated.

In this reversal of the original process of encoding, a sentence may be created by finding a 'good' alignment amongst a set of sequences which includes a pattern which encodes the sentence together with rules in the grammar which were used to create the encoding. In both cases, encoding and decoding, alignments may be evaluated in terms of the potential compression of one sequence: the sentence in the first case and the encoded representation of the sentence in the second case.

Figure 6 shows an alignment of this kind. At the top of the figure is the sequence ' S 0 1 0 1 0 #S' which is the encoded version of ' t h i s b o y l o v e s t h a t g i r l', as described previously. The other sequences in the figure are rules from the grammar shown in Figure 3.

As with parsing (Section 3), an alignment may be interpreted by projecting its constituent symbols into a single sequence. In the case of the alignment in Figure 6, the result of this projection is exactly the same as was shown in Section 3.2. Although this sequence contains grammatical symbols other than words, it has the right words in the right order and may thus be regarded as a realisation of the sentence corresponding to the coded sequence ' S 0 1 0 1 0 #S'.

S      0              1                0                   1              0                #S 
|      |              |                |                   |              |                |  
S NP   |              |          #NP V |           #V NP   |              |            #NP #S 
  |    |              |           |  | |           |  |    |              |             |     
  |    |              |           |  V 0 l o v e s #V |    |              |             |     
  |    |              |           |                   |    |              |             |     
  |  D 0 t h i s #D   |           |                   |    |              |             |     
  |  |           |    |           |                   |    |              |             |     
  |  |           |  N 1 b o y #N  |                   |    |              |             |     
  |  |           |  |         |   |                   |    |              |             |     
  NP D           #D N         #N #NP                  |    |              |             |     
                                                      |    |              |             |     
                                                      |  D 1 t h a t #D   |             |     
                                                      |  |           |    |             |     
                                                      |  |           |  N 0 g i r l #N  |     
                                                      |  |           |  |           |   |     
                                                      NP D           #D N           #N #NP

Figure 6
An alignment amongst sequences including an encoded representation of a sentence (at the top) and rules in the grammar shown in Figure 3 [basic2].

6.1 Compression

The judgement that the alignment in Figure 6 is a good one may be reached in essentially the same way as we did with the alignment in Figure 4. As before, the measure which we use is the extent to which the 'input' sequence (which appears at the top of the alignment) may be encoded in terms of the grammar rules which appear in the alignment.

What is different in this case is that the sequence to be encoded in terms of the grammar is already a compressed representation of the sentence. Unlike the sentence, the sequence to be encoded contains relatively little redundancy. If every symbol is represented by the theoretical minimum number of bits, then the sequence does not contain any redundancy of a kind which can be discovered and extracted by the SP52 model. In this case, the best possible encoding of ' S 0 1 0 1 0 #S' is exactly the same sequence! In short, no net compression can be achieved by lossless unification of ' S 0 1 0 1 0 #S' with matching symbols in the grammar.

In these circumstances, there is no means of distinguishing a 'good' alignment from all the many 'bad' alignments which also achieve zero compression of the input sequence. There is nothing to guide the program through the enormous search space of possible alignments and no prospect of finding the 'correct' alignment, except by chance, and then no means of recognising that it was the 'correct' alignment.

The answer to this problem seems to be the provision of two distinct values for each symbol, as described in Sections 4.1 and 4.7. The minimum value is the theoretical minimum calculated according to the S-F-E method, while the actual value is larger by some constant factor.

In the calculation of the CS for each alignment (described in Section 4.7, above), the actual values of symbols are used to compute B_N, the number of bits required to represent the sentence (or other sequence in New) in 'raw' form. But the minimum values of symbols are used to compute B_E, the number of bits required to encode the alignment. Thus the CS which is derived from B_N and B_E represents the maximum compression which is theoretically possible (within this framework). This maximum compression is not usable in practice because that would always lead to the trap described above where the 'correct' decoding of a code sequence could not be distinguished from any of the many 'incorrect' decodings.

Although the CS calculated in this way implies smaller codes than can be used in practice, it serves the very useful purpose of allowing the search process to distinguish between 'good' alignments and 'bad' ones. Adding a little redundancy to each of the symbols in the encoded form of the sentence allows decoding to be achieved by a process of compression which is identical to the process by which the encoding was originally formed.

The foregoing remarks reflect what appears to be a general truth of IC: if lossless compression of a body of information is required (so that the original form of the information can be reconstituted) then it seems that the encoded form of the information must always contain some residual redundancy. The existence of this residual redundancy may not always be obvious but it seems that decoding is not possible without it.

6.2 SP52 and Decoding by ICMAUS

As previously noted, the SP52 model is capable of finding the 'correct' alignment shown in Figure 6, given the grammar shown in Figure 3 in Old and the sequence ' S 0 1 0 1 0 #S' in New.

Earlier versions of the model, which did not distinguish between the 'minimum cost' and the 'actual cost' of symbols, as described above, were not capable of finding the 'correct' alignment in examples like this or identifying it as the best because of the many alternative alignments, including the correct one, which yield zero compression.

7 FUTURE DEVELOPMENTS

This section discusses briefly some of the ways in which these ideas may be developed in the future.

7.1 Refinement of the Model

The SP52 model is now fairly robust and normally produces results in accordance with or close to our intuitions about what a 'correct' parsing should be. However, there are still some areas of weakness and there is still room for improvement. Here are some of the issues to be explored in the future.

7.1.1 Search Methods.

Given an appropriate grammar incorporated in a conventional recursive-descent parser, it is trivially easy to parse an arithmetic expression like this:

or other expressions which also contain nested brackets. By contrast, the SP52 model can easily fail to find the 'correct' parsing for this kind of structure, getting lost easily amongst the many alternative ways in which brackets in the expression may be aligned with brackets in the grammar and getting stuck on 'local peaks' in the search space. To achieve 'correct' results it is necessary to set the parameters of the model so that it can explore large parts of the search space of alternative parsings. Given these adjustments, the model will normally find an alignment corresponding to the 'correct' parsing, assigning it a compression score which is higher than for any other alignment, but the penalty is the relatively long running time which is required.

The too-easy answer to this problem would be to constrain the SP52 model to work like a recursive-descent parser. Any such modification of the model would be an ad hoc response to a specific problem which would sacrifice the generality of the search method which is necessary for recognising discontinuous dependencies in syntax (as described in the accompanying article) or for recognising ambiguities in syntax (as discussed in the accompanying article).

A solution to this problem is needed which does not sacrifice generality in the model. This probably requires some new thinking about the search strategies employed in the model and the way the search strategies adapt to changes in the parameters which control the size of the search space. In general, there is a need to refine the search methods used in the model to achieve a more effective trade-off between the scope of the search and speed of processing.

7.1.2 Parallel Processing and New Hardware.

The SP52 model has been developed as a software system on a conventional computer. This is convenient for the purpose of developing and demonstrating the concepts but it is not a form which is very suitable for practical purposes, mainly because of the heavy computational demands imposed by the large search spaces which the model explores.

When the abstract model is more mature, it is envisaged that it may then be developed for application with high levels of parallel processing, perhaps initially as a software virtual machine on existing high-parallel hardware. At some stage, there may be a case for developing new high- parallel hardware which is tailored to the requirements of the model.

7.2 Generalisation of the Model

This research programme aims to develop a single simple framework which can accommodate parsing, learning, deductive and probabilistic inference, (fuzzy) pattern recognition and other functions, without ad hoc provision for any one function. It seems that the ICMAUS framework which has been described in this article may generalise with relatively small modifications to meet this goal.

The following subsections describe briefly how the model may be generalised in the future. An overview of the range of concepts which may be seen to fit the ICMAUS framework is presented in Wolff (1995a)

7.2.1 Meanings.

A working hypothesis in this research programme is that patterns of symbols in one, two or more dimensions, coupled with IC in the manner which has been described, should be sufficient to represent any kind of knowledge, including 'non-linguistic' knowledge, in an efficient manner. In support of this view, it can be shown (Wolff, 1995a) that the ICMAUS framework accommodates non-linguistic class hierarchies and heterarchies with inheritance of attributes.

If linguistic and non-linguistic structures can both be accommodated naturally within the ICMAUS framework, then grammars of the kind shown previously may be extended seamlessly to include the 'meanings' of syntactic forms. Parsing and production of language as described here should generalise without radical reorganisation to a more rounded model of language understanding and production of language based on meanings. Future work will aim establish whether this is true and, if so, how.

Since the process of 'understanding' language often requires the inference of ideas which are not stated explicitly in the language which is to be understood, it anticipated that developments in this area will take advantage of the apparent potential of the SP52 model as a device for the drawing of deductive and probabilistic inferences, as outlined in Section 7.2.3, below.

7.2.2 Parsing and Learning.

Parsing is the process of analysing a sentence or other sequence of symbols in terms of a grammar. More generally, in the context of systems that learn, parsing may be regarded as a process by which newly-received information ('New') is analysed in terms of already-known information ('Old') and encoded in terms of that Old information as far as possible.

If New information is received which cannot be unified fully with patterns in Old, then the parts which do not unify with existing patterns may be simply added to Old. By hypothesis in this research programme, the process of adding New knowledge to Old in a manner which minimises redundancy captures the essentials of learning.

Consider, for example, how a New pattern, ' t h a t b o y d a n c e s' would match and unify with an Old pattern ' t h a t g i r l d a n c e s'. The result of partial matching and unification should be a set of patterns like this:

t h a t N #N d a n c e s
N b o y #N
N g i r l #N

Symbols like ' N' and ' #N are added at appropriate points so that the original patterns can be reconstructed at some other time by matching and unification of those symbols. The result of this compression is the beginnings of a grammar in which ' b o y' and ' g i r l' have been picked out as discrete entities and, at the same time, assigned to a syntactic class defined by the context in which the words may appear.¹¹

An extensive account of how grammatical structures may be learned by matching and unification of patterns without external supervision is presented in Wolff (1982, 1988) and in articles referenced therein.

7.2.3 Parsing and Inference.

If the 'input' pattern for a parsing system like SP52 is, in some sense, incomplete then a good alignment for the pattern, with the corresponding unification, can, if effect, make an inference or prediction of what the missing part or parts of the input pattern might be. For example, the input pattern "How to see you", together with an appropriate grammar which contains patterns representing common expressions, should yield the inferences that words like "nice" or "wonderful" are probable interpolations between "How" and "to see you".

The SP52 model is totally indifferent to the nature of the patterns in New and Old. Given 'linguistic' patterns, the model will make 'linguistic' kinds of inference, like the example just given. But the system may make interpolations or predictions in any other domain, given only that it has been supplied with appropriate patterns.

One of the attractions of the ICMAUS framework in this context is that it appears to accommodate 'chains' of inference or other kinds of inference where the connection between the 'premises' and the 'conclusion' is indirect. In the same way that a 'sentence' can be recognised indirectly via 'word' and 'phrase' patterns, a conclusion, ' D' may be linked to a premise, ' A', via the alignment and unification of patterns ' A B', ' B C' and ' C D'.

An outline of ideas in this area is presented in Wolff (1996c). The way in which concepts in 'logic' may be understood within the ICMAUS framework is discussed at more length in Wolff (1996a).

It is intended that the potential of the model as a general-purpose probabilistic inference system will be explored in the near future.

- - -

11. In general, 'code' symbols like ' N' in the example should be assigned in accordance with the S-F-E (or Huffman) principles: longer codes (more bits for each symbol or more symbols or both) should be used for rare patterns and shorter codes for common patterns. In an incremental model of the kind which is envisaged, it may easily happen that a pattern which is initially seen to be rare becomes common later or vice versa. For this reason, there would need to be a process which periodically reviews the frequencies of occurrence of 'data' patterns and makes adjustments to the lengths of the corresponding code patterns.

A similar process seems to occur in the way we shorten words if they turn out to be useful: 'electronic television' became shortened to 'television' which was later shortened to 'TV', 'horseless carriage' was shortened to 'car', 'omnibus' gave way to 'bus', and so on. It seems that the converse process does not normally occur. We do not generally lengthen words which fall into disuse, perhaps for that very reason: there is no point in making the effort of 'recoding' any word which we do not intend to use!

8 CONCLUSION

In this article I have tried to show how parsing of language may be understood as IC by multiple alignment, unification and search. Since the concept of a 'correct' parsing currently depends on human judgement and intuition, evaluation of this new approach to parsing will depend in part on whether or not it can deliver parsings which appear to be right. Experience to date suggests that it will but more experience is needed to establish how well this approach to parsing reflects linguistic intuitions.

A novel feature of these proposals is the superficially paradoxical idea that a single process of information compression by multiple alignment, unification and search may achieve both the encoding and the decoding of information, both the analysis and the production of sentences. This is not simply a gimmick: in practical terms it offers the prospect that one search engine may be used for both purposes and it offers a theoretical bonus in extending the explanatory range of the model without the need for any ad hoc additions or modifications.

An apparent benefit of this alternative view of language processing is a way of representing the syntax of natural language which appears to be simpler than existing methods (examples and discussion are presented in the accompanying article). This new way of representing linguistic knowledge may have benefits in the creation of hand-crafted grammars for natural languages. Perhaps more significantly, it may simplify the automatic learning of grammars for natural languages which is envisaged as a future development in this programme of research.

In general, a motivation for further development of these ideas is the potential which they offer for the integration of parsing with other aspects of computing including learning, inference, pattern recognition, information retrieval and others. In the broadest terms, the aim of this research programme and a touchstone for its success or failure is the development of a model which exhibits a favourable combination of conceptual simplicity with explanatory or descriptive power.

A1 APPENDIX: ADJUSTING THE INFORMATION VALUES OF SYMBOLS IN THE LIGHT OF GAPS BETWEEN HITS

In the calculations of B_N and B_E described in Sections 4.7.3 and 4.7.4, above, adjustments are made to the 'actual' or 'minimum' information costs of symbols in the light of any gaps that may exist in the hit sequences in which they are involved. This Appendix explains the thinking behind these adjustment and describes the method used in SP52 for computing the table of factors which is used to make the adjustments.

A1.1 Overall Approach

The general approach is to assess whether a sequence of hits between two sequences of symbols is or is not likely to have been the product of chance processes. Even with two sequences of symbols which are entirely random in themselves and entirely unrelated to each other, one would expect to find one or more subsequences within one sequence each one of which matches one or more subsequences within the other sequence (unless the sequences are very short). But hit sequences like these may be regarded as 'noise' which lack 'significance' and are without value for the detection of redundancy or compression of information.

In accordance with standard practice in statistics, we may, for any hit sequence, set up a 'null hypothesis' that the hit sequence is the result of chance. If we find then that the probability of finding such a hit sequence or a better one is less than, say, 0.05 then we may be inclined to reject the null hypothesis and accept the proposition that the hit sequence is not the result of chance - that there is a meaningful and significant congruence of structure between the two patterns.

The same thinking may be applied to each individual hit within a hit sequence. If each hit in a hit sequence (after the first) comes immediately after the preceding hit we may judge it to be more 'significant' than a hit which follows the previous hit after a gap of one or more unmatched symbols in one of the two symbol sequences or the other one or both.

Probabilities translate into values for information: as Shannon argued so persuasively, the occurrence of an event which is probable conveys less information than the occurrence of an event which is improbable. From absolute information values we may calculate relative information values and thus the adjustments which need to be made to the information values of symbols involved in hit sequences.

A1.2 Spans and Probabilities

Consider two sequences of symbols, X and Y. On a null hypothesis that the two sequences are unrelated, each symbol in each sequence may be regarded as the result of rolling an unbiassed A-sided die, where A is the size of the alphabet of symbols used in X and Y. For present purposes, it does not matter if either or both of the two sequences do in fact contain internal structure. Given our null hypothesis that they are unrelated, then the model serves our purpose.

Under the null hypothesis, the number of possible combinations of sides for one roll of the X die and one roll of the Y die is (A A). Of these, the number of combinations which are the same is A. So, under the null hypothesis of no relationship between the two streams of symbols, the probability (p) of obtaining a hit from one roll of the X die and one roll of the Y die is (A / (A A)) = (1 / A). Again, under the null hypothesis, the probability (q) of not obtaining a hit from one roll of the X die and one roll of the Y die is (1 - p).

Let us suppose now that the X die is rolled once but the Y die is rolled y times, where y is some number greater than 1. In this case, the chance of obtaining a hit is clearly better. In fact, the probability of obtaining one hit (and only one hit) when the Y die is rolled y times is (1 - q^y).¹² Calculation of the probability of obtaining one or more hits in a situation like this is discussed below.

In this calculation, y represents the number of opportunities to obtain a hit between X and Y. In this research, the number of opportunities to obtain a hit between two symbols has been dubbed the 'span'. What happens if X is rolled x times and Y is rolled y times, where x and y are both larger than 1? It is tempting at first sight to suppose that the span in this case is (x + y). But a little reflection shows that the number of opportunities for a hit in a case like this is (x y).¹³

If we apply our dice-rolling analogy to symbol sequences and hit sequences, each hit has an associated span, s, which is (x y), where, reading left-to-right through the sequences, x is (g_x + 1), y is (g_y + 1) and where, for the first hit in the hit sequence, g_x may be taken to be the number of unmatched symbols since the start of the X sequence and, for subsequent hits, g_x is the number of unmatched symbols since the previous hit; likewise for g_y.¹⁴

A1.3 The Probability of One Or More Hits within a Given Span

For the sake of simplicity, the focus of attention so far has been the probability of one hit within a given span. But, for most practical purposes including the present one, our interest lies in the probability of one or more hits within a given span. This may be calculated as:

where s is the given span, and

where p is the probability of a hit on any one pairing of symbols (= (1 / A)) as described above) and q is probability of a non-hit on any one pairing of symbols (= 1 - p).

The method just summarised for making the calculation is based directly on the methods presented in Lowry (1989), Chapter 3.

A1.4 Probabilities, Information and Scaling Factors

For any given span there is a corresponding probability, P_{(1 or more hits out of s)}, and for any such probability, a corresponding amount of information is conveyed if one or more hits occurs within the given span. This amount of information may be calculated as:

I_s = log₂ (1 / P_{(1 or more hits out of s)}).

It should be clear that the probability of obtaining at least one hit is smallest when s is 1 and increases progressively as s increases. Correspondingly, I_s is biggest when s is 1 and decreases progressively as s increases. Thus any value for I_s may be converted into a corresponding relative value or scaling factor, F_s, by dividing the given value for I_s by the value of I₁.

In SP52, a table of scaling factors for a range of spans is computed at the start of processing using a value for the size of the alphabet obtained by scanning the symbols in New and Old. This table is used when the compression scores of alignments are calculated (Sections 4.7.3 and 4.7.4, above) to adjust the information value of symbol for any hit depending on the value of s which is associated with that hit. The adjustment is achieved by simply multiplying the (actual or minimum) information cost of the symbol by the relevant scaling factor.

When the value of s for a given hit is 1, the scaling factor is 1, so the information cost of each symbol involved in the hit is not reduced. With progressively larger values of s, scaling factors decrease towards 0 and information costs are correspondingly reduced. In SP52, the largest value of s for which there is an entry in the table of scaling factors is 100. For any value of s which is larger than 100, the scaling factor is taken to be 0.

- - -

12. For discussion and explanation of how formulae like this are derived, see Lowry (1989).

13. I am indebted to Dr G A Stephen for this observation.

14. At present, the formula used in SP52 assumes that, for the first hit in any hit sequence, the values of x and y are both 1. This assumption has been made because it has seemed unreasonable to use the beginnings of the symbol sequences as reference points - because, for the first hit in a hit sequence but not for later hits, the use of the beginnings of the symbol sequences as reference points underlines the arbitrariness of the left-to-right reading (a right-to-left reading would give a different result). Further thinking is required to resolve these uncertainties.

Acknowledgements

I am grateful to Prof. C. S Wallace of Monash University for discussion of some of the ideas presented in this article, to Dr. Tim Porter of the Department of Mathematics, University of Wales, Bangor (UWB) and to Mr. John Hornsby of the School of Electronic Engineering and Computer Systems, UWB, for constructive comments on an earlier version of this article.

References

Abney, S. P. (1997) Stochastic attribute-value grammars, Computational Linguistics, 23 (4), 597- 618.

Ahadi, S. M. and Woodland, P. C. (1997) Combined Bayesian and predictive techniques for rapid speaker adaptation of continuous density hidden Markov models, Computer Speech and Language, 11, 187-206.

Allison, L and Wallace, C. S. (1994) The posterior probability distribution of alignments and its application to parameter estimation of evolutionary trees and to optimization of multiple alignments. Journal of Molecular Evolution 39, 418-430.

Allison, L, Wallace, C. S. and Yee, C. N. (1992) Finite-state models in the alignment of macromolecules. Journal of Molecular Evolution, 35, 77-89.

Allison, L. and Yee, C. N. (1990) Minimum Message Length encoding and the comparison of macromolecules. Bulletin of Mathematical Biology, 52 (3), 431-453.

Barton G. J. (1990) Protein Multiple Sequence Alignment and Flexible Pattern Matching. Methods in Enzymology, 183, 403-428.

Belloti, T. and Gammerman, A. (1996) Experiments in solving analogy problems using Minimal Length Encoding. Presented at Applied Decision Technologies '95, Brunel University, April 1995 (Proceedings of Stream 1, Computational Learning and Probabilistic Reasoning, pp. 209-220).

Berger, A. L., Della Pietra, S. A. and Della Pietra, V. J. (1996) A maximum entropy approach to natural language processing, Computational Linguistics, 22 (1), 39-71.

Chan, S. C., Wong, A. K. C. and Chiu, D. K. Y. A (1992) Survey of Multiple Sequence Comparison Methods. Bulletin of Mathematical Biology, 54 (4), 563-598.

Cheeseman, P. (1990) On finding the most probable model. In J. Strager and P. Langley (Eds.) Computational models of scientific discovery and theory formation, Chapter 3, Morgan Kaufmann, San Mateo, California, pp.73-95.

Cover, T. M. and Thomas, J. A. (1991). Elements of Information Theory. New York: John Wiley.

Day, W. H. E. and McMorris, F. R. (1992) Critical Comparison of Consensus Methods for Molecular Sequences. Nucleic Acids Research, 20 (5), 1093-1099.

Dreuth, E. W. and Rüber B. (1997) Context-dependent probability adaptation in speech understanding, Computer Speech and Language, 11, 225-252.

Felsenstein, J. (1981) Evolutionary trees from DNA sequences: a maximum likelihood approach. Journal of Molecular Evolution, 17, 368-376.

Gammerman, A. J. (1991) The representation and manipulation of the algorithmic probability measure for problem solving. Annals of Mathematics and Artificial Intelligence, 4, 281- 300.

Hu, J., Turin, W. and Brown, M. K. (1997) Language modelling using stochastic automata with variable length contexts, Computer Speech and Language, 11, 1-6.

Li, M. and Vitányi, P. (1993) An Introduction to Kolmogorov Complexity and Its Applications. Springer-Verlag, New York.

Lowry, R. (1989). The Architecture of Chance. Oxford: Oxford University Press.

Lucke, H. (1995) Bayesian belief networks as a tool for stochastic parsing. Speech Communication, 16, 89-118.

Pednault, E. P. D. (1991) Minimal-length encoding and inductive inference. In G. Piatetsky- Shapiro and W. J. Frawley (eds.), Knowledge Discovery in Databases, MIT Press, Cambridge Mass.

Reichert, T. A., Cohen, D. N. and Wong, A. K. C. (1973) An application of information theory to genetic mutations and the matching of polypeptide sequences. Journal of Theoretical Biology, 42, 245-261.

Rissanen, J. (1978) Modelling by the shortest data description. Automatica-J,.IFAC 14, 465-471.

Roytberg, M. A. A. (1992) Search for Common Patterns in Many Sequences,. Cabios, 8 (1), 57-64.

Sproat, R., Shih, C., Gale, W. and Chang, N. (1996) A stochastic finite-state word-segmentation algorithm for Chinese, Computational Linguistics, 22 (3), 377-404.

Storer, J. A. (1988) Data Compression: Methods and Theory. Computer Science Press, Rockville, Maryland.

Takahashi, J and Sagayama, S (1997) Vector-field smoothed Bayesian learning for fast and incremental speaker/telelphone-channel adaptation, Computer Speech and Language, 11, 127-146.

Taylor, W. R. (1988) Pattern matching methods in protein sequence comparison and structure prediction. Protein Engineering, 2 (2), 77-86.

Wagner, R. A. and Fischer, M. J. (1974) The string-to-string correction problem. Journal of the ACM, 21 (1), 168-173.

Wallace, C. S. and Boulton, D. M. (1968) An information measure for classification. Computer Journal, 11 (2), 185-195.

Wolff, J. G. (1998). Parsing as information compression by multiple alignment, unification and search: examples. In this issue.

Wolff, J. G. (1996a) Logic as information compression by multiple alignment, unification and search. SEECS Report, October 1996.

Wolff, J. G. (1996b) Information compression by multiple alignment, unification and search as a general theory of computing. SEECS Report, February 1996.

Wolff, J. G. (1996c) Learning and reasoning as information compression by multiple alignment, unification and search. In: A. Gammerman (ed.), Computational Learning and Probabilistic Reasoning, Wiley, Chichester, pp. 67-83. An earlier version was presented at Applied Decision Technologies '95, Brunel University, April 1995 (Proceedings of Stream 1, Computational Learning and Probabilistic Reasoning, pp. 223-236).

Wolff, J. G. (1995a) Computing as compression: an overview of the SP theory and system. New Generation Computing, 13, 187-214.

Wolff, J. G. (1995b) Computing as compression: SP20. New Generation Computing, 13, 215-241.

Wolff, J. G. (1995c) Multiple alignment and computing. SEECS Report, August 1995.

Wolff, J. G. (1994) A scaleable technique for best-match retrieval of sequential information using metrics-guided search. Journal of Information Science, 20 (1), 16-28.

Wolff, J. G. (1994) Computing and information compression: a reply. AI Communications, 7 (3/ 4), 203-219.

Wolff, J. G. (1993) Computing, cognition and information compression. AI Communications, 6 (2), 107-127.

Wolff, J. G. (1994) Towards a new concept of software. Software Engineering Journal, 9 (1), 27- 38.

Wolff, J. G. (1991) Towards a Theory of Cognition and Computing. Ellis Horwood, Chichester.

Wolff, J. G. (1990) Simplicity and power - some unifying ideas in computing. Computer Journal, 33 (6), 518-534.

Wolff, J. G. (1988) Learning syntax and meanings through optimization and distributional analysis. In Y. Levy, I. M. Schlesinger and M. D. S. Braine (eds.), Categories and Processes in Language Acquisition. Lawrence Erlbaum, Hillsdale, NJ. Reprinted in Chapter 2 of Wolff (1991).

Wolff, J. G. (1982) Language acquisition, data compression and generalization. Language & Communication, 2, 57-89. Reprinted in Chapter 3 of Wolff (1991).

Wu, D. (1997) Stochastic inversion transduction grammars and bilingual parsing of parallel corpora. Computational Linguistics, 23 (3), 377-403.

Computing

Cognition

Language Learning

PARSING AS INFORMATION COMPRESSION BY MULTIPLE

ALIGNMENT, UNIFICATION AND SEARCH: SP52

J Gerard Wolff

February 1998

School of Electronic Engineering and Computer Systems, University of Wales, Dean

Street, Bangor, LL57 1UT, UK. Telephone: +44 1248 382691. E-mail:

gerry@sees.bangor.ac.uk. Fax: +44 1248 361429. Web: http://

www.sees.bangor.ac.uk/~gerry/.

TABLE OF CONTENTS

Abstract

1. INTRODUCTION

1.1 Novelty of Proposals

1.2 Background and Context

1.2.1 Related Research.

1.3 Evaluation of these Proposals

1.3.1 Speed of Processing.

1.3.2 Representation of Syntax.

1.3.3 Ambiguity in Parsing.

1.3.4 Integration of Syntax with Semantics.

1.3.5 Integration of Parsing with Other Functions.

1.4 Presentation

2 MULTIPLE ALIGNMENT PROBLEMS

2.1 Varieties of the Multiple Alignment Problem

3 PARSING AS MULTIPLE ALIGNMENT

3.1 Representing a Grammar with Patterns of Symbols

3.2 Parsing as Alignment of a Sentence and Rules in a Grammar

4 EVALUATION OF AN ALIGNMENT IN TERMS OF COMPRESSION

4.1 Encoding Individual Symbols

4.2 Encoding Words

4.3 Encoding Phrases

4.4 Encoding the Sentence

4.5 Taking Account of the Sizes of Gaps

4.6 Discussion

4.7 Summary of Method for Calculating the Compression Associated with an Alignment

4.7.1 Information Costs of Symbols.

4.7.2 Calculation of E, the Minimum Number of Bits Required for the Encoding of a Given Pattern in Old.

4.7.3 Calculation of BN (the Number of Bits Required to Express Symbols from New in 'Raw' Form, without Encoding).

4.7.4 Calculation of BE (the Number of Bits Required to Encode the Alignment).

5 THE SP52 MODEL

5.1 Preliminary Processing

5.1.1 Calculation of the Information Cost of Each Symbol.

5.1.2 Establishing the Encoding Cost of Each Pattern in Old.

5.2 Building the 'Hit Structure' (Step 2 of the compress() Function in Figure 5)

5.2.1 Fuzzy Matching of One Pattern with Another.

5.2.2 No One Instance of a Symbol Should Ever Be Matched with Itself.

5.2.3 The Order of Symbols in New must be Preserved.

5.2.4 Estimation of Compression Scores.

5.2.5 Purging the Hit Structure.

5.2.6 Distinctive Features of the Technique.

5.3 Building, Scoring and Selection of Alignments

5.3.1 Building Alignments and Scoring Them (Step 3 of the compress() Function).

5.3.2 Preliminary Selection of Alignments.

5.3.3 Selection of Alignments.

5.4 Computational Complexity

6 DECODING BY COMPRESSION: THE PRODUCTION OF LANGUAGE

6.1 Compression

6.2 SP52 and Decoding by ICMAUS

7 FUTURE DEVELOPMENTS

7.1 Refinement of the Model

7.1.1 Search Methods.

7.1.2 Parallel Processing and New Hardware.

7.2 Generalisation of the Model

7.2.1 Meanings.

7.2.2 Parsing and Learning.

7.2.3 Parsing and Inference.

8 CONCLUSION

A1 APPENDIX: ADJUSTING THE INFORMATION VALUES OF SYMBOLS IN THE LIGHT OF GAPS BETWEEN HITS

A1.1 Overall Approach

A1.2 Spans and Probabilities

A1.3 The Probability of One Or More Hits within a Given Span

A1.4 Probabilities, Information and Scaling Factors

Acknowledgements

References

4.7.3 Calculation of B_N (the Number of Bits Required to Express Symbols from New in 'Raw' Form, without Encoding).

4.7.4 Calculation of B_E (the Number of Bits Required to Encode the Alignment).