LANGUAGE-BASED ENVIRONMENT FOR NATURAL LANGUAGE PARSING Lehtola, A., J~ppinen, H., Nelimarkka, E. sirra Foundation (*) and Helsinki University of Technology Helsinki, Finland ABSTRACT This paper introduces a special programming environment for the definition of grammars and for the implementation of corresponding parsers. In natural language processing systems it is advantageous to have linguistic knowledge and processing mechanisms separated. Our environment accepts grammars consisting of binary dependency relations and grammatical functions. Well-formed expressions of functions and relations provide constituent surroundings for syntactic categories in the form of two-way automata. These relations, functions, and automata are described in a special definition language. In focusing on high level descriptions a linguist may ignore computational details of the parsing process. He writes the grammar into a DPL-description and a compiler translates it into efficient LISP-code. The environment has also a tracing facility for the parsing process, grammar-sensitive lexical maintenance programs, and routines for the interactive graphic display of parse trees and grammar definitions. Translator routines are also available for the transport of compiled code between various LISP-dialects. The environment itself exists currently in INTERLISP and FRANZLISP. This paper focuses on knowledge engineering issues and does not enter linguistic argumentation. inflections of the words. Therefore existing parsing methods for Indoeuropean languages (eg. ATN, DCG, LFG etc.) did not seem to grasp the idiosyncracies of Finnish. The parser system we have developed is based on functional dependency. Grammar is specified by a family of two-way finite automata and by dependency function and relation definitions. Each automaton expresses the valid dependency context of one constituent type. In abstract sense the working storage of the parser consists of two constituent stacks and of a register which holds the current constituent (Figure I). The register of the current constituent LI L2 L3 RI R2 R3 INTRODUCTION Our objective has been to build a parser for Finnish to work as a practical tool in real production applications. In the beginning of our work we were faced with two major problems. First, so far there was no formal description of the Finnish grammar. Second difficulty was that Finnish differs by its structure greatly from the Indoeuropean languages. Finnish has relatively free word order and syntactico-semantic knowledge in a sentence is often expressed in the The left constituent stack The righ constituent stack storage Figure I. The working of DPL-parsers (*) SITRA Foundation P.O. Box 329, SF-00121 Helsinki, Finland 98 <-Phrase Adverbial )<+Phrase Adverbial IILD PHRASE ON RIGHT ILO PHRASE ON RIGHT ~*Phrase Subject~ ~ophrase Phrase = ,Nominal FIND REGENT Ph-+Nominal ON RIGHT LA dverrbaisael ] I • - -Nomina Sublet1 empty left- hand side ~nd of inpul ! *Phrase IAdverbial BUILD PXRA: ~Phrase ON RIGHT @ IILO PHRASE ON RIGHT Notations: On the left is I On the left is a state transition Double circles are used the state node ?X with priority, conditions for to denote entrees and of the automaton {cond$ .... the dependent candidate (if not exits of an automaton• The question mark k Toncllon) otherwised stated) and Inside is expressed the indicates the direction I4 , connection function indicated. manner of operation. Figure 2. A two-way automaton for Finnish verbs The two stacks hold the right and left Subject and then for Object. When a contexts of the current constituent. The function test succeeds, the neighbor will parsing process is always directed by the be subordinated and the verb advances to expectations of the current constituent. the state indicated by arcs. The double Dynamic local control is realized by circle states denote entry and exit points permitting the automata to activate one of the automaton. another. The basic decision for the automaton associated with the current ~f completed constituents do not exist as constituent is to accept or reject a neighbors, an automaton may defer neighbor via a valid syntactico-semantic decision. In the Figure 2 states labelled subordinate relation. Acceptance \"BUILD PHRASE ON RIGHT\" and \"FIND REGENT subordinates the neighbor, and it ON RIGHT\" push the verb to the left stack disappears from the stack. The structure and pop the right stack for the current an input sentence receives is an annotated constituent. When the verb is activated tree of such binary relations. later on, the control flow will continue An automaton for verbs is described in from the state expressed in the Figure 2. When a verb becomes the current deactivation command. constituent for the first time it will enter the automaton through the START There are two distinct search strategies node. The automaton expects to find a involved. If a single parse is dependent from the left (?V). If the left sufficient, the graphs (i.e. the neighbor has the constituent feature automata) are searched depth first +Phrase, it will be tested first for following the priority numbering. A full search is also possible. 99 The functions, relations and automata are expressed in a special conditional expression formalism DPL (for Dependency Parser Language). We believe that DPL might find applications in other inflectional languages as well. DPL-DESCRIPTIONS The main object in DPL is a constituent. A grammar specification opens with the structural descriptions of constituents and the allowed property names and property values. User may specify simple properties, features or categories. The structures of the lexical entries are also defined at the beginning. The syntax of these declarations can be seen in Figure 3. All properties of constituents may be referred in a uniform manner using their values straight. The system automatically takes into account the computational details associated to property types. For example, the system is automatically tuned to notice the inheritance of properties in their hierarchies. Extensive support to multidimensional analysis has been one of the central objectives in the design of the DPL-formalism. Patterning can be done in multiple dimensions and the property set associated to constituents can easily be extended. An example of a constituent structure and its property definitions is given in Figure 4. The description states first that each constituent contains Function, Role, ConstFeat, PropOfLexeme and MorphChar. The next two following definitions further specify ConstFeat and PropOfLexeme. In the last part the definition of a category tree SemCat is given. This tree has sets of property values associated with nodes. The DPL-system automatically takes care of their inheritances. Thus for a constituent that belongs to the semantic category Human the system automatically associates feature values +Hum, +Anim, +Countable, and +Concr. The binary grammatical functions and relations are defined using the syntax in Figure 5. A DPL-function returns as its value the binary construct built from the ~urrent constituent (C) and its dependent candidate (D), or it returns NIL. DPL-relations return as their values the pairs of C and D constituents that have passed the associated predicate filter. By choosing operators a user may vary a predication between simple equality (=) and equality with ambiguity elimination (=:=). Operators := and :- denote replacement and insertion, respectively. In predicate expressions angle brackets signal the scope of an implicit OR-operator and parentheses that of an
::= ( CONSTITUENT: ::= ( SUBTREE: ::= ::= ::= ::= ::= .. ) ) : ( LEXICON-ENTRY: ) ( .. ) ( .. ) : ::= ( PROPERTY: ) : ( FEATURE: ) ( CATEGORY: < .. > ) ::= < .. > ::= NoDefault : ::= ( ) ::= ::= ( ) : ::= / : ::= Figure 3. The syntax of constituent structure and property definitions 100 (CONSTITUENT: (LEXICON-ENTRY: (Function Role ConstFeat PropOgLexeme Morphchar)) PropOfLexeme ( (SyntCat SyntFeat) (SemCat SemFeat) (FrameCat LexFrame) AKO )) MorphChar ( Polar Voice Modal Tense Comparison Number Case PersonN PersonP Clitl Clit2)) SemCat < ( Entity ) ( Concrete ( +Concr ) / Entity ) ( Animate ( +Anim +Countable ) / Concrete ) ( Human ( +Hum ) / Animate ) ( Animals / Animate ) ( NonAnim / Concrete ) ( Matter ( -Countable ) / NonAnim ) ( Thing ( +Countable ) / NonAnim ) > (SUBTREE: (CATEGORY: Figure 4. An example of a constituent structure specification and the definition of an category tree implicit AND-operator. An arrow triggers defaults on: the elements of expressions to the right of an arrow are in the OR-relation and those to the left of it are in the AND-relation. Two kinds of arrows are in use. A simple arrow (->) performs all operations on the right and a double arrow (=>) terminates the execution at the first successful operation. In Figure 6 is an example of how one may define Subject. If the relation RecSubj holds between the regent and the dependent candidate the latter will be labelled Subject and subordinated to the former. The relational expression RecSubj defines the property patterns the constituents should match. A grammar definition ends with the context specifications of constituents expressed as two-way automata. The automata are described using the notation shown in somewhat simplified form in Figure 7. An automaton can refer up to three constituents to the right or left using indexed names: LI, L2, L3, RI, R2 or R3. <~unction> ::= ( FUNCTION: <~unction name> ) ::= ( RELATION: ) ::= ( .. .. ) : ( DEL ) ::= < > I ( ) ( ) ::= -> I => ::= C I D ::= = I := I :-- I =:= ::= < .. > : ( .. ) : I ' I ( ) Figure 5. The syntax of DPL-functions and DPL-relations 101 (FUNCTION: Subject ( RecSubj -> (D := Subject)) ) (RELATION: RecSubj ((C = Act < Ind Cond Pot Imper >) (D = -Sentence +Nominal) -> ((D = Nom) -> (D = PersPron (PersonP C) (PersonN C)) ((D = Noun) (C = 3P) -> ((C = S) (D = SG)) ((C = P) (D = PL)))) ((D = Part) (C = S 3P) -> ((C = \"OLLA) => (C :- +Existence)) ((C = -Transitive +Existence)))) Figure 6. A realisation of Subject ::= ( STATE: .. ) ::= LEFT | RIGHT ::= ( .. ) ( ) ::= ~ .. ::= ( C := ) : ( FIND-REG-ON ) ( BUILD-PHRASE-ON ) ( PARSED ) ::= ::= ( C := ) ::= ( DEL ) ( TRANSPOSE ) Figure 7. Simplified syntax of state specifications ( STATE: V? RIGHT ((D = +Phrase) -> (Subject -> (C := VS?)) (Object -> (C := VO?)) (Adverbial -> (C := V?)) (T => (C := ?VFinal))) ((D = -Phrase) -> (BUILD-PHRASE-ON RIGHT (C := V?))) Figure 8. The expression of V? in Figure 2. 102 The direction of a state (see Figure 2.) optimized code (Lehtola, 1984). selects the dependent candidate normally Internally data structures are only partly as L1 or R1. A switch of state takes dynamic for the reason of fast information place by an assignment in the same way as flinguistic properties are assigned. As an teot ch.m inimAimzbei guitrieedsu ndaarne t exprseesasrecdh . localTlhye example the node V? of Figure 2 is principle of structure sharing is followed defined formally in Figure 8. whenever new data structures are built. In the manipulation of constituent More linguistically oriented structures there exists a special service argumentation of the DPL-formalism appears routine for each combination of property elsewhere (Nelimarkka, 1984a, and and predication types. These routines Nelimarkka, 1984b). take special care of time and memory consumption. For instance with regard replacements and insertions the copying THE ARCHITECTURE OF THE DPL-ENVIRONMENT includes physically only the path from the root of the list structure to the changed sublist. The logically shared parts will • be shared also physically. This The architecture of the DPL-environment is stipulation minimizes memory usage. described schematically in Figure 9. The main parts are highlighted by heavy lines. In the state transition network level the Single arrows represent data transfer; search is done depth first. To handle double arrows indicate the production of ambiquities DPL-functions and -relations data structures. All modules have been process all alternative interpretations in implemented in LISP. The realisations do not rely on specifics of underlying parallel. In fact the alternatives are stored in the stacks and in the C-register LISP-environments. as trees of alternants. In the first version of the DPL-compiler The DPL-compiler the generation rules were intermixed with A compilation results in executable code the compiler code. The maintenance of the compiler grew harder when we experimented of a parser. The compiler produces highly with new computational features. We parser facility information extraction system with lexicon graphic output maintenance Figure 9. The architecture of the DPL-environment 103 therefore started to develop a Lexicon and its Maintenance metacompiler in which compilation is defined by rules. At moment we are The environment has a special maintenance testing it and soon it will be in everyday program for lexicons. The program uses use. The amount of LISP-code has greatly video graphics to ease updating and it reduced with the rule based approach, and performs various checks to guarantee the we are now planning to install the consistency of the lexical entries. It DPL-environment into IBM PC. also co-operates with the information extraction system to help the user in the Our parsers were aimed to be practical selection of properties. tools in real production applications. It was hence important to make the produced programs transferable. As of now we have The Tracing Facility a rule-based translator which converts parsers between LISP dialects. The The tracing facility is a convenient tool translator accepts currently INTERLISP, for grammar debugging. For example, in FranzLISP and Common Lisp. Figure I0 appears the trace of the parsing of the sentence \"Poikani tuli illalla kent~it~ heitt~m~st~ kiekkoa.\" (= \"My son (T POIKANI TULI ILLALLA KENT~LT~ HEITT~M~ST~ KIEKKOA .) ~8~ ¢c~ses • 03 seconds 0.0 seconds, garbage collection time PARSED _PRTH ( ) => (POIKA) (TULJ.A) (ILTA) (KENTT~) (HEITT~) (KIE] (POIKA) (TULLA) (ILTA) (KENTT~) (HEITT~) (KIEKKO) ?NFinal (##) (POIKA) (TULLA) (ILTA) (KENTT~) (HEITT~) (KIEKKO) NIL (POIKA) => (TULLA) (ILTA) (KENTT~) (HEITT~) (KIEKKO) ?V. ,=> ((POIKA) TULLA) (ILTA) (KENTT~) (HEITT~) (KIEKKO) ?VS ((POIKA) TULLA) <= (~LTA) (KENTT~) (HEITT~&) (KIEKKO) VS? ((POIKA) TULLA) => (ILTA) (KENTT~) (HEITT~&~) (KIEKKO) ?N ((POIKA) TULLA) (ILTA) <= (KENTT~) (HEITT~) (KIEKKO) N? ((POIKA) TULLA) => \"(ILTA) (KENTT~) (HEITT~) (KIEKKO) ?NFinal ((POIKA) TULLA) <= (ILTA) (KENTT~) (HEITT~) (KIEKKO) VS? ((POIKA) TULLA (ILTA)) <= (KENTT~) (HEITTYdl) (KIEKKO) VS? ((POIKA) TULLA (ILTA)) => (KENTT&) (HEITT~) (KIEKKO) ?N ((POIKA) TULLA (ILTA)) (KENTT~) <= (HEITT~) (KIEKKO) N? ((POIKA) TULLA (ILTA)) => (KENTT~) (HEITT~) (KIEKKO) ?NFinal ((POIKA) TULLA (ILTA)) <= (KENTT&) (HEITT~) (KIEKKO) VS? ((POLKA) TULLA (ILTA) (KENTT~)) <= (HEITT~) (KIEKKO) VS? ((POIKA) TULLA (ILTA) (KENTT~)) => (HEITT~i) (KIEKKO) .9%/ ((POIKA) TULLA (ILTA) (KENTT~)) (HEITT~) <= (KIEKKO) V? ((POIKA) TULLA (ILTA) (KENTT~)) (HEITT~dl) => (KIEKKO) ?N ((POIKA) TULLA (ILTA) (KENTT~)) (HEITT~) (KIEKKO) <= N? ((POIKA) TULLA (ILTA) (KENTT~)) (HEITT&~) => (KIEKKO) ?NFinal ((POIKA) TULLA (ILTA) (KENTT~)) (HEITT~) <= (KIEKKO) V? ((POIKA) TULLA (ILTA) (KENTT&)) (HEITT~ (KIEKKO)) <= VO? ((POIKA) TULLA (ILTA) (KENTT~)) => (HEITT~ (KIEKKO)) ?VFinal ((POIKA) TULLA (ILTA) (KENTT~)) <= (HEITT&~ (KIEKKO)) VS? ((POIKA) TULLA (ILTA) (KENTT~) (HEITT~ (KIEKKO))) <= VS? => ((POIKA) TULLA (ILTA) (KENTT~) (HEITT~ (KIEKKO))) ?VFinal ((POIKA) TULLA (ILTA) (KENTT~) (HEITT~ (KIEKKO))) <= MainSent? ((POIKA) TULLA (ILTA) (KENTT~) (HEITT&& (KIEKKO))) <= MainSent? OK DONE Figure I0. A trace of parsing process 104 came back in the evening from the stadium where he had been throwing the discus.\"). Each row represents a state of the parser before the control enters the state mentioned on the right-hand column. The thus-far found constituents are shown by the parenthesis. An arrow head points from a dependent candidate (one which is subjected to dependency tests) towards the current constituent. The tracing facility gives also the consumed CPU-time and two quality indicators: search efficiency and connection efficiency. Search efficiency is 100%, if no useless state transitions took place in the search. This figure is meaningless when the system is parameterized to full search because then all transitions are tried. Connection efficiency is the ratio of the number of connections remaining in a result to the total number of connections attempted for it during the search. We are currently developing other measuring tools to extract statistical information, eg. about the frequency distribution of different constructs. Under development is also automatic book-keeping of all sentence~ input to the system. These will be divided into two groups: parsed and not parsed. The first group constitutes growing test material to ensure monotonic improvement of grammars: after a non trivial change is done in the grammar, a new compiled parser runs all test sentences and the results are compared to the previous ones. Information Extraction System In an actual working situation there may be thousands of linguistic symbols in the work space. To make such a complex manageable, we have implemented an information system that for a given symbol pretty-prints all information associated with it. The environment has routines for the graphic display of parsing results. A user can select information by pointing with the cursor. The example in Figure Ii demonstrates the use of this facility. The command SHOW() inquires the results of _SHOW ( ) (POIKANI) ((PI]IKA) (TULI) (ILJ.RLLR) (KI~&I.T&) ( HE I TT31I'I~X ) ! (KIEK~) STRRT TULLA (ILTA]~KENTT~) (HEITT xx (KIEKKO))) TULLA I I ! i SubJect 'oative Neutral) , ! ILTA Adverbial TiaeIPred i ! KENTTX Adverbial Ablative , e HEITT~U~ Adverbial S ! KIEKKO Object Neutral Function Role SubJect (Ergative Neutral ) ConstFeat is a linguistic feature type. FrameFeat Polar IVoice !Modal Tense Comparison Number Case PersonN P~sonP Clitl Clit2 (NIL) (Pos) (NIL) (NIL) (NIL) (NilColpar) (SG) (Nee) (S) (IP) (NIL) (NIL) Default valuen -Phrase Associated values: (+Declarative -Declarative +Main -Main +Nominal -Nominal +Phrase -Phrase +Predicative -Predicative +Relative -Relative +Sentence -Sentence) Associated ~uncti onsl (C~nstFeat/INIT ConstFeat/FN CenstFeatl= ConstFeat/=:= ConstFeat/:- ConstFeat/,-/C CanstFeat/:= ConstFeat/:=/C) Figure ii. An example of information extraction utilities 105 the parsing process described in Figure i0. The system replies by first printing the start state and then the found result(s) in compressed Eorm. The cursor has been moved on top of this parse and CTRL-G has been typed. The system now draws the picture of the tree structure. Subsequently one of the nodes has been opened. The properties of the node POIKA appear pretty-printed. The user has furthermore asked information about the property type ConstFeat. All these operations are general; they do not use the special features of any particular terminal. REFERENCES i. Lehtola, A., Compilation and Implementation of 2-way Tree Automata for the Parsing of Finnish. M.So Thesis, ~elsinki University of Technology, Department of Physics, 1984, 120 p. (in Finnish) 2° Nelimarkka, E°, J~ppinen, H. and Lehtola A., Two-way Finite Automata and Dependency Theory: A Parsing Method for Inflectional Free Word Order Languages. Proc. COLING84/ACL, Stanford, 1984a, pp. 3-392. 3° Nelimarkka, E., J~ppinen, H. and Lehtola A., Parsing an Inflectional Free Word Order Language with Two-way Finite Automata° Proc. of the 6th European Conference on Artificial Intelligence, Pisa, 1984b, pp. 167-176. 4. Winograd, To, Language as a Cognitive Process. Volume I: Syntax, Addison-Wesley Publishing Company, Reading, 1983, 0 p. CONCLUSION The parsing strategy applied for the DPL-formalism was originally viewed as a cognitive model. It has proved to result practical and efficient parsers as well. Experiments with a non-trivial set of Finnish sentence structures have been performed both on DEC-2060 and on VAX-II/780 systems. The analysis of an eight word sentence, for instance, takes between 20 and 600 ms of DEC CPU-time in the INTERLISP-version depending on whether one wants only the first or, through complete search, all parses for structurally ambiguous sentences. The MacLISP-version of the parser runs about 20 % faster on the same computer. The NIL-version (Common Lisp compatible) is about 5 times slower on VAX. The whole environment has been transferred also to FranzLISP on VAX. We have not yet focused on optimality issues in grammar descriptions. We believe that by rearranging the orderings of expectations in the automata improvement in efficiency ensues. 106