Korean Resource Grammar


About KRG | Current Grammar | Summary of the Project | Publications on KRG| Related Publications


About Korean Resource Grammar

The Korean Resource Grammar is a computational grammar for Korean currently under development by Jong-Bok Kim (Kyung Hee Univ., Seoul) and Jaehyung Yang (Kangnam Univ). It was aimed to develop an open source grammar of Korean. This grammar was developed in the formalism of Head-Driven Phrase Structure Grammar (HPSG). HSPG is an integrated syntax-semantic approach on a lexical basis. Complex signs are combined by unification. The Korean grammar has a broad lexical and syntactic coverage, such that it will be possible to use it in products like automatic email response systems. The semantic formalism is Minimal Recursive Semantics (MRS). MRS offers an interface between syntax and semantics using feature structures. The formalism has syntactically flat structures, offers at the same time the possibility of the handling of scope relations. The semantics was developed in close cooperation with the LinGO English Resource Grammar. In addition, reuse of the English Resource Grammar (ERG, see Flickinger, 2000) is facilitated by the LinGO Grammar Matrix (see Bender et al., 2002), an open source tool ("starter-kit") designed for the rapid development of multilingual broad coverage grammars couched in HPSG and MRS and based on LKB.



Downloadable Grammar

Korean Resource Grammar As of November 2005: Download the grammar


Summary of the project

Background: The advent of the information era in this century has escalated the importance of processing linguistic information more precisely and correctly. Recent developments in artificial intelligence, information sciences and other high technology activities have made it possible to build feasible computational applications for language processing and understanding. Such applications (e.g. message extraction systems, web-based search engines, machine translation and dialogue understanding systems) demand increasing accuracy and robustness of the grammar (or parsers) combined with sophisticated statistical processing methods.

When considering the reality that the basic units in understanding language are sentences, we could not miss the fact that building a reliable syntactic and semantic parser is a prerequisite for language processing. Although there have been several successful morphological analyzers developed for Korean, no serious attempts have been made to build its syntactic or semantic parser(s), partly because of its structural complexity and partly because of the existence of no reliable grammar-build up system. As observed by Kang (1998), the research on syntactic and semantic processing in Korea is at the beginning stage and at least 10 to 15 years behind compared to the one for English:

EnglishKorean
Morphological AnalyzerApplication StageApplication Stage
CorpusApplication StageResearch and Application Stage
Syntax/Semantic ParserResearch and Application StageResearch Stage


As represented in the table, the research for the development of English syntactic and semantic parsers has reached a significant level that can even allow real-time applications. For example, in past projects, the ERG (English Resource Grammar), a part of the LinGO project at CSLI (Center for the Study of Language and Information), was used in the Verbmobil machine translation system and in an NSF-funded project on computer-aided speech generation for people who cannot speak because of disability (cf. Copestake and Flickinger 2000, Flickinger 2002). However, in Korea there exist few reliable applications in particular for English and Korean or vice versa.

The urgent need to advance the lagging research for Korean syntactic/semantic parser provides the very motivation for this project.

Objectives: The objective of this project is thus to build a general purpose system for processing the Korean language that will support both research and practical applications. The goal includes building a broad-coverage Korean grammar that can be used both to extract precise meanings from text input and to generate well-formed text output. To achieve this goal, the project will develop a computationally feasible Korean Resource Grammar and implement it into the LKB (Linguistic Knowledge Building) system developed by the LingGO (Linguistic Grammar Online) Lab researchers at the CSLI (Center for the Study of Language and Information).

Methodology:

Korean Resource Grammar: The Korean Resource Grammar (KRG) to be developed in this project is a computational grammar for Korean currently under development since October 2002. Its aim is to develop an open source grammar of Korean. The grammatical framework for the KRG is the constraint-based grammar, HPSG (Pollard and Sag 1994, Sag, Wasow, and Bender 2003). HPSG (Head-driven Phrase Structure Grammar) is built upon a non-derivational, constraint-based, and surface-oriented grammatical architecture. Though HPSG shares with the so-called P&P (Principles and Parameters) grammar the idea that interaction between lexical entries and a set of parameterized principles determines grammatical well-formedness, it has one fundamental architectural difference: there are no derivational or transformational operations involved. Unlike the P&P framework, where distinct levels of syntactic structure are sequentially derived by means of the transformational operation Move-α(affecting both phrasal categories and heads), HPSG has no notion of deriving one structure from another structure. It employs a concrete conception of constituent structures, a limited set of universal principles (e.g., the Head Feature Principle, the Valence Principle, etc.), and enriched lexical representations (cf. Pollard and Sag 1994, Kim 2002, Sag, Wasow, and Bender 2003).

In addition, HPSG is a constraint-based, lexicalist approach to grammatical theory that seeks to model human languages as systems of constraints on typed feature structures. In particular, the grammar adopts the mechanism of type hierarchy in which every linguistic sign is typed with appropriate constraints and hierarchically organized. The characteristic of such typed feature structure formalisms facilities the extension of grammar in a systematic and efficient way, resulting in linguistically precise and theoretically motivated descriptions of languages including Korean. The grammar HPSG is thus well suited to the task of multilingual development of broad coverage grammars.

The Korean Resource Grammar, developed as an extension of HPSG, will also have a broad lexical and syntactic coverage, such that it will be possible to use it in application products such as an automatic email response system. In addition, the grammar adopts a flat semantic formalism Minimal Recursion Semantics (MRS) in representing semantics (Copestake et al. 2001). MRS offers an interface between syntax and semantics using feature structures. The formalism has syntactically flat structures and offers at the same time the possibility of the handling of scope relations. The semantics is being developed in close cooperation with the LinGO English Resource Grammar.

Grammar Tool Writing: The basic tool for writing, testing, and processing the Korean Resource Grammar is the LKB (Linguistic Knowledge Building) system (Copestake 2002). The LKB system is a grammar and lexicon development environment for use with constraint-based linguistic formalisms such as HPSG. Both are freely available with open source (http://ling.stanford.edu). The LKB also provides an efficient parser and generator.

Status Quo of the Korean Resource Grammar: The Korean Resource Grammar consists of grammar rules, inflection rules, lexical rules, type definitions, and lexicon. At this stage it includes a lexicon of some 500 words whose properties are organized in a hierarchy of about a hundred types and 300 sentences. The grammar also includes a limited set of phrasal rules and types organized in a type hierarchy, providing coverage of the most familiar phenomena found in ordinary Korean (Kim 2003a, 2003b).

One example will suffice to demonstrate the efficiency of the grammar developed so far. One of the most complicated facts in Korean is that it allows sentence internal free scrambling. For example, observe the sentence (1):

(1) mayil John-un haksayng-tul-eykey yenge-lul[kaluchi-ess-ta]
    Everyday John-Top students-Pl-Dat English-Acc teach-Past-Decl
    John taught English to the students everyday.

The five syntactic elements here can induce 24 (4!) different scrambling possibilities. A few ordering possibilities are given here:

(2) a. John-i mayil haksayng-tul-eykey yenge-lul kaluchiessta.
     b. haksayng-tul-eykey John- i mayil yenge-lul kaluchiessta
     c. yenge-lul John- i mayil haksayng-tul-eykey kaluchiessta.
     d. mayil haksayng-tul-eykey John- i yenge-lul kaluchiessta.
     e. haksayng-tul-eykey John- i mayil yenge-lul kaluchiessta.
     f. ...


A most effective grammar would no doubt be the one that can capture all such scrambling possibilities within a minimal processing load. In the KRG at this stage, this flexibility of Korean syntax is captured by the interactions between lexical information and a limited set of the well-formed phrase conditions. Different from English (and from the Japanese grammar JACY of Siegel and Bender 2002, Siegel 2000), the KRG assumes just the following informally represented phrasal well-formed conditions:

(3) Korean X' Syntax (simplified):
a. hd-arg-ph: XP => [1], H[ARG-ST < ... [1]...>]
b. hd-mod-ph: XP => [MOD [1]], H[1]
c. hd-filler-ph: XP => [1], H[GAP <[1]>]
d. hd-word-ph: X[LEX +] => [word], H


(3a) means that when a head combines with one of its arguments, the resulting phrase is a well-formed phrase. (3b) allows a head to combine with a phrase that modifies it. (3c) is a constraint for a head to form a phrase (with a missing gap) with a filler. (3d) basically generates a word level syntactic element by the combination of a head and a word. This well-formed phrase condition, not found in languages like English, forms various types of complex predicates frequently found in the language. This simple X’ syntax generates either unary or binary syntactic structures and thus can capture the major syntactic structures including scrambling cases in a straightforward manner.

To be more formal, in the implementation of the KRG into the LKB, the phrase condition of hd-arg-ph (head-argument-phrase) is written as follows:

head-arg-rule-1 := hd-arg-ph &
 [ SYN.VAL.ARG-ST #2,
    ARGS < #1 & [ SYN.HEAD.PRD - ],
                syn-st & [ SYN.VAL.ARG-ST [ FIRST #1,
                                            REST #2 ] ] > ].


This description, a direct translation of the KRG, specifies that there are two elements in the ARGS list, the second element of which represents the head and selects two #1 and #2 as its arguments. When this head combines with #1, the resulting phrase requires only #2. This eventually allows the arguments to be discharged one by one, generating binary structures. This condition, combined with the hd-mod-ph, can generate all the 24 word order possibilities for the sentence (1). The following examples are the actual parsed tree of the sentence (2a) in the current KRG system and its MRS semantic representation.



As observed here, the LKB is a powerful and efficient tool that allows a hands-on implementation of the Korean Resource grammar, built upon the typed feature structure formalisms of HPSG.

Areas to be developed during the proposed project period: The Korean Resource Grammar has been under development since October 2002 as a collaboration work with a computer scientist at the School of Computer Science, Kangnam University. It currently covers phenomena such as basic clause syntax, free word order (scrambling), case marking, adverb modification, topicalization, relative clauses, auxiliary constructions, light verb constructions, and complex sentences, among others. The results have been presented in a LinGO lab weekly meeting in February 2003 and a linguistic conference here in Korea, and they will also be presented in an international conference this coming October. The two previous reports of the on-going projects have greatly impressed the researchers in the relevant field and have received strongly positive responses. In particular, central attention has come from the preciseness of the grammar and its efficient parsing results: The mean edges for the 300 sample sentences is only 1.48, much lower than the current existing systems. The first phase of the current LinGO Korean Resource Grammar has thus achieved impressive coverage of major constructions in the language in question, providing a promising future direction.

Even though we have reached an unexpectedly high coverage of real data, considering the short period of research, there of course remain many areas asking for further development. These can be summarized as follows:

Significance: The successful completion of this project would bring us the following results:

Evaluation and dissemination: The first step in evaluating our system is the SERI test suite built by the researchers in the ETRI. These 600 sample sentences are key sentences that the literature on Korean linguistics has most frequently discussed. The current KRG covers about 60% of the sentences. The project will increase its full coverage with the proper semantics as well. In addition to this test suite, the project result will be evaluated by the Syntactic Structure Corpus built by the Sejong 21 Project.

The project results will be presented in international conferences on linguistics as well as on computational linguistics such as COLING, PACLIC, ACL, etc.

More importantly, at the end of this project (August 2005), all the results of this research will be put on-line (temporary site: web.khu.ac.kr/~jongbok/projects.html). We cannot deny that most of the previous research results -- in particular the source files of Korean parsing systems -- have been off-limits to other researchers and even been confidential. The open source policy of this project will surely help researchers and students in the field and encourage further research.

Why Research at CSLI (Center for the Study of Language and Information)? As noted, the grammar platform that this project is adopting is the LKB system developed by the LinGO lab researchers at CSLI. In addition, the LinGO Lab serves as the center of the LinGO project as well as the HPSG project. The CSLI LinGO lab has played a leading role in the development of the English Resource Grammar (or ERG) and the LKB grammar engineering system as well. It is needless to say that the best place to do related research is the original place where all started and the related projects are most actively being pursued. There is no doubt that through conducting research with the lab people, I can learn the most up-to-date computational drills and theories and eventually achieve the project goals that I described here. In addition, the CSLI LinGO lab, where related researchers are visiting frequently will provide me with the most ideal environment to exchange ideas and get feedback in developing the KRG system. There is no ideal research place better than the CSLI LingGO Lab for this project.


Selected Publications




Related Publications: