Korean Resource Grammar
About KRG |
Current Grammar |
Summary of the Project |
Publications
on KRG|
Related
Publications
About Korean Resource Grammar
The Korean Resource Grammar is a computational grammar for Korean
currently under development by Jong-Bok Kim
(Kyung Hee Univ., Seoul) and Jaehyung Yang (Kangnam Univ).
It was aimed to develop an open source grammar of Korean.
This grammar was developed in the formalism of
Head-Driven Phrase Structure Grammar (HPSG).
HSPG is an integrated syntax-semantic approach on a lexical basis.
Complex signs are combined by unification. The Korean grammar
has a broad lexical and syntactic coverage, such that it will be possible to
use it in products like automatic email response systems.
The semantic formalism is Minimal Recursive Semantics
(MRS).
MRS offers an interface between syntax and semantics using feature structures.
The formalism has syntactically flat structures, offers at the same time the
possibility of the handling of scope relations. The semantics was developed
in close cooperation with the LinGO
English Resource Grammar. In addition, reuse of the
English Resource Grammar
(ERG, see Flickinger, 2000) is facilitated by the
LinGO Grammar Matrix
(see Bender et al., 2002), an open source tool ("starter-kit")
designed for the rapid development of multilingual broad coverage grammars couched
in HPSG and MRS and based on LKB.
-
The morphological analyzer we use for Korean is MACH.
-
Korean Resource Grammar has a close cooperation with the LinGO Research Laboratory
at CSLI, Stanford
-
The Korean Resource Grammar has been under development since January 2003.
Downloadable Grammar
Korean Resource Grammar As of November 2005:
Download the grammar
Summary of the project
Background:
The advent of the information era in this century has escalated the importance of processing
linguistic information more precisely and correctly.
Recent developments in artificial intelligence, information sciences and other
high technology activities have made it possible to build feasible computational applications
for language processing and understanding. Such
applications (e.g. message extraction systems, web-based search engines, machine translation
and dialogue understanding systems) demand increasing accuracy and robustness of the grammar (or parsers)
combined with sophisticated statistical processing methods.
When considering the reality that the basic units in understanding language are sentences,
we could not miss the fact that building a reliable syntactic and semantic parser is a prerequisite
for language processing. Although there have been several successful morphological analyzers
developed for Korean, no serious attempts have been made to build its syntactic or semantic parser(s),
partly because of its structural complexity and partly because of the existence
of no reliable grammar-build up system.
As observed by Kang (1998), the research on syntactic and semantic processing in Korea
is at the beginning stage and at least 10 to 15 years behind compared to the one for English:
| English | Korean |
| Morphological Analyzer | Application Stage | Application Stage |
| Corpus | Application Stage | Research and Application Stage |
| Syntax/Semantic Parser | Research and Application Stage | Research Stage |
|
As represented in the table, the research for the development of English syntactic
and semantic parsers has reached a significant level
that can even allow real-time applications.
For example, in past projects, the ERG (English Resource Grammar),
a part of the LinGO project at CSLI (Center for the Study of Language and Information),
was used in the Verbmobil machine translation system
and in an NSF-funded project on computer-aided speech generation for people
who cannot speak because of disability (cf. Copestake and Flickinger 2000, Flickinger 2002).
However, in Korea there exist few reliable applications in particular for English and Korean
or vice versa.
The urgent need to advance the lagging research for Korean syntactic/semantic parser provides
the very motivation for this project.
Objectives: The objective of this project is thus to build a general purpose system for processing the Korean language that will support both research and practical applications. The goal includes building a broad-coverage Korean grammar that can be used both to extract precise meanings from text input and to generate well-formed text output. To achieve this goal, the project will develop a computationally feasible Korean Resource Grammar and implement it into the LKB (Linguistic Knowledge Building) system developed by the LingGO (Linguistic Grammar Online) Lab researchers at the CSLI (Center for the Study of Language and Information).
Methodology:
Korean Resource Grammar: The Korean Resource Grammar (KRG) to be developed
in this project is a computational grammar for Korean currently under development
since October 2002. Its aim is to develop an open source grammar of Korean.
The grammatical framework for the KRG is the constraint-based grammar,
HPSG (Pollard and Sag 1994, Sag, Wasow, and Bender 2003).
HPSG (Head-driven Phrase Structure Grammar) is built upon a non-derivational,
constraint-based, and surface-oriented grammatical architecture.
Though HPSG shares with the so-called P&P (Principles and Parameters) grammar
the idea that interaction between lexical entries and a set of
parameterized principles determines grammatical well-formedness,
it has one fundamental architectural difference:
there are no derivational or transformational operations involved.
Unlike the P&P framework, where distinct levels of syntactic structure
are sequentially derived by means of the transformational operation
Move-α(affecting both phrasal categories and heads),
HPSG has no notion of deriving one structure from another structure.
It employs a concrete conception of constituent structures,
a limited set of universal principles (e.g., the Head Feature Principle,
the Valence Principle, etc.), and enriched lexical representations
(cf. Pollard and Sag 1994, Kim 2002, Sag, Wasow, and Bender 2003).
In addition, HPSG is a constraint-based, lexicalist approach to grammatical theory
that seeks to model human languages as systems of constraints on typed feature structures.
In particular, the grammar adopts the mechanism of type hierarchy
in which every linguistic sign is typed with appropriate constraints
and hierarchically organized. The characteristic of such typed feature
structure formalisms facilities the extension of grammar in a systematic and efficient way,
resulting in linguistically precise and theoretically motivated descriptions of languages
including Korean. The grammar HPSG is thus well suited to the task of
multilingual development of broad coverage grammars.
The Korean Resource Grammar, developed as an extension of HPSG, will also have a broad lexical and syntactic coverage, such that it will be possible to use it in application products such as an automatic email response system. In addition, the grammar adopts a flat semantic formalism Minimal Recursion Semantics (MRS) in representing semantics (Copestake et al. 2001). MRS offers an interface between syntax and semantics using feature structures. The formalism has syntactically flat structures and offers at the same time the possibility of the handling of scope relations. The semantics is being developed in close cooperation with the LinGO English Resource Grammar.
Grammar Tool Writing: The basic tool for writing, testing, and processing the Korean Resource Grammar is the LKB (Linguistic Knowledge Building) system (Copestake 2002). The LKB system is a grammar and lexicon development environment for use with constraint-based linguistic formalisms such as HPSG. Both are freely available with open source (http://ling.stanford.edu). The LKB also provides an efficient parser and generator.
Status Quo of the Korean Resource Grammar: The Korean Resource Grammar consists of grammar rules, inflection rules, lexical rules, type definitions, and lexicon. At this stage it includes a lexicon of some 500 words whose properties are organized in a hierarchy of about a hundred types and 300 sentences. The grammar also includes a limited set of phrasal rules and types organized in a type hierarchy, providing coverage of the most familiar phenomena found in ordinary Korean (Kim 2003a, 2003b).
One example will suffice to demonstrate the efficiency of the grammar developed so far. One of the most complicated facts in Korean is that it allows sentence internal free scrambling. For example, observe the sentence (1):
(1) mayil John-un haksayng-tul-eykey yenge-lul[kaluchi-ess-ta]
Everyday John-Top students-Pl-Dat English-Acc teach-Past-Decl
John taught English to the students everyday.
The five syntactic elements here can induce 24 (4!) different scrambling possibilities. A few ordering possibilities are given here:
(2) a. John-i mayil haksayng-tul-eykey yenge-lul kaluchiessta.
b. haksayng-tul-eykey John- i mayil yenge-lul kaluchiessta
c. yenge-lul John- i mayil haksayng-tul-eykey kaluchiessta.
d. mayil haksayng-tul-eykey John- i yenge-lul kaluchiessta.
e. haksayng-tul-eykey John- i mayil yenge-lul kaluchiessta.
f. ...
A most effective grammar would no doubt be the one that can capture all such scrambling possibilities within a minimal processing load. In the KRG at this stage, this flexibility of Korean syntax is captured by the interactions between lexical information and a limited set of the well-formed phrase conditions. Different from English (and from the Japanese grammar JACY of Siegel and Bender 2002, Siegel 2000), the KRG assumes just the following informally represented phrasal well-formed conditions:
(3) Korean X' Syntax (simplified):
a. hd-arg-ph: XP => [1], H[ARG-ST < ... [1]...>]
b. hd-mod-ph: XP => [MOD [1]], H[1]
c. hd-filler-ph: XP => [1], H[GAP <[1]>]
d. hd-word-ph: X[LEX +] => [word], H
(3a) means that when a head combines with one of its arguments, the resulting phrase is a well-formed phrase. (3b) allows a head to combine with a phrase that modifies it. (3c) is a constraint for a head to form a phrase (with a missing gap) with a filler. (3d) basically generates a word level syntactic element by the combination of a head and a word. This well-formed phrase condition, not found in languages like English, forms various types of complex predicates frequently found in the language. This simple X’ syntax generates either unary or binary syntactic structures and thus can capture the major syntactic structures including scrambling cases in a straightforward manner.
To be more formal, in the implementation of the KRG into the LKB, the phrase condition of hd-arg-ph (head-argument-phrase) is written as follows:
head-arg-rule-1 := hd-arg-ph &
[ SYN.VAL.ARG-ST #2,
ARGS < #1 & [ SYN.HEAD.PRD - ],
syn-st & [ SYN.VAL.ARG-ST [ FIRST #1,
REST #2 ] ] > ].
This description, a direct translation of the KRG, specifies that there are two elements in the ARGS list, the second element of which represents the head and selects two #1 and #2 as its arguments. When this head combines with #1, the resulting phrase requires only #2. This eventually allows the arguments to be discharged one by one, generating binary structures. This condition, combined with the hd-mod-ph, can generate all the 24 word order possibilities for the sentence (1). The following examples are the actual parsed tree of the sentence (2a) in the current KRG system and its MRS semantic representation.
As observed here, the LKB is a powerful and efficient tool that allows a hands-on implementation of the Korean Resource grammar, built upon the typed feature structure formalisms of HPSG.
Areas to be developed during the proposed project period: The Korean Resource Grammar has been under development since October 2002 as a collaboration work with a computer scientist at the School of Computer Science, Kangnam University. It currently covers phenomena such as basic clause syntax, free word order (scrambling), case marking, adverb modification, topicalization, relative clauses, auxiliary constructions, light verb constructions, and complex sentences, among others. The results have been presented in a LinGO lab weekly meeting in February 2003 and a linguistic conference here in Korea, and they will also be presented in an international conference this coming October. The two previous reports of the on-going projects have greatly impressed the researchers in the relevant field and have received strongly positive responses. In particular, central attention has come from the preciseness of the grammar and its efficient parsing results: The mean edges for the 300 sample sentences is only 1.48, much lower than the current existing systems. The first phase of the current LinGO Korean Resource Grammar has thus achieved impressive coverage of major constructions in the language in question, providing a promising future direction.
Even though we have reached an unexpectedly high coverage of real data, considering the short period of research, there of course remain many areas asking for further development. These can be summarized as follows:
- refining the current Korean Resource Grammar (pro drop, case, relativization, light verb constructions, etc.)
- developing fine-grained semantics using MRS that can capture scope, event structures, message types, linking between syntax and semantics)
- incrementally increasing coverage of clause internal syntax in Korean (e.g., coordination, different types of long-distance dependencies, pro drop, honorification, tense, aspect, coordination, etc.)
- incorporating the use of default entries for words unknown to the Korean HSPG lexicon
- testing with real-time corpora and expanding more coverage
- linking the Korean Resource Grammar and the English Resource Grammar for applications such as machine translation
Significance: The successful completion of this project would bring us the following results:
- A precise Korean grammar that can be used general purposes including research and computational purposes. Considering that the research in Korean grammar has been dominated by the Principles and Parameters framework, this new constraint-based grammar would theoretically broaden perspectives on the languages (English and Korean). In addition, this would boost research in the computational implementations of the Korean grammar.
- A syntactic parser that parallels with semantic representations. Previous research for building a parsing system has been focused only on syntactic aspects, not paying much attention to robust semantic representations. This has made difficulties in further developing real-time applications such as a large scale written and spoken language translation. The syntax-semantics system this project aims for will narrow this gap.
- A sentence generator that can be used for a speech prosthesis system. One peculiar property of the LinGO system is that it allows generation of sentences, a process that can hardly be found in the existing systems. The generation system could lead to the development of computer-aided speech generation for people who cannot speak because of disability.
Evaluation and dissemination: The first step in evaluating our system is the SERI test suite built by the researchers in the ETRI. These 600 sample sentences are key sentences that the literature on Korean linguistics has most frequently discussed. The current KRG covers about 60% of the sentences. The project will increase its full coverage with the proper semantics as well. In addition to this test suite, the project result will be evaluated by the Syntactic Structure Corpus built by the Sejong 21 Project.
The project results will be presented in international conferences on linguistics as well as on computational linguistics such as COLING, PACLIC, ACL, etc.
More importantly, at the end of this project (August 2005), all the results of this research will be put on-line (temporary site: web.khu.ac.kr/~jongbok/projects.html). We cannot deny that most of the previous research results -- in particular the source files of Korean parsing systems -- have been off-limits to other researchers and even been confidential. The open source policy of this project will surely help researchers and students in the field and encourage further research.
Why Research at CSLI (Center for the Study of Language and Information)? As noted, the grammar platform that this project is adopting is the LKB system developed by the LinGO lab researchers at CSLI. In addition, the LinGO Lab serves as the center of the LinGO project as well as the HPSG project. The CSLI LinGO lab has played a leading role in the development of the English Resource Grammar (or ERG) and the LKB grammar engineering system as well. It is needless to say that the best place to do related research is the original place where all started and the related projects are most actively being pursued. There is no doubt that through conducting research with the lab people, I can learn the most up-to-date computational drills and theories and eventually achieve the project goals that I described here. In addition, the CSLI LinGO lab, where related researchers are visiting frequently will provide me with the most ideal environment to exchange ideas and get feedback in developing the KRG system. There is no ideal research place better than the CSLI LingGO Lab for this project.
Selected Publications
- Kim, Jong-Bok. 2000. The Grammar of Negation: A Constraint-Based Approach. Stanford: CSLI Publications.
- Kim, Jong-Bok and Jaehyung Yang. 2003a. `Korean Phrase Structure Grammar in Constraint Based Grammar and Building a Syntactic Parser with the Linguistic Knowledge Building System, (In Korean). Korean Linguistics 20: 1-40.
- Kim, Jong-Bok and Jaehyung Yang. 2003b. Korean Phrase Structure Grammar and Its Implementations into the LKB System. Paper presented at the 17th Pacific Asia Conference on Language, Information, and Computation, October 1—3, 2003.
- Jong-Bok Kim and Jaehyung Yang. Projections from Morphology to Syntax in the Korean Resource Grammar: Implementing Typed Feature Structures. Lecture Notes in Computer Science Vol.2945: 13-24, (Paper presented in CICLING 2004 Conference, Seoul, Korea), Springer-Verlag, 2004.2
- Jong-Bok Kim, Jaehyung Yang and Incheol Choi. Feature Unification and Constraint Satisfaction in Parsing Korean Case Phenomena. Lecture Notes in Artificial Intelligence Vol.3339, pp.1160-66, (Paper presented at AI 2004: Advances in Artificial Intelligence), Springer, Dec 2004.
- Jong-Bok Kim, Jaehyung Yang, Incheol Choi, "Capturing and Parsing the Mixed Properties of Light Verb Constructions in a Typed Feature Structure Grammar", Proc. of PACLIC 18: The 18th Pacific Asia Conference on Language, Information and Computation, pp.81-92, Waseda Univ, Tokyo, Dec 2004.
- Jong-Bok Kim, Jaehyung Yang, "Parsing Mixed Constructions in a Typed Feature Structure Grammar", Lecture Notes in Artificial Intelligence, Vol.3248, pp.42-51, Springer, Feb 2005.
- Jong-Bok Kim, Jaehyung Yang, "Processing Korean Case Phenomena in a Typed-Feature Structure Grammar", Lecture Notes in Computer Science, Vol.3406, pp.60-72, (Paper presented at CICLING-2005, Mexico City, Mexico), Springer, Feb 2005.
- Jong-Bok Kim, Jaehyung Yang, "Parsing Korean Honorification in a Typed Feature Structure Grammar", CLS (Chicago Linguistic Society) 41, (To appear), 2005.
Related Publications:
- Copestake, Ann. 2002. Implementing Type Feature Structures. CSLI Publications.
- Copestake, Ann and Dan Flickinger. 2000. An open-source grammar development environment and broad-coverage English grammar using HPSG. In Proceedings of the Second conference on Language Resources and Evaluation (LREC-2000). Athens, Greece.
- Copestake, Ann, Dan Flickinger, Ivan A. Sag and Carl J. Pollard. 2001. Minimal Recursion Semantics: An Introduction. Ms. Stanford University.
- Flickinger Dan. 2002. On building a more efficient grammar by exploiting types. In Stephan Oepen, Dan Flickinger, Jun'ichi Tsujii and Hans Uszkoreit (eds.) Collaborative Language Engineering. Stanford: CSLI Publications, pp. 1-17.
- Kang, Sung-Sik. 1998. Problems in Korean Language Processing and Methods. Paper Presented in the 8th Korean Language Processing Conference (in Korean).
- Pollard, Carl and Ivan Sag. 1994. Head-driven Phrase Structure Grammar. Chicago University Press.
- Sag, Ivan, Tom Wasow, Emily Bender. 2003. Syntactic Theory: A Formal Introduction (2nd ed). CSLI Publications.
- Siegel, Melanie. 2000. HPSG Analysis of Japanese. In W.Wahlster (ed.), Verbmobil: Foundations of Speech-to-Speech Translation. Springer Verlag.
- Siegel, Melanie and Emily M. Bender. 2002. Efficient Deep Processing of Japanese. In Proceedings of the 3rd Workshop on Asian Language Resources and International Standardization. Coling 2002 Post-Conference Workshop. Taipei, Taiwan.