Minimalist grammar optimization

Syntactic literature tends towards a big-picture outlook, abstracting away from algorithmic-level details such as full specifications of lexical items or syntactic features being checked by each application of a structure-building operation. At the same time, differences between competing analyses of the same phenomenon seem to belong to a relatively low level of description. Assuming a sufficiently rich formalism compatible with the Minimalist framework, how can we choose between competing analyses on quantitative grounds? Framing this question as a learning problem, I have developed an algorithm capable of transforming a naive minimalist grammar over unsegmented words into a linguistically motivated one over morphemes. The project primarily focuses on learning morphological structure within complex words, extracting linguistically motivated generalizations and instantiating them as new lexical items.

Minimalist grammars and agreement

Stabler’s minimalist grammars provide a useful tool for modeling natural language syntax by defining grammar fragments in a very precise way. As a formalization of Chomsky’s Minimalist Program, they can accommodate linguistic analyses from the field of generative syntax. However, they have no machinery for encoding agreement: while morphology can be simulated by multiplying lexical items, there is no systematic way to state generalizations and implement actual proposals. My goal is to extend minimalist grammars with morphological features and operations on them.
A Javascript implementation of MGs with agreement can be found on this page.

Formalizing Distributed Morphology

Together with Daniel Edmiston, I am working on a mathematically rigorous formalization of the Distributed Morphology framework. We are interested in adapting DM to work over strings. Distributed Morphology is typically depicted as operating on (binary) trees, meaning its strong-generative capacity is above regular. By constraining it to operating on strings, we restrict the strong-generative capacity of the morphological module to that of regular languages, providing an immediate explanation for the regularity of morphological phenomena in natural language.

Automated processing of agglutinative morphology

The majority of existing tools that deal with complex morphology rely on either hand-written rules or large text corpora. I am interested in taking the third option: extract (agglutinative) morphology from a small sample of fully analyzed word forms. The main challenge is to reconstruct allomorphs and morphotactic sequences missing from the sample. Hand-glossed texts are a natural output of linguistic fieldwork, readily available even for under-studied languages. The goal of this project is to facilitate tasks such as morphological parsing for agglutinative languages, with a focus on good performance even with very limited language-specific resources.

One application of this and related work is Diretra, a tool for computer-aided translation which I am developing in collaboration with Alëna Aksënova. Diretra is designed for and tested on Turkic languages; its primary goal is to provide a word-for-word translation of a given text, reflecting the morphological phenomena of the source language as precisely as possible.

  • Diretra, a customizable direct translation system: first sketches
    (with Alëna Aksënova)
    2nd International TRANSLATA Conference, October 30–November 1, 2014. Innsbruck, Austria
    [slides] [paper]

  • Морфологический анализатор Diretra: больше, чем глосса
    [Diretra, a morphological analyzer: more than a gloss]

    (with Alëna Aksënova)
    201th Meeting of the Workshop on Mathematical Methods Applied to Linguistics, October 25, 2014. Moscow, Russia
    [slides in Russian]

  • An adaptable morphological parser for agglutinative languages
    Italian Conference on Computational Linguistics, December 9–10, 2014. Pisa, Italy
    [poster] [paper]

Turkic converbs

In Turkic languages, converbs — a type of non-finite verb form — are a regular means of constructing complex predications. The -p converb, present in the majority of Turkic languages, exhibits a number of interesting syntactic properties. In particular, -p converbs can correspond to both adjunct and coordinate syntactic structures.