Morphology: Analizing Words

Morphology is the study of:

The formation of words.
Grammatical forms of the words.
Use of prefixes and suffixes in the formation of words.
How parts-of-speech (PoS) of a language are formed.
The origin of the words (etimology).

Natural Language (Morphological) Typology

Languages are divided, according to their morphology, into:

inflective or fusional: use single inflectional morphemes to denote multiple grammatical, syntactic, or semantic features;
agglutinative: words contain multiple morphemes concatenated together, but in such a manner that each word stem and affix can be isolated and identified as indicating a particular inflection or derivation (for example, passive suffix, causative suffix, etc. on verbs, plural suffix, accusative suffix, dative suffix, etc. on nouns.); and
isolating: has a morpheme per word ratio close to one, and no inflectional morphology whatsoever: in the extreme case, each word contains a single morpheme.

Agglutination [is] a grammatical process in which words are composed of a sequence of morphemes (meaningful word elements), each of which represents not more than a single grammatical category. This term is traditionally employed in the typological classification of languages.

Turkish, Finnish, and Japanese are among the languages that form words by agglutination. The Turkish term evler-den (from the houses) is an example of a word containing a stem and two word elements; the stem is ev- “house,” the element -ler- carries the meaning of plural, and -den indicates “from.” In Wishram, a dialect of Chinook (a North American Indian language), the word ačimluda (“He will give it to you”) is composed of the elements a- “future,” -č- “he,” -i- “him,” -m- “thee,” -1- “to,” -ud- “give,” and -a “future.”

Agglutinating languages contrast with inflecting languages, in which one word element may represent several grammatical categories, and also with isolating languages, in which each word consists of only one word element. Most languages are mixtures of all three types.

(From Britannica)

Agglutinative languages have generally one grammatical category per affix while fusional languages combine multiple into one.

Some well known constructed languages are agglutinative, such as Black Speech,[6] Esperanto, Klingon, and Quenya.

Although historically, languages were divided into three basic types (isolating, inflectional, agglutinative), the traditional morphological types can be categorized by two distinct parameters:

morpheme per word ratio (how many morphemes there are per word)
degree of fusion between morphemes (how separable the inflectional morphemes of words are according to units of meaning represented)

A language is said to be more isolating than another if it has a lower morpheme per word ratio.

To illustrate the relationship between words and morphemes, the English term "rice" is a single word, consisting of only one morpheme (rice). This word has a 1:1 morpheme per word ratio. In contrast, "handshakes" is a single word consisting of three morphemes (hand, shake, -s). This word has a 3:1 morpheme per word ratio. On average, words in English have a morpheme per word ratio substantially greater than one.

Morphological Parsing

The term morphological parsing is related to the parsing of morphemes. We can define morphological parsing as the problem of recognizing that a word breaks down into smaller meaningful units called morphemes producing some sort of linguistic structure for it. For example, we can break the word foxes into two: fox and -es. We can therefore see that the word foxes is made up of two morphemes, one is fox and other is -es.

Morphotactics

Morphotactics is the model of morpheme ordering. It specifies which classes of morphemes can follow other classes of morphemes inside a word. For example, morphotacticly, the English plural morpheme always follows the noun rather than preceding it.

Nonconcatenative Morphology

Nonconcatenative morphology, also called discontinuous morphology and introflection, is a form of word formation and inflection in which the root is modified and which does not involve stringing morphemes together sequentially.

Types

Apophony (including Ablaut and Umlaut)

Transfixation

Vowel and consonant morphemes are interdigitated. For example, depending on the vowels, the Arabic consonantal root k-t-b can have different but semantically related meanings. Thus, [kataba] 'he wrote' and [kitaːb] 'book' both come from the root k-t-b. Words from k-t-b are formed by filling in the vowels, e.g. kitāb "book", kutub "books", kātib "writer", kuttāb "writers", kataba "he wrote", yaktubu "he writes", etc.

Extensive use of transfixation only occurs in Afro-Asiatic and some Nilo-Saharan languages (such as Lugbara) and is rare or unknown elsewhere.

Reduplication

A process in which all or part of the root is reduplicated.

Computational Morphology

Morphology and Regular Expressions*

Finate State Transducers (FSTs)

According to one source:

The appropriate tool for morphological analysis of languages with non-trivial morphology is finite state transducers. There are robust implementations that you can track down and use. [...]

FSTs are based on finite-state automata, like (pure) regular expressions, but they are by no means a drop-in replacement. They are rather complex, so if your goals are simple (e.g., syllabification for purposes of hyphenation) you may want to look for something simpler. (There are machine-learning algorithms that will learn hyphenation, for example.) If you are indeed interested in morphological analysis, you have to make the effort to look at FSTs.