formal grammar

Formal grammars are sets of rules which describe which strings of words are considered grammatical. They can be used to characterize the syntax of both natural language and programming languages.

Fix a finite set $V$ called the vocabulary, also called the *terminal symbols*. A *language* is a subset of the free monoid $V^*$, i.e. strings of words. Let $V^+$ denote the free semigroup, i.e. non-empty strings. $X + Y$ and $X \times Y$ denote the disjoint union and the Cartesian product respectively.

An *unrestricted grammar* is given by a tuple $G = (V, X, R, s)$ where $X$ is a finite set of *nonterminal symbols* with a *start symbol* $s \in X$, and $R \subseteq (V + X)^+ \times (V + X)^*$ is a finite set of *rewrite rules*. The language $L(G)$ generated by the grammar is usually defined as follows:

- Define the “rewrite step” relation $(\to_R) \subseteq (V + X)^* \times (V + X)^*$ where $(\alpha \beta \gamma) \to_R (\alpha \delta \gamma)$ for all $(\beta, \delta) \in R$ and $\alpha, \gamma \in (V + X)^*$.
- Take its reflexive transitive closure $(\to_R^*)$, i.e. $s \to_R^* t$ iff there is some sequence of rewrite steps $s \to_R s_1 \to_R s_2 \to_R ... \to_R t$. This sequence is called a
*grammatical derivation*. - Let $L(G) = \{ w_1 ... w_n \in V^* | s \to_R^* w_1 ... w_n\}$ for $s \in X$ the start symbol.

The construction of $\to_R^*$ can be seen as the free preordered monoid generated by the rewrite rules $R$. Grammatical derivations can be seen as string diagrams in the free monoidal category with the symbols $V + X$ as generating objects and the rewrite rules $R$ as generating arrows.

Context free grammars are defined in the same way as unrestricted grammars, but they only allow rules with a single non-terminal symbol as domain. In that particular case, the grammatical derivations are trees with the start symbol as root and the words as leaves.

In categorial grammars, monoidal categories are replaced by biclosed monoidal category. The derivations are trees with two kinds of rules: 1) a finite language-dependent *dictionary* of rules $w \to t$ with $w \in V$ and $t \in F(X)$, for $F(X)$ the free residuated monoid (see Lambek (1958)) and 2) an infinite number of language-agnostic rules, for the application of each higher order type.

Similarly, pregroup grammars are constructed from a dictionary with types coming from a free pregroup $P(X)$. The only rewrite rules are the counit maps $x^* \otimes x \to 1$ and $x \otimes {}^*x \to 1$, canceling a non-terminal symbol $x \in P(X)$ with its left and right adjoints $x^*$ and ${}^*x$. The derivations are planar string diagrams in a rigid monoidal category, the counit maps are drawn as cup-shaped wires connecting the words in a sentence.

The grammatical derivations of dependency grammars can also be encoded as diagrams in a rigid monoidal category.

The origins of grammar go back to 4th century BC India: the Sanskrit philologist Pānini showed ancient mastery of context-sensitive grammars to analyse the syntax of Sanskrit.

Unrestricted grammars were introduced by Noam Chomsky as the level zero of his hierarchy, they are equivalent to Turing machines in expressive power.

The similar concept of *semi-Thue system*, also known as a string rewriting system, was introduced thirty years before Chomsky’s work on grammar, see Thue (1914). It makes no distinction between terminal and non-terminal symbols, and allows rules with empty left-hand sides.

A third, equivalent, formulation was proved to be undecidable independently by Emil Post? and Andrey Markov: the word problem for semigroups.

- Saroja Bhate and Subhash Kak. Pānini’s Grammar and Computer Science. Annals of the Bhandarkar Oriental Research Institute (1993)
- Axel Thue. Probleme Uber Veränderungen von Zeichenreihen Nach Gegebenen Regeln. (1914)
- Emil L. Post. Recursive Unsolvability of a problem of Thue. Journal of Symbolic Logic (1947)
- A Markov. On certain insoluble problems concerning matrices. In Doklady Akad (1947)
- J. Lambek. The mathematics of sentence structure. American Mathematics Monthly (1958)
- J. Lambek. Type grammar revisited. Logical Aspects of Computational Linguistics (1999)

Last revised on October 28, 2021 at 05:14:43. See the history of this page for a list of all contributions to it.