Skip to content

2620 Notes 4.1 (9/17)

Published: at 09:14 PM

Table of Contents

Open Table of Contents

Regular Expressions

A regular expression(often written as reg-ex or regex) is a high level specification language for expressing regular patterns. An example is $\Sigma^{\ast}ACC\Sigma^{\ast}$. This is a regular expression that represents the strings that contain the string $ACC$. $\Sigma^*$ represents any string, so this expression essentially means “any two strings can surround $ACC$”.

These expressions are sets, so set notation can be used on them. For example, $\Sigma^{\ast}ACC\Sigma^{\ast} \cup \Sigma^{\ast}T$ is the set of all strings that contain $ACC$ or end with $T$.

Practically, these are used for text search, spam filters, passwords, etc. They are supported in many programming languages and text editors in different formats.

The syntax of a regular expression are the set of rules for constructing them.

Base Cases

Operations

Note that usually ”$.$” is omitted, ex. $01$ stands for the reg-ex $0.1$ Also, if $\Sigma = (a, b, c)$, then the reg-ex $(a \cup b \cup c)$ is abbreviated as $\Sigma$.

Notational Convention Review

There’s some notation used above that we should review:

Operator Precedence

In the general case, operators go from left to right. However, the order of operations go $*$ then $.$ then $\cup$. Of course, you can use parentheses to modify the order as necessary.

Regular Expression Examples

Let $\Sigma = \{a, b\}$ be our language.

Q: Write a regular expression for $\{ w$ | last symbol in $w =$ first symbol in $w\}$

In this case, we know that the letters can be either $a$ or $b$, so we want a string that starts and ends with both $a$ or starts and ends with both $b$. If we decide to treat the first and last characters separately, then we should define some base cases first. Both $a$ and $b$ are valid strings. With that taken care of, the rest is straightforward. We can treat both cases individually, where it starts and ends with the same character with any string in between.

A: $a \cup b \cup a \Sigma^* a \cup b \Sigma^* b$

Q: Write a Regular expression for $\{w |count(w, a)$ $modulo$ $3 = 0$$\}$

In this case, we simply want to make sure we are always adding $a$ in multiples of three without obstructing how $b$ can be added. At first, this may look like doing something along the lines of $b^{\ast}ab^{\ast}ab^{\ast}ab^{\ast}$. This, however, only allows for three $a$s. To make sure we can add as many of the triple as possible, we can add more ${\ast}$ carefully. This would look like $ b^{\ast}(b^{\ast}ab^{\ast}ab^{\ast}ab^{\ast})b^{\ast}$. Finally, we can omit some of the redundant $b^*$

A: $b^{\ast}(ab^{\ast}ab^{\ast}ab^{\ast})$

Creating an NFA

Creating an NFA from a RegEx is relatively simple: just employ the strategies used for each operation in the last set of notes. This also implies that you can create a DFA from an equivalent RegEx, and that each DFA can be described as a RegEx.

RegEx to NFA Complexity

How many states does the $\mathcal{E}$-NFA of a regex $r$ have? Each base case NFA for a regex has either one or two states, and each operator adds either zero or one new state. This structure implies that the number of states is $O(r)$.

Determinization using subset construction causes an exponential blow up instead, where the size of the DFA is about $2^k$, where $k$ is the size of $r$.

Unfortunately, this is the best we can hope for. Regular expressions can be described as the desired language much more succinctly compared to DFAs. An easy example to see for this is $A^k$, where the $kth$ symbol from the end must be an $a$. We know a DFA for this must have $2^k$ states, and a regex for this would be $\Sigma*a\Sigma ^{k+1}$.

Span

Now that we have defined regex, we have the operations union, concatenation, and Kleene*. This begs the question: do we have enough operators to span the class of all regular languages? Can every regular language be defined by a regex? Given a DFA $M$, can we construct a regex for it?

The answer is yes. There actually exists a very long and complicated systematic construction that can take any NFA $M$ and build a regex $r$ such that $L(M) = L(r)$.

Regular Languages

Remember that a language $A$ is regular iff: