Table of Contents
Open Table of Contents
Regular Expressions
A regular expression(often written as reg-ex or regex) is a high level specification language for expressing regular patterns. An example is $\Sigma^{\ast}ACC\Sigma^{\ast}$. This is a regular expression that represents the strings that contain the string $ACC$. $\Sigma^*$ represents any string, so this expression essentially means any two strings can surround $ACC$.
These expressions are sets, so set notation can be used on them. For example, $\Sigma^{\ast}ACC\Sigma^{\ast} \cup \Sigma^{\ast}T$ is the set of all strings that contain $ACC$ or end with $T$.
Practically, these are used for text search, spam filters, passwords, etc. They are supported in many programming languages and text editors in different formats.
The syntax of a regular expression are the set of rules for constructing them.
Base Cases
- $L(\mathcal{E}) = \{\mathcal{E}\}$
- $L(\Phi) = \{\}$
- $\rho \in \Sigma \implies \rho \in L(\Sigma)$
- $s \in L(\Sigma) \implies (s) \in L(\Sigma)$
Operations
- If $r$ and $r’$ are regular expressions, then so is $r.r’$
- If $r$ and $r’$ are regular expressions, then so is $r \cup r’$
- If $r$ is a regular expression, then so is $r*$
Note that usually ”$.$” is omitted, ex. $01$ stands for the reg-ex $0.1$ Also, if $\Sigma = (a, b, c)$, then the reg-ex $(a \cup b \cup c)$ is abbreviated as $\Sigma$.
Notational Convention Review
There’s some notation used above that we should review:
- $r*$ means 0 or more repetitions of $r$
- $r+$ means one or more repetitions of $r$
- $r+$ is equivalent to $r.r*$
- Character ranges: $[0-9], [a-p]$
- Counting: $D^4 = D.D.D.D$
- Intersection: &
- Negation/Complement: ~N
- Optional use: $r? = (r \cup \mathcal{E})$
Operator Precedence
In the general case, operators go from left to right. However, the order of operations go $*$ then $.$ then $\cup$. Of course, you can use parentheses to modify the order as necessary.
Regular Expression Examples
Let $\Sigma = \{a, b\}$ be our language.
Q: Write a regular expression for $\{ w$ | last symbol in $w =$ first symbol in $w\}$
In this case, we know that the letters can be either $a$ or $b$, so we want a string that starts and ends with both $a$ or starts and ends with both $b$. If we decide to treat the first and last characters separately, then we should define some base cases first. Both $a$ and $b$ are valid strings. With that taken care of, the rest is straightforward. We can treat both cases individually, where it starts and ends with the same character with any string in between.
A: $a \cup b \cup a \Sigma^* a \cup b \Sigma^* b$
Q: Write a Regular expression for $\{w |count(w, a)$ $modulo$ $3 = 0$$\}$
In this case, we simply want to make sure we are always adding $a$ in multiples of three without obstructing how $b$ can be added. At first, this may look like doing something along the lines of $b^{\ast}ab^{\ast}ab^{\ast}ab^{\ast}$. This, however, only allows for three $a$s. To make sure we can add as many of the triple as possible, we can add more ${\ast}$ carefully. This would look like $ b^{\ast}(b^{\ast}ab^{\ast}ab^{\ast}ab^{\ast})b^{\ast}$. Finally, we can omit some of the redundant $b^*$
A: $b^{\ast}(ab^{\ast}ab^{\ast}ab^{\ast})$
Creating an NFA
Creating an NFA from a RegEx is relatively simple: just employ the strategies used for each operation in the last set of notes. This also implies that you can create a DFA from an equivalent RegEx, and that each DFA can be described as a RegEx.
RegEx to NFA Complexity
How many states does the $\mathcal{E}$-NFA of a regex $r$ have? Each base case NFA for a regex has either one or two states, and each operator adds either zero or one new state. This structure implies that the number of states is $O(r)$.
Determinization using subset construction causes an exponential blow up instead, where the size of the DFA is about $2^k$, where $k$ is the size of $r$.
Unfortunately, this is the best we can hope for. Regular expressions can be described as the desired language much more succinctly compared to DFAs. An easy example to see for this is $A^k$, where the $kth$ symbol from the end must be an $a$. We know a DFA for this must have $2^k$ states, and a regex for this would be $\Sigma*a\Sigma ^{k+1}$.
Span
Now that we have defined regex, we have the operations union, concatenation, and Kleene*. This begs the question: do we have enough operators to span the class of all regular languages? Can every regular language be defined by a regex? Given a DFA $M$, can we construct a regex for it?
The answer is yes. There actually exists a very long and complicated systematic construction that can take any NFA $M$ and build a regex $r$ such that $L(M) = L(r)$.
Regular Languages
Remember that a language $A$ is regular iff:
- There is a DFA $M$ such that $L(M) = A$
- There is an NFA $M$ such that $L(M) = A$
- There is an $\mathcal{E}$-NFA $M$ such that $L(M) = A$
- There is a regular expression $r$ such that $L(r) = A$
There are many more alternative characterizations of regular languages, but note that all of these are inter-implacable.