CONTENTS

**Compiler Construction Using Java, JavaCC, and Yacc**

IEEE

**computer society**

**Press Operating Committee**

**Chair**

Linda Shafer

*former Director, Software Quality Institute
The University of Texas at Austin*

**Editor-in-Chief**

Alan Clements

*Professor
University of Teesside*

**Board Members**

Mark J. Christensen, *Independent Consultant*

James W. Cortada, *IBM Institute for Business Value*

Richard E. (Dick) Fairley, *Founder and Principal Associate, Software Engineering
Management Associates (SEMA)*

Phillip Laplante, Professor of Software Engineering, Penn State University

Evan Butterfield,

Kate Guillemette,

IEEE Computer Society Publications

The world-renowned IEEE Computer Society publishes, promotes, and distributes a wide variety of authoritative computer science and engineering texts. These books are available from most retail outlets. Visit the CS Store at for a list of products.

IEEE Computer Society / Wiley Partnership

The IEEE Computer Society and Wiley partnership allows the CS Press authored book program to produce a number of exciting new titles in areas of computer science, computing and networking with a special focus on software engineering. IEEE Computer Society members continue to receive a 15% discount on these titles when purchased through Wiley or at

To submit questions about the program or send proposals please e-mail or write to Books, IEEE Computer Society, 10662 Los Vaqueros Circle, Los Alamitos, CA 90720-1314. Telephone +1-714-816-2169.

Copyright © 2012 by the IEEE Computer Society, Inc.

Published by John Wiley & Sons, Inc., Hoboken, New Jersey. All rights reserved.

Published simultaneously in Canada.

No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 750-4470, or on the web at . Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at .

Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representation or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.

For general information on our other products and services please contact our Customer Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002.

Wiley also publishes its books in a variety of electronic formats. Some content that appears in print, however, may not be available in electronic formats. For more information about Wiley products, visit our web site at .

*Library of Congress Cataloging-in-Publication Data:*

Dos Reis, Anthony J.

Compiler construction using Java, JavaCC, and Yacc / Anthony J. Dos Reis.

p. cm.

ISBN 978-0-470-94959-7 (hardback)

1. Compilers (Computer programs) 2. Java (Computer program language) I. Title.

QA76.76.C65D67 2011

005.4’53—dc23 2011013211

oBook ISBN: 978-1-118-11276-2

ePDF ISBN: 978-1-118-11287-8

ePub ISBN: 978-1-118-11277-9

eMobi ISBN: 978-1-118-11288-5

*To little sister*

PREFACE

My principal goal in writing this book is to provide the reader with a clear exposition of the theory of compiler design and implementation along with plenty of opportunities to put that theory into practice. The theory, of course, is essential, but so is the practice.

- Provides numerous, well-defined projects along with test cases. These projects ensure that students not only know the theory but also know how to apply it. Instructors are relieved of the burden of designing projects that integrate well with the text.
- Project work starts early in the book so that students can apply the theory as they are learning it.
- The compiler tools (JavaCC, Yacc, and Lex) are optional topics.
- The entire book is Java oriented. The implementation language is Java. The principal target language is similar to Java’s bytecode. The compiler tools generate Java code. The form of attributed grammar used has a Java-like syntax.
- The target languages (one is stack oriented like Java’s bytecode; the other is register oriented) are very easy to learn but are sufficiently powerful to support advanced compiler projects.
- The software package is a dream come true for both students and instructors. It automatically evaluates a student’s compiler projects with respect to correctness, run time, and size. It is great for students: they get immediate feedback on their projects. It is great for instructors: they can easily and accurately evaluate a student’s work. With a single command, an instructor can generate a report for an entire class. The software runs on three platforms: Microsoft Windows, Linux, and the Macintosh OS X.
- Demonstrates how compiler technology is not just for compilers. In a capstone project, students design and implement grep using compiler technology.
- Includes a chapter on interpreters that fits in with the rest of the book.
- Includes a chapter on optimization that is just right for an introductory course. Students do not simply read about optimization techniques—they implement a variety of techniques, such as constant folding, peephole optimization, and register allocation.
- The book uses a Java-like form of grammars that students can easily understand and use. This is the same form that JavaCC uses. Thus, students can make transition to JavaCC quickly and easily.
- Provides enough theory that the book can be used for a combined compiler/automata/formal languages course. The book covers most of the topics covered in an automata/formal languages course: finite automata, stack parsers, regular expressions, regular grammars, context-free grammars, context-sensitive grammars, unrestricted grammars, Chomsky’s hierarchy, and the pumping lemmas. Pushdown automata, Turing machines, computability, and complexity are discussed in supplements in the software package. The software package also includes a pushdown automaton simulator and Turing machine simulator.
- Covers every topic that should be in a first course or in an only course on compilers. Students will learn not only the theory and practice of compiler design but also important system concepts.

The software package for the textbook has some unusual features. When students run one of their compiler-generated programs, the software produces a log file. The log file contains a time stamp, the student’s name, the output produced by the compiler-generated program, and an evaluation of the compiler-generated program with respect to correctness, program size, and execution time. If the output is not correct (indicating that the student’s compiler is generating incorrect code), the log file is marked with NOT CORRECT. If the compiled program is too big or the execution time too long, the log file is marked with OVER LIMIT.

The name of a log file contains the student’s name. For example, the log file for the S3 project of a student whose last name is Dos Reis would be S3.dosreis.log. Because each log file name is unique, an instructor can store all the log files for a class in a single directory. A single command will then produce a report for the entire class.

The software supports two instruction sets: the stack instruction set and the register instruction set. The stack instruction set is the default instruction set. To use the register instruction set, a single directive is placed in the assembly language source program. The software then automatically reconfigures itself to use the register instruction set.

The three principal programs in the software package are **a** (the assembler/linker), **e** (the executor), and **I** (the library maker). The software package also includes **p** (a pushdown automaton simulator) and **t** (a Turing machine simulator).

The software package for this book is available from the publisher. The compiler tools are available on the Web. At the time of this writing, JavaCC is at , Byacc/j is at , and Jflex is at .

This textbook specifies many well-defined projects. The source language has six levels of increasing complexity. A student can write a compiler for each level that translates to the stack instruction set. A student can also write a compiler for each level that translates to the register instruction set, or incorporates optimization techniques. For each level, a student can write a pure interpreter or an interpreter that uses an intermediate code. A student can implement several variations of grep using compiler technology. A student can write the code for any of these projects by hand or by using JavaCC or Yacc. Many of the chapter problems provide additional projects. In short, there are plenty of projects.

For each project, the textbook provides substantial support. Moreover, many of the projects are incremental enhancements of a previous project. This incremental approach works well; each project is challenging but not so challenging that students cannot do it.

Most projects can be completed in a week’s time. Thus, students should be able to do ten or even more projects in a single semester.

For background material, the reader is referred to the author’s *An Introductions to Programming Using Java* (Jones & Bartlett, 2010) and *Assembly Language and Computer Architecture Using C ^{++} and Java* (Course Technology, 2004). Also recommended is JFLAP (available at ), an interactive program that permits experimentation with various types of automata and grammars.

I would like to thank Professors Robert McNaughton and Dean Arden who years ago at RPI taught me the beauty of formal language theory, my students who used preliminary versions of the book and provided valuable feedback, Katherine Guillemette for her support of this project, my daughter Laura for her suggestions on the content and structure of the book, and my wife for her encouragement.

ANTHONY J. DOS REIS

*New Paltz, New York*

*October 2011*

Compiler construction is truly an engineering science. With this science, we can methodically—almost routinely—design and implement fast, reliable, and powerful compilers.

You should study compiler construction for several reasons:

- Compiler construction techniques have very broad applicability. The usefulness of these techniques is not limited to compilers.
- To program most effectively, you need to understand the compiling process.
- Language and language translation are at the very heart of computing. You should be familiar with their theory and practice.
- Unlike some areas of computer science, you do not typically pick up compiler construction techniques “on the job.” Thus, the formal study of these techniques is essential.

To be fair, you should also consider reasons for *not* studying compiler construction. Only one comes to mind: Your doctor has ordered you to avoid excitement.

In our study of compiler design theory, we begin with several important definitions. An *alphabet* is the finite set of characters used in the writing of a language. For example, the alphabet of the Java programming language consists of all the characters that can appear in a program: the upper- and lower-case letters, the digits, whitespace (space, tab, newline, and carriage return), and all the special symbols, such as =, +, and {. For most of the examples in this book, we will use very small alphabets, such as {b, c} and {b, c, d}. We will avoid using the letter “a” in our alphabets because of the potential confusion with the English article “a”.

A *string over an alphabet* is a finite sequence of characters selected from that alphabet. For example, suppose our alphabet is {b, c, d}. Then

cbd cbcc c

are examples of strings over our alphabet. Notice that in a string over an alphabet, each character in the alphabet can appear any number of times (including zero times) and in any order. For example, in the string cbcc (a string over the three-letter alphabet {b, c, d}), the character b appears once, c appears three times, and d does not appear.

The *length of a string* is the number of characters the string contains. We will enclose a string with vertical bars to designate its length. For example, |cbcc| designates the length of the string cbcc. Thus, |cbcc| = 4.

A *language* is a set of strings over some alphabet. For example, the set containing just the three strings cbd, cbcc, and c is a language. This set is not a very interesting language, but it is, nevertheless, a language according to our definition.

Let us see how our definitions apply to a “real” language—the programming language Java. Consider a Java program all written on a single line:

class C {public static void main (String[] args) {} }

Clearly, such a program is a single string ove the alphabet of Java. We can also view a multiple-line program as a single string—namely, the string that is formed by connecting successive lines with a line separator, such as a newline character or a carriage return/newline sequence. Indeed, a multiline program stored in a computer file is represented by just such a string. Thus, the multiple-line program

class C { public static void main (String[] args) { } }

is the single string

where represents the line separator. The *Java language* is the set of all strings over the Java alphabet that are valid Java programs.

A language can be either finite or infinite and may or may not have a meaning associated with each string. The Java language is infinite and has a meaning associated with each string. The meaning of each string in the Java language is what it tells the computer to do. In contrast, the language {cbd, cbcc, c} is finite and has no meaning associated with each string. Nevertheless, we still consider it a language. A language is simply a set, finite or infinite, of strings, each of which may or may not have an associated meaning.

*Syntax rules* are rules that define the form of the language, that is, they specify which strings are in a language. *Semantic rules* are rules that associate a meaning to each string in a language, and are optional under our definition of language.

Occasionally, we will want to represent a string with a single symbol very much like *x* is used to represent a number in algebra. For this purpose, we will use the small letters at the end of the English alphabet. For example, we might use *x* to represent the string cbd and *y* to represent the string cbcc.

A *compiler* is a translator. It typically translates a program (the *source program*) written in one language to an equivalent program (the *target program*) written in another language (see ). We call the languages in which the source and target programs are written the *source* and *target languages*, respectively.

Typically, the source language is a high-level language in which humans can program comfortably (such as Java or C++), whereas the target language is the language the computer hardware can directly handle (*machine language*) or a symbolic form of it (*assembly language*).

If the source program violates a syntax rule of the source language, we say it has a *syntax error.* For example, the following Java method has one syntax error (a right brace instead of a left brace on the second line):

public void greetings ( ) } // syntax error System.out.printIn (“hello“); }

A *logic error* is an error that does not violate a syntax rule but results in the computer performing incorrectly when we run the program. For example, suppose we write the following Java method to compute and return the sum of 2 and 3:

public int sum ( ) { return 2 + 30; // logic error }

This method is a valid Java method but it tells the computer to do the wrong thing—to compute 2 + 30 instead 2 + 3. Thus, the error here is a logic error.

A compiler in its simplest form consists of three parts: the token manager, the parser, and the code generator (see ).

The source program that the compiler inputs is a stream of characters. The *token manager* breaks up this stream into meaningful units, called *tokens.* For example, if a token manager reads

int x; // important example x = 55;

it would output the following sequence of tokens:

int x ; x = 55 ;

The token manager does not produce tokens for white space (i.e., space, tab, newline, and carriage return) and comments because the parser does need these components of the source program. A token manager is sometimes called a *lexical analyzer, lexer, scanner*, or *tokenizer.*

A *parser* in its simplest form has three functions:

The *code generator*, the last module of a compiler, outputs the target program based on the information provided by the parser.

In the compilers we will build, the parser acts as the controller. As it executes, it calls the token manager whenever it needs a token, and it calls the code generator at various points during the parse, passing the code generator the information the code generator needs. Thus, the three parts of the compiler operate concurrently. An alternate approach is to organize the compiling process into a sequence of *passes.* Each pass reads an input file and creates an output file that becomes the input file for the next pass. For example, we can organize our simple compiler into three passes. In the first pass, the token manager reads the source program and creates a file containing the tokens corresponding to the source program. In the second pass, the parser reads the file of tokens and outputs a file containing information required by the code generator. In the third pass, the code generator reads this file and outputs a file containing the target program.

Since languages are sets of strings, it is appropriate at this point to review some basic set theory. One method of representing a set is simply to list its elements in any order. Typically, we use the left and right braces, “{” and “}”, to delimit the beginning and end, respectively, of the list of elements. For example, we represent the set consisting of the integers 3 and 421 with

{3, 421} or {421, 3}

Similarly, we represent the set consisting of the two strings b and bc with

{b, bc} or {bc, b}

This approach cannot work for an infinite set because it is, of course, impossible to list all the elements of an infinite set. If, however, the elements of an infinite set follow some obvious pattern, we can represent the set by listing just the first few elements, followed by the ellipsis (…). For example, the set

{b, bb, bbb, …}

represents the infinite set of strings containing one or more b’s and no other characters. Representing infinite sets this way, however, is somewhat imprecise because it requires the reader to figure out the pattern represented by the first few elements.

Another method for representing a set—one that works for both finite and infinite sets—is to give a rule for determining its elements. In this method, a set definition has the form

{*E: defining rule*}

where *E* is an expression containing one or more variables, and the defining rule generally specifies the allowable ranges of the variables in *E*. The colon means “such that.” We call this representation the *set-builder notation.* For example, we can represent the set containing the integers 1 to 100 with

{*x* : *x* is an integer and 1 ≤ *x* ≤ 100}

Read this definition as “the set of all *x* such that *x* is an integer and *x* is greater than or equal to 1 and less than or equal to 100.” A slightly more complicated example is

{*n*^{2} : *n* is an integer and *n* ≥ 1}

Notice that the expression preceding the colon is not a single variable as in the preceding example. The defining rule indicates that *n* can be 1, 2, 3, 4, and so on. The corresponding values of *n*^{2} are the elements of the set—namely, 1, 4, 9, 16, etc. Thus, this is the infinite set of integer squares:

{1, 4, 9, 16, …}

In set notation, the mathematical symbol means “is an element of.” A superimposed slash on a symbol negates the condition represented. Thus, means “is not a element of.” For example, if *P* = {2, 3, 4}, then 3 *P*, but 5 *P*.

The *empty set* [denoted by either { } or ] is the set that contains no elements. The *universal set* (denoted by *U*) is the set of all elements under consideration. For example, if we are working with sets of integers, then the set of all integers is our universe. If we are working with strings over the alphabet {b, c}, then the set of all strings over {b, c} is our universe.

The set operations *union, intersection*, and *complement*, form new sets from given sets. The union operator is most often denoted by the special symbol . We, however, use the vertical bar | to denote the union operator. The advantage of | is that it is available on standard keyboards. We will use and ~ to denote the intersection and complement operators, respectively. , the standard symbol for set intersection, unfortunately is not available on keyboards. However, we will use set intersection so infrequently that it will not be necessary to substitute a keyboard character for .

Set union, intersection, and complement are defined as follows:

Union of P and Q: |
P | Q= {x: x P or x Q} |

Intersection of P and Q: |
P Q = {x: x P and x Q} |

Complement of P: |
~P = {x: x U and x P} |

Here are the definitions in words of these operators:

*P* | *Q* is the set of all elements that are in either *P* or *Q* or both.

*P* *Q* is the set of elements that are in both *P* and *Q*.

~*P* is the set of all elements in the universe *U* that are not in *P*.

For example, if *P*= {b, bb}, *Q*= {bb, bbb}, and our universe *U* = {b, bb, bbb, …}, then

P | Q |
= {b, bb, bbb} |

P Q |
={bb} |

~P |
= {bbb, bbbb, bbbbb, …} |

~Q |
= {b, bbbb, bbbbb, …} |

A collection of sets is *disjoint* if the intersection of every pair of sets from the collection is the empty set (i.e., they have no elements in common). For example, the sets {b}, {bb, bbb}, and {bbbb} are disjoint since no two have any elements in common.

The set *P* is a subset of *Q* (denoted *P* *Q*) if every element of *P* is also in *Q*. The set *P* is a proper subset of the set *Q* (denoted *P* *Q*) if *P* is a subset of *Q*, and *Q* has at least one element not in *P*. For example, if *P* = {b, bb}, *Q*= {b, bb, bbb}, and *R* = {b, bb}, then *P* is proper subset of *Q*, but *P* is not a proper subset of *R*. However, *P* is a subset of *R*. Two sets are equal if each is the subset of the other. With *P* and *R* given as above, *P* *R* and *R* *P*. So we can conclude that *P* = *R*. Note that the empty set is a subset of any set; that is, { } *S* for any set *S*.

We can apply the set operations union, intersection, and complement to any sets. We will soon see some additional set operations specifically for sets of strings.

When prehistoric humans started using numbers, they used the natural numbers 1, 2, 3, …. It was easy to grasp the idea of oneness, twoness, threeness, and so on. Therefore, it was natural to have symbols designating these concepts. In contrast, the number 0 is hardly a natural concept. After all, how could something (the symbol 0) designate nothing? Today, of course, we are all quite comfortable with the number 0 and put it to good use every day. A similar situation applies to strings. It is natural to think of a string as a sequence of one or more characters. But, just as the concept zero is useful to arithmetic, so is the concept of a *null string*—the string whose length is zero—useful to language theory. The null string is the string that does not contain any characters.

How do we designate the null string? Normally, we designate strings by writing them down on a piece of paper. For example, to designate a string consisting of the first three small letters of the English alphabet, we write abc. A null string, however, does not have any characters, so there is nothing to write down. We need some symbol, preferably one that does not appear in the alphabets we use, to represent the null string. Some writers of compiler books use the Greek letter for the null string. However, since is easily confused with the symbol for set membership, we will use the small Greek letter λ (lambda) to represent the null string.

One common misconception about the null string is that a string consisting of a single space is the null string. A space is a character whose length is one; the null string has length zero. They are not the same. Another misconception has to do with the empty set. The null string is a string. Thus, the set {λ} contains exactly one string—namely the null string. The empty set { }, on the other hand, does not contain any string.

We call the operation of taking one string and placing it next to another string in the order given to form a new string *concatenation*. For example, if we concatenate bcd and efg, we get the string bcdefg. Note that the concatenation of any string *x* with the null string λ yields *x*. That is,

xλ = λx = x

A nonnegative exponent applied to a character or a sequence of characters in a string specifies the replication of that character or sequence of characters. For example b^{4} is a shorthand representation of bbbb. We use parentheses if the scope of the replication is more than one character. Hence, b(cd)^{2}e represents bcdcde. A string replicated zero times is by definition the null string; that is, for any string *x, x*^{0} = λ.

We can use exponent notation along with set-builder notation to define sets of strings. For example, the set

{b^{i}: 1 ≤ *i* ≤ 3}

is the set

{b^{1}, b^{2}, b^{3}} = {b, bb, bbb}

The exponent in exponent notation can never be less than zero. If we do not specify its lower bound in a set definition, assume it is zero. For example, the set

{b^{i}: *i* ≤ 3}

should be interpreted as

{b^{i}: 0 ≤ *i* ≤ 3} = {b^{0}, b^{1}, b^{2}, b^{3}} = {λ, b, bb, bbb}

Describe in English the language defined by {b^{i}c^{2i}: *i* ≥ 0}.

*Answer:*

The set of all strings consisting of b’s followed by c’s in which the number of c’s is twice the number of b’s. This set is {λ, bcc, bbcccc, bbbcccccc, …}.

We have just seen that an exponent following a character represents a single string (for example, b^{3} represents bbb). In contrast, the star operator, *, following a character (for example, b*) represents a set of strings. The set contains every possible replication (including zero replications) of the starred character. For example,

b* = {b^{0}, b^{1}, b^{2}, b^{3}, …} = {b^{n}: *n* ≥ 0 } = {λ, b, bb, bbb, …}

Think of the star operator as meaning “zero or more.”

The star operator always applies to the item immediately preceding it. If a parenthesized expression precedes the star operator, then the star applies to whatever is inside the parentheses. For example, in (bcd)*, the parentheses indicate that the star operation applies to the entire string bcd. That is,

(bcd)* = {λ, bcd, bcdbcd, bcdbcdbcd, … }

The star operator can also be applied to sets of strings. If *A* is a set of strings, then *A** is the set of strings that can be formed from the strings of *A* using concatenation, allowing any string in *A* to be replicated any number of times (including zero times) and used in any order. By definition, the null string is always in *A**.

Here are several examples of starred sets:

{b}* = {λ, b, bb, bbb, …} = b* {b, c}* = {λ, b, c, bb, bc, cb, cc, bbb, …} {λ}* = {λ} {}*={λ} {bb, cc}* = {λ, bb, cc, bbbb, bbcc, ccbb, cccc, …} {b, cc}* = {λ, b, bb, cc, bbb, bcc, ccb, bbbb …}

Notice that {b} * = b*. That is, starring a set that contains just one string yields the same set as starring just that string.

Here is how to determine if a given string is in *A**, where *A* is an arbitrary set of strings: If the given string is the null string, then it is in *A** by definition. If the given string is nonnull, and it can be divided into substrings such that each substring is in *A*, then the given string is in *A**. Otherwise, the string is not in *A**. For example, suppose *A* = {b, cc}. We can divide the string bccbb into four parts: b, cc, b, and b, each of which is in *A*. Therefore, bccbb *A**. On the other hand, for the string bccc the required subdivision is impossible. If we divide bccc into b, cc, and c, the first two strings are in *A* but the last is not. All other subdivisions of bccc similarly fail. Therefore, bccc *A**.

We call the set that results from the application of the star operator to a string or set of strings the *Kleene closure*, in honor of Stephen C. Kleene, a pioneer in theoretical computer science.

Let us now use the star operator to restate two important definitions that we gave earlier. Let the capital Greek letter Σ (sigma) represent an arbitrary alphabet. A *string over the alphabet* Σ is any string in Σ*. For example, suppose Σ = {b, c}. Then

Σ* = {λ, b, c, bb, bc, cb, cc, bbb, …}

Thus, λ, b, c, bb, bc, cb, cc, bbb,… are strings over Σ. It may appear strange to view λ as a string over the alphabet Σ = {b, c}. Actually, this view is quite reasonable since λ has no characters *not* in {b, c}. λ is always a string over Σ regardless of the content of Σ because, by definition, λ is always in Σ*. A *language over the alphabet* Σ is any subset of Σ*. For example {λ}, {b}, and {b, cc} are each languages over Σ = {b, c}. Even the empty set is a language over Σ because it is a subset of Σ*.

*Answer:*

Concatenation can be applied to sets of strings as well as individual strings. If we let *A* and *B* be two sets of strings, then *AB*, the *concatenation of the sets A* and *B*, is

That is, *AB* is the set of all strings that can be formed by concatenating a string *A* with a string *B*. For example, if *A* = {b, cc} and *B* = {d, dd}, then

*AB* = {bd, bdd, ccd, ccdd}

*BA* = {db, dcc, ddb, ddcc}

As an example of concatenation, consider the set b*c*, the concatenation of the sets b* and c*. Each string in b*c* consists of some string from b* concatenated to some string in c*. That is, each string consists of zero or more b’s followed by zero or more c’s. The number of b’s does not have to equal the number of c’s, but all b’s must precede all c’s. Thus, b*c* = {λ, b, c, bb, bc, cc, bbb, bbc, bcc, ccc,…}. In exponent notation, b*c* = {b^{i}c^{j}: *i* ≥ 0 and *j* ≥ 0}.

A string can also be concatenated with a set. If *x* is a string and *A* is a set of strings, then *xA*, the concatenation of *x* with *A* is

{xy : y A}

Similarly, *Ax* is

{yx : y A}

For example, bbc*, the concatenation of the string bb and the set c*, is the set of all strings consisting of bb followed by a string in c*. Thus,

bbc* = {bbλ = bb, bbc, bbcc, bbccc, …}

Notice that it follows from our definitions that *xA* = {*x*}*A*, where *x* is an arbitrary string and *A* is a set of strings. That is, we get the same result whether we concatenate *x* (the string) or {*x*} (the set containing just *x*) to a set *A*.

*Answer:*

The union operator implies a choice with respect to the makeup of the strings in the language specified. For example, we can interpret

{b} ({c} | {d})

as the set of strings consisting of a b followed by a choice of c or d. That is, the set consists of the strings bc and bd.

Describe in English the set defined by b*({c} | {d})e*.

*Answer:*

The set of all strings consisting of zero or more b’s, followed by either c or d, followed by zero or more e’s.

The *plus operator* is like the star operator, except that the former means “one or more” instead of “zero or more.” We can apply it either to an individual string or a set of strings. It appears as a + following the item to which it applies. For example,

b+ = {b^{1}, b^{2}, b^{3}, …} = {b^{i} : *i* ≥ 1}

{b, c} + = {b, c, bb, bc, cb, cc, bbb, … }

*A*+, where *A* is a set of strings, contains the null string only if the set *A* itself contains the null string. *A*^{*}, on the other hand, always contains the null string for any set *A*.

Consider the set bb*. Each string in bb* consists of a single b followed a string in b*. Because the shortest string in b* is λ, the shortest string in bb* is bλ = b. Thus, every string in bb* contains one or more b’s. That is, bb* = b+. In general, for a string *x* and a set of strings *A*,

*xx** = *x*x* = *x*+

and

*AA** = *A***A* = *A*+

We call the set that results from the application of the plus operator to a string or a set of strings the *positive closure.*

Show that {λ} | b+ = b*.

*Answer:*

{λ} | b+ = {λ} | {b, bb, bbb, …} = {λ, b, bb, bbb, …} = b*.

The question mark operator specifies an optional item. We can apply it to either an individual string or a set of strings. It appears as a ? following the item to which it applies. For example, bc? specifies a b followed by an optional c—that is, a b followed by zero or one c. Thus bc? is the set {b, bc}. Think of c? as representing the set {λ} | {c} = {λ, c}. Thus, bc? = b{λ, c} = {b, bc}.

If *A* is a set of strings, then b*A*? specifies a b optionally followed by any single string in *A*, which is the set {b} |b*A*. For example, b{b, c}? is the set {b, bb, bc}.

Show that (b+)? =b*.

*Answer:*

(b+)? = {λ} | b+ = {λ} | {b, bb, bbb, …} = {λ, b, bb, bbb, …} = b*.

{b} and b are not the same. The former is a set containing one string; the latter is a string— not a set containing a string. In spite of this distinction, it is common practice to represent the former (the set) by writing the latter (the string). We follow this practice only when the context clearly implies the correct interpretation, or where it does not make a difference. For example, instead of writing {c} | bd*, we can write c | bd* (recall we are using | to represent the set union). The union operator clearly implies that the c to its left must represent the set {c} and not the string c. In some expressions, it does not matter which interpretation we use. For example, whether we interpret b as a string or as a set makes no difference in the Kleene closure b*. Similarly, in b+ and b*A*, our interpretation makes no difference.

If an expression contains more than one kind of operator, then the operations are performed in an order determined by their *precedence*. Specifically, operations with higher precedence are performed before operations with lower precedence. Our string operations, ordered from highest to lowest precedence, with equal precedence operations listed on the same line, are

Complementation

Star, Plus, Question Mark

Concatenation

Intersection

Union

For example, in c | bd*, we first apply the star to the d, then we concatenate b and d*, and last, we take the union of c and bd*. We can override this order by using parentheses. For example, in ((c | b)d)*, we perform the union first, then the concatenation, and the star operation last.

If the star, plus, or question mark operators appear consecutively, we perform their corresponding operations left to right. For example, in (bb)?+, we perform the question mark operation first, then the plus operation. Thus, (bb)?+ = {λ, bb}+ = (bb)*.

Write an expression without using ~ that defines the same set as ~(b*). Assume complementation is with respect to the set Σ*, where Σ = {b, c}.

*Answer:*

b* is the set of all strings over the alphabet with no c’s. Therefore, its complement is the set of strings that have at least one c. Thus, every string in ~(b*) must be of the form *xcy*, where *x* and *y* are arbitrary strings over {b, c}. Thus, ~(b*) = (b | c)*c(b | c)*. Another expression that defines the same set is b*c(b | c)*.

Let us use our convention of designating a set containing a single string by writing just the string. For example, let us write b to represent the set {b}. Then each of these following expressions designates a set of strings:

λ

b

c

λ|b

bbc

b*c*

b|(cc)*

(b|c)*

For example, , λ, b, c, and λ|b designate, respectively, the sets { }, {λ}, {b}, {c}, and {λ, b}. In every expression above, the only operations that appear, if any, are union, concatenation, and star. We call such expressions *regular expressions*. We can use regular expressions to define languages—that is, to define sets of strings.

Let us look at a precise definition of a regular expression: A *regular expression over the alphabet* Σ is any of the following:

λ

any single symbol in Σ

These expressions are the *base regular expressions.* In addition, we can construct additional regular expressions using the following *construction rule:*

If *r* and *s* are arbitrary regular expressions, then the following expressions are also regular:

(*r*)

Our construction rule allows us to construct new regular expressions, using union, concatenation, star, and parenthesis, from our base regular expressions or expressions previously constructed using the construction rule. For example, since b and c are regular expressions, so are (b), b | c, bc, and b* by our construction rule. We can continue applying our construction rule, producing ever more complex regular expressions. For example, using our previously constructed regular expressions bc and b*, we can now, in turn, construct bc | b* with our construction rule. We are not allowed to apply our construction rule an infinite number of times when building a regular expression. This restriction implies that any regular expression must be of finite length.

Every regular expression defines (i.e., represents) a language. For example, b | c defines the language {b, c}. We call any language that can be defined with a regular expression a *regular language.*

Every regular language has more than one regular expression that defines it. For example, bb*, λbb*, and b*b all define the language consisting of one or more b’s.

An assumption we have made about regular expressions over an alphabet Σ is that Σ does not contain |, *, (, or). Thus, when we see b | c, we know the vertical bar is the union operator. However, if the vertical bar were in Σ, then b | c would be ambiguous. It could represent either the set {b, c} (if we regard | as the union operator) or the set containing the single string “b | c” (if we regard | as a symbol from Σ). A simple way to disambiguate an expression like b | c is to quote the symbols in a regular expression that come from Σ. Accordingly, we would write “b | c” or “b” “|” “c” to represent the single string “b | c”. Here, the vertical bar is a symbol from Σ. We know this because the vertical bar is in quotes. But we would write “b” | “c” to represent the set {b, c}. Here, the vertical bar is the union operator. We know this because here the vertical bar is not in quotes.

Give the strings in the set specified by “b” | “c” “|” “d”.

*Answer:*

Two strings: “b” and “c | d”.

Let us look at some expressions that are not regular expressions:

bb | bbb | bbbb | … | (the ellipsis “…” is not allowed) |

(cc)+ | (the plus operator is not allowed) |

(cc)? | (the question mark operator is not allowed) |

{b^{i}: i ≥ 0} |
(exponent and set-builder notation is not allowed) |

Because we do not allow the ellipsis, the plus operator, the exponent notation, or the set builder notation in regular expressions, the expressions above are not regular expressions. The languages they represent, however, are regular because we can represent them with regular expressions, namely bbb*, cc(cc)*, λ|cc, and b*, respectively.

When we analyze regular expressions, it is often helpful to think of the union operator as indicating a choice. For example, we can think of the regular expression (b | c) d as representing the set of strings consisting of the choice b or c followed by d—that is, as the set consisting of bd and cd.

If we allow our regular expressions to include the ~, +, and ? operators in addition to |, *, and concatenation, we get the class of expressions called *extended regular expressions.* Any language defined by an extended regular language can also be defined by a nonextended regular language. In other words, extended regular expressions are no more powerful than nonextended regular expressions in defining languages. Thus, every extended regular expression necessarily has a nonextended equivalent. For example, the extended regular expression (b | c)? has the nonextended equivalent λ|b | c.

Although extended regular expressions are no more powerful than nonextended regular expressions, they are, nevertheless, useful because they are often easier to use and understand than their nonextended equivalents.

Regular expressions have limitations. They cannot represent every language. For example, consider the following language that we call *PAIRED:*

*PAIRED* = {b^{i}c^{i}: *i* ≥ 0} = {λ, bc, bbcc, bbbccc, bbbbcccc, …}

Each string in *PAIRED* consists of some number of b’s followed by the *same* number of c’s. With a regular expression, it is possible to capture the condition that all b’s precede all c’s (as in b*c*). But there is no way to capture the condition that the number of b’s equals the number of c’s unless we limit the length of the string (we give a proof for this in Chapter 17). The language represented by b*c* includes strings in which the number of b’s is equal to the number of c’s (for example, bbcc). But it also includes strings in which the number of b’s is not equal to the number of c’s (for example, bcc). *PAIRED*, on the other hand, contains only strings in which the number of b’s is equal to the number of c’s. *PAIRED* is not equal to b*c* but is, in fact, a proper subset of it.

Another attempt at a regular expression for *PAIRED* is the infinite-length expression

λ|bc | bbcc | bbbccc | bbbbccc|…

Although this expression does represent *PAIRED*, it is not a regular expression because regular expressions cannot be of infinite length.

Let us place an upper bound on the exponent *i* in the preceding definition of *PAIRED*. We then get a new language that is, in fact, regular. For example,

{b^{i}c^{i}:0 ≤ *i* ≤ 2}

is a regular language represented by the regular expression

λ|bc|bbcc

That a regular expression cannot represent the language *PAIRED* is a serious limitation since similar constructs frequently appear in programming languages. For example, in Java, arithmetic expressions may be nested with parentheses to an arbitrary depth:

(((…)))

Similarly, blocks of code may be nested to an arbitrary depth with braces:

{{{ … }}}

We cannot describe either of these constructs with regular expressions unless we place an upper limit on the depth of nesting.

When we assert that the two regular expressions are equal, we are asserting that the languages defined by those regular expressions are equal. For example, when we write

λ|bb* = b*

we are asserting that the language defined by λ|bb* is equal to the language defined by b*.

Although regular expressions are too limited to fully describe the typical programming language, they are still quite useful to the compiler designer. Their usefulness appears in the design of the token manager. In particular, we can use regular expressions to describe the various tokens that the token manager provides to the parser. For example, if we let *D* represent any digit 0 through 9, then the regular expression *DD** represents an unsigned integer token. We will see in Chapter 13 how regular expressions in conjunction with the software tool JavaCC can automate the implementation of the token manager.

One of the jobs of a compiler is to determine if the source program is indeed one of the strings that make up the source language. If it is not, we say that the source program has a *syntax error.* For example, in the following Java program,

public class Bug { public static void main (String[ ] args) { System.out.println (“hello”) // missing semicolon } }

println