http://rw4.cs.uni-saarland.de/teaching/spa10/papers/shapeanalysis3vl.pdf

publicité
Parametric Shape Analysis via
3-Valued Logic
MOOLY SAGIV, THOMAS REPS, and REINHARD WILHELM
Shape analysis concerns the problem of determining “shape invariants” for programs that perform destructive updating on dynamically allocated storage. This article presents a parametric
framework for shape analysis that can be instantiated in different ways to create different shapeanalysis algorithms that provide varying degrees of efficiency and precision. A key innovation of
the work is that the stores that can possibly arise during execution are represented (conservatively)
using 3-valued logical structures. The framework is instantiated in different ways by varying the
predicates used in the 3-valued logic. The class of programs to which a given instantiation of the
framework can be applied is not limited a priori (i.e., as in some work on shape analysis, to programs that manipulate only lists, trees, DAGS, etc.); each instantiation of the framework can be
applied to any program, but may produce imprecise results (albeit conservative ones) due to the
set of predicates employed.
Categories and Subject Descriptors: D.2.5 [Software Engineering]: Testing and Debugging—
symbolic execution; D.3.3 [Programming Languages]: Language Constructs and Features—
data types and structures; dynamic storage management; E.1 [Data]: Data Structures—graphs;
lists; trees; E.2 [Data]: Data Storage Representations—composite structures; linked representations; F.3.1 [Logics and Meanings of Programs]: Specifying and Verifying and Reasoning about
Programs—assertions; invariants
General Terms: Algorithms, Languages, Theory, Verification
Additional Key Words and Phrases: Abstract interpretation, alias analysis, constraint solving,
destructive updating, pointer analysis, shape analysis, static analysis, 3-valued logic
A preliminary version of this paper appeared in the Proceedings of the 1999 ACM Symposium on
Principles of Programming Languages. Part of this research was carried out while M. Sagiv was at
the University of Chicago. M. Sagiv was supported in part by the National Science Foundation under
grant CCR-9619219 and by the U.S.–Israel Binational Science Foundation under grant 96-00337.
T. Reps was supported in part by the National Science Foundation under grants CCR-9625667
and CCR-9619219, by the U.S.–Israel Binational Science Foundation under grant 96-00337, by a
Vilas Associate Award from the University of Wisconsin, by the Office of Naval Research under
contract N00014-00-1-0607, by the Alexander von Humboldt Foundation, and by the John Simon
Guggenheim Memorial Foundation. R. Wilhelm was supported in part by a DAAD-NSF Collaborative Research Grant.
Authors’ addresses: M. Sagiv, Department of Computer Science, School of Mathematical Science, Tel-Aviv University, Tel-Aviv 69978, Israel; email: [email protected]; T. Reps, Computer
Sciences Department, University of Wisconsin, 1210 West Dayton Street, Madison, WI 53706;
email: [email protected]; R. Wilhelm, Fachrichtung Inf., Univ. des Saarlandes, 66123 Saarbrücken,
Germany; email: [email protected].
Permission to make digital/hard copy of all or part of this material without fee for personal or
classroom use provided that the copies are not made or distributed for profit or commercial advantage, the ACM copyright/server notice, the title of the publication, and its date appear, and notice
is given that copying is by permission of the ACM, Inc. To copy otherwise, to republish, to post on
servers, or to redistribute to lists requires prior specific permission and/or a fee.
C 2002 ACM 0164-0925/02/0500–0217 $5.00
ACM Transactions on Programming Languages and Systems, Vol. 24, No. 3, May 2002, Pages 217–298.
218
•
M. Sagiv et al.
1. INTRODUCTION
In the past two decades, many “shape-analysis” algorithms have been developed that can automatically create different classes of “shape descriptors” for
programs that perform destructive updating on dynamically allocated storage
[Jones and Muchnick 1981, 1982; Larus and Hilfinger 1988; Horwitz et al. 1989;
Chase et al. 1990; Stransky 1992; Assmann and Weinhardt 1993; Plevyak et al.
1993; Wang 1994; Sagiv et al. 1998]. A common feature of these algorithms is
that they represent the set of possible memory states (“stores”) that arise at a
given point in the program by shape graphs, in which heap cells are represented
by shape-graph nodes and, in particular, sets of “indistinguishable” heap cells
are represented by a single shape-graph node (often called a summary-node
[Chase et al. 1990]).
This article presents a parametric framework for shape analysis. The framework can be instantiated in different ways to create shape-analysis algorithms
that provide different degrees of precision. The essence of a number of previous shape-analysis algorithms, including Jones and Muchnick [1981, 1982],
Horwitz et al. [1989], Chase et al. [1990], Stransky [1992], Plevyak et al. [1993],
Wang [1994], and Sagiv et al. [1998], can be viewed as instances of this framework. Other instantiations of the framework yield new shape-analysis algorithms that obtain more precise information than previous work.
A parametric framework must address the following issues:
(i) What is the language for specifying (a) the properties of stores that are to
be tracked, and (b) how such properties are affected by the execution of the
different kinds of statements in the programming language?
(ii) How is a shape-analysis algorithm generated from such a specification?
Issue (i) concerns the specification language of the framework. A key innovation
of our work is the way in which it makes use of 2-valued and 3-valued logic:
2-valued and 3-valued logical structures are used to represent concrete and
abstract stores, respectively (i.e., interpretations of unary and binary predicates encode the contents of variables and pointer-valued structure fields);
first-order formulae with transitive closure are used to specify properties such
as sharing, cyclicity, reachability, and the like. Formulae are also used to specify
how the store is affected by the execution of the different kinds of statements
in the programming language. The analysis framework can be instantiated in
different ways by varying the predicates that are used. The specified set of
predicates determines the set of data-structure properties that can be tracked,
and consequently what properties of stores can be “discovered” to hold at the
different points in the program by the corresponding instance of the analysis. Issue (ii) concerns how to create an actual analyzer from the specification.
In our work, the analysis algorithm is an abstract interpretation; it finds the
least fixed point of a set of equations that are generated from the analysis
specification.
The ideal is to have a fully automatic parametric framework, a yacc for shape
analysis, so to speak. The designer of a shape-analysis algorithm would supply only the specification, and the shape-analysis algorithm would be created
ACM Transactions on Programming Languages and Systems, Vol. 24, No. 3, May 2002.
Parametric Shape Analysis via 3-Valued Logic
•
219
automatically from this specification. A prototype version of such a system,
based on the methods presented in this article, has been implemented by
T. Lev-Ami [Lev-Ami 2000; Lev-Ami and Sagiv 2000]. (See also Section 7.4.1.)
The class of programs to which a given instantiation of the framework can
be applied is not limited a priori (i.e., as in some work on shape analysis, to
programs that manipulate only lists, trees, DAGS, etc.). Each instantiation of
the framework can be applied to any program, but may produce conservative
results due to the set of predicates employed; that is, the attempt to analyze a
particular program with an instantiation created using an inappropriate set of
predicates may produce imprecise, albeit conservative, results. Thus, depending
on the kinds of linked data structures used in a program, and on the linkrearrangement operations performed by the program’s statements, a different
instantiation—using a different set of predicates—may be needed in order to
obtain more useful results.
The framework allows one to create algorithms that are more precise than
the shape-analysis algorithms cited earlier. In particular, by tracking which
heap cells are reachable from which program variables, it is often possible
to determine precise shape information for programs that manipulate several
(possibly cyclic) data structures (see Sections 2.6 and 5.3). Other static-analysis
techniques yield very imprecise information on these programs. So that reachability properties can be specified, the specification language of the framework
includes a transitive-closure operator.
The key features of the approach described in this article are as follows.
—The use of 2-valued logical structures to represent concrete stores. Interpretations of unary and binary predicates encode the contents of variables and
pointer-valued structure fields (see Section 2.2).
—The use of a 2-valued first-order logic with transitive closure to specify properties of stores such as sharing, cyclicity, reachability, and so on (see Sections 2,
3, and 5).
—The use of Kleene’s 3-valued logic [Kleene 1987] to relate the concrete
(2-valued) world and the abstract (3-valued) world. Kleene’s logic has a
third truth value that signifies “unknown,” which is useful for shape analysis because we only have partial information about summary nodes. For
these nodes, predicates may have the value unknown (see Sections 2
and 4).
—The development of a systematic way to construct abstract domains that
are suitable for shape analysis. This is based on a general notion of “truthblurring” embeddings that map from a 2-valued world to a corresponding
3-valued one (see Sections 2.5, 4.2, and 4.3).
—The use of the Embedding Theorem (Theorem 4.9) to ensure that the meaning
of a formula in the “blurred” (3-valued) world is compatible with the formula’s
meaning in the original (2-valued) world. The consequence of the Embedding
Theorem is that it allows us to extract information from either the concrete
world or the abstract world via a single formula: the same syntactic expression can be interpreted either in the 2-valued world or the 3-valued world.
The Embedding Theorem ensures that the information obtained from the
ACM Transactions on Programming Languages and Systems, Vol. 24, No. 3, May 2002.
220
•
M. Sagiv et al.
3-valued world is compatible (i.e., safe) with that obtained from the 2-valued
world. This eases soundness proofs considerably.
—New insight into the issue of “materialization.” This is known to be very
important for maintaining accuracy in the analysis of loops that advance
pointers through linked data structures [Chase et al. 1990; Plevyak et al.
1993; Sagiv et al. 1998]. (Materialization involves the splitting of a summarynode into two separate nodes by the abstract transfer function that expresses
the semantics of a statement of the form x = y->n.) This article develops a
new approach to materialization:
—The essence of materialization involves a step (called focus in Section 6.3)
that forces the values of certain formulae from unknown to true or false.
This has the effect of converting an abstract store into several abstract
stores, each of which is more precise than the original one.
—Materialization is complicated because various properties of a store
are interdependent. We introduce a mechanism based on a constraintsatisfaction system to capture the effects of such dependences (see
Section 6.4).
In this article, we address the problem of shape analysis for a single procedure. This has allowed us to concentrate on foundational aspects of shapeanalysis methods. The application of our techniques to the problem of interprocedural shape analysis, including shape analysis for programs with recursive
procedures, is addressed in Rinetskey and Sagiv [2001] (see also Section 7.4.3).
The remainder of the article is organized as follows. Section 2 provides an
overview of the shape-analysis framework. Section 3 shows how 2-valued logic
can be used as a metalanguage for expressing the concrete operational semantics of programs (and programming languages). Section 4 provides the technical details about the representation of stores using 3-valued logic. Section 5
defines the notion of instrumentation predicates, which are used to specify the
abstract domain that a specific instantiation of the shape-analysis framework
will use. Section 6 formulates the abstract semantics for program statements
and conditions, and defines the iterative algorithm for computing a (safe) set of
3-valued structures for each program point. Section 7 discusses related work.
Section 8 makes some final observations.
In Appendix A, we sketch how the framework can be used to analyze programs that manipulate doubly linked lists, which has been posed as a challenging problem for program analysis [Aiken 1996]. The proof of the Embedding Theorem and other technical proofs are presented in Appendices B
and C.
2. AN OVERVIEW OF THE PARAMETRIC FRAMEWORK
This section provides an overview of the main ideas used in the article. The
presentation is at a semi-technical level; a more detailed treatment of this
material, as well as several elaborations on the ideas covered here, is presented
in the later sections of the paper.
ACM Transactions on Programming Languages and Systems, Vol. 24, No. 3, May 2002.
Parametric Shape Analysis via 3-Valued Logic
•
221
2.1 Shape Invariants and Data Structures
Constituents of shape invariants that can be used to characterize a data structure include
(i) anchor pointer variables, that is, information about which pointer variables
point into the data structure;
(ii) the types of the data-structure elements, and in particular, which fields
hold pointers;
(iii) connectivity properties, such as
—whether all elements of the data structure are reachable from a root
pointer variable,
—whether any data-structure elements are shared,
—whether there are cycles in the data structure, and
—whether an element v pointed to by a “forward” pointer of another element v0 has its “backward” pointer pointing to v0 ; and
(iv) other properties, for instance, whether an element of an ordered list is in
the correct position.
Each data structure can be characterized by a certain set of such properties.
Most semantics track the values of pointer variables and pointer-valued
fields using a pair of functions, often called the environment and the store.
Constituents (i) and (ii) above are parts of any such semantics; consequently,
we refer to them as core properties.
Connectivity and other properties, such as those mentioned in (iii) and (iv),
are usually not explicitly part of the semantics of pointers in a language, but
instead are properties derived from this core semantics. They are essential
ingredients in program verification, however, as well as in our approach to shape
analysis of programs. Noncore properties are called instrumentation properties
(for reasons that become clear shortly).
Let us start by taking a Platonic view, namely, that ideas exist without regard
to their physical realization. Concepts such as “is shared,” “lies on a cycle,” and
“is reachable” can be defined either in graph-theoretic terms, using properties
of paths, or in terms of the programming-language concept of pointers. The
definitions of these concepts can be stated in a way that is independent of any
particular data structure.
Example 2.1. A heap cell is heap-shared if it is the target of two pointers,
either from two different heap cells, or from two different pointer components
of the same heap cell.
Data structures can now be characterized using sets of such properties, where
“data structure” here is still independent of a particular implementation.
Example 2.2. An acyclic singly linked list is a set of objects, each with one
pointer component. The objects are reachable from a root pointer variable either
directly or by following pointer components. No object lies on a cycle, that is, is
reachable from itself by following pointer components.
ACM Transactions on Programming Languages and Systems, Vol. 24, No. 3, May 2002.
222
•
M. Sagiv et al.
Fig. 1. (a) Declaration of a linked-list data type in C; (b) a C function that searches a list pointed
to by parameter x and splices in a new element.
To address the problem of verifying or analyzing a particular program that
uses a certain data structure, we have to leave the Platonic realm, and formulate
shape invariants in terms of the pointer variables and data type declarations
from that program.
Example 2.3. Figure 1(a) shows the declaration of a linked-list data type
in C, and Figure 1(b) shows a C program that searches a list and splices a new
element into the list. The characterization of an acyclic singly linked list in
terms of the properties “is reachable from a root pointer variable” and “lies on a
cycle” can now be specialized for that data type declaration and that program:
—“is reachable from a root pointer variable” means “is reachable from x, or is
reachable from y, or is reachable from t, or is reachable from e.”
—“lies on a cycle” means “is reachable from itself following one or more n fields.”
To be able to carry out shape analysis, a number of additional concepts need
to be formalized:
—an encoding (or representation) of stores, so that we can talk precisely about
store elements and the relationships among them;
—a language in which to state properties that store elements may or may not
possess;
—a way to extract the properties of stores and store elements;
—a definition of the concrete semantics of the programming language, and, in
particular, one that makes it possible to track how properties change as the
execution of a program statement changes the store; and
—a technique for creating abstractions of stores so that abstract interpretation
can be applied.
In our approach, the formalization of each of these concepts is based on predicate
logic.
ACM Transactions on Programming Languages and Systems, Vol. 24, No. 3, May 2002.
Parametric Shape Analysis via 3-Valued Logic
•
223
Table I. Predicates Used for Representing the Stores Manipulated by
Programs that use the List Data Type Declaration from Figure 1(a)
Predicate
q(v)
n(v1 , v2 )
Intended Meaning
Does pointer variable q point to element v?
Does the n field of v1 point to v2 ?
2.2 Representing Stores via 2-Valued and 3-Valued Logical Structures
To represent stores, we work with what logicians call logical structures. A logical structure is associated with a vocabulary of predicate symbols (with given
arities); each logical structure S, denoted by hU S , ι S i, has a universe of individuals U S . In a 2-valued logical structure, ι S maps each arity-k predicate
symbol p and possible k-tuple of individuals (u1 , . . . , uk ), where ui ∈ U S , to the
value 0 or 1 (i.e., false and true, respectively). In a 3-valued logical structure, ι S
maps p and (u1 , . . . , uk ) to the value 0, 1, or 1/2 (i.e., false, true, and unknown,
respectively).
In other words, 2-valued logical structures are used to encode concrete stores;
3-valued logical structures are used to encode abstract stores; and members of
these two families of structures are related by “truth-blurring embeddings”
(which are explained in Section 2.5).
2-valued logical structures are used to encode concrete stores as follows: Individuals represent memory locations in the heap; pointers from the stack into
the heap are represented by unary “pointed-to-by-variable-q” predicates; and
pointer-valued fields of data structures are represented by binary predicates.
Example 2.4. Table I lists the predicates used for representing the stores
manipulated by programs that use the List data type declaration from
Figure 1(a). In the case of insert, the unary predicates x, y, t, and e correspond to the program variables x, y, t, and e, respectively. The binary predicate
n corresponds to the n fields of List elements.
Figure 2 illustrates the 2-valued logical structures that represent lists of
length ≤4 that are pointed to by program variable x. (We generally superscript
the names of 2-valued logical structures with the “natural” symbol (\).) In column 3 of Figure 2, the following graphical notation is used for depicting 2-valued
logical structures.
—Individuals of the universe are represented by circles with names inside.
—A unary predicate p is represented in the graph by having a solid arrow from
the predicate name p to node u for each individual u for which ι( p)(u) = 1,
and no arrow from predicate name p to node u0 for each individual u0 for
which ι( p)(u0 ) = 0. (If ι( p) is 0 for all individuals, the predicate name p is not
shown.)
—A binary predicate q is represented in the graph by having a solid arrow
labeled q between each pair of individuals ui and u j for which ι(q)(ui , u j ) = 1,
and no arrow between pairs ui0 and u0j for which ι(q)(ui0 , u0j ) = 0.
\
Thus, in structure S2 , pointer variable x points to individual u1 , whose n field
points to individual u2 .
ACM Transactions on Programming Languages and Systems, Vol. 24, No. 3, May 2002.
224
•
M. Sagiv et al.
Fig. 2. The 2-valued logical structures that represent lists of length ≤4.
The n field of u2 does not point to any individual (i.e., u2 represents a heap
cell whose n field has the value NULL).
Throughout Section 2, all examples of structures show both the tables of
unary and binary predicates, as well as the corresponding graphical representation. In all other sections of the article, the tables are omitted and just the
graphical representation is shown.
2.3 Extraction of Store Properties
2-valued structures offer a systematic way to answer questions about properties
of the concrete stores they encode. As an example, consider the formula
ϕis (v) = ∃v1 , v2 : n(v1 , v) ∧ n(v2 , v) ∧ v1 6= v2 ,
def
(1)
which expresses the “is-shared” property: “Do two or more different heap cells
\
point to heap cell v via their n fields?” For instance, ϕis (v) evaluates to 0 in S2
for\ v 7→ u2 , because
there is no assignment v1 7→ ui and v2 7→ u j such that
\
ι S2 (n)(ui , u2 ), ι S2 (n)(u j , u2 ), and ui 6= u j all hold. As a second example, consider
the formula
ϕcn (v) = n+ (v, v),
def
(2)
which expresses the property of whether a heap cell v appears on a directed
n-cycle. Here n+ denotes the transitive closure of the n-relation. Formula ϕcn (v)
\
evaluates
to 0 in S2 for v 7→ u2 , because the transitive closure of the relation
\
ι S2 (n) does not contain the pair (u2 , u2 ).
The preceding discussion can be summarized as the following principle.
ACM Transactions on Programming Languages and Systems, Vol. 24, No. 3, May 2002.
Parametric Shape Analysis via 3-Valued Logic
•
225
Fig. 3. The given predicate-update formulae express a transformation on logical structures that
corresponds to the semantics of y = y->n.
OBSERVATION 2.5 (Property-Extraction Principle). By encoding stores as logical structures, questions about properties of stores can be answered by evaluating formulae. The property holds or does not hold, depending on whether the
formula evaluates to 1 or 0, respectively, in the logical structure.
2.4 Expressing the Semantics of Program Statements
Our tool for expressing the semantics of program statements is also based on
evaluating formulae.
OBSERVATION 2.6 (Expressing the Semantics of Statements via Logical
Formulae). Suppose that σ is a store that arises before statement st, that σ 0 is
the store that arises after st is evaluated on σ , and that S is the logical structure
that encodes σ . A collection of predicate-update formulae—one for each predicate p in the vocabulary of S—allows one to obtain the structure S 0 that encodes
σ 0 . When evaluated in structure S, the predicate-update formula for a predicate
p indicates what the value of p should be in S 0 .
In other words, the set of predicate-update formulae captures the concrete
semantics of st.
This process is illustrated in Figure 3 for the statement y = y->n, where the
initial structure Sa\ represents a list of length 4 that is pointed to by both x and y.
ACM Transactions on Programming Languages and Systems, Vol. 24, No. 3, May 2002.
226
•
M. Sagiv et al.
\
Fig. 4. The abstraction of 2-valued structure Sa to 3-valued structure Sa when we use {x, y, t, e}abstraction. The boxes in the tables of unary predicates indicate how individuals are grouped into
equivalence classes; the boxes in the tables for predicate n indicate how the quotient of n with
respect to these equivalence classes is performed.
Figure 3 shows the predicate-update formulae for the five predicates of the
vocabulary used in conjunction with insert: x, y, t, e, and n; the symbols x 0 , y 0 ,
t 0 , e0 , and n0 denote the values of the corresponding predicates in the structure
that arises after execution of y = y->n. Predicates x 0 , t 0 , e0 , and n0 are unchanged
in value by y = y->n. The predicate-update formula y 0 (v) = ∃v1 : y(v1 ) ∧ n(v1 , v)
expresses the advancement of program variable y down the list.
2.5 Abstraction via Truth-Blurring Embeddings
The abstract stores used for shape-analysis are 3-valued logical structures that,
by the construction discussed below, are a priori of bounded size. In general, each
3-valued logical structure corresponds to a (possibly infinite) set of 2-valued
logical structures. Members of these two families of structures are related by
“truth-blurring embeddings.”
The principle behind truth-blurring embedding is illustrated in Figure 4,
which shows how 2-valued structure Sa\ is abstracted to 3-valued structure
Sa . The abstraction function of a particular shape analysis is determined by a
subset A of the unary predicates. The predicates in A are called the abstraction predicates.1 Given A, the corresponding abstraction function is called the
A-abstraction function (and the act of applying it is called A-abstraction). If
there are instrumentation predicates that are not used as abstraction predicates, we call the abstraction A-abstraction with W, where W is the set I − A.
The abstraction illustrated in Figure 4 is {x, y, t, e}-abstraction.
Abstraction is driven by the values of the “vector” of abstraction predicates
for each individual u—that is, for Sa\ , by the values ι(x)(u), ι( y)(u), ι(t)(u), and
ι(e)(u)—and, in particular, by the equivalence classes formed from the individuals that have the same vector of values for their abstraction predicates. In Sa\ ,
there are two such equivalence classes: {u1 }, for which x, y, t, and e are 1, 1, 0,
1 Later
on, for simplicity, we use all of the unary predicates as abstraction predicates. In Section
2.6, however, we illustrate the ability to define different abstraction functions by varying which
unary predicates are used as abstraction predicates.
ACM Transactions on Programming Languages and Systems, Vol. 24, No. 3, May 2002.
Parametric Shape Analysis via 3-Valued Logic
•
227
and 0, respectively; and {u2 , u3 , u4 }, for which x, y, t, and e are all 0. (The boxes
in the table of unary predicates for Sa\ show how individuals of Sa\ are grouped
into two equivalence classes.)
All members of such equivalence classes are mapped to the same individual
of the 3-valued structure. Thus, all members of {u2 , u3 , u4 } from Sa\ are mapped
to the same individual in Sa , called u234 ;2 similarly, all members of {u1 } from
Sa\ are mapped to the same individual in Sa , called u1 .
For each nonabstraction predicate of the 2-valued structure, the corresponding predicate in the 3-valued structure is formed by a “truth-blurring quotient.”
\
— In Sa\ , ι Sa (n) evaluates to 0 for the only pair of individuals in {u1 } × {u1 }.
Therefore, in Sa the value of ι Sa (n)(u1 , u1 ) is 0.
\
— In Sa\ , ι Sa (n) evaluates to 0 for all pairs from {u2 , u3 , u4 } × {u1 }. Therefore, in
Sa the value of ι Sa (n)(u234 , u1 ) is 0.
\
— In\ Sa\ , ι Sa (n) evaluates to \ 0 for two of the pairs from {u\ 1 } × {u2 , u3 , u4 } (i.e.,
Sa
ι Sa (n)(u1 , u3 ) = 0 and ι Sa (n)(u
1 , u4 ) = 0), whereas ι (n) evaluates to 1
\
for the other pair (i.e., ι Sa (n)(u1 , u2 ) = 1); therefore, in Sa the value of
ι Sa (n)(u1 , u234 ) is 1/2.
\
— In Sa\ , \ι Sa (n) evaluates to 0 for some
pairs from {u2 , u3 , u4 } × {u2 , u3 , u4 }
\
Sa
Sa
(n)(u
,
u
)
=
0),
whereas
ι
(n)
evaluates to 1 for other pairs (e.g.,
(e.g.,
ι
2
4
\
ι Sa (n)(u2 , u3 ) = 1); therefore, in Sa the value of ι Sa (n)(u234 , u234 ) is 1/2.
In Figure 4, the boxes in the tables for predicate n indicate these four groupings
of values.
An additional unary predicate, called sm (standing for “summary”), is added
to the 3-valued structure to capture whether individuals of the 3-valued structure represent more than one concrete individual. For instance, ι Sa (sm)(u1 ) = 0
because u1 in Sa represents a single individual of Sa\ . On the other hand, u234
represents three individuals of Sa\ . For technical reasons, sm can be 0 or 1/2,
but never 1; therefore, ι Sa (sm)(u234 ) = 1/2.
The graphical notation for 3-valued logical structures (cf. structure Sa of
Figure 4) is derived from the one for 2-valued structures, with the following
additions:
—summary nodes (i.e., those for which sm = 1/2) are represented by double
circles;
—unary and binary predicates with value 1/2 are represented by dotted arrows.
Thus, in structure Sa of Figure 4, pointer variables x and y definitely point to
the concrete element represented by u1 , whose n field may point to a concrete
element represented by element u234 ; u234 is a summary node (i.e., it may represent more than one concrete element). Possibly there is an n field in one or
2 The
reader should bear in mind that the names of individuals are completely arbitrary: u234 could
have been called u17 or u99 , etc.; in particular, the subscript “234” is used here only to remind the
\
reader that, in this example, u234 of Sa is the individual that represents {u2 , u3 , u4 } of Sa . (In many
subsequent examples, u234 is named u.)
ACM Transactions on Programming Languages and Systems, Vol. 24, No. 3, May 2002.
228
•
M. Sagiv et al.
Fig. 5. The 3-valued logical structures that are obtained by applying truth-blurring embedding to
the 2-valued structures that appear in Figure 2, using {x, y, t, e}-abstraction.
Table II. Kleene’s 3-Valued Interpretation of the
Propositional Operators
∧
0
1
1/2
0 1 1/2
0 0
0
0 1 1/2
0 1/2 1/2
∨
0
0
0
1
1
1/2 1/2
1 1/2
1 1/2
1 1
1 1/2
¬
0
1
1
0
1/2 1/2
more of these concrete elements that points to another of the concrete elements
represented by u234 , but there cannot be an n field in any of these concrete
elements that points to the concrete element represented by u1 .
Figure 5 shows the 3-valued structures that are obtained by applying truthblurring embedding to the 2-valued structures that appear in Figure 2, using
{x, y, t, e}-abstraction. In addition to the lists of lengths 3 and 4 from Figure 2
\
\
(i.e., S3 and S4 ), the 3-valued structure S3 also represents
—the acyclic lists of lengths 5, 6, and so on that are pointed to by x;
—the cyclic lists of length 3 or more that are pointed to by x, such that the
backpointer is not to the head of the list, but to the second, third, or later
element.
Thus, S3 is a finite abstract structure that captures an infinite set of (possibly
cyclic) concrete lists.
The structures S0 , S1 , and S2 represent the cases of acyclic lists of lengths
zero, one, and two, respectively.
2.6 Conservative Extraction of Store Properties
Kleene’s 3-valued interpretation of the propositional operators is given in
Table II. In Section 4.2, we give the Embedding Theorem (Theorem 4.9),
which states that the 3-valued Kleene interpretation in S of every formula
ACM Transactions on Programming Languages and Systems, Vol. 24, No. 3, May 2002.
Parametric Shape Analysis via 3-Valued Logic
•
229
is consistent with the formula’s 2-valued interpretation in every concrete store
that S represents. Thus, questions about properties of stores can be answered
by evaluating formulae using Kleene’s semantics of 3-valued logic.
—If a formula evaluates to 1, then the formula holds in every store represented
by the 3-valued structure.
—If a formula evaluates to 0, then the formula does not hold in any store
represented by the 3-valued structure.
—If a formula evaluates to 1/2, then we do not know if this formula holds in all
stores, does not hold in any store, or holds in some stores and does not hold
in some other stores represented by the 3-valued structure.
Consider the formula ϕcn (v) defined in Equation (2). (“Does heap cell v appear
on a directed cycle of n fields?”) Formula ϕcn (v) evaluates to 0 in S3 for v 7→ u1 ,
because n+ (u1 , u1 ) evaluates to 0 in Kleene’s semantics.
Formula ϕcn (v) evaluates to 1/2 in S3 for v 7→ u: n+ (u, u) evaluates to 1/2
because (i) ι S3 (n)(u, u) = 1/2 and (ii) there is no path of length one or more from
u to u in which all edges have the value 1. Because of this, the evaluation of the
formula does not tell us whether the elements that u represents lie on a cycle:
some may and some may not. This uncertainty implies that (the tail of) the list
pointed to by x might be cyclic.
In many situations, however, we are interested in analyzing the behavior
of a program under the assumption, for example, that the program’s input
is an acyclic list. If an abstraction is not capable of expressing the distinction
between cyclic and acyclic lists, an analysis algorithm based on that abstraction
will usually be able to recover only very imprecise information about the actions
of the program.
For this reason, we are interested in having our parametric framework support abstractions in which, for instance, the acyclic lists are distinguished from
the cyclic lists. Our framework supports such distinctions by allowing the introduction of instrumentation predicates: the vocabulary can be extended with
additional predicate symbols, and the corresponding predicate values are defined by means of formulae.
Example 2.7. Figure 6 illustrates {x, y, t, e}-abstraction with {cn }, where
cn is the unary instrumentation predicate defined by
cn (v) = n+ (v, v).
def
Figure 6 shows two shape graphs: Sacyclic , the result of applying this abstraction
to acyclic lists, and Scyclic , the result of applying it to cyclic lists.
—Although Sacyclic , which is obtained by {x, y, t, e}-abstraction with {cn }, looks
like S3 in Figure 5, which is obtained just by {x, y, t, e}-abstraction (without
{cn }), it describes a smaller set of lists, namely, only acyclic lists of length at
least three. The absence of a cn -arrow to u1 expresses the fact that none of
the heap cells summarized by u1 lie on a cycle; the absence of a cn -arrow to
u expresses the fact that none of the heap cells summarized by u lie on a
cycle. In Sacyclic , we have ι Sacyclic (cn )(u1 ) = 0 and ι Sacyclic (cn )(u) = 0. This implies
ACM Transactions on Programming Languages and Systems, Vol. 24, No. 3, May 2002.
230
•
M. Sagiv et al.
Fig. 6. The 3-valued logical structures that are obtained by applying truth-blurring embedding to
the 2-valued structures that represent acyclic and cyclic lists of length 3 or more, using {x, y, t, e}abstraction with {cn }.
that Sacyclic can only represent acyclic lists, even though the formula n+ (v, v)
evaluates to 1/2 on u.
—On the other hand, Scyclic describes lists in which the heap cells represented
by u1 and u all lie on a cycle. These are lists in which the last list element
has a back pointer to the first element of the list. In Scyclic , the fact that
the value of ι Scyclic (cn )(u1 ) is 1 indicates that u1 definitely lies on a cycle, even
though the formula n+ (v, v) evaluates to 1/2 on u1 . In addition, ι Scyclic (cn )(u) =
1, even though the formula n+ (v, v) evaluates to 1/2 on u, which indicates
that all elements of the tails of the lists that Scyclic represents lie on a cycle
as well.
The preceding discussion illustrates the following principle.
OBSERVATION 2.8 (Instrumentation Principle). Suppose that S is a 3-valued
structure that represents the 2-valued structure S \ . By explicitly “storing” in S
the values that a formula ϕ has in S \ , it is sometimes possible to extract more
precise information from S than can be obtained just by evaluating ϕ in S.
Example 2.7 also illustrated how the introduction of an instrumentation
predicate alters the abstraction in use (cf. Figures 5 and 6). A second means
for altering an abstraction is to change which unary predicates are used as
abstraction predicates. Figure 7 illustrates the effect of making the unary instrumentation predicate cn into an abstraction predicate. The two 3-valued
structures shown in Figure 7 result from applying {x, y, t, e}-abstraction with
{cn } and {x, y, t, e, cn }-abstraction to a cyclic list of length at least five in which
the backpointer points somewhere into the middle of the list. When cn is an
additional abstraction predicate, there are two separate summary nodes, one
for which cn is 0 and one for which cn is 1.
In Section 5, several other instrumentation predicates are introduced that
are useful both for analyzing data structures other than singly linked lists, as
well as for increasing the precision of shape-analysis algorithms. By using the
right collection of instrumentation predicates, shape-analysis algorithms can be
created that, in many cases, determine precise shape information for programs
that manipulate several (possibly cyclic) data structures simultaneously. The
ACM Transactions on Programming Languages and Systems, Vol. 24, No. 3, May 2002.
Parametric Shape Analysis via 3-Valued Logic
•
231
Fig. 7. 3-valued structures that illustrate {x, y, t, e}-abstraction with {cn } and {x, y, t, e, cn }abstraction. The two abstractions have been applied to a cyclic list of length at least 5 in which the
backpointer points somewhere into the middle of the list.
Fig. 8. A 2-valued structure for a list pointed to by x, where y points into the middle of the list.
information obtained is more precise than that obtained from previous work on
shape analysis.
As discussed further in Section 5.3, instrumentation predicates that track information about reachability from pointer variables are particularly important
for avoiding a loss of precision, because they permit the abstract representations of data structures—and different parts of the same data structure—that
are disjoint in the concrete world to be kept separate [Sagiv et al. 1998, p. 38].
A reachability instrumentation predicate rq,n (v) captures whether v is (transitively) reachable from pointer variable q along n fields. This is illustrated in
Figures 8 and 9, which show how a concrete list in which x points to the head
and y points into the middle is mapped to two different 3-valued structures,
ACM Transactions on Programming Languages and Systems, Vol. 24, No. 3, May 2002.
232
•
M. Sagiv et al.
Fig. 9. The 3-valued logical structures that are obtained by applying truth-blurring embedding
\
to the list S6 from Figure 8, using {x, y, t, e, rx,n , r y,n , rt,n , re,n , cn }-abstraction and {x, y, t, e, cn }abstraction, respectively.
depending on whether the instrumentation predicates rx,n , r y,n , rt,n , and re,n
are used. Note that the situation depicted in Figure 8 is one that occurs in
insert as y is advanced down the list. The reachability instrumentation predicates play a crucial role in developing a shape-analysis algorithm that is capable
of obtaining precise shape information for insert.
2.7 Abstract Interpretation of Program Statements
The most complex issue that we face is the definition of the abstract semantics
of program statements. This abstract semantics has to be (i) conservative (i.e.,
must represent every possible run-time situation), and (ii) should not yield too
many “unknown” values.
The fact that the semantics of statements can be expressed via logical formulae (Observation 2.6), together with the fact that the evaluation of a formula
ϕ in a 3-valued structure S is guaranteed to be safe with respect to the evaluation of ϕ in any 2-valued structure that S represents (the Embedding Theorem)
means that one abstract semantics falls out automatically from the concrete
semantics: one merely has to evaluate the predicate-update formulae of the
concrete semantics on 3-valued structures.
OBSERVATION 2.9 (Reinterpretation Principle). Evaluation of the predicateupdate formulae for a statement st in 2-valued logic captures the transfer function for st of the concrete semantics. Evaluation of the same formulae in 3-valued
logic captures the transfer function for st of the abstract semantics.
Figure 10 combines Figures 3 and 4 (see column 2 and row 1 of Figure 10,
respectively). Column 4 of Figure 10 illustrates how the predicate-update formulae that express the concrete semantics for y = y->n also express a transformation on 3-valued logical structures—that is, an abstract semantics—that
ACM Transactions on Programming Languages and Systems, Vol. 24, No. 3, May 2002.
Fig. 10. Commutative diagram that illustrates the relationship between (i) the transformation on 2-valued structures (defined by predicate-update
formulae) that represents the concrete semantics for y = y->n, (ii) abstraction, and (iii) the transformation on 3-valued structures (defined by the same
predicate-update formulae) that represents the simple abstract semantics for y = y->n obtained via the Reinterpretation Principle (Observation 2.9). (In
this example, {x, y, t, e}-abstraction is used.)
Parametric Shape Analysis via 3-Valued Logic
•
233
ACM Transactions on Programming Languages and Systems, Vol. 24, No. 3, May 2002.
234
•
M. Sagiv et al.
\
is safe with respect to the concrete semantics (cf. Sa\ → Sb versus Sa → Sb).3
To keep things simple, the issue of how to update the values of instrumentation
predicates is not addressed here (see Section 5).
As we show, this approach has a number of good properties.
—Because the number of elements in the 3-valued structures that we work
with is bounded, the abstract-interpretation process always terminates.
—The Embedding Theorem implies that the results obtained are conservative.
—By defining appropriate instrumentation predicates, it is possible to emulate
some previous shape-analysis algorithms (e.g., Chase et al. [1990], Jones and
Muchnick [1981], Larus and Hilfinger [1988], and Horwitz et al. [1989]).
Unfortunately, there is also bad news: the method described above and illustrated in Figure 10 can be very imprecise. For instance, the statement y =
y->n illustrated in Figure 10 sets y to the value of y->n; that is, it makes y
point to the next element in the list. In the abstract semantics, the evaluation
in structure Sa of the predicate-update formula y 0 (v) = ∃v1 : y(v1 ) ∧ n(v1 , v)
causes ι Sb ( y)(u234 ) to be set to 1/2: when ∃v1 : y(v1 ) ∧ n(v1 , v) is evaluated in Sa ,
we have ι Sa ( y)(u1 ) ∧ ι Sa (n)(u1 , u234 ) = 1 ∧ 1/2 = 1/2. Consequently, all we can
surmise after the execution of y = y->n is that y may point to one of the heap
cells that summary node u234 represents (see Sb). (This provides insight into
where the algorithm of Chase et al. [1990] loses precision.)
\
In contrast, the truth-blurring embedding of Sb is Sc ; thus, column 4 and
row 4 of Figure 10 show that the abstract semantics obtained via Observation 2.9 can lead to a structure that is not as precise as the abstract domain is
capable of representing (cf. structures Sc and Sb). This observation motivates
the mechanisms that are introduced in Section 6, where we define an improved
abstract semantics. In particular, the mechanisms introduced in Section 6 are
able to “materialize” new nonsummary nodes from summary nodes as data
structures are traversed. (Thus, Section 6 generalizes the algorithms of Plevyak
et al. [1993] and Sagiv et al. [1998].) As we show, this allows us to determine
more precise shape descriptors for the data structures that arise, for example,
in the insert program. In general, these techniques are important for retaining precision during the analysis of programs that, like insert, traverse linked
data structures and perform destructive updating.
Because the mechanisms described in Section 6 are semantic reductions
[Cousot and Cousot 1979], and because the Reinterpretation Principle falls
out directly from the Embedding Theorem (Theorem 4.9), the correctness argument for the shape-analysis framework is surprisingly simple. (The reader
3 The
\
abstraction of Sb , as described in Section 2.5, is Sc . Figure 10 illustrates that in the abstract
semantics we also work with structures that are even further “blurred.” We say that Sc embeds
into Sb : u1 in Sc maps to u1 in Sb ; u2 and u34 in Sc both map to u234 in Sb ; the n predicate of Sb is
the “truth-blurring quotient” of n in Sc under this mapping.
Our notion of the 2-valued structures that a 3-valued structure represents is actually based on
the more general notion of embedding, rather than on the “truth-blurring quotient” (cf. Definition
4.8). Note that in Figure 5, S2 can be embedded into S3 ; thus, structure S3 also represents the
acyclic lists of length 2 that are pointed to by x.
ACM Transactions on Programming Languages and Systems, Vol. 24, No. 3, May 2002.
Parametric Shape Analysis via 3-Valued Logic
•
235
is invited to compare the proof of Theorem 6.29 to that of Theorem 5.3.6 from
Sagiv et al. [1998].)
3. EXPRESSING THE CONCRETE SEMANTICS USING LOGIC
In this section, we define a metalanguage for expressing the concrete operational semantics of programs (and programming languages), and use it to define
a concrete collecting semantics for a simple programming language. The metalanguage is based on first-order logic: each observable property is expressed via
a predicate; the effect of every statement on every predicate’s interpretation is
given by means of a formula. The shape-analysis algorithm is generated from
such a specification.
The rest of this section is organized as follows. Section 3.1 introduces the
syntax of formulae for a first-order logic with transitive closure; the semantics
of this logic is defined in Section 3.2. In Section 3.3, the logic is used to define a concrete operational semantics for statements and conditions of a C-like
language (in particular, with heap-allocated storage and destructive updating
through pointers). Finally, Section 3.4 presents a concrete collecting semantics,
which associates a (potentially infinite) set of logical structures with every program point. (In subsequent sections of the article, algorithms are developed
that compute safe approximations to the collecting semantics.)
3.1 Syntax of First-Order Formulae with Transitive Closure
Let P = { p1 , . . . , pn } be a finite set of predicate symbols. Without loss of generality we exclude constant and function symbols from the logic.4 We write
first-order formulae over P using the logical connectives ∧, ∨, ¬, and the quantifiers ∀ and ∃. The symbol “ = ” denotes the equality predicate. The operator
“T C ” denotes transitive closure on formulae.
Formally, the syntax of first-order formulae with equality and transitive closure is defined as follows.
Definition 3.1. A formula over the vocabulary P = { p1 , . . . , pn } is defined
inductively, as follows.
Atomic Formulae. The logical literals 0 and 1 are atomic formulae with no
free variables.
For every predicate symbol p ∈ P of arity k, p(v1 , . . . , vk ) is an atomic formula
with free variables {v1 , . . . , vk }.
The formula (v1 = v2 ) is an atomic formula with free variables {v1 , v2 }.
Logical Connectives. If ϕ1 and ϕ2 are formulae whose sets of free variables
are V1 and V2 , respectively, then (ϕ1 ∧ ϕ2 ), (ϕ1 ∨ ϕ2 ), and (¬ϕ1 ) are formulae with
free variables V1 ∪ V2 , V1 ∪ V2 , and V1 , respectively.
Quantifiers. If ϕ1 is a formula with free variables {v1 , v2 , . . . , vk }, then (∃v1 :
ϕ1 ) and (∀v1 : ϕ1 ) are both formulae with free variables {v2 , v3 , . . . , vk }.
4 Constant
symbols can be encoded via unary predicates, and n-ary functions via n + 1-ary
predicates.
ACM Transactions on Programming Languages and Systems, Vol. 24, No. 3, May 2002.
236
•
M. Sagiv et al.
Transitive Closure. If ϕ1 is a formula with free variables V such that v3 , v4 6∈
V , then (TC v1 , v2 : ϕ1 )(v3 , v4 ) is a formula with free variables (V − {v1 , v2 }) ∪
{v3 , v4 }.
A formula is closed when it has no free variables.
We also use several shorthand notations: ϕ1 ⇒ ϕ2 is a shorthand for
(¬ϕ1 ∨ ϕ2 ); ϕ1 ⇔ ϕ2 is a shorthand for (ϕ1 ⇒ ϕ2 ) ∧ (ϕ2 ⇒ ϕ1 ), and v1 6= v2 is
a shorthand for ¬(v1 = v2 ). For a binary predicate p, p+ (v3 , v4 ) is a shorthand
for (TC v1 , v2 : p(v1 , v2 ))(v3 , v4 ). Finally, we make use of conditional expressions:
(
ϕ2 if ϕ1
is a shorthand for (ϕ1 ∧ ϕ2 ) ∨ (¬ϕ1 ∧ ϕ3 ).
ϕ3 otherwise
Table I lists the predicates used for representing the stores manipulated by
programs that use the List data type declaration from Figure 1(a). In the general case, a program may use a number of different struct types. The vocabulary
is then defined as
P = {x | x ∈ PVar} ∪ {sel | sel ∈ Sel},
def
(3)
where PVar is the set of pointer variables in the program, and Sel is the set of
pointer-valued fields in the struct types declared in the program.
3.2 Semantics of First-Order Logic
In this section, we define the (2-valued) semantics for first-order logic with
transitive closure in the standard way.
Definition 3.2. A 2-valued interpretation of the language of formulae over
P is a 2-valued logical structure S = hU S , ι S i, where U S is a set of individuals
and ι S maps each predicate symbol p of arity k to a truth-valued function:
ι S ( p) : (U S )k → {0, 1}.
An assignment Z is a function that maps free variables to individuals (i.e., an
assignment has the functionality Z : {v1 , v2 , . . . } → U S ). An assignment that
is defined on all free variables of a formula ϕ is called complete for ϕ. (In the
remainder of the article, we generally assume that every assignment Z that
arises in connection with the discussion of some formula ϕ is complete for ϕ.)
The (2-valued) meaning of a formula ϕ, denoted by [[ϕ]]2S (Z ), yields a truth
value in {0, 1}. The meaning of ϕ is defined inductively as follows.
Atomic Formulae.For an atomic formula consisting of a logical literal
l ∈ {0, 1},
[[l ]]2S (Z ) = l (where l ∈ {0, 1}).
For an atomic formula of the form p(v1 , . . . , vk ),
[[ p(v1 , . . . , vk )]]2S (Z ) = ι S ( p)(Z (v1 ), . . . , Z (vk )).
ACM Transactions on Programming Languages and Systems, Vol. 24, No. 3, May 2002.
Parametric Shape Analysis via 3-Valued Logic
•
237
For an atomic formula of the form (v1 = v2 ),5
(
0 Z (v1 ) 6= Z (v2 )
S
.
[[v1 = v2 ]]2 (Z ) =
1 Z (v1 ) = Z (v2 )
Logical Connectives.When ϕ is a formula built from subformulae ϕ1 and ϕ2 ,
¡
¢
[[ϕ1 ∧ ϕ2 ]]2S (Z ) = min [[ϕ1 ]]2S (Z ), [[ϕ2 ]]2S (Z )
¡
¢
[[ϕ1 ∨ ϕ2 ]]2S (Z ) = max [[ϕ1 ]]2S (Z ), [[ϕ2 ]]2S (Z )
[[¬ϕ1 ]]2S (Z ) = 1 − [[ϕ1 ]]2S (Z ).
Quantifiers.When ϕ is a formula that has a quantifier as the outermost
operator,
[[∀v1 : ϕ1 ]]2S (Z ) = min [[ϕ1 ]]2S (Z [v1 7→ u])
u∈U S
[[∃v1 :
ϕ1 ]]2S (Z )
= max [[ϕ1 ]]2S (Z [v1 7→ u]).
u∈U S
Transitive Closure.When ϕ is a formula of the form (TC v1 , v2 : ϕ1 )(v3 , v4 ),
[[(TC v1 , v2 : ϕ1 )(v3 , v4 )]]2S (Z )
=
n
max
n ≥ 1, u1 , . . . , un+1 ∈ U ,
Z (v3 ) = u1 , Z (v4 ) = un+1
min [[ϕ1 ]]2S (Z [v1 7→ ui , v2 7→ ui+1 ]).
i=1
We say that S and Z satisfy ϕ (denoted by S, Z |= ϕ) if [[ϕ]]2S (Z ) = 1. We write
S |= ϕ if for every Z we have S, Z |= ϕ.
We denote the set of 2-valued structures by 2-STRUCT[P].
As already discussed in Section 2.2, logical structures are used to encode
stores as follows. Individuals represent memory locations in the heap; pointers
from the stack into the heap are represented by unary “pointed-to-by-variableq” predicates; and pointer-valued fields of data structures are represented by
binary predicates.
Notice that Definitions 3.1 and 3.2 could be generalized to allow many-sorted
sets of individuals. This would be useful for modeling heap cells of different
types; however, to simplify the presentation, we have chosen not to follow this
route.
3.3 The Meaning of Program Statements
For every statement st, the new values of every predicate p are defined via a
predicate-update formula ϕ st
p.
5 Note that there is a typographical distinction between the syntactic symbol for equality, namely, =,
and the symbol for the “identically-equal” relation on individuals, namely, =. In any case, it should
always be clear from the context which symbol is intended.
ACM Transactions on Programming Languages and Systems, Vol. 24, No. 3, May 2002.
238
•
M. Sagiv et al.
Table III. Predicate-Update Formulae that Define the Semantics of Statements that
Manipulate Pointers and Pointer-Valued Fields
st
ϕ st
p
x = NULL
ϕxst (v) = 0
x = t
x = t->sel
x->sel = NULL
x->sel = t
(assuming that
x->sel == NULL)
def
def
ϕxst (v) = t(v)
def
st
ϕx (v) = ∃v1 : t(v1 ) ∧ sel (v1 , v)
def
st
ϕsel
(v1 , v2 ) = sel (v1 , v2 ) ∧ ¬x(v1 )
def
st
ϕsel
(v1 , v2 ) = sel (v1 , v2 ) ∨ (x(v1 ) ∧ t(v2 ))
def
ϕxst (v) = isNew(v)
x = malloc()
def
ϕzst (v) = z(v) ∧ ¬isNew(v), for each z ∈ (PVar − {x})
def
st
ϕsel (v1 , v2 ) = sel (v1 , v2 ) ∧ ¬isNew(v1 ) ∧ ¬isNew(v2 ) for
each sel ∈ PSel
Definition 3.3. Let st be a program statement, and for every arity-k predicate p in vocabulary P, let ϕ st
p be the formula over free variables v1 , . . . , vk
that defines the new value of p after st. Then the P transformer associated with st, denoted by [[st]] : 2-STRUCT[P] → 2-STRUCT[P], is defined as
follows.
£ ¤S
®
­
[[st]](S) = U S , λp.λu1 , . . . , uk . ϕ st
p 2 ([v1 7→ u1 , . . . , vk 7→ uk ]) .
In the remainder of the article, we avoid cluttering the definition of statement
transformers by omitting predicate-update formulae for predicates whose value
is not changed by the statement, that is, for predicates whose predicate-update
formula ϕ st
p is merely p(v1 , v2 , . . . , vk ).
Example 3.4. Table III lists the predicate-update formulae that define
the operational semantics of the five kinds of statements that manipulate C
structures.
In Table III, and also in later tables, we simplify the presentation of the
semantics by breaking the statement x->sel = t into two parts: (i) x->sel =
NULL, and (ii) x->sel = t, assuming that x->sel == NULL.
Definition 3.3 does not handle statements of the form x = malloc() because
the universe of the structure produced by [[st]](S) is the same as the universe
of S. Instead, for storage-allocation statements we need to use the modified
definition of [[st]](S) given in Definition 3.5, which first allocates a new individual unew , and then invokes predicate-update formulae in a manner similar to
Definition 3.3.
Definition 3.5. Let st ≡ x = malloc() and let isNew 6∈ P be a unary predicate. For every p ∈ P, let ϕ st
p be a predicate-update formula over the vocabulary
P ∪ {isNew}. Then the P transformer associated with st ≡ x = malloc(), denoted
ACM Transactions on Programming Languages and Systems, Vol. 24, No. 3, May 2002.
Parametric Shape Analysis via 3-Valued Logic
•
239
Table IV. Formulae for Four
Kinds of Primitive Conditions
Involving Pointer Variables
w
cond (w)
x == y
x != y
x == NULL
x != NULL
∀v : x(v) ⇔ y(v)
∃v : ¬(x(v) ⇔ y(v))
∀v : ¬x(v)
∃v : x(v)
by [[x = malloc()]], is defined as follows.
[[x = malloc()]](S) =
let U 0 = U S ∪ {unew }, where unew is an individual not in U S
and ι0 = λp ∈ (P∪{isNew}).λu1 , . . . , uk .

1
p = isNew and u1 = unew




0
p = isNew and u1 6= unew

p 6= isNew and there exists i,

0


1
≤ i ≤ k, such that ui = unew

S
ι ( p)(u1 , . . . , uk ) otherwise
£ ¤ hU 0 ,ι0 i
®
­
[v1 7→ u1 , . . . , vk 7→ uk ] .
in U 0 , λp ∈ P.λu1 , . . . , uk . ϕ st
p 2
In Definition 3.5, ι0 is created from ι as follows: (i) isNew(unew ) is set to 1,
(ii) isNew(u1 ) is set to 0 for all other individuals u1 6= unew , and (iii) all predicates
are set to 0 when one or more arguments is unew . The predicate-update operation
in Definition 3.5 is very similar to the one in Definition 3.3 after ι0 has been set.
(Note that the p in “ι0 = λp. . . . ” ranges over P ∪ {isNew}, whereas the p in
“λp. . . . ” appearing in the last line of Definition 3.5 ranges over P.)
2-valued formulae also provide a way to define the meaning of program conditions. 2-valued (closed) formulae for four kinds of primitive program conditions
that involve pointer variables are shown in Table IV; the formula to express the
meaning of a compound program condition involving pointer variables would
have the formulae from Table IV as constituents. To keep things simple, we
do not use examples in which program conditions have side effects; however, it
should be noted that it is possible to handle side effects in program conditions in
the same way that it is done for statements, namely, by providing appropriate
predicate-update formulae.
Finally, it should also be noted that the concrete semantics that has been defined is already somewhat abstract.6 By design, the concrete semantics ignores
a number of details.
—The only parts of the store that the concrete semantics keeps track of are the
pointer variables and the cells of heap-allocated storage.
—The concrete semantics does not track changes to stores caused by assignment statements that perform actions other than pointer manipulations (e.g.,
arithmetic, etc.).
6 This is an approach that has also been used in previous work on shape-analysis [Sagiv et al. 1998].
ACM Transactions on Programming Languages and Systems, Vol. 24, No. 3, May 2002.
240
•
M. Sagiv et al.
—The concrete semantics is assumed to “go both ways” at a branch point in the
control-flow graph where the program condition involves something other
than pointer-valued quantities (see Section 3.4).
Such assumptions build a small amount of abstraction into the “concrete”
semantics. The consequence of these assumptions is that the collecting semantics defined in Section 3.4 may associate a control-flow-graph vertex with more
concrete stores (i.e., 2-valued structures) than would be the case had we started
with a conventional concrete semantics.
3.4 Collecting Semantics
We now turn to the collecting semantics. For each vertex v of control-flow graph
G, the set ConcStructSet[v] is a (potentially infinite) set of structures that may
arise on entry to v for some potential input. For our purposes, it is convenient
to define ConcStructSet[v] as the least fixed point (in terms of set inclusion) of
the following system of equations (over the variables ConcStructSet[v]).
ConcStructSet[v] =

{h∅, ∅i}[




{[[st(w)]](S) | S ∈ ConcStructSet[w]}




w→v∈E(G),



w∈As(G)


[


∪
{S | S ∈ ConcStructSet[w]}






∪









∪
w→v∈E(G),
w∈I d (G)
















[


{S | S ∈ ConcStructSet[w] and S |= cond(w)} 




w→v∈T
b(G)

[



{S | S ∈ ConcStructSet[w] and S |= ¬cond(w)}

if v = start
otherwise.
(4)
w→v∈F b(G)
In Equation (4), As(G) denotes the set of assignment statements that manipulate pointers; I d (G) denotes the set of assignment statements that perform actions other than pointer manipulations, plus the branch points for
which the program condition involves something other than just pointer-valued
quantities (in both cases, these control-flow graph vertices are uninterpreted);
T b(G) ⊆ E(G) and F b(G) ⊆ E(G) are the subsets of G’s edges that represent
the true and false branches, respectively, from branch points that involve only
pointer-valued quantities (cond(w) denotes the formula for the program condition at w). (An edge whose source is an assert statement that involves only
pointer-valued quantities would be handled as a true-branch edge.)
4. REPRESENTING SETS OF STORES USING 3-VALUED LOGIC
In this section, we show how 3-valued logical structures can be used to conservatively represent sets of concrete stores. Section 4.1 defines 3-valued logic.
Section 4.2 introduces the concept of embedding, which is used to relate concrete
(2-valued) and abstract (3-valued) structures. In particular, Section 4.2 contains
the Embedding Theorem, which is the main tool for conservative extraction of
ACM Transactions on Programming Languages and Systems, Vol. 24, No. 3, May 2002.
Parametric Shape Analysis via 3-Valued Logic
•
241
Fig. 11. The semi-bilattice of 3-valued logic. (The symbol ∗ attached to 1/2 and 1 indicates that
these are the “designated values,” which correspond to “potential truth.”)
store properties (cf. Section 2.6). The lattice of static information that is used
in Section 6 has as its elements sets of 3-valued structures (ordered by set inclusion). To guarantee that the analysis terminates when applied to a program
that contains a loop, we need a way to ensure that the number of 3-valued
structures that can arise is finite. For this reason, in Section 4.3 we introduce
the set of bounded structures, and show how every 3-valued structure can be
mapped into a bounded structure.
(Section 5 introduces an additional mechanism for refining the abstractions
discussed in the present section.)
4.1 Kleene’s 3-Valued Semantics
In this section, we define Kleene’s 3-valued semantics for first-order logic with
transitive closure. We say that the values 0 and 1 are definite values and that
1/2 is an indefinite value, and define a partial order v on truth values to reflect information content: l 1 v l 2 denotes that l 1 has more definite information
than l 2 :
Definition 4.1 (Information Order). For l 1 , l 2 ∈ {0, 1/2, 1}, we define the information order on truth values as follows. l 1 v l 2 if l 1 = l 2 or l 2 = 1/2. The
symbol t denotes the least-upper-bound operation with respect to v.
Kleene’s 3-valued semantics of logic is monotonic in the information order (see
Table II and Definition 4.2).
As shown in Figure 11, the values 0, 1, and 1/2 form a mathematical structure known as a semi-bilattice (see Ginsberg [1988]). A semi-bilattice has two
orderings: the information order and the logical order.
—The information order is the one defined in Definition 4.1, which captures
“(un)certainty.”
—The logical order is the one used in Table II: that is, ∧ and ∨ are meet and join
in the logical order (e.g., 1 ∧ 1/2 = 1/2, 1 ∨ 1/2 = 1, 1/2 ∧ 0 = 0, 1/2 ∨ 0 = 1/2,
etc.).
A value that is “far enough up” in the logical order indicates “potential truth,”
and is called a designated value. In Figure 11, 1/2 and 1 are the designated
ACM Transactions on Programming Languages and Systems, Vol. 24, No. 3, May 2002.
242
•
M. Sagiv et al.
values. This means that a structure S potentially satisfies a formula when the
formula’s interpretation with respect to S is either 1/2 or 1 (see Definition 4.2).
We now generalize Definition 3.2 to define the meaning of a formula with
respect to a 3-valued structure. The generalized definition assumes that every 3-valued structure includes a unary predicate sm, which is used to define
the meaning of the syntactic equality symbol (=). As explained earlier, sm
formalizes the notion of “summary nodes” (i.e., individuals of a 3-valued structure that may represent more than one individual from corresponding 2-valued
structures).
Definition 4.2. A 3-valued interpretation of the language of formulae over
P is a 3-valued logical structure S = hU S , ι S i, where U S is a set of individuals
and ι S maps each predicate symbol p of arity k to a truth-valued function:
ι S ( p) : (U S )k → {0, 1, 1/2}.
For an assignment Z , the (3-valued) meaning of a formula ϕ, denoted by
[[ϕ]]3S (Z ), now yields a truth value in {0, 1, 1/2}. The meaning of ϕ is defined
inductively as in Definition 3.2, with the following changes.
Atomic Formulae. For an atomic formula of the form (v1 = v2 ),

Z (v1 ) 6= Z (v2 )

0
S
Z (v1 ) = Z (v2 ) and ι S (sm)(Z (v1 )) = 0.
[[v1 = v2 ]]3 (Z ) = 1


1/2 otherwise
(5)
We say that S and Z potentially satisfy ϕ, denoted by S, Z |=3 ϕ, if [[ϕ]]3S (Z ) =
1/2 or [[ϕ]]3S (Z ) = 1. We write S |=3 ϕ if for every Z we have S, Z |=3 ϕ.
In the following, we denote the set of 3-valued structures by 3-STRUCT[P ∪
{sm}].
In Definition 4.2, the meaning of a formula of the form v1 = v2 is defined in
terms of the sm predicate and the “identically-equal” relation on individuals
(denoted by the symbol =):
—Nonidentical individuals u1 and u2 are unequal (i.e., if u1 6= u2 , then
[[v1 = v2 ]]3S ([v1 7→ u1 , v2 7→ u2 ]) evaluates to 0).
—A nonsummary individual must be equal to itself (i.e., if sm(u) = 0, then
[[v1 = v2 ]]3S ([v1 7→ u, v2 7→ u]) evaluates to 1).
—In all other cases, we throw up our hands and return 1/2.
Example 4.3.
Consider the structure S3 from Figure 5 and formula (1),
ϕis (v) = ∃v1 , v2 : n(v1 , v) ∧ n(v2 , v) ∧ v1 6= v2 ,
def
which expresses the “is-shared” property. For the assignment Z 1 = [v 7→ u], we
have
[[ϕis ]]3S3 (Z 1 )
=
max
u0,u00 ∈{u1,u}
[[n(v1 , v) ∧ n(v2 , v) ∧ v1 6= v2 ]]3S3 ([v 7→ u, v1 7→ u0 , v2 7→ u00 ])
= 1/2,
ACM Transactions on Programming Languages and Systems, Vol. 24, No. 3, May 2002.
(6)
Parametric Shape Analysis via 3-Valued Logic
•
243
and thus S3 , Z 1 |=3 ϕis . In contrast, for the assignment Z 2 = [v 7→ u1 ], we
have
[[ϕis ]]3S3 (Z 2 )
=
max
u0 ,u00 ∈{u1 ,u}
[[n(v1 , v) ∧ n(v2 , v) ∧ v1 6= v2 ]]3S3 ([v 7→ u1 , v1 7→ u0 , v2 7→ u00 ])
= 0,
and thus S3 , Z 2 6|=3 ϕis .
3-valued logic retains a number of properties that are familiar from 2-valued
logic, such as De Morgan’s laws, associativity of ∧ and ∨, and distributivity of ∧
over ∨ (and vice versa).
Kleene’s semantics is monotonic in the information order.
LEMMA 4.4. Let ϕ be a formula, and let S and S 0 be two structures such
0
0
that U S = U S and ι S v ι S . (That is, for each predicate symbol p of arity k,
S
S0
ι ( p)(u1 , . . . , uk ) v ι ( p)(u1 , . . . , uk ).) Then, for every complete assignment Z ,
0
[[ϕ]]3S (Z ) v [[ϕ]]3S (Z ).
(7)
4.2 Embedding into 3-Valued Structures
In this section, we introduce the concept of embedding, which provides a way
to relate 2-valued and 3-valued structures and formulate the Embedding Theorem, which relates 2-valued and 3-valued interpretations of a given formula.
Convention. To avoid the need to work with different vocabularies at the
concrete and abstract levels, we assume that the sm predicate is defined in
every concrete 2-valued structure, where it has the trivial fixed meaning of 0 for
all individuals. In the concrete operational semantics, we assume that sm is set
to 0 for the individual allocated by x = malloc(), and is never changed by any
of the other kinds of statements; thus, in the concrete operational semantics, for
each non-malloc statement st, the predicate-update formula for sm is always
st
(v) = sm(v).
the trivial one: ϕsm
4.2.1 Embedding Order. We define the embedding ordering on structures
as follows.
0
0
Definition 4.5. Let S = hU S , ι S i and S 0 = hU S , ι S i be two structures, and
0
let f : U S → U S be a surjective function. We say that f embeds S in S 0 (denoted
by S v f S 0 ) if (i) for every predicate symbol p ∈ P ∪ {sm} of arity k and all
u1 , . . . , uk ∈ U S ,
0
ι S ( p)(u1 , . . . , uk ) v ι S ( p)( f (u1 ), . . . , f (uk ))
and (ii) for all u0 ∈ U S
(8)
0
0
(|{u | f (u) = u0 }| > 1) v ι S (sm)(u0 ).
(9)
We say that S can be embedded in S 0 (denoted by S v S 0 ) if there exists a
function f such that S v f S 0 .
ACM Transactions on Programming Languages and Systems, Vol. 24, No. 3, May 2002.
•
244
M. Sagiv et al.
0
Note that inequality (8) applies to sm, as well; therefore, ι S (sm)(u0 ) can never
be 1.
4.2.2 Tight Embedding. A tight embedding is a special kind of embedding,
one in which information loss is minimized when multiple individuals of S are
mapped to the same individual in S 0 .
0
0
Definition 4.6. A structure S 0 = hU S , ι S i is a tight embedding of S =
0
hU S , ι S i if there exists a surjective function t embed : U S → U S such that,
for every p ∈ P of arity k,
G
0
ι S ( p)(u01 , . . . , u0k ) =
ι S ( p)(u1 , . . . , uk )
(10)
(u1 , . . . , uk )∈(U S )k , s.t.
0
t embed(ui )=ui0 ∈U S , 1≤i≤k
0
and for every u0 ∈ U S ,
0
ι S (sm)(u0 ) = (|{u|t embed(u) = u0 }| > 1) t
G
ι S (sm)(u).
(11)
S
u∈U , s.t.
0
t embed(u)=u0 ∈U S
When a surjective function t embed possesses both properties (10) and (11),
we say that S 0 = t embed(S).
It is immediately apparent from Definition 4.6 that the tight embedding of
a structure S by a function t embed embeds S in t embed(S) (i.e., S vt embed
t embed(S)).
It is also apparent from Definition 4.6 how several individuals from U S can
0
“lose their identity” by being mapped to the same individual in U S :
Example 4.7. Let u1 , u2 ∈ U S , where u1 6= u2 , be individuals such that
ι (sm)(u1 ) = 0 and ι S (sm)(u2 ) = 0 both hold, and where t embed(u1 ) =
0
t embed(u2 ) = u0 . Therefore, ι S (sm)(u0 ) = 1/2, and consequently, by (5),
0
[[v1 = v2 ]]3S ([v1 7→ u0 , v2 7→ u0 ]) = 1/2.
S
In addition to defining what it means for a 2-valued structure to be embedded in a 3-valued structure, Definitions 4.5 and 4.6 also define what
it means for a 3-valued structure to be embedded in a 3-valued structure.
Equations (9) and (11) have the form given above so that v is transitive and so
that tight embeddings compose properly (i.e., so that t embed2 (t embed1 (S)) =
(t embed2 ◦ t embed1 )(S) holds).
4.2.3 Concretization of 3-Valued Structures. Embedding also allows us to
define the (potentially infinite) set of concrete structures that a single 3-valued
structure represents.
Definition 4.8 (Concretization of 3-Valued Structures). For a structure
S ∈ 3-STRUCT[P], we denote by γ (S) the set of 2-valued structures that S
represents, that is,
γ (S) = {S \ ∈ 2-STRUCT[P] | S \ v S}.
ACM Transactions on Programming Languages and Systems, Vol. 24, No. 3, May 2002.
(12)
•
Parametric Shape Analysis via 3-Valued Logic
245
4.2.4 The Embedding Theorem. Informally, the Embedding Theorem says,
If S v f S 0 , then every piece of information extracted from S 0 via a formula ϕ is
a conservative approximation of the information extracted from S via ϕ.
To formalize this, we extend mappings on individuals to operate on assign0
ments. If f : U S → U S is a function and Z : Var → U S is an assignment, f ◦ Z
0
denotes the assignment f ◦ Z : Var → U S such that ( f ◦ Z )(v) = f (Z (v)).
The formal statement of the Embedding Theorem is as follows.
0
0
THEOREM 4.9 (Embedding Theorem). Let S = hU S , ι S i and S 0 = hU S , ι S i
0
be two structures, and let f : U S → U S be a function such that S v f S 0 . Then,
0
for every formula ϕ and complete assignment Z for ϕ, [[ϕ]]3S (Z ) v [[ϕ]]3S ( f ◦ Z ).
PROOF. Appears in Appendix B. h
0
Note that if S is a 2-valued structure, then we have [[ϕ]]2S (Z ) v [[ϕ]]3S ( f ◦ Z ).
Example 4.10. Continuing Example 4.7, we can illustrate the Embedding
Theorem on the formula ϕ ≡ v1 = v2 and the embedding f ≡ t embed, as follows.
0 = [[v1 = v2 ]]3S ([v1 7→ u1 , v2 7→ u2 ])
0
v [[v1 = v2 ]]3S (t embed ◦ [v1 7→ u1 , v2 7→ u2 ])
0
= [[v1 = v2 ]]3S ([v1 7→ t embed(u1 ), v2 7→ t embed(u2 )])
0
= [[v1 = v2 ]]3S ([v1 →
7 u0 , v2 7→ u0 ])
= 1/2
1 = [[v1 = v2 ]]3S ([v1 7→ u1 , v2 7→ u1 ])
0
v [[v1 = v2 ]]3S (t embed ◦ [v1 7→ u1 , v2 7→ u1 ])
0
= [[v1 = v2 ]]3S ([v1 7→ t embed(u1 ), v2 7→ t embed(u1 )])
0
= [[v1 = v2 ]]3S ([v1 →
7 u0 , v2 7→ u0 ])
= 1/2.
The Embedding Theorem requires that f be surjective in order to guarantee
that a quantified formula, such as ∃v : ϕ, has consistent values in S and S 0 . For
0
example, if f were not surjective, then there could exist an individual u0 ∈ U S ,
S0
0
not in the range of f , such that [[ϕ]]3 ([v 7→ u ]) = 1. This would permit there to
0
be structures S and S 0 for which [[∃v : ϕ]]3S (Z ) = 0 but [[∃v : ϕ]]3S ( f ◦ Z ) = 1.
Apart from surjectivity, the Embedding Theorem depends on the fact that
the 3-valued meaning function is monotonic in its “interpretation” argument
(cf. Lemma 4.4).
The use of this machinery provides several advantages for program analysis.
—The Embedding Theorem provides a systematic way to use an abstract
(3-valued) structure S to answer questions about properties of the concrete
(2-valued) structures that S represents. It ensures that it is safe to evaluate a formula ϕ on a single 3-valued structure S, instead of evaluating ϕ in
all structures S \ that are members of the (potentially infinite) set γ (S). In
ACM Transactions on Programming Languages and Systems, Vol. 24, No. 3, May 2002.
246
•
M. Sagiv et al.
particular, a definite value for ϕ in S means that ϕ yields the same definite
value in all S \ ∈ γ (S).
—The Embedding Theorem allows us to extract information from either the
concrete world or the abstract world via the same formula: the same syntactic expression can be interpreted either in the 2-valued world or the 3-valued
world; the consistency of the information obtained is ensured by the Embedding Theorem.
4.2.5 “Summary Nodes” and Equality. Because predicate sm receives special treatment in Definitions 4.5 and 4.6, the definitions of embedding and tight
embedding look a bit awkward. It would be possible to sidestep this by assuming that every structure—2-valued or 3-valued—includes a binary predicate eq
(rather than a unary predicate sm); eq is then used to define the meaning of
the syntactic equality symbol (=). In 2-valued structures, eq merely represents
the “identically-equal” relation on individuals:
ι S (eq)(u1 , u2 ) = (u1 = u2 ).
In embeddings, the status of eq is no different from the other predicates: its
value must abide by Equation (8) (or Equation (10), in the case of a tight embedding). In both 2-valued and 3-valued structures, the meaning of the syntactic
equality symbol (=) is defined by
[[v1 = v2 ]] S (Z ) = ι S (eq)(z(v1 ), z(v2 )).
With this approach, sm can be defined as an instrumentation predicate:
sm(v) = (v 6= v).
def
It is then a consequence of Definition 3.2 that sm always evaluates to 0 in a
2-valued structure. In 3-valued structures created via embedding, Equations (5)
and (9) (as well as Equation (11), in the case of a tight embedding) follow from
the surjectivity of the embedding function and Equation (8) (Equation (10), in
the case of a tight embedding).
One motivation for introducing sm explicitly was to resemble more closely
the shape-analysis algorithms presented in earlier work, which have an explicit
notion of “summary nodes” [Jones and Muchnick 1981; Chase et al. 1990; Sagiv
et al. 1998; Wang 1994].
A second motivation was provided by the fact that the presence of an explicit
sm predicate reduces the amount of space needed to represent 3-valued structures. In 3-valued structures, semantic equality no longer coincides with the
“identically-equal” relation on individuals (cf. Examples 4.7 and 4.10); hence,
some storage must be devoted to representing semantic equality. The unary
predicate sm can be represented in space linear in the number of individuals,
as opposed to the binary predicate eq, which takes quadratic space.
4.3 Bounded Structures
We use the symbol P1 to denote the set of unary predicate symbols of vocabulary
P, and A ⊆ P1 to denote a designated set of abstraction predicate symbols.
ACM Transactions on Programming Languages and Systems, Vol. 24, No. 3, May 2002.
Parametric Shape Analysis via 3-Valued Logic
•
247
To guarantee that shape analysis terminates for a program that contains a
loop, we require that the number of potential structures for a given program be
finite.7 Toward this end, we make the following definition.
Definition 4.11. A bounded structure over vocabulary P ∪{sm} is a structure
S = hU S , ι S i such that for every u1 , u2 ∈ U S , where u1 6= u2 , there exists an
abstraction predicate symbol p ∈ A such that ι S ( p)(u1 ) 6= ι S ( p)(u2 ).
In the following, B-STRUCT[P ∪ {sm}] denotes the set of such structures.
The consequence of Definition 4.11 is that there is an upper bound on the
size of structures S ∈ B-STRUCT[P ∪ {sm}]; that is, |U S | ≤ 3|A| .
Example 4.12. Consider the class of bounded structures associated with
the List data type declaration from Figure 1(a). Here the predicate symbols
are P = {n} ∪ {x | x ∈ PVar}. For the insert program from Figure 1(b),
the program variables are x, y, t, and e, yielding unary predicates x, y, t,
and e. Therefore, the maximal number of individuals in a structure is 34 = 81.
(However, this is a worst-case bound; an application of the analysis does
not necessarily create structures that have this many individuals. For instance, at most 6 individuals arise in any structure in the complete analysis of
insert.)
4.3.1 Canonical Abstraction. One way to obtain a bounded structure is to
map individuals into abstract individuals named by the definite values of the
unary predicate symbols. This is formalized in the following definition.
Definition 4.13. The canonical abstraction of a structure S, denoted by
t embedc (S), is the tight embedding induced by the following mapping.
t embedc (u) = u{ p∈A|ιS ( p)(u)=1},{ p∈A|ιS ( p)(u)=0} .
The name “u{ p∈A|ιS ( p)(u)=1},{ p∈A|ιS ( p)(u)=0} ” is known as the canonical name of
individual u. The subscript on the canonical name of u involves two sets of
unary predicate symbols: those that are true at u, and those that are false at u.
Henceforth, we assume in our examples that A = P1 is the set {x, y, t, e, is,
cn , rx,n , r y,n , rt,n , re,n }; that is, we work with {x, y, t, e, is, cn , rx,n , r y,n , rt,n , re,n }abstraction.
Example 4.14. In structure Sreach from Figure 9, the canonical names of
the four individuals are as follows.
Individual
u1
u
u2
u0
Canonical Name
u{x,rx,n },{ y,t,e,is,cn ,r y,n ,rt,n ,re,n }
u{rx,n },{x, y,t,e,is,cn ,r y,n ,rt,n ,re,n }
u{ y,rx,n ,r y,n },{x,t,e,is,cn ,rt,n ,re,n }
u{rx,n ,r y,n },{x, y,t,e,is,cn ,rt,n ,re,n }
7 An
alternative would be to define widening operators that guarantee termination [Cousot and
Cousot 1979].
ACM Transactions on Programming Languages and Systems, Vol. 24, No. 3, May 2002.
248
•
M. Sagiv et al.
Note that t embedc can be applied to any 3-valued structure, not just 2-valued
structures, and that t embedc is idempotent (i.e., t embedc (t embedc (S)) =
t embedc (S)).
For any two bounded structures S, S 0 ∈ B-STRUCT[P ∪ {sm}], it is possible
to check whether S is isomorphic to S 0 , in time linear in the (explicit) sizes of
S and S 0 , using the following two-phase procedure.
0
(1) Rename the individuals in U S and U S according to their canonical names.
(2) For each predicate symbol p ∈ P ∪ {sm}, check that the predicates ι S ( p) and
0
ι S ( p) are equal.
4.3.2 Relationship of Canonical Abstraction to Previous Work. Canonical
abstraction is a generalization of the abstraction functions that have been used
in some of the previous work on shape analysis [Jones and Muchnick 1981;
Chase et al. 1990; Wang 1994; Sagiv et al. 1998].8 For instance, the abstraction
predicates used in Sagiv et al. [1988] are the “pointed-to-by-variable-x” predicates and thus correspond to the instantiation of canonical abstraction discussed in Example 4.14. Earlier, Jones and Muchnick [1981] proposed making
even finer distinctions by keeping exact information on elements within a distance k from a variable. In Wang [1994], in addition to “pointed-to-by-variablex” predicates, there are predicates of the form “was-pointed-to-by-variable-xat-program-point-p.”
Definition 4.13 generalizes these ideas to define a set of bounded structures
in terms of any fixed set of unary “abstraction properties” on individuals.
OBSERVATION 4.15 (Abstraction Principle). Individuals are partitioned into
equivalence classes according to their sets of unary abstraction-property values.
Every structure S \ is then represented (conservatively) by a condensed structure
S in which each individual of S represents an equivalence class of individuals
from S \ . This method of collapsing structures always yields bounded structures.
Compared to previous work, however, the present article uses canonical abstraction in somewhat different ways.
—Because the concrete and abstract worlds are defined in terms of a single unified concept of logical structures, it is possible to apply t embedc to abstract
(3-valued) structures as well as to concrete (2-valued) ones.
—The present work is not so tightly tied to canonical abstractions. There is
nothing special about a bounded structure that uses canonical names; whenever necessary, canonical names can be recovered from the values of a structure’s unary predicates.
—At various stages, we work with nonbounded structures, and return to
bounded structures by applying t embedc (see Section 6). The Embedding
Theorem ensures that the operations we apply to 3-valued structures are
safe, even when we are working with nonbounded structures.
8 The shape-analysis algorithms presented in Jones and Muchnick [1981], Chase et al. [1990], Wang
[1994], and Sagiv et al. [1998] are described in terms of various kinds of Storage Shape Graphs
(SSGs), not bounded structures. Our comparison is couched in the terminology of the present article.
ACM Transactions on Programming Languages and Systems, Vol. 24, No. 3, May 2002.
Parametric Shape Analysis via 3-Valued Logic
•
249
Table V. Examples of Instrumentation Predicates
Pred.
is(v)
rx,n (v)
rn (v)
cn (v)
Intended Meaning
Do two or more fields of heap elements point to v?
Is v (transitively) reachable from
pointer variable x along n fields?
Is v reachable from some pointer variable along n fields (i.e., is v a nongarbage element)?
Is v on a directed cycle of n fields?
c f .b (v)
Purpose
lists and trees
separating disjoint
data structures
compile-time garbage
collection
reference counting
Does a field-f deref. from v, followed doubly linked lists
by a field-b deref., yield v?
cb. f (v) Does a field-b deref. from v, followed doubly linked lists
by a field-f deref., yield v?
Ref.
[Chase et al. 1990],
[Sagiv et al. 1998]
[Sagiv et al. 1998]
[Jones and Muchnick
1981]
[Hendren et al. 1992],
[Plevyak et al. 1993]
[Hendren et al. 1992],
[Plevyak et al. 1993]
5. INSTRUMENTATION PREDICATES
It is possible to improve the precision of the analysis by using an abstract domain that makes finer distinctions among the concrete structures. As discussed
in Section 2.6, the instrumentation principle is the main tool for achieving this;
that is, the notion of which finer distinctions to make is defined using instrumentation predicates, which record information derived from other predicates.9
Formally, we assume that the set of predicates P is partitioned into two disjoint sets: the core predicates, denoted by C, and the instrumentation predicates,
denoted by I. Furthermore, the meaning of every instrumentation predicate is
defined in terms of a formula over the core predicates.10
Example 5.1. Table V lists some examples of instrumentation predicates,
and Table VI gives their defining formulae.
Instrumentation predicates can increase the precision of a program-analysis
algorithm in at least the following ways:
(1) The value stored for an instrumentation predicate in a given 3-valued structure S may be more precise than that obtained by evaluating the instrumentation predicate’s defining formula (Observation 2.8). For example, in
structure Sacyclic from Figure 6, ι Sacyclic (cn )(u) = 0 despite the fact that ϕcn
evaluates to 1/2 on u.
(2) The set of concrete structures that a given 3-valued structure S represents
(as defined by Equation (12)) is, in general, decreased if we augment S
with additional instrumentation predicates that have definite values for at
least some combinations of individuals. For example, structure Sacyclic from
Figure 6 is structure S3 from Figure 5 augmented with instrumentation
predicate cn , for which cn (u1 ) = 0 and cn (u) = 0. Thus, Sacyclic cannot possibly
9 In
the literature on logic, these predicates are sometimes called derived predicates.
the sake of abbreviation, it is sometimes convenient to allow an instrumentation predicate’s
defining formula to be defined in terms of other instrumentation predicates; however, such defining
formulae should not be mutually recursive.
10 For
ACM Transactions on Programming Languages and Systems, Vol. 24, No. 3, May 2002.
250
•
M. Sagiv et al.
Table VI. Formulae for the Instrumentation Predicates
Listed in Table V
ϕis (v)
def
= ∃v1 , v2 : n(v1 , v) ∧ n(v2 , v) ∧ v1 6= v2
ϕrx,n (v) = x(v) ∨ ∃v1 : x(v1 ) ∧ n+ (v1 , v)
def
W
+
x∈PVar (x(v) ∨ ∃v1 : x(v1 ) ∧ n (v1 , v))
(13)
(14)
ϕrn (v)
def
ϕcn (v)
def
(16)
def
(17)
def
(18)
=
= n+ (v, v)
ϕc f .b (v) = ∀v1 : f (v, v1 ) ⇒ b(v1 , v)
ϕcb. f (v) = ∀v1 : b(v, v1 ) ⇒ f (v1 , v)
(15)
represent 2-valued structures with cyclic nodes; in contrast, among the
structures that S3 represents is the following concrete cyclic list:
n
x
u1
n
u2
n
u3
(3) Unary instrumentation predicates that are used as abstraction predicates
refine the set of bounded structures. For example, using the “is-shared”
predicate is as an abstraction predicate leads to an algorithm that is more
precise than the one given in Sagiv et al. [1998]. The latter does not distinguish between shared and unshared individuals, and thus loses accuracy
for stores that contain a shared heap cell that is not directly pointed to by
a program variable. Adopting is as an additional abstraction predicate improves the accuracy of shape analysis because concrete-store elements that
are shared and concrete-store elements that are not shared are represented
by abstract individuals that have different canonical names.
It is important to note that the instrumentation predicates do not have to
be unary. Furthermore, not all of the unary instrumentation predicates need
necessarily be used as abstraction predicates.
Adding more unary instrumentation predicates and using them as abstraction predicates increases the worst-case cost of the analysis, since the number
of individuals in bounded structures is proportional to 3|A| . However, our initial
experience indicates that the opposite happens in practice; by using the “right”
unary instrumentation predicates, the cost of the analysis can be significantly
decreased (see Section 7.4).
5.1 Updating Instrumentation Predicates
Because each instrumentation predicate is defined by means of a formula over
the core predicates (cf. Table VI), for the concrete semantics there is no need to
specify formulae for updating the instrumentation predicates. However, for the
abstract semantics, the Instrumentation Principle implies that it may be more
precise for a statement transformer to update the values of the instrumentation predicates explicitly. In particular, this is often the case for an instrumentation predicate’s value for a summary node.
In order to update the values of the instrumentation predicates based on
the stored values of the instrumentation predicates, as part of instantiating
ACM Transactions on Programming Languages and Systems, Vol. 24, No. 3, May 2002.
Parametric Shape Analysis via 3-Valued Logic
•
251
Table VII. Predicate-Update Formulae for the Instrumentation Predicate is
st
st
ϕis
x->n = NULL
st
ϕis
(v) =
x->n = t
(assuming that
x->n == NULL)
st
ϕis
(v) =
x = malloc()
st
ϕis
(v) = is(v) ∧ ¬new(v)
½
def
½
def
is(v) ∧ ϕis [n 7→ ϕnst ] if ∃v0 : x(v0 ) ∧ n(v0 , v)
is(v)
otherwise
is(v) ∨ ϕis [n 7→ ϕnst ] if ∃v1 : t(v) ∧ n(v1 , v)
is(v)
otherwise
def
the parametric framework, the designer of a shape analysis must provide, for
every predicate p ∈ I and statement st, a predicate-update formula ϕ st
p that
identifies the new value of p after st. It is always possible to define ϕ st
p to
be the formula ϕ p [c 7→ ϕcst | c ∈ C] (i.e., the formula obtained from ϕ p by replacing each occurrence of a predicate c ∈ C by ϕcst ).This substitution captures the
value for c after st has been executed. We refer to ϕ p [c 7→ ϕcst | c ∈ C] as the trivial update formula for predicate p, since it is equivalent to merely reevaluating
p’s defining formula in the structure obtained after st has been executed. As
demonstrated in Section 2, reevaluation may yield many indefinite values, and
hence the trivial update formula is often unsatisfactory. It is preferable, therefore, to devise predicate-update formulae that minimize reevaluations of ϕ p .
We now state the requirements on predicate-update formulae that the user
of our framework needs to show in order to make sure that the analysis is
conservative.
Definition 5.2. We say that a predicate-update formula for p maintains the
correct instrumentation for statement st if, for all S \ ∈ 2-STRUCT[P] and all Z ,
£ st ¤ S \
\
)
ϕ P 2 (Z ) = [[ϕ p ]][[st]](S
(Z ).
(19)
2
In the above definition, [[st]](S \ ) denotes a version of the operation defined in
Definitions 3.3 and 3.5 in which P is restricted to C. We make the assumption
that the predicate-update formula for an instrumentation predicate is defined
solely in terms of core predicates. An instrumentation predicate’s formula can
always be put in this form by repeated substitution until only core predicates
remain.
Henceforth, we assume that for all the instrumentation predicates and
all the statements, the predicate-update formulae maintain correct instrumentation. Note that the trivial update formulae do maintain correct instrumentation; however, they may yield very imprecise answers when applied to 3-valued
structures.
5.2 Updating Sharing
Table VII gives the predicate-update formulae for the instrumentation predicate is. (It lists formulae only for the statements that may affect the value of
is.) The assignment to x->n = NULL can only change the value of is to 0, and
only for an element pointed to by x->n. Therefore, in Table VII ϕis [n 7→ ϕnst ]
is evaluated only for elements pointed to by x->n. Similarly, the assignment
x->n = t (assuming that x->n == NULL) can only change the value of is to 1,
ACM Transactions on Programming Languages and Systems, Vol. 24, No. 3, May 2002.
•
252
M. Sagiv et al.
Table VIII. Predicate-Update Formulae for Instrumentation Predicate rz,n , for Programs
Using the List Data Type Declaration from Figure 1(a)
st
cond. ϕrstz,n (v)
x = NULL
z
z
z
z
z
z
z
x = t
x = t->n
x->n = NULL
≡
6
≡
≡
6
≡
≡
6
≡
≡
x
x
x
x
x
x
x
0
rz,n (v)
rt,n (v)
rz,n (v)
rt,n (v) ∧ (cn (v) ∨ ¬t(v))
rz,n (v)
x(v)
½
ϕrz,n [n 7→ ϕnst ]
if cn (v) ∧ rx,n (v)
z≡
6 x
rz,n (v) ∧ ¬(∃v0 : rz,n (v0 ) ∧ x(v0 ) ∧ rx,n (v) ∧ ¬x(v)) otherwise
x->n = t
(assuming that
rz,n (v) ∨ (∃v0 : rz,n (v0 ) ∧ x(v0 ) ∧ rt,n (v))
x->n == NULL)
x = malloc() z ≡ x new(v)
z 6≡ x rz,n (v) ∧ ¬new(v)
Table IX. Predicate-Update Formulae for
Instrumentation Predicate cn , for Programs Using
the List Data Type Declaration from Figure 1(a)
st
ϕcstn (v)
x = NULL
x = t
x = t->n
x->n = NULL
x->n = t
(assuming that
x->n == NULL)
x = malloc()
cn (v)
cn (v)
cn (v)
cn (v) ∧ ¬(∃v0 : x(v0 ) ∧ cn (v0 ) ∧ rx,n (v))
cn (v) ∨ ∃v0 : x(v0 ) ∧ rt,n (v0 ) ∧ rt,n (v)
cn (v) ∧ ¬new(v)
and only for an element that is pointed to by t and already has at least one
incoming edge. Therefore, ϕis [n 7→ ϕnst ] is evaluated only for elements that are
pointed to by t and already have at least one incoming edge.
5.3 Updating Reachability and Cyclicity
In this section, we discuss how to define predicate-update formulae ϕrstx,n (v) that
maintain the correct instrumentation for the rx,n predicates. First, note that in a
3-valued structure S, ϕrx,n (v) is likely to evaluate to 0 or 1/2 for most individuals:
for [[ϕrx,n ]]3S ([v 7→ u]) to evaluate to 1, there would have to be a path of n-edges
that all have the value 1, from the individual pointed to by x to u. However,
the Instrumentation Principle comes into play. As we show below, in many
cases, by maintaining information about cyclicity in addition to reachability,
information about the absence of a cycle can be used to update rx,n directly,
without reevaluating ϕrx,n (v).
For programs that use the List data type declaration from Figure 1(a),
predicate-update formulae for rx,n (v) and cn (v), are given in Tables VIII and
IX, respectively. Some of these formulae update the value of rx,n (v) in terms of
ACM Transactions on Programming Languages and Systems, Vol. 24, No. 3, May 2002.
Parametric Shape Analysis via 3-Valued Logic
•
253
Fig. 12. For the statement x->n = NULL, the above graph illustrates the chief obstacle for updating reachability information. After execution of x->n = NULL, elements u4 and u5 are no longer
reachable from z, whereas u2 (and u3 ) are still reachable from z. Note that beforehand the value
of rz,n is the same for u2 , u4 , and u5 . For such individuals, Table VIII obtains the value of rz,n (in
the new structure) by evaluating the formula ϕrz,n [n 7→ ϕnst ] (in the above structure).
the value of cn (v), and vice versa. For example, the predicate-update formula
in the x->n = NULL case in Table VIII, when z 6≡ x, is based on the observation that it is unnecessary to reevaluate ϕrz,n (v) whenever v does not occur on a
directed cycle or is not reachable from x.
Let us now consider the predicate-update formulae for rx,n (v) that appear in
Table VIII.
—The statement x = NULL resets rx,n to 0.
—The statement x = t sets rx,n to rt,n .
—The statement x = t->n sets rx,n for all individuals for which rt,n holds, except
for the individual u pointed to by t, unless u appears on a cycle, in which case
rx,n (u) also holds.
—The statement x->n = NULL not only resets the x-reachability property rx,n ,
it may also change rz,n when the element directly pointed to by x is reached
by variable z. Furthermore, as illustrated in Figure 12, in the presence of
cycles it is not always obvious how to determine the exact elements whose
rz,n properties change. Therefore, the predicate-update formula breaks into
two subcases.
— v appears on a directed cycle and is reachable from the individual pointed
to by x. In this case, ϕrz,n (v) is reevaluated (in the structure after the
destructive update). For 3-valued structures, this may lead to a loss of
precision.
— v does not appear on a directed cycle or is not reachable from the individual
pointed to by x. In this case, v fails to be reachable from z only if the edge
being removed is used on the path from z to v.
—After the statement x = malloc(), the only element reachable from x is the
newly allocated element.
Let us now examine the predicate-update formulae for cn (v) that appear in
Table IX.
—The statements x = NULL, x = t, and x = t->n do not change the npredicates, and thus have no effect on cyclicity.
— If v0 , the node pointed to by x, appears on a cycle, then the statement
x->n = NULL breaks the cycle involving all the nodes reachable from x. (The
latter cycle is unique, if it exists.) If the node pointed to by x does not appear
on a cycle, this statement has no effect on cyclicity.
ACM Transactions on Programming Languages and Systems, Vol. 24, No. 3, May 2002.
254
•
M. Sagiv et al.
—If the node pointed to by x is reachable from t, then the statement x->n = t
creates a cycle involving all the nodes reachable from t. In other cases, no
new cycles are created.
—The statement x = malloc() sets the cyclicity property of the newly allocated
element to 0.
6. ABSTRACT SEMANTICS
In this section, we formulate the abstract semantics for the shape-analysis
framework. As an intermediate step, Section 6.1 first describes a simple abstract semantics based on the Reinterpretation Principle (Observation 2.9).
This version shows how the machinery developed thus far fits together, but
serves mainly as a strawman: this first approach yields very imprecise information about programs that perform list traversals and destructive updates
(such as insert), and this failing motivates the development of two new pieces
of machinery (focus and coerce) to refine the strawman approach. The main
ideas behind the more refined approach are sketched in Section 6.2, and formally defined in Sections 6.3 and 6.4. Finally, Section 6.5 defines the more refined semantics, and also illustrates how it is capable of obtaining very precise
shape-analysis information when applied to the analysis of insert.
6.1 A Strawman Shape-Analysis Algorithm
We now define a simple abstract semantics for the shape-analysis framework based on the Reinterpretation Principle (Observation 2.9). The goal is
to associate with each vertex v of control-flow graph G, a finite set of 3valued structures StructSet[v] that “describes” all of the 2-valued structures in
ConcStructSet[v] (and possibly more). The abstract semantics can be expressed
as the least fixed point (in terms of set inclusion) of the following system of
equations over the variables StructSet[v].
StructSet[v] =

{h∅, ∅i}[
if v = start





{t embedc ([[st(w)]]3 (S)) | S ∈ StructSet[w]} 







w→v∈E(G),






w∈As(G)




[



∪


{S | S ∈ StructSet[w]}


(20)
w→v∈E(G),
otherwise.


w ∈ I d (G)


[





∪
{S | S ∈ StructSet[w] and S |=3 cond(w)} 






 w→v∈T b(G)


[








∪
{S
|
S
∈
StructSet[w]
and
S
|=
¬cond(w)}
3


w→v∈F b(G)
This equation system closely resembles equation system (4), but has the following differences.
ACM Transactions on Programming Languages and Systems, Vol. 24, No. 3, May 2002.
Parametric Shape Analysis via 3-Valued Logic
•
255
—The notation [[st(w)]]3 denotes the abstract meaning function for statement w; it is identical to the operation [[st(w)]] defined in Definitions 3.3
and 3.5, except that the predicate-update formulae are evaluated in 3-valued
logic.
—Instead of working with concrete 2-valued structures, equation system (20)
operates on bounded structures and applies t embedc after every statement.
—When a formula cond(w) evaluates to 1/2, equation system (20) conservatively propagates information along both the true and the false branches.
The shape-analysis algorithm takes the form of an iterative procedure that
finds the least fixed point of equation system (20). The iteration starts from the
initial assignment StructSet[v] = ∅ for each control-flow-graph vertex v. Because of the t embedc operation, it is possible to check efficiently if two 3-valued
structures are isomorphic.
Termination and safety of the shape-analysis algorithm are argued in the
standard manner [Cousot and Cousot 1977]. The algorithm terminates because,
for any instantiation of the analysis framework, the set of bounded structures
is finite, and hence of finite height. Termination is assured because equation
system (20) is monotonic (with respect to set inclusion).
The heart of the safety argument involves showing that the abstract transformer that is applied along each edge of the control-flow graph is conservative
with respect to the corresponding transformer of the concrete semantics.
THEOREM 6.1 (Local Safety Theorem).
S ∈ 3-STRUCT[P ∪ {sm}]
If vertex w is a condition, then for all
(i) If S \ ∈ γ (S) and S \ |= cond(w), then S |=3 cond(w).
(ii) If S \ ∈ γ (S) and S \ |= ¬cond(w), then S |=3 ¬cond(w).
If vertex w is a statement, then
(iii) If S \ ∈ γ (S), then [[st(w)]](S \ ) ∈ γ (t embedc ([[st(w)]]3 (S))).
PROOF. By Definition 4.8, S \ ∈ γ (S) means that there is a mapping f such
that S \ v f S. Thus, properties (i) and (ii) follow immediately from the Embedding Theorem.
If vertex w is a statement, then it follows from the Embedding Theorem and
the definitions of [[st(w)]] and [[st(w)]]3 (Definitions 3.3 and 3.5) that, for any
structure S,
— If S \ ∈ γ (S), then [[st(w)]](S \ ) ∈ γ ([[st(w)]]3 (S)).
In particular, this means that there is a mapping f such that [[st(w)]](S \ ) v f
[[st(w)]]3 (S). However, t embedc simply folds together and renames individuals from [[st(w)]]3 (S), so we know that the composed mapping (t embedc ◦ f )
embeds [[st(w)]](S \ ) into t embedc ([[st(w)]]3 (S)):
[[st(w)]](S \ ) v(t embedc ◦ f ) t embedc ([[st(w)]]3 (S)).
By the definition of γ (Definition 4.8), this implies property (iii). h
ACM Transactions on Programming Languages and Systems, Vol. 24, No. 3, May 2002.
256
•
M. Sagiv et al.
Fig. 13. An application of the strawman abstract transformer [[st0 ]]3 for statement st0 : y = y->n
in insert.
THEOREM 6.2 (Global Safety Theorem). Let ConcStructSet[·] and StructSet[·] be the least-fixed-point solutions of equation systems (4) and (20), respectively. Then, for each vertex v of the control-flow graph,
[
ConcStructSet[v] ⊆
γ (S).
S∈StructSet[v]
PROOF. Immediate from monotonicity and Theorem 6.1 (Local Safety), using
Theorem T2 of Cousot and Cousot [1977, p. 252]. h
Other variations on equation system (20) are possible. In particular, the
function t embedc need not be applied after every statement; instead, it could
be applied (i) at every merge point in the control-flow graph, or (ii) only in
loops (e.g., at the target of each backedge in the control-flow graph). It is also
possible to define a Hoare ordering on sets of structures that is induced by the
embedding order.
Definition 6.3. For sets of structures XS1 , XS1 ⊆ 3-STRUCT[P], XS1 v
XS2 if and only if ∀S1 ∈ XS1 : ∃S2 ∈ XS2 : S1 v S2 .
In principle, this could be used to remove nonmaximal structures during the
course of the analysis; the shape-analysis algorithm would then be an iterative
procedure to compute a least fixed point with respect to the Hoare order. However, this would require being able to test the property S1 v S2 (i.e., whether
there exists a function f that embeds S1 into S2 ).
Unfortunately, the use of the Reinterpretation Principle alone leads to shapeanalysis algorithms that return very imprecise information. Figure 13 shows
the result of applying [[st0 ]]3 to structure Sa , one of the 3-valued structures
that arises in the analysis of insert just before y is advanced down the list
by the statement st0 : y = y->n. Similar to what was illustrated in Figure 10,
the structure Sb that results is not as precise as what the abstract domain
of canonical abstractions is capable of representing: for instance, Sb does not
contain a node that is definitely pointed to by y.
This imprecision leads to problems when a destructive update is performed.
In particular, the first column in Table X shows what happens when the abstract
ACM Transactions on Programming Languages and Systems, Vol. 24, No. 3, May 2002.
Parametric Shape Analysis via 3-Valued Logic
•
257
Table X.
Selective applications of the abstract transformers using the strawman and refined approaches,
for statements in insert that come after the search loop (for brevity, rz is used in place of rz,n for
all variables z, and node names are not shown).
ACM Transactions on Programming Languages and Systems, Vol. 24, No. 3, May 2002.
258
•
M. Sagiv et al.
Fig. 14. One-stage vs. multistage abstract semantics for statement st0 : y = y->n. The focus and
the coerce operations are introduced in Sections 6.3 and 6.4, respectively. (This example is discussed
in further detail in Sections 6.3 and 6.4.)
transformers for the five statements that follow the search loop in insert are
applied to Sb: Because y(v) evaluates to 1/2 for the summary node, we eventually reach the situation shown in the fifth row of structures, in which y, e,
rx , r y , re , rt , and is are all 1/2 for the summary node. As a result, under the
strawman approach, the abstract transformer for y->n = t sets the value of
cn for the summary node to 1/2. Consequently, the strawman analysis fails to
determine that the structure returned by insert is an acyclic list.
In contrast, the refined analysis described in the following subsections is able
to determine that at the end of insert the following properties always hold: (i)
x points to an acyclic list that has no shared elements, (ii) y points into the tail
of the x-list, and (iii) the values of e and y->n are equal.
It is worthwhile to note that the precision problem becomes even more acute
for shape-analysis algorithms that, like Chase et al. [1990], do not explicitly
track reachability properties. The reason is that, without reachability, Sb represents situations in which y points to an element that is not even part of the
x-list.
6.2 An Overview of a More Precise Abstract Semantics
In formulating an improved approach, our goal is to retain an important property of the strawman approach, namely, that the transformer for a program
statement falls out automatically from the predicate-update formulae of the
concrete semantics and the predicate-update formulae supplied for the instrumentation predicates. Thus, the main idea behind the more refined approach is to decompose the transformer for st into a composition of several functions, as depicted in Figure 14 and explained below, each of which
falls out automatically from the predicate-update formulae of the concrete semantics and the predicate-update formulae supplied for the instrumentation
predicates.
(1) The operation focus, defined in Section 6.3, refines 3-valued structures so
that the formulae that define the meaning of st evaluate to definite values.
The focus operation thus brings these formulae “into focus.”
(2) The simple abstract meaning function for statement st, [[st]]3 , is then
applied.
(3) The operation coerce, defined in Section 6.4, converts a 3-valued structure into a more precise 3-valued structure by removing certain kinds of
inconsistencies.
ACM Transactions on Programming Languages and Systems, Vol. 24, No. 3, May 2002.
Parametric Shape Analysis via 3-Valued Logic
•
259
It is worthwhile noting that both focus and coerce are semantic-reduction
operations (a concept originally introduced in Cousot and Cousot [1979]). That
is, they convert a set of 3-valued structures into a more precise set of 3-valued
structures that describe the same set of stores. This property, together with
the correctness of the structure transformer [[st]]3 , guarantees that the overall multistage semantics is correct. In the context of a parametric framework
for abstract interpretation, semantic reductions are valuable because they allow the transformers of the abstract semantics to be defined in a modular
fashion.
It is also interesting to compare this approach to the best abstract transformer defined in Cousot and Cousot [1979], which is obtained by applying the
concrete transformer to every concrete store that the input 3-valued structure
S represents (i.e., to all of the 2-valued structures in γ (S)). The best abstract
transformer is conceptually simpler and potentially more precise, but cannot
be directly computed (since γ (S) is potentially infinite). In contrast, focus yields
a finite set of 3-valued structures that represent the same concrete stores as S,
and thus serves as a “partial concretization function.” Since [[st]]3 is conservative
by the Embedding Theorem, the overall result is guaranteed to be conservative. Our experience has been that having focus ensure that the “important”
formulae have definite values is sufficient to ensure that the overall result is
precise enough.
In contrast to focus, coerce does not depend on the particular statement st;
it can be applied at any step, and may improve the precision of the analysis.
In practice, it is often beneficial to also perform an application of coerce just
after the application of focus and before [[st]]3 , so that the order of operations
becomes focus, coerce, [[st]]3 , coerce—followed by t embedc .
6.3 Bringing Formulae into Focus
In this section, we define an operation, called focus, that generates a set of
structures on which a given set of formulae F have definite values for all assignments. Unfortunately, in general focus may yield an infinite set of structures. Therefore, in Section 6.3.1, we give a declarative specification of the
desired properties of the focus operation, and in Section 6.3.2, we give an algorithm that implements focus for a certain specific class of formulae that are
needed for shape analysis. The latter algorithm always yields a finite set of
structures.
6.3.1 The Focus Operation. We extend operations on structures to operations on sets of structures in the natural way. For an operation op that returns
a set (such as γ , [[st]]3 , etc.),
oc
p(XS) =
def
[
op(S).
(21)
S∈ XS
Definition 6.4. Given a set of formulae F , a function op: 3-STRUCT[P] →
23-STRUCT[P] is a focus operation for F if for every S ∈ 3-STRUCT[P], op(S)
ACM Transactions on Programming Languages and Systems, Vol. 24, No. 3, May 2002.
260
•
M. Sagiv et al.
satisfies the following requirements.
—op(S) and S represent the same concrete structures; that is, γ (S) = γ̂ (op(S))
—In each of the structures in op(S), every formula ϕ ∈ F has a definite value
for every assignment; that is, for every S 0 ∈ op(S), ϕ ∈ F , and assignment
0
Z , we have [[ϕ]]3S (Z ) 6= 1/2.
In the above definition, Z maps the free variables of a formula ϕ ∈ F to
individuals in structures S 0 ∈ op(S). In particular, when ϕ has one designated
free variable v, Z maps v to an individual. As usual, when ϕ is a closed formula,
the quantification over Z is superfluous.
Henceforth, we use the notation focus F , or simply focus when F is clear from
the context, when referring to a focus operation for F in the generic sense. We
consider a specific algorithm for focusing shortly.
The first obstacle to developing a general algorithm for focusing is that the
number of resulting structures may be infinite. In many cases (including the
ones used below for shape analysis), this can be overcome by only generating
structures that are maximal (in terms of the embedding order). However, in
some cases the set of maximal structures is infinite, as well. This phenomenon
is illustrated by the following example:
Example 6.5.
Consider the following formula
ϕlast (v) = ∀v1 : ¬n(v, v1 ),
def
which is true for the last heap cell of an acyclic singly linked list. Focusing on
ϕlast with the structure Sacyclic shown in Figure 6 will lead to an infinite set of
maximal structures (lists of length 1, 2, 3, etc.).
To sidestep this obstacle, the focus formulae ϕ used in shape analysis are
determined by the left-hand side L-values and right-hand side R-values of each
kind of statement in the programming language. These are formally defined in
Section 6.3.2 and illustrated in the following example.
Example 6.6.
on the formula
For the statement st0 : y = y->n in procedure insert, we focus
ϕ0 (v) = ∃v1 : y(v1 ) ∧ n(v1 , v),
def
(22)
which corresponds to the right-hand side R-value of st0 (the heap cell pointed to
by y->n). The upper part of Figure 15 illustrates the application of focus{ϕ0 } (Sa ),
where Sa is the structure shown in Figure 13 that occurs in insert just before
the first application of statement st0 : y = y->n. This results in three structures:
Sa, f ,0 , Sa, f ,1 , and Sa, f ,2 .
S
—In Sa, f ,0 , [[ϕ0 ]]3 a, f ,0 ([v 7→ u]) equals 0. This structure represents a situation
in which the concrete list that x and y point to has only one element, but
the store also contains garbage cells, represented by summary node u. (As
we show later, this structure is actually inconsistent because of the values
of the rx,n and r y,n instrumentation predicates, and will be eliminated from
consideration by coerce.)
ACM Transactions on Programming Languages and Systems, Vol. 24, No. 3, May 2002.
Parametric Shape Analysis via 3-Valued Logic
•
261
Fig. 15. The first application of the improved transformer for statement st0 : y = y->n in insert.
S
— In Sa, f ,1 , [[ϕ0 ]]3 a, f ,1 ([v 7→ u]) equals 1. This covers the case where the list that
x and y point to has exactly two elements. For all of the concrete cells that
summary node u represents, ϕ0 must evaluate to 1, and so u must represent
just a single list node.
S
S
— In Sa, f ,2 , [[ϕ0 ]]3 a, f ,2 ([v 7→ u.0]) equals 0 and [[ϕ0 ]]3 a, f ,2 ([v 7→ u.1]) equals 1. This
covers the case where the list that x and y point to is a list of three or more
elements. For all of the concrete cells that u.0 represents, ϕ0 must evaluate
to 0, and for all of the cells that u.1 represents, ϕ0 must evaluate to 1. This
case captures the essence of node materialization as described in Sagiv et al.
[1998]: individual u is bifurcated into two individuals.
Notice how focus{ϕ0 } (Sa ) can be effectively constructed from Sa by considering the reasons why [[ϕ0 ]]3Sa (Z ) evaluates to 1/2 for various assignments Z . In
some cases, [[ϕ0 ]]3Sa (Z ) already has a definite value; for instance [[ϕ0 ]]3Sa ([v 7→ u1 ])
equals 0, and therefore ϕ0 is already in focus at u1 . In contrast, [[ϕ0 ]]3Sa ([v 7→ u])
equals 1/2. There are three (maximal) structures S that we can construct from
Sa in which [[ϕ0 ]]3S ([v 7→ u]) has a definite value:
S
— Sa, f ,0 , in which ι Sa, f ,0 (n)(u1 , u) is set to 0, and thus [[ϕ0 ]]3 a, f ,0 ([v 7→ u]) equals 0;
S
— Sa, f ,1 , in which ι Sa, f ,1 (n)(u1 , u) is set to 1, and thus [[ϕ0 ]]3 a, f ,1 ([v 7→ u]) equals 1;
ACM Transactions on Programming Languages and Systems, Vol. 24, No. 3, May 2002.
262
•
M. Sagiv et al.
Table XI. The Target Formulae for Focus, for Statements
and Conditions of a Program that Uses Type List
st
Focus Formulae
x = NULL
x = t
x = t->n
x->n = t
x = malloc()
x == NULL
x != NULL
x == t
x != t
∅
{t(v)}
{∃v1 : t(v1 ) ∧ n(v1 , v)}
{x(v), t(v)}
∅
{x(v)}
{x(v)}
{x(v), t(v)}
{x(v), t(v)}
UninterpretedCondition
∅
— Sa, f ,2 , in which u has been bifurcated into two different individuals, u.0
S
and u.1. In Sa, f ,2 , ι Sa, f ,2 (n)(u1 , u.0) is set to 0, and thus [[ϕ0 ]]3 a, f ,2 ([v 7→ u.0])
S
equals 0, whereas ι Sa, f ,2 (n)(u1 , u.1) is set to 1, and thus [[ϕ0 ]]3 a, f ,2 ([v 7→ u.1])
equals 1.
Of course, there are other structures that can be embedded into Sa that would
assign a definite value to ϕ0 , but these are not maximal because each of them
can be embedded into one of Sa, f ,0 , Sa, f ,1 , or Sa, f ,2 .
6.3.2 Selecting the Set of Focus Formulae for Shape Analysis. The greater
the number of formulae on which we focus, the greater the number of distinctions that the shape-analysis algorithm can make, leading to improved
precision. However, using a larger number of focus formulae can increase the
number of structures that arise, thereby increasing the cost of analysis. Our
preliminary experience indicates that in shape analysis there is a simple way
to define the formulae on which to focus that guarantees that the number of
structures generated grows only by a constant factor. The main idea is that in
a statement of the form lhs = rhs, we only focus on formulae that define the
heap cells for the L-value of lhs and the R-value of rhs. Focusing on left-hand
side L-values and right-hand side R-values ensures that the application of the
abstract transformer does not set to 1/2 the entries of core predicates that
correspond to pointer variables and fields that are updated by the statement.
This approach extends naturally to program conditions and to statements that
manipulate multiple left-hand side L-values and right-hand side R-values.
For our simplified language and type List, the target formulae on which to
focus can be defined as shown in Table XI. Let us examine a few of the cases
from Table XI.
—For the statement x = NULL, the set of target formulae is the empty set because neither the left-hand side L-value nor the right-hand side R-value is a
heap cell.
—For the statement x = t->n, the set of target formulae is the singleton set
{∃v1 : t(v1 ) ∧ n(v1 , v)} because the left-hand side L-value cannot be a heap
cell, and the right-hand side R-value is the cell pointed to by t->n.
ACM Transactions on Programming Languages and Systems, Vol. 24, No. 3, May 2002.
Parametric Shape Analysis via 3-Valued Logic
•
263
—For the statement x->n = t, the set of target formulae is the set {x(v), t(v)}
because the left-hand side L-value is the heap cell pointed to by x and the
right-hand side R-value is the heap cell pointed to by t.
—For the condition x == t, the set of target formulae is the set {x(v), t(v)}; the
R-values of the two sides of the conditional expression are the heap cells
pointed to by x and t.
It is not hard to extend Table XI for statements that manipulate more complicated data structures involving chains of selectors. For example, the set of
target formulae for the statement x->a->b = y->c->d->e is
{∃v1 : x(v1 ) ∧ a(v1 , v), ∃v1 , v2 , v3 : y(v1 ) ∧ c(v1 , v2 ) ∧ d (v2 , v3 ) ∧ e(v3 , v)},
because the left-hand side L-value is the heap cell pointed to by x->a, and the
right-hand side R-value is the heap cell pointed to by y->c->d->e.
Figure 16 contains an algorithm that implements focus for the type of formulae that arise in Table XI.11 Here we observe that for every set of formulae
F1 ∪ F2 , it is possible to focus on F1 ∪ F2 by first focusing on F1 , and then on F2 .
Thus, it is sufficient to provide an algorithm that focuses on an individual formula ϕ. As shown in Table XI, in shape analysis there are two types of formulae
that must be considered:
— ϕ ≡ x(v), for x ∈ PVar. In this case, FocusVar(S, x) is applied;
— ϕ ≡ ∃v1 : x(v1 ) ∧ n(v1 , v), for x ∈ PVar. In this case, FocusVarDeref(S, x, n) is
applied.
FocusVar repeatedly eliminates more and more indefinite values for x(v) by
creating more and more structures. For every individual u for which ι S (x)(u)
is an indefinite value, two or three structures are created. The function Expand creates a structure in which individual u is bifurcated into two individuals; this captures the essence of shape-node materialization (cf. Sagiv et al.
[1998]).
FocusVarDeref first brings x(v) into focus (by invoking FocusVar), and then
proceeds to eliminate indefinite ι S (n) values.
Example 6.7. Consider the application of FocusVarDeref(Sa , y, n) for structure Sa from Figure 15. In this case, FocusVar(Sa , y) = {Sa }. When Sa is selected
from WorkSet, structures Sa, f ,0 , Sa, f ,1 , and Sa, f ,2 are created. In the next three
iterations, these structures are moved to AnswerSet.
The following two lemmas guarantee that the algorithm for focus shown in
Figure 16 is correct.
LEMMA 6.8. For ϕ ≡ x(v), and for every structure S ∈ 3-STRUCT[P],
FocusVar(S, y) is a focus operation for {ϕ}.
11 An
enhanced focus algorithm, which generalizes the methods describe here, is described in
Lev-Ami [2000]. This algorithm can be applied to an arbitrary formula, but may not always succeed.
ACM Transactions on Programming Languages and Systems, Vol. 24, No. 3, May 2002.
264
•
M. Sagiv et al.
Fig. 16. An algorithm for focus for the two types of formulae that arise in Table XI.
LEMMA 6.9. For ϕ ≡ ∃v1 : x(v1 ) ∧ n(v1 , v), and for every structure S ∈
3-STRUCT[P], FocusVarDeref(S, y, n) is a focus operation for {ϕ}.
It is not hard to see that both FocusVar and FocusVarDeref always return a
finite (although not necessarily maximal) set of structures.
We use Focus to denote the operation that invokes FocusVar or
FocusVarDeref, as appropriate.
ACM Transactions on Programming Languages and Systems, Vol. 24, No. 3, May 2002.
Parametric Shape Analysis via 3-Valued Logic
•
265
6.4 Coercing into More Precise Structures
In this section, we define the operation coerce, which converts a 3-valued structure into a more precise 3-valued structure by removing certain kinds of inconsistencies.
Example 6.10. After focus, the simple transformer [[st]]3 is applied to each
of the structures produced. In the example discussed in Examples 6.6 and 6.7,
[[st0 ]]3 is applied to structures Sa, f ,0 , Sa, f ,1 , and Sa, f ,2 to obtain structures Sa,o,0 ,
Sa,o,1 , and Sa,o,2 , respectively (see Figure 15).
However, this process can produce structures that are not as precise as
we would like. The intuitive reason for this state of affairs is that there can
be interdependences between different properties stored in a structure, and
these interdependences are not necessarily incorporated in the definitions of
the predicate-update formulae. In particular, consider structure Sa,o,2 . In this
structure, the n field of u.0 can point to u.1, which suggests that y may be
pointing to a heap-shared cell. However, this is incompatible with the fact that
ι(is)(u.1) = 0—that is, u.1 cannot represent a heap-shared cell—and the fact
that ι(n)(u1 , u.1) = 1—that is, it is known that u.1 definitely has an incoming n
edge from a cell other than u.0.
Also, the structure Sa,o,0 describes an impossible situation: ι(r y,n )(u) = 1 and
yet u is not reachable (or even potentially reachable) from a heap cell that is
pointed to by y.
In this section, we develop a systematic way to capture interdependences
among the properties stored in 3-valued structures. The mechanism that we
present removes indefinite values that violate certain consistency rules, thereby
“sharpening” the structures that arise during shape analysis. This allows us
to remedy the imprecision illustrated in Example 6.10. In particular, when the
sharpening process is applied to structure Sa,o,2 from Figure 15, the structure
that results is Sb,2 . In this case, the sharpening process discovers that (i) two
of the n-edges with value 1/2 can be removed from Sa,o,2 , and (ii) individual
u.1 can only ever represent a single individual in each of the structures that
Sa,o,2 represents, and hence u.1 should not be labeled as a summary node. These
facts are not something that the mechanisms that have been described in earlier
sections are capable of discovering. Also, the structure Sa,o,0 is discarded by the
sharpening process.
The sharpening mechanism is crucial to the success of the improved shapeanalysis framework because it allows a more accurate job of materialization
to be performed than would otherwise be possible. For instance, note how the
sharpened structure, Sb,2 , clearly represents an unshared list of length 3 or
more that is pointed to by x and whose second element is pointed to by y. In
fact, in the abstract domain of canonical abstractions that is being used in
our examples, Sb,2 is the most precise representation possible for the family of
unshared lists of length 3 or more that are pointed to by x and whose second
element is pointed to by y. Without the sharpening mechanism, instantiations
of the framework would rarely be able to determine such things as “The data
ACM Transactions on Programming Languages and Systems, Vol. 24, No. 3, May 2002.
266
•
M. Sagiv et al.
structure being manipulated by a certain list-manipulation program is actually
a list.”
This subsection is organized as follows: Section 6.4.1 discusses how structures should obey certain consistency rules. Section 6.4.2 discusses how we can
obtain a system of “compatibility constraints” that formalize such consistency
rules; the constraint system is obtained automatically from formulae that express certain global invariants on concrete stores. Compatibility constraints
are used in Section 6.4.3 to define an operation, called coerce, that “coerces”
a structure into a more precise structure. Finally, in Section 6.4.4, we give an
algorithm for coerce.
6.4.1 Compatibility Constraints. We can, in many cases, sharpen some of
the stored predicate values of 3-valued structures:
Example 6.11. Consider a 2-valued structure S \ that can be embedded in a
3-valued structure S, and suppose that the formula ϕis for “inferring” whether
an individual u is shared evaluates to 1 in S (i.e., [[ϕis (v)]]3S ([v 7→ u]) = 1). By
\
the Embedding Theorem (Theorem 4.9), ι S (is)(u\ ) must be 1 for any individual
\
u\ ∈ U S that the embedding function maps to u.
0
Now consider a structure S 0 that is equal to S except that ι S (is)(u) is 1/2.
S \ can also be embedded in S 0 . However, the embedding of S \ in S is a “better”
embedding; it is a “tighter embedding” in the sense of Definition 4.6. This has
operational significance: it is needlessly imprecise to work with structure S 0
0
in which ι S (is)(u) has the value 1/2; instead, we should discard S 0 and work
with S. In general, the “stored predicate” is should be at least as precise as its
inferred value; consequently, if it happens that ϕis evaluates to a definite value
(1 or 0) in a 3-valued structure, we can sharpen the stored predicate is.
Similar reasoning allows us to determine, in some cases, that a structure
is inconsistent. In Sa,o,0 , for instance, ϕr y,n (u) = 0, whereas ι Sa,o,0 (r y,n )(u) is 1;
consequently, Sa,o,0 is a 3-valued structure that does not represent any concrete structures at all! This structure can therefore be eliminated from further
consideration by the abstract-interpretation algorithm.
This reasoning applies to all instrumentation predicates, not just is and rx,n ,
and to both of the definite values, 0 and 1.
The reasoning used in Example 6.11 can be summarized as the following
principle.
OBSERVATION 6.12 (The Sharpening Principle). In any structure S, the value
stored for ι S ( p)(u1 , . . . , uk ) should be at least as precise as the value of p’s
defining formula ϕ p , evaluated at u1 , . . . , uk (i.e., [[ϕ p ]]3S ([v1 7→ u1 , . . . , vk 7→ uk ])).
Furthermore, if ι S ( p)(u1 , . . . , uk ) has a definite value and ϕ p evaluates to an incomparable definite value, then S is a 3-valued structure that does not represent
any concrete structures at all.
This observation motivates the subject of the remainder of this subsection: an investigation of compatibility constraints expressed in terms of a new
connective x .
ACM Transactions on Programming Languages and Systems, Vol. 24, No. 3, May 2002.
Parametric Shape Analysis via 3-Valued Logic
•
267
Definition 6.13. A compatibility constraint is a term of the form ϕ1 x ϕ2 ,
where ϕ1 is an arbitrary 3-valued formula, and ϕ2 is either an atomic formula
or the negation of an atomic formula over distinct logical variables.
We say that a 3-valued structure S and an assignment Z satisfy ϕ1 x ϕ2 , denoted by S, Z |= ϕ1 x ϕ2 , if whenever Z is an assignment such that [[ϕ1 ]]3S (Z ) =
1, we also have [[ϕ2 ]]3S (Z ) = 1. (Note that if [[ϕ1 ]]3S (Z ) equals 0 or 1/2, S and Z
satisfy ϕ1 x ϕ2 , regardless of the value of [[ϕ2 ]]3S (Z ).)
We say that S satisfies ϕ1 x ϕ2 , denoted by S |= ϕ1 x ϕ2 , if for every Z we
have S, Z |= ϕ1 x ϕ2 . If 6 is a finite set of compatibility constraints, we write
S |= 6 if S satisfies every constraint in 6.
The compatibility constraint that captures the reasoning used in
Example 6.11 is ϕis (v) x is(v). That is, when ϕis evaluates to 1 at u, then is
must evaluate to 1 at u in order to satisfy the constraint. The compatibility
constraint used to capture the similar case of sharpening ι(is)(u) from 1/2 to 0
is ¬ϕis (v) x ¬is(v).
Section 6.4.4 presents a constraint-satisfaction algorithm that repeatedly
searches for assignments Z such that S, Z 6|= ϕ1 x ϕ2 (i.e., [[ϕ1 ]]3S (Z ) = 1, but
[[ϕ2 ]]3S (Z ) 6= 1). This algorithm is used to improve the precision of shape analysis
by (i) sharpening the values of predicates stored in S (when the constraint
violation is repairable), and (ii) eliminating S from further consideration when
the constraint violation is irreparable.
6.4.2 From Formulae to Constraints. Compatibility constraints provide a
way to express certain properties that are a consequence of the tight-embedding
process, but that would not be expressible with formulae alone. For a 2-valued
structure, x has the same meaning as implication. (That is, if S is a 2-valued
structure, S, Z |= ϕ1 x ϕ2 if and only if S, Z |= ϕ1 ⇒ ϕ2 .) However, for a
3-valued structure, x is stronger than implication: if ϕ1 evaluates to 1 and ϕ2
evaluates to 1/2, the constraint ϕ1 x ϕ2 is not satisfied. More precisely, suppose
that [[ϕ1 ]]3S (Z ) = 1 and [[ϕ2 ]]3S (Z ) = 1/2; the implication ϕ1 ⇒ ϕ2 is satisfied (i.e.,
S, Z |= ϕ1 ⇒ ϕ2 ), but the constraint ϕ1 x ϕ2 is not satisfied (i.e., S, Z 6|= ϕ1 x ϕ2 ).
In general, compatibility constraints are not expressible in Kleene’s logic
(i.e., by means of a formula that simulates the connective x ). The reason is that
formulae are monotonic in the information order (see Lemma 4.4), whereas x is
nonmonotonic in its right-hand side argument. For instance, the constraint
1 x p is satisfied in the structure S = h∅, [ p 7→ 1]i; however, it is not satisfied
in S 0 = h∅, [ p 7→ 1/2]i w S.
Thus, in 3-valued logic, compatibility constraints are in some sense “better”
than formulae. Fortunately, compatibility constraints can be generated automatically from formulae that express certain global invariants on concrete
stores. We call such formulae compatibility formulae. There are two sources
of compatibility formulae:
—the formulae that define the instrumentation predicates; and
—additional formulae (“hygiene conditions”) that formalize the properties of
stores that are compatible with the semantics of C (i.e., with our encoding of
C stores as 2-valued logical structures).
ACM Transactions on Programming Languages and Systems, Vol. 24, No. 3, May 2002.
268
•
M. Sagiv et al.
In the remainder of the article, 2-CSTRUCT[P, F ] denotes the set of compatible
2-valued structures that satisfy a given set of compatibility formulae F .12
The following definition supplies a way to convert formulae into constraints.
Definition 6.14. Let ϕ be a closed formula, and (where applicable below)
let a be an atomic formula such that (i) a contains no repetitions of logical
variables, and (ii) a 6≡ sm(v). Then the constraint generated from ϕ, denoted by
r(ϕ), is defined as follows.
r(ϕ) = ϕ1 x a
r(ϕ) = ϕ1 x ¬a
r(ϕ) = ¬ϕ x 0
if ϕ ≡ ∀v1 , . . . vk : (ϕ1 ⇒ a)
if ϕ ≡ ∀v1 , . . . vk : (ϕ1 ⇒ ¬a)
otherwise.
(23)
(24)
(25)
For a set of formulae F , we define r̂(F ) to be the set of constraints generated
from the formulae in F (i.e., {r(ϕ) | ϕ ∈ F }).
The intuition behind (23) and (24) is that for an atomic predicate, a tight
embedding yields 1/2 only in cases in which a evaluates to 1 on one tuple of
values for v1 , . . . , vk , but evaluates to 0 on a different tuple of values. In this case,
the left-hand side will evaluate to 1/2 as well (see Lemma 6.15 below). Rule (25)
is included to enable an arbitrary formula to be converted to a constraint.
The following lemma guarantees that tight embedding preserves satisfaction
of r̂(F ).
LEMMA 6.15. For every pair of structures S \ ∈ 2-CSTRUCT[P, F ] and S ∈
3-STRUCT[P] such that S is a tight embedding of S \ , S |= r̂(F ).
PROOF. See Appendix C.
¤
Example 6.16. It is worthwhile to point out that tight embedding need not
preserve implications when the right-hand side is an arbitrary formula. In
particular, it does not hold for disjunctions. Consider the implication formula
∀v : 1 ⇒ p1 (v) ∨ p2 (v)
and the structure S \ = h{u1 , u2 }, ι\ i with two individuals, u1 and u2 , such that
ι\ = [sm 7→ [u1 7→ 0, u2 7→ 0], p1 7→ [u1 7→ 1, u2 7→ 0], p2 7→ [u1 7→ 0, u2 7→ 1]].
Let S be the tight embedding of S \ obtained by mapping both u1 and u2 into
the same individual u1,2 ; that is, S = h{u1,2 }, ιi), where
ι = [sm 7→ [u1,2 7→ 1/2], p1 7→ [u1,2 7→ 1/2], p2 7→ [u1,2 7→ 1/2]].
We see that S \ |= 1 ⇒ p1 (v) ∨ p2 (v) but S 6|= 1 x p1 (v) ∨ p2 (v) since
[[ p1 (v) ∨ p2 (v)]]3S ([v 7→ u1,2 ]) = 1/2, whereas [[1]]3S ([v 7→ u1,2 ]) = 1.
12 We
also use a weaker notion of when a predicate-update formula for p maintains the correct
instrumentation for statement st; in particular, in Definition 5.2, the occurrence of 2-STRUCT[P]
is replaced by 2-CSTRUCT[P, F ]. The notion of concretization (Definition 4.8) is adjusted in the
same manner.
ACM Transactions on Programming Languages and Systems, Vol. 24, No. 3, May 2002.
•
Parametric Shape Analysis via 3-Valued Logic
269
Table XII.
The set of formulae listed above the line are the compatibility formulae generated for the instrumentation predicates is, cn , and rx,n . The corresponding
compatibility constraints are listed below the line.
∀v : (∃v1 , v2 : n(v1 , v) ∧ n(v2 , v) ∧ v1 6= v2 )
∀v : ¬(∃v1 , v2 : n(v1 , v) ∧ n(v2 , v) ∧ v1 6= v2 )
∀v : n+ (v, v)
∀v : ¬n+ (v, v)
for each x ∈ PVar, ∀v : x(v) ∨ ∃v1 : x(v1 ) ∧ n+ (v1 , v)
for each x ∈ PVar, ∀v : ¬(x(v) ∨ ∃v1 : x(v1 ) ∧ n+ (v1 , v))
(∃v1 , v2 : n(v1 , v) ∧ n(v2 , v) ∧ v1 6= v2 )
¬(∃v1 , v2 : n(v1 , v) ∧ n(v2 , v) ∧ v1 6= v2 )
n+ (v, v)
¬n+ (v, v)
for each x ∈ PVar : x(v) ∨ ∃v1 : x(v1 ) ∧ n+ (v1 , v)
for each x ∈ PVar : ¬(x(v) ∨ ∃v1 : x(v1 ) ∧ n+ (v1 , v))
⇒ is(v)
⇒ ¬is(v)
⇒ cn (v)
⇒ ¬cn (v)
⇒ rx,n (v)
⇒ ¬rx,n (v)
x is(v)
x ¬is(v)
x cn (v)
x ¬cn (v)
x rx,n (v)
x ¬rx,n (v)
(27)
(28)
(29)
(30)
(31)
(32)
(33)
(34)
(35)
(36)
(37)
(38)
Compatibility Constraints from Instrumentation Predicates. Our first
source of compatibility formulae is the set of formulae that define the instrumentation predicates. For every instrumentation predicate p ∈ I defined by a
formula ϕ p (v1 , . . . , vk ), we generate a compatibility formula of the form
∀v1 , . . . , vk : ϕ p (v1 , . . . , vk ) ⇔ p(v1 , . . . , vk ).
(26)
So that we can apply Definition 6.14, this is then broken into two implications:
∀v1 , . . . , vk : ϕ p (v1 , . . . , vk ) ⇒ p(v1 , . . . , vk )
∀v1 , . . . , vk : ¬ϕ p (v1 , . . . , vk ) ⇒ ¬ p(v1 , . . . , vk ).
For instance, for the instrumentation predicate is, we use formula (13) for ϕis
to generate compatibility formulae (27) and (28), which lead to compatibility
constraints (33) and (34) (see Table XII).
Compatibility Constraints from Hygiene Conditions. Our second source
of compatibility formulae stems from the fact that not all structures S \ ∈
2-STRUCT[P] represent stores that are compatible with the semantics of C.
For example, stores have the property that each pointer variable points to at
most one element in heap-allocated storage.
Example 6.17. The set of formulae FList , listed above the line in Table XIII,
is a set of compatibility formulae that must be satisfied for a structure to represent a store of a C program that operates on values of the type List defined in
Figure 1(a). Formula (39) captures the condition that concrete stores never contain any summary nodes. Formula (40) captures the fact that every program
variable points to at most one list element. Formula (41) captures a similar
property of the n fields of List structures: Whenever the n field of a list element
is non-NULL, it points to at most one list element. The corresponding compatibility constraints generated according to Definition 6.14 are listed below the
line.
Compatibility Constraints from “Extended Horn Clauses.” The constraintgeneration rules defined in Definition 6.14 generate interesting constraints
ACM Transactions on Programming Languages and Systems, Vol. 24, No. 3, May 2002.
270
•
M. Sagiv et al.
Table XIII.
The set of formulae listed above the line, denoted by FList , are compatibility formulae for structures that represent a store of a C program that operates on values of the type List defined in Figure 1(a).
The corresponding compatibility constraints r̂(FList ) are listed below
the line.
for each x ∈ PVar, ∀v1 , v2 : x(v1 ) ∧ x(v2 )
∀v1 , v2 : (∃v3 : n(v3 , v1 ) ∧ n(v3 , v2 ))
(∃v : sm(v))
for each x ∈ PVar, x(v1 ) ∧ x(v2 )
(∃v3 : n(v3 , v1 ) ∧ n(v3 , v2 ))
¬∃v : sm(v)
⇒ v1 = v2
⇒ v1 = v2
x0
x v1 = v2
x v1 = v2
(39)
(40)
(41)
(42)
(43)
(44)
only for certain specific syntactic forms, namely, implications with exactly one
(possibly negated) predicate symbol on the right-hand side. Thus, when we
generate compatibility constraints from compatibility formulae written as implications (cf. Tables XII and XIII), the set of constraints generated depends on
the form in which the compatibility formulae are written. In particular, not all
of the many equivalent forms possible for a given compatibility formula lead
to useful constraints. For instance, r(∀v1 , . . . , vk : (ϕ ⇒ a)) yields the (useful)
constraint ϕ x a, but r(∀v1 , . . . , vk : (¬ϕ ∨ a)) yields the (not useful) constraint
¬(¬ϕ ∨ a) x 0.
This phenomenon can prevent an instantiation of the shape-analysis framework from having a suitable compatibility constraint at its disposal that would
otherwise allow it to sharpen or discard a structure that arises during the
analysis, and hence can lead to a shape-analysis algorithm that is more conservative than we would like. However, when compatibility formulae are written
as “extended Horn clauses” (see Definition 6.18 below), the way around this difficulty is to augment the constraint-generation process to generate constraints
for some of the logical consequences of each compatibility formula. The process
of “generating some of the logical consequences for extended Horn clauses” is
formalized as follows.
Definition 6.18. For a formula ϕ, we define ϕ 1 ≡ ϕ and ϕ 0 ≡ ¬ϕ. We say
that a formula ϕ of the form
∀... :
m
_
(ϕi ) Bi ,
i=1
where m > 1 and Bi ∈ {0, 1}, is an extended Horn clause. We define the closure
of ϕ, denoted by closure(ϕ), to be the following set of formulae.


¯
¯1 ≤ j ≤ m,


m


¯
^
B ¯
def
ϕi1−Bi ⇒ ϕ j j ¯vk ∈ freeVars(ϕ), . (45)
closure(ϕ) = ∀ . . . , ∃v1 , v2 , . . . , vn :


¯

i=1,i6= j
¯vk 6∈ freeVars(ϕ j )
For a formula ϕ that is not an extended Horn clause, closure(ϕ) = {ϕ}. Finally,
d
for a set of formulae F , we write closure(F
) to denote the application of closure
to every formula in F .
ACM Transactions on Programming Languages and Systems, Vol. 24, No. 3, May 2002.
Parametric Shape Analysis via 3-Valued Logic
•
271
Table XIV.
d
The compatibility formulae closure(F
List ) generated via Definition 6.18
when the two implication formulae in FList , (40) and (41), are expressed
as the extended Horn clauses (50) and (51), respectively (Note that
the systematic application of Definition 6.18 leads, in this case, to two
pairs of formulae that differ only in the names of their bound variables:
(46)/(47) and (48)/(49).)
for each x ∈ PVar, ∀v1 , v2 : x(v1 ) ∧ x(v2 ) ⇒
for each x ∈ PVar, ∀v2 : (∃v1 : x(v1 ) ∧ v1 6= v2 ) ⇒
for each x ∈ PVar, ∀v1 : (∃v2 : x(v2 ) ∧ v1 6= v2 ) ⇒
∀v1 , v2 : (∃v3 : n(v3 , v1 ) ∧ n(v3 , v2 )) ⇒
∀v2 , v3 : (∃v1 : n(v3 , v1 ) ∧ v1 6= v2 ) ⇒
∀v1 , v3 : (∃v2 : n(v3 , v2 ) ∧ v1 6= v2 ) ⇒
¬∃v : sm(v)
v1 = v2
¬x(v2 )
¬x(v1 )
v1 = v2
¬n(v3 , v2 )
¬n(v3 , v1 )
(39)
(40)
(46)
(47)
(41)
(48)
(49)
It is easy to see that the formulae in closure(ϕ) are implied by ϕ.
Example 6.19. The set of formulae listed in Table XIV are the compatibility
d
formulae closure(F
List ) generated via Definition 6.18 when the two implication
formulae in FList , (40) and (41), are expressed as the following extended Horn
clauses (i.e., by rewriting the implications as disjunctions, and then applying
De Morgan’s laws).
for each x ∈ PVar, ∀v1 , v2 : ¬x(v1 ) ∨ ¬x(v2 ) ∨ v1 = v2 .
∀v1 , v2 , v3 : ¬n(v3 , v1 ) ∨ ¬n(v3 , v2 ) ∨ v1 = v2 .
(50)
(51)
From (50) and (51), Definition 6.14 generates the final six compatibility formulae shown in Table XIV. By Definition 6.14 these yield the following compatibility constraints.
for each x ∈ PVar, x(v1 ) ∧ x(v2 ) x v1 = v2
for each x ∈ PVar, (∃v1 : x(v1 ) ∧ v1 6= v2 ) x ¬x(v2 )
for each x ∈ PVar, (∃v2 : x(v2 ) ∧ v1 6= v2 ) x ¬x(v1 )
(∃v3 : n(v3 , v1 ) ∧ n(v3 , v2 )) x v1 = v2
(∃v1 : n(v3 , v1 ) ∧ v1 6= v2 ) x ¬n(v3 , v2 )
(∃v2 : n(v3 , v2 ) ∧ v1 6= v2 ) x ¬n(v3 , v1 ).
(43)
(52)
(53)
(44)
(54)
(55)
Similarly, after (27) is rewritten as the following extended Horn clause
∀v, v1 , v2 : ¬n(v1 , v) ∨ ¬n(v2 , v) ∨ v1 = v2 ∨ is(v),
we obtain the following compatibility constraints.
(∃v1 , v2 : n(v1 , v) ∧ n(v2 , v) ∧ v1 6= v2 ) x is(v)
(∃v1 : n(v1 , v) ∧ v1 6= v2 ∧ ¬is(v)) x ¬n(v2 , v)
(∃v2 : n(v2 , v) ∧ v1 6= v2 ∧ ¬is(v)) x ¬n(v1 , v)
(∃v : n(v1 , v) ∧ n(v2 , v) ∧ ¬is(v)) x v1 = v2 .
(33)
(56)
(57)
(58)
ACM Transactions on Programming Languages and Systems, Vol. 24, No. 3, May 2002.
272
•
M. Sagiv et al.
As we show in Section 6.4.4, the use of constraints—and, in particular, the
ones created from formulae generated by closure—plays a crucial role in the
shape-analysis framework. In particular, constraint (56) (or the equivalent constraint (57)) allows a more accurate job of materialization to be performed than
would otherwise be possible: When is(u) is 0 and one incoming n edge to u is 1,
to satisfy constraint (56) a second incoming n edge to u cannot have the value
1/2; it must have the value 0. That is, the latter edge cannot exist (cf. Examples 6.10 and 6.26). This allows edges to be removed (safely) that a more naive
materialization process would retain (cf. structures Sa,o,2 and Sb,2 in Figure 15),
and permits the improved shape-analysis algorithm to generate more precise
structures for insert than the ones generated by the simple shape-analysis
algorithm described in Sections 2 and 6.1.
d has been applied to all sets of compatiHenceforth, we assume that closure
bility formulae.
Definition 6.20 (Compatible 3-Valued Structures). Given
a
set
of
compatibility formulae F , the set of compatible 3-valued structures
3-CSTRUCT[P, r̂(F )] ⊆ 3-STRUCT[P] is defined by S ∈ 3-CSTRUCT[P, r̂(F )]
if and only if S |= r̂(F ).
The following lemma ensures that we can always replace a structure by a compatible one that satisfies constraint-set r̂(F ) without losing
information.
LEMMA 6.21. For every structure S ∈ 3-STRUCT[P] and concrete structure
0
S \ ∈ γ (S), there exists a structure S 0 ∈ 3-CSTRUCT[P, (F )] such that (i) U S =
U S , (ii) S 0 v S, and (iii) S \ ∈ γ (S 0 ).
PROOF. Let S \ ∈ γ (S), then by Definition 4.8 (as modified by footnote 12),
\
S |= F and there exists a function f : U S → U S such that S \ v f S. Define
S 0 = f (S \ ) (i.e., S 0 is the tight embedding of S \ under f ). By Lemma 6.15, S 0
satisfies the necessary requirements. h
\
In Section 6.4.4, we give an algorithm that constructs from S and r̂(F ) a
maximal S 0 meeting the conditions of Lemma 6.21 (without investigating the
possibly infinite set of actual concrete structures S \ ∈ γ (S)).
6.4.3 The Coerce Operation. We are now ready to show how the coerce
operation works.
Example 6.22. Consider structure Sa,o,2 from Figure 15. This structure violates constraint (56) for the assignment [v 7→ u.1, v2 7→ u.0] when the variable v1 of the existential quantifier is bound to u1 : because ι(n)(u1 , u.1) = 1,
u1 6= u.0, and ι(is)(u.1) = 0, but ι(n)(u.0, u.1) = 1/2, constraint (56) is not satisfied; the left-hand side evaluates to 1, whereas the right-hand side evaluates
to 1/2.
This example motivates the following definition.
ACM Transactions on Programming Languages and Systems, Vol. 24, No. 3, May 2002.
Parametric Shape Analysis via 3-Valued Logic
Definition 6.23.
•
273
The operation
coercer̂(F ) : 3-STRUCT[P] → 3-CSTRUCT[P, r̂(F )] ∪ {⊥}
0
is defined as follows: coercer̂(F ) (S) = the maximal S 0 such that S 0 v S, U S = U S ,
and S 0 ∈ 3-CSTRUCT[P, r̂(F )], or ⊥ if no such S 0 exists.
def
(We simply write coerce when r̂(F ) is clear from the context.)
It is a fact that the maximal such structure S 0 is unique (if it exists), which
follows from the observation that compatible structures with the same universe
of individuals are closed under the following join operation:
Definition 6.24. For every pair of structures S1 , S2 ∈ 3-CSTRUCT[P, r̂(F )]
such that U S1 = U S2 = U , the join of S1 and S2 , denoted by S1 t S2 , is defined
as follows.
S1 t S2 = hU, λp.λu1 , u2 , . . . , um .ι S1 ( p)(u1 , u2 , . . . , um ) t ι S2 ( p)(u1 , u2 , . . . , um )i.
def
LEMMA 6.25. For every pair of structures S1 , S2 ∈ 3-CSTRUCT[P, r̂(F )]
such that U S1 = U S2 = U , the structure S1 t S2 is also in 3-CSTRUCT[P, r̂(F )].
PROOF. See Appendix C. h
Because coerce can result in at most one structure, its definition does not
involve a set former, in contrast to focus, which can return a nonsingleton set.
The significance of this is that only focus can increase the number of structures
that arise during shape analysis, whereas coerce cannot.
Example 6.26. The application of coerce to the structures Sa,o,0 , Sa,o,1 , and
Sa,o,2 yields Sb,1 and Sb,2 , as shown in the bottom block of Figure 15.
—The structure Sa,o,0 is discarded because there exists no structure that can
be embedded into it that satisfies constraint (38).
—The structure Sb,1 was obtained from Sa,o,1 by removing incompatibilities as
follows:
(1) Consider the assignment [v 7→ u, v1 7→ u1 , v2 7→ u]. Because ι(n)(u1 , u) =
1, u1 6= u, and ι(is)(u) = 0, constraint (56) implies that ι(n)(u, u) must
equal 0. Thus, in Sb,1 the (indefinite) n edge from u to u has been removed.
(2) Consider the assignment [v1 7→ u, v2 7→ u]. Because ι( y)(u) = 1, conS
straint (43) implies that [[v1 = v2 ]]3 b,1 ([v1 7→ u, v2 7→ u]) must equal 1. By
Definition 4.2, this means that ι Sb,1 (sm)(u) must equal 0. Thus, in Sb,1 u
is no longer a summary node.
—The structure Sb,2 was obtained from Sa,o,2 by removing incompatibilities as
follows:
(1) Consider the assignment [v 7→ u.1, v1 7→ u1 , v2 7→ u.0]. Because
ι(n)(u1 , u.1) = 1, u1 6= u.0, and ι(is)(u.1) = 0, constraint (56) implies that
ι Sb,2 (n)(u.0, u.1) must equal 0. Thus, in Sb,2 the (indefinite) n edge from
u.0 to u.1 has been removed.
(2) Consider the assignment [v 7→ u.1, v1 7→ u1 , v2 7→ u.1]. Because
ι(n)(u1 , u.1) = 1, u1 6= u.1, and ι(is)(u.1) = 0, constraint (56) implies that
ACM Transactions on Programming Languages and Systems, Vol. 24, No. 3, May 2002.
274
•
M. Sagiv et al.
ι Sb,2 (n)(u.1, u.1) must equal 0. Thus, in Sb,2 the (indefinite) n edge from
u.1 to u.1 has been removed.
(3) Consider the assignment [v1 7→ u.1, v2 7→ u.1]. Because ι( y)(u.1) = 1,
S
constraint (43) implies that [[v1 = v2 ]]3 b,2 ([v1 7→ u.1, v2 7→ u.1]) must equal
1. By Definition 4.2, this means that ι Sb,2 (sm)(u.1) must equal 0. Thus, in
Sb,2 u.1 is no longer a summary node.
There are important differences between the structures Sb,1 and Sb,2 that
result from applying the refined abstract transformer for statement st0 : y =
y->n, compared with the structure Sb that results from applying the strawman
abstract transformer (see Figure 13). For instance, y points to a summary node
in Sb, whereas y does not point to a summary node in either Sb,1 or Sb,2 ; as noted
earlier, in the abstract domain of canonical abstractions that is being used in
our examples, Sb,2 is the most precise representation possible for the family of
unshared lists of length 3 or more that are pointed to by x and whose second
element is pointed to by y.
6.4.4 The Coerce Algorithm. In this subsection, we describe an algorithm,
called Coerce, that implements the operation coerce. This algorithm actually
finds a maximal solution to a system of constraints of the form defined in
Definition 6.13. It is convenient to partition these constraints into the following
types:
ϕ(v1 , v2 , . . . , vk ) x 0
ϕ(v1 , v2 , . . . , vk ) x (v1 = v2 )b
ϕ(v1 , v2 , . . . , vk ) x pb(v1 , v2 , . . . , vk ),
(59)
(60)
(61)
where p 6= sm, and the superscript notation used is the same as in Definition 6.18: ϕ 1 ≡ ϕ and ϕ 0 ≡ ¬ϕ. We say that constraints in the forms (59), (60),
and (61) are Type I, Type II, and Type III constraints, respectively.
The Coerce algorithm is shown in Figure 17. The input is a 3-valued structure S ∈ 3-STRUCT[P] and a set of constraints r̂(F ). It initializes S 0 to the
input structure S and then repeatedly refines S 0 by lowering predicate values
0
in ι S from 1/2 to a definite value, until either: (i) a constraint is irreparably
violated (i.e., the left-hand and right-hand sides have different definite values,
in which case the algorithm fails and returns ⊥; or (ii) no constraint is violated,
in which case the algorithm succeeds and returns S 0 . The main loop is a case
switch on the type of the constraint considered.
—A violation of a Type I constraint is irreparable since the right-hand side is
the literal 0.
—A violation of a Type II constraint when the right-hand side is a negated
equality cannot be fixed. When v1 6= v2 does not evaluate to 1, we have Z (v1 ) =
Z (v2 ); therefore, it is impossible to lower predicate values to force the formula
v1 6= v2 to evaluate to 1 for assignment Z .
—A violation of a Type II constraint when the right-hand side is an equality
that evaluates to 1/2 can be fixed. This type of violation occurs when there
ACM Transactions on Programming Languages and Systems, Vol. 24, No. 3, May 2002.
Parametric Shape Analysis via 3-Valued Logic
•
275
Fig. 17. An iterative algorithm for solving 3-valued constraints.
is an individual u that is a summary node:
0
0
[[v1 = v2 ]]3S ([v1 7→ u, v2 7→ u]) = ι S (sm)(u) = 1/2.
0
In this case, ι S (sm)(u) is set to 0.
—A violation of a Type III constraint can be fixed when the right-hand side
value is 1/2.
Example 6.27. When the Coerce algorithm is applied to Sa,o,0 , the Type
III constraint (38) for program variable y is irreparably violated, and ⊥ is
returned.
Coerce must terminate after at most
P n steps, where n is the number of definite
values in S 0 , which is bounded by p∈P |U |arity( p) . Correctness is established by
the following theorem.
THEOREM 6.28.
r̂(F )).
For every S ∈ 3-STRUCT[P], coercer̂(F ) (S) = Coerce(S,
PROOF. See Appendix C. h
6.5 The Shape-Analysis Algorithm
In this section, we define the more refined abstract semantics, which includes
applications of focus and coerce. (The actual algorithm uses Focus and Coerce.)
The main idea is to refine equation system (20) by performing focus at the
beginning of each abstract transformer that is applied along an edge of the
ACM Transactions on Programming Languages and Systems, Vol. 24, No. 3, May 2002.
276
•
M. Sagiv et al.
control-flow graph, and coerce at the end. Formally, this is defined as follows.
StructSet[v] =

{h∅, ∅i} [




d c (coerce(
d [[st(w)]]
d 3 (focus
dF (w) (StructSet[w]))))

t embed




w→v∈E(G),



w∈As(G)

[


∪
{S | S ∈ StructSet[w]}








∪











∪
w→v∈E(G),
w∈I d (G)
[
½
t
w→v∈T b(G) ½
[
t
w→v∈F b(G)
 if v = start
















¯
¾ otherwise.

¯
d focus
dF (w) (StructSet[w])) 

S ∈ coerce(

¯
embedc (S) ¯



and S |=3 cond(w)

¯
¾

¯

d focus
dF (w) (StructSet[w])) 
S
∈
coerce(


embedc (S) ¯¯


and S |=3 ¬cond(w)
(62)
Here F (w) is the set of focus formulae for w (see Table XI).
As with the strawman semantics, the safety argument involves showing that
the abstract transformer that is applied along each edge of the control-flow
graph is conservative with respect to the corresponding transformer of the concrete semantics. Because focus and coerce are defined as semantic reductions,
their presence in equation system (62) merely serves to increase the precision
of the final answer. As before, the safety of the uses of [[·]]3 and |=3 follow from
the Embedding Theorem. Formally, we show the following local safety theorem:
THEOREM 6.29 (Local Safety Theorem).
all S ∈ 3-STRUCT[P ∪ {sm}]
If vertex w is a condition, then for
d
(i) If S \ ∈ γ (S) and S \ |= cond(w), then there exists S 0 ∈ coerce(focus
F (w) (S))
such that S 0 |=3 cond(w) and S \ ∈ γ (t embedc (S 0 )).
d
(ii) If S \ ∈ γ (S) and S \ |= ¬cond(w), then there exists S 0 ∈ coerce(focus
F (w) (S))
such that S 0 |=3 ¬cond(w) and S \ ∈ γ (t embedc (S 0 )).
If vertex w is a statement, then
d c (coerce(
d 3 (focus
d [[st(w)]]
(iii) If S \ ∈ γ (S), then [[st(w)]](S \ ) ∈ γ̂ (t embed
F (w) (S))))).
PROOF.
See Appendix C. h
The global safety property is argued in the same way as in the strawman
semantics.
Example 6.30. Table XV shows the 3-valued structures that occur before
and after applications of the abstract transformer for the statement y = y->n
during the abstract interpretation of insert. (Because we are analyzing a single
procedure, we allow an arbitrary set of 3-valued structures to hold at the entry
of the procedure, as opposed to equation system (62), which assumes a single,
empty initial structure. The global safety theorem still holds as long as all of
the initial concrete structures are represented by the 3-valued structures that
are provided for the entry point.)
ACM Transactions on Programming Languages and Systems, Vol. 24, No. 3, May 2002.
Parametric Shape Analysis via 3-Valued Logic
•
277
Table XV.
The structures that occur before and after successive applications of the abstract transformer
for the statement y = y->n during abstract interpretation of insert (for brevity, node names
are not shown).
ACM Transactions on Programming Languages and Systems, Vol. 24, No. 3, May 2002.
278
•
M. Sagiv et al.
The material in Table X that appears under the heading “Refined Analysis”
shows the application of the abstract transformers for the five statements that
follow the search loop in insert to Sb,1 and Sb,2 . For space reasons, we do not
show the abstract execution of these statements on the other structures shown
in Table XV; however, the analysis is able to determine that at the end of insert
the following properties always hold: (i) x points to an acyclic list that has no
shared elements, (ii) y points into the tail of the x-list, and (iii) the values of e and
y->n are equal. The identification of the latter condition is rather remarkable:
the analysis is capable of showing that e and y->n are must-aliases at the end
of insert.
7. RELATED WORK
This article presents results from an effort to clarify and extend our previous
work on shape analysis [Sagiv et al. 1998]. Compared with Sagiv et al. [1998],
the major differences are as follows.
—A single specific shape-analysis algorithm was presented in Sagiv et al.
[1998]. The present article presents a parametric framework for shape analysis. It provides the basis for generating different shape-analysis algorithms
by varying the instrumentation predicates used.
—This article uses different instantiations of the parametric framework to
show how shape analysis can be performed for a variety of different kinds of
linked data structures.
—The shape-analysis algorithm in Sagiv et al. [1998] was cast as an abstract
interpretation in which the abstract transfer functions transformed shape
graphs to shape graphs. The present work is based on logic, and shape graphs
are replaced by 3-valued logical structures.
The use of logic has many advantages. The most important of these is that it
relieves the designer of a particular shape analysis from many of the burdensome tasks that the methodology of abstract interpretation ordinarily imposes.
In particular, (i) the abstract semantics falls out automatically from the concrete
semantics, and (ii) there is no need for a proof that a particular instantiation
of the shape-analysis framework is correct: the soundness of all instantiations
of the framework follows from a single metatheorem, the Embedding Theorem,
which shows that information extracted from a 3-valued structure is sound with
respect to information extracted from a corresponding 2-valued structure.
Of course, a trade-off is involved: with our approach it is necessary to define the instrumentation predicates that are appropriate for a given analysis.
It is also usually necessary to provide predicate-update formulae that specify
how the values of instrumentation predicates are affected by the execution of
each kind of statement in the programming language, and to prove that these
formulae maintain the correct instrumentation (in the sense of Definition 5.2
and footnote 12). It is open to debate whether these are more or less burdensome tasks than those one faces with more standard approaches to abstract
interpretation.
ACM Transactions on Programming Languages and Systems, Vol. 24, No. 3, May 2002.
Parametric Shape Analysis via 3-Valued Logic
•
279
A substantial amount of material covering previous work on pointer analysis,
alias analysis, and shape analysis is presented in Sagiv et al. [1998]. In the
remainder of this section, we confine ourselves to the work most relevant to the
present article.
7.1 Previous Work on Shape Analysis
The following previous shape-analysis algorithms, which all make use of some
kind of shape-graph formalism, can be viewed as instances of the framework
presented in this article.
—The algorithms of Wang [1994] and Sagiv et al. [1998] map unbounded-size
stores into bounded-size abstractions by collapsing concrete cells that are not
directly pointed to by program variables into one abstract cell, whereas concrete cells that are pointed to by different sets of variables are kept apart in
different abstract cells. As discussed in Section 4.3, these algorithms are captured in the framework by using abstraction predicates of the form pointedto-by-variable-x (for all x ∈ PVar).
—The algorithm of Jones and Muchnick [1981], which collapses individuals
that are not reachable from a pointer variable in k or fewer steps, for some
fixed k, can be captured in our framework by using instrumentation predicates of the form “reachable-from-x-via-access-path-α,” for |α| ≤ k.
—The algorithms of Jones and Muchnick [1982] and Chase et al. [1990] can be
captured in the framework by introducing unary core predicates that record
the allocation sites of heap cells.
—The algorithm of Plevyak et al. [1993] can be captured in the framework
using the predicates c f .b(v) and cb. f (v) (see Tables V, VI, and Appendix A).
Throughout this article, we have focused on precision and ignored efficiency.
Some of the above-cited algorithms are more efficient than instantiations of the
framework presented here because they keep only a single abstract structure
at each program point. However, this issue has been addressed in the TVLA
system, which implements the 3-valued logic framework (see Section 7.4.1).
In addition, the techniques presented in this article may also provide a new
basis for improving the efficiency of shape-analysis algorithms. In particular,
the machinery we have introduced provides a way both to collapse individuals
of 3-valued structures, via embedding, as well as to materialize them when
necessary, via focus (and coerce).
7.2 The Use of Logic for Pointer Analysis
Jensen et al. [1997] defined a decidable logic for describing properties of linked
data structures, and showed how it could be used to verify properties of programs written in a subset of Pascal. In Elgaard et al. [2000], this method was
extended to handle programs that use tree data structures. The method is complete for loop-free code, but for loops and recursive functions it relies on Hoarestyle invariants. In contrast, the shape-analysis framework described in the
present article can handle some programs that manipulate shared structures
ACM Transactions on Programming Languages and Systems, Vol. 24, No. 3, May 2002.
280
•
M. Sagiv et al.
(as well as some circular structures, such as doubly linked lists). Because a
set of 3-valued structures produced by shape analysis can also be viewed as a
representation of a data-structure invariant, our approach can be thought of as
providing a way to synthesize loop invariants. For certain instantiations of the
framework, the invariants obtained for programs that manipulate linked lists
and doubly linked lists are rather precise.
Benedikt et al. [1999] defined a decidable logic, called Lr , for describing properties of linked data structures. They showed how a generalization of Hendren’s
path-matrix descriptors [Hendren 1990; Hendren and Nicolau 1990] can be represented by Lr formulae, as well as how the variant of shape graphs defined
by Sagiv et al. [1998] can be represented by Lr formulae. This correspondence
provides insight into the expressive power of path matrices and shape graphs.
It also has interesting consequences for extracting information from the results
of program analyses, in that it provides a way to amplify the results obtained
from known analyses.
—By translating the structure descriptors obtained from the techniques given
in Hendren [1990], Hendren and Nicolau [1990], and Sagiv et al. [1998] to
Lr formulae, it is possible to determine if there is any store at all that corresponds to a given structure descriptor. This makes it possible to determine
whether a given structure descriptor contains any useful information.
—Decidability provides a mechanism for reading out information obtained by
existing shape-analysis algorithms, without any additional loss of precision
over that inherent in the shape-analysis algorithm itself.
The 3-valued structures used in this article are more general than Lr ; that is,
not all properties that we are able to capture using 3-valued structures can be
expressed in Lr . Thus, it is not clear to us whether Lr (or a decidable extension
of Lr ) can be used to amplify the results obtained via the techniques described
in the present work.
Morris [1982] studied the use of a reachability predicate “x → v | K ” for
establishing properties of programs that manipulate linked lists and trees. The
predicate x → v | K means “v is a node reachable from variable x via a path
that avoids nodes pointed to by variables in set K .” Morris discussed techniques
that, given a statement and a postcondition, generate a formula that captures
the weakest precondition. It is not clear to us how this relates to our predicateupdate formulae, which update the values of predicates after the execution of
a pointer-manipulation statement.
7.3 Embedding and Canonical Abstraction
Despite the naturalness and simplicity of the Embedding Theorem, this theorem appears not to have been known previously [Kunen 1998; Lifschitz 1998].
The closest concept that we found in the literature is the notion of embedding
discussed in Bell and Machover [1977, p. 165]. For Bell and Machover, an embedding of one 2-valued structure into another is a one-to-one, truth-preserving
mapping. However, this notion is unsuitable for abstract interpretation of programs that manipulate heap-allocated storage: in abstract interpretation, it is
ACM Transactions on Programming Languages and Systems, Vol. 24, No. 3, May 2002.
Parametric Shape Analysis via 3-Valued Logic
•
281
necessary to have a way to associate the structures that arise in the concrete
semantics, which are of arbitrary size, with abstract structures of some fixed
size.
In Section 4, there were two steps involved in defining a suitable family of
fixed-size abstract structures.
—Section 4.2 introduced “truth-blurring” onto mappings, for which the Embedding Theorem ensures that the meaning of a formula in the “blurred”
(abstract) world is consistent with the formula’s meaning in the original
(concrete) world. In particular, a tight embedding is one that minimizes the
information lost in mapping concrete individuals to abstract individuals.
—Section 4.3 introduced canonical abstractions, which were defined as the
tight embeddings induced by t embedc . The use of t embedc ensures that the
result of embedding is a bounded structure, and hence of a priori finite size.
Canonical abstraction is related to the notion of predicate abstraction introduced in [Graf and Saidi 1997], and used subsequently by others [Das et al.
1999; Clarke et al. 2000]. However, canonical abstraction yields 3-valued predicates, whereas predicate abstraction yields 2-valued predicates. Moreover, the
use of 3-valued predicates provides some additional flexibility; in particular, it
permits canonical abstraction to mesh with the more general notion of embedding. This ability was important for the material presented in Section 6, which
discussed a number of improvements to the abstract semantics; the methods
developed in Section 6 take advantage of the ability to work, at times, with
structures that are not “bounded” in the sense of Definition 4.11.
7.4 Follow-On Work Using 3-Valued Logic
The work presented in the article was originally motivated by the problem of
shape analysis: how to determine shape invariants of programs that perform
destructive updating on dynamically allocated storage. The article explains how
the various ingredients that are part of the analysis framework can be used to
specify (intraprocedural) shape-analysis algorithms, as well as how to fine-tune
the precision of such algorithms.
It is important to understand, however, that the material presented here actually has a much broader range of applicability to program-analysis problems
in general: as has been shown in work that has been carried out subsequent
to the original presentation of the approach in Sagiv et al. [1999], the use of
3-valued logic is not restricted just to shape analysis. In fact, the machinery
of 3-valued logic—embedding, tight embedding, bounded structures, canonical
abstraction, focus, and coerce—provides a framework in which a wide variety of
program-analysis problems can be addressed. Below, we summarize the results
that have been obtained to date in several pieces of follow-on work.
7.4.1 TVLA. The approach presented in this article has been implemented
by T. Lev-Ami in a system called TVLA, for Three-Valued Logic Analysis (see
Lev-Ami [2000] and Lev-Ami and Sagiv [2000]). TVLA provides a language
in which the user can specify (i) an operational semantics (via predicates and
ACM Transactions on Programming Languages and Systems, Vol. 24, No. 3, May 2002.
282
•
M. Sagiv et al.
predicate-update formulae), (ii) a control-flow graph for a program, and (iii) a
set of 3-valued structures that describe the program’s input. From this specification, TVLA builds the corresponding equation system, and finds its least
fixed point (cf. Equation (20)).
TVLA was used to test out the ideas described in the present article. The experience gained from this effort led to a number of improvements to, and extensions of, the methods that have been described here. Some of the enhancements
that TVLA incorporates include
—the ability to declare that certain binary predicates specify functional properties;
—the ability to specify that structures should be stored only at nodes of the
control-flow graph that are targets of backedges;
—an enhanced version of Coerce that exploits dependences among the set of
constraints to speed up the constraint-satisfaction process;
—an enhanced focus algorithm that generalizes the methods of Section 6.3 to
handle focusing on arbitrary formulae.13 In addition, this version of focus
also takes advantage of the properties of predicates that are specified to be
functions; and
—the ability to specify criteria for merging structures associated with a program point. This feature is motivated by the idea that when the number of
structures that arise at a given program point is too large, it may be better
to create a smaller number of structures that represent at least the same set
of 2-valued structures.
In particular, nullary predicates (i.e., predicates of 0-arity) are used to specify
which structures are to be merged. For example, for linked lists, the “x-isnot-null” predicate, defined by the formula nn[x]() = ∃v : x(v), discriminates
between structures in which x actually points to a list element, and structures
in which it does not. By using nn[x]() as the criterion for whether to merge
structures, the structures in which x is NULL are kept separate from those in
which x points to an allocated memory cell.
Further details about these features can be found in Lev-Ami [2000].
7.4.2 Verification of Sorting Implementations. In Lev-Ami et al. [2000],
3-valued logic is applied to the problems of
—automatically proving the partial correctness of correct programs; and
—discovering, locating, and diagnosing bugs in incorrect programs.
An algorithm is presented that analyzes sorting programs that manipulate
linked lists. The main idea is to refine the list abstraction that was introduced
in Section 2.6 by adding two more predicates:
—core predicate dle(v1 , v2 ) records the fact that the value in the data field of
element v1 is less than or equal to the value in the data field of element v2 ;
and
13 The
enhanced focus algorithm may not always succeed.
ACM Transactions on Programming Languages and Systems, Vol. 24, No. 3, May 2002.
Parametric Shape Analysis via 3-Valued Logic
•
283
—instrumentation predicate inOrder(v) holds for memory cells that are connected in increasing order in a linked list. (To reduce the cost of the analysis,
this instrumentation predicate was not used as an abstraction predicate.)
The TVLA implementation of the algorithm was found to be sufficiently precise
to discover that (correct versions) of bubble-sort and insertion-sort procedures
do, in fact, produce correctly sorted lists as outputs, and that the invariant
“is-sorted” is maintained by list-manipulation operations, such as elementinsertion, element-deletion, and even destructive list reversal and merging of
two sorted lists. When the algorithm was run on erroneous versions of bubblesort and insertion-sort procedures, it was capable of discovering and, in some
cases, locating and diagnosing the error.
7.4.3 Interprocedural Analysis. In Rinetskey and Sagiv [2001], the problem of interprocedural shape analysis for programs with recursive procedures
is addressed. The main idea is to expose the run-time stack as an explicit “data
structure” of the operational semantics; that is, activation records are individuals, and suitable core predicates are introduced to capture how activation
records are linked to form a stack. Instrumentation predicates are used to record
information about the calling context and the “invisible” copies of variables in
pending activation records on the stack. The resulting algorithm is expensive,
but quite precise. For example, it can show the absence of memory leaks in
a recursive implementation of a list-reversal procedure that, in turn, uses a
recursive version of a list-append procedure.
7.4.4 Analyzing Mobile Ambients. In Nielson et al. [2000], 3-valued logic
is applied to the problem of analyzing mobile ambients [Cardelli and Gordon
1998]. The challenge here is that the number of 3-valued structures arising
in the analysis is quite large. In this case, the ability to specify a criterion for
merging structures was crucial to the success of the analysis. Using this feature, the implementation of the analysis in TVLA was able to verify nontrivial
properties of a routing program, including mutual exclusion.
7.4.5 Checking Multithreaded Systems. In Yahav [2001], it is shown how
to apply 3-valued logic to the problem of checking properties of multithreaded
systems. In particular, Yahav [2001] addresses the problem of state-space exploration for languages, such as Java, that allow (i) dynamic creation and destruction of an unbounded number of threads, (ii) dynamic allocation and freeing
of an unbounded number of storage cells from the heap, and (iii) destructive
updating of structure fields. This combination of features creates considerable
difficulties for any method that tries to check program properties.
In the present article, the problem of program analysis is expressed as a
problem of annotating a control-flow graph with sets of 3-valued structures; in
contrast, the analysis algorithm given in Yahav [2001] is one that builds and
explores a 3-valued transition system on-the-fly.
In Yahav [2001], problems (ii) and (iii) are handled essentially via the techniques developed in the present article; problem (iii) is addressed by reducing it
to problem (ii): threads are modeled by individuals, which are abstracted using
ACM Transactions on Programming Languages and Systems, Vol. 24, No. 3, May 2002.
284
•
M. Sagiv et al.
canonical names, in this case, the collection of unary thread properties that
hold for a given thread. The use of this naming scheme automatically discovers commonalities in the state space, but without relying on explicitly supplied
symmetry properties, as in, for example, Emerson and Sistla [1993] and Clarke
and Jha [1995].
Unary core predicates are used to represent the program counter of each
thread object; focus implements the interleaving of threads. The analysis described in Yahav [2001] is capable of proving the absence of deadlock in a
dining-philosophers program that permits there to be an unbounded number
of philosophers.
In Yahav et al. [2001], this approach is extended to provide a method for
verifying LTL properties of multithreaded systems.
8. SOME FINAL OBSERVATIONS
We conclude with a few general observations about the material that has been
developed in the article.
8.1 Propagation of Formulae Versus Propagation of Structures
It is interesting to compare the machinery developed in this article with the
approach taken in methodologies for program development based on weakest
preconditions [Dijkstra 1976; Gries 1981], and also in systems for automatic
program verification [King 1969; Deutsch 1973; Constable et al. 1982], where
assertions (formulae) are pushed backwards through statements. The justification for propagating information in the backwards direction is that it avoids the
existential quantifiers that arise when assertions are pushed in the forwards
direction to generate strongest postconditions. Ordinarily, strongest postconditions present difficulties because quantifiers accumulate, forcing one to work
with larger and larger formulae.
In the shape-analysis framework developed in this article, an abstract shape
transformer can be viewed as computing a safe approximation to a statement’s
strongest postcondition: The application of an abstract statement transformer
to a 3-valued logical structure describing a set of stores S that arise before
a given statement st creates a set of 3-valued logical structures that covers
all of the stores that could arise from applying st to members of S. However,
the shape-analysis framework works at the semantic level; that is, it operates
directly on explicit representations of logical structures, rather than on an implicit representation, such as a logical formula.14 It is true that new abstract
heap-cells are materialized when necessary via the Focus operation; however,
because the fixed-point-finding algorithm keeps performing abstraction (via
t embedc ), 3-valued logical structures cannot grow to be of unbounded size.
The conventional approach to verifying programs that use data structures
built using pointers is to characterize the structures in terms of invariants
that describe their shape at stable points, that is, except for the procedures
14 However, see Benedikt et al. [1999] for a discussion of how a class of shape graphs can be converted
into logical formulae.
ACM Transactions on Programming Languages and Systems, Vol. 24, No. 3, May 2002.
Parametric Shape Analysis via 3-Valued Logic
•
285
that may be applied to them [Hoare 1975]. Because data-structure invariants
are usually temporarily violated within such procedures, it is challenging to
prove that invariants are reestablished at the end of these procedures. As mentioned in Section 7.4.2, the analysis approach developed in this article has also
been applied to a program-verification problem [Lev-Ami et al. 2000]. From the
perspective of someone interested in analyzing or verifying a program via our
approach, the need to adopt a local element-wise view of a data structure gives
the approach a markedly different flavor from the conventional approach to
program verification. In particular, the notion of an instrumentation predicate
can be contrasted with that of an invariant.
—A data-structure invariant states a global property of the instances of a data
structure that holds on entry to and exit from the operations that can be
performed on the data structure.
—An instrumentation predicate captures a local property that can be used to
distinguish among some of a data structure’s components.
As noted earlier, a set of 3-valued structures produced by the shape-analysis
algorithm can also be viewed as a representation of a data-structure invariant;
consequently, our approach can be thought of as providing a way to synthesize
global invariants from local properties.
8.2 Biased Versus Unbiased Static Program Analysis
Many of the classical dataflow-analysis algorithms use bit vectors to represent
the characteristic functions of set-valued dataflow values. This corresponds to
a logical interpretation (in the abstract semantics) that uses two values. It is
definite on one of the bit values and conservative on the other. That is, either
“false” means “false” and “true” means “may be true/may be false,” or “true”
means “true” and “false” means “may be true/may be false.” Many other staticanalysis algorithms have a similar character.
Conventional wisdom holds that static analysis must inherently have such
a one-sided bias. However, the material developed in this article shows that
while indefiniteness is inherent (i.e., a static analysis is unable, in general, to
give a definite answer), one-sidedness is not. By basing the abstract semantics
on 3-valued logic, definite truth and definite falseness can both be tracked, with
the third value, 1/2, capturing indefiniteness.
This outlook provides some insight into the true nature of the values that
arise in other work on static analysis.
—A one-sided analysis that is precise with respect to “false” and conservative
with respect to “true” is really a 3-valued analysis over 0, 1, and 1/2 that
conflates 1 and 1/2 (and uses “true” in place of 1/2).
—Likewise, an analysis that is precise with respect to “true” and conservative
with respect to “false” is really a 3-valued analysis over 0, 1, and 1/2 that
conflates 0 and 1/2 (and uses “false” in place of 1/2).
In contrast, the analyses developed in this article are unbiased. They are precise
with respect to both 0 and 1, and use 1/2 to capture indefiniteness.
ACM Transactions on Programming Languages and Systems, Vol. 24, No. 3, May 2002.
286
•
M. Sagiv et al.
Fig. 18. (a) Declaration of a doubly linked list data type in C; (b) a program that splices an element
with a data value d into a doubly linked list just after an element pointed to by p. We assume that
the variable that points to the head of the doubly linked list is named l.
Table XVI. Predicate-Update Formulae for the
Instrumentation Predicate c f .b
st
ϕcstf .b (v)
x = NULL
x = t
x = t->f
x->f = NULL
c f .b (v)
c f .b (v)
c f .b (v)
c f .b (v) ∨ x(v)
½
ϕc f .b [b 7→ ϕbst ] if ∃v1 : x(v1 ) ∧ b(v1 , v)
otherwise
c f .b (v)
x->b = NULL
½
x->f = t
∀v1 : t(v1 ) ⇒ b(v1 , v) if x(v)
(assuming that
c f .b (v)
otherwise
x->f == NULL)
½
x->b = t
ϕc f .b [b 7→ ϕbst ] if ∃v1 : x(v1 ) ∧ f (v, v1 )
(assuming that
otherwise
c f .b (v)
x->b == NULL)
x = malloc() c f .b (v) ∨ new(v)
APPENDIX A. HANDLING DOUBLY LINKED LISTS
We now briefly sketch the treatment of doubly linked lists. A C declaration of a
doubly linked list element is given in Figure 18(a). The defining formulae for the
predicates c f .b and cb. f , which track when forward and backward dereferences
“cancel” each other, were given in Table VI as Equations (17) and (18):
ϕc f .b (v) = ∀v1 : f (v, v1 ) ⇒ b(v1 , v)
def
ϕcb. f (v) = ∀v1 : b(v, v1 ) ⇒ f (v1 , v).
def
The predicate-update formulae for c f .b are given in Table XVI. (The predicateupdate formulae for cb. f are not shown because they are dual to those for c f .b.)
ACM Transactions on Programming Languages and Systems, Vol. 24, No. 3, May 2002.
Parametric Shape Analysis via 3-Valued Logic
•
287
In addition to c f .b and cb. f , we use two different reachability predicates for
every variable z: (i) rz, f (v), which holds for elements v that are reachable from
z via 0 or more applications of the field-selector f, and (ii) rz,b(v), which holds
for elements v that are reachable from z via 0 or more applications of the fieldselector b. Similarly, we use two cyclicity predicates, c f and cb. The predicateupdate formulae for these four predicates are essentially the ones given in
Tables VIII and IX (with n replaced by f and b). (One way in which Table VIII
should be adjusted is in the case of updating the reachability predicate with
respect to one field, say b, when the f-field is traversed, i.e., via x = t->f. In
this case, the predicates c f .b and cb. f can be used to avoid an overly conservative
solution [Lev-Ami 2000].)
We have already demonstrated how the shape-analysis algorithm works as a
pointer is advanced along a singly linked list, as in the body of insert (see Table
XV). The shape-analysis algorithm works in a similar fashion when a pointer
is advanced along a doubly linked list. Therefore, in this section, we consider
the operation splice, shown in Figure 18(b), which splices an element with a
data value d into a doubly linked list just after an element pointed to by p. (We
assume that this operation occurs after a search down the list has been carried
out, and that the variable that points to the head of the list is named l.)
Table XVII illustrates the abstract interpretation of splice under the following conditions: p points to some element in the list beyond the second element,
and the tail of p is not NULL. (This is the most interesting case since it exhibits
all of the possible indefinite edges arising in a call on splice.) Preceding row
by row in Table XVII, we observe the following changes.
—In the initial structure, the values of c f .b and cb. f are 1 for all elements, since
in all of the list elements forward and backward dereferences cancel.
—Immediately after a new heap-cell is allocated and its address assigned to e,
c f .b and cb. f are both trivially true for the new element since this element’s
f and b components do not point to any element. Note that by the last row of
Table XVI, the value of c f .b for the newly allocated element is set to 1.
—The assignment to the data field of e does not change the structure. The assignment t = p->f materializes a new element whose c f .b and cb. f predicate
values are 1.
—The assignment e->f = t is performed in two stages: (i) e->f = NULL and
then (ii) e->f = t, assuming that e->f == NULL. The first stage has no effect
because the value of e->f is already NULL. The second stage changes the value
of c f .b to 0 for the element pointed to by e, and changes the values of re, f to
1 for the elements transitively pointed to by t.
—The assignment t->b = e is performed in two stages: (i) t->b = NULL and
then (ii) t->b = e, assuming that t->b == NULL. In the first stage, the assignment t->b = NULL changes the value of c f .b to 0 for the element pointed
to by p. Also, the elements reachable in the backward direction from t are
no longer reachable from t. In the second stage, the assignment t->b = e
changes the value of c f .b to 1 for the element pointed to by e. Also, the nodes
reachable in the backward direction from e are now reachable from t.
ACM Transactions on Programming Languages and Systems, Vol. 24, No. 3, May 2002.
288
•
M. Sagiv et al.
Table XVII.
The abstract interpretation of the splice procedure applied to a doubly linked list
whose head is pointed to by l. Variable p points to some element in the list beyond
the second element, and the tail of p is assumed to be non-NULL. For brevity, node
names are not shown, and rz, f (v) and rz,b (v) are omitted when v is directly pointed
to by variable z.
—The assignment p->f = e is performed in two stages: (i) p->f = NULL and
then (ii) p->f = e, assuming that p->f == NULL. In the first stage, the assignment p->f = NULL causes the elements reachable from p->f to no longer
be reachable from p and l. However, the assignment p->f = e restores the
reachability properties r p, f and rl , f to all the elements reachable from e along
the forward direction.
ACM Transactions on Programming Languages and Systems, Vol. 24, No. 3, May 2002.
Parametric Shape Analysis via 3-Valued Logic
•
289
TABLE XVII—(Contiuned.)
—Finally, the assignment e->b = p causes all elements reachable from p along
the backward direction to have the reachability properties re,b and rt,b, and
causes the element pointed to by p to have the property c f ,b.
APPENDIX B. PROOF OF THE EMBEDDING THEOREM
0
0
THEOREM 4.9. Let S = hU S , ι S i and S 0 = hU S , ι S i be two structures, and let
0
f : U S → U S be a function such that S v f S 0 . Then, for every formula ϕ and
0
complete assignment Z for ϕ, [[ϕ]]3S (Z ) v [[ϕ]]3S ( f ◦ Z ).
PROOF. By De Morgan’s laws, it is sufficient to show the theorem for formulae
involving ∧, ¬, ∃, and TC. The proof is by structural induction on ϕ.
ACM Transactions on Programming Languages and Systems, Vol. 24, No. 3, May 2002.
290
•
M. Sagiv et al.
Basis: For atomic formula p(v1 , v2 , . . . , vk ), u1 , u2 , . . . , uk ∈ U S , and Z = [v1 7→
u1 , v2 7→ u2 , . . . , vk 7→ uk ] we have
[[ p(v1 , v2 , . . . , vk )]]3S (Z ) = ι S ( p)(u1 , u2 , . . . , uk )
(Definition 4.2)
S0
v ι ( p)( f (u1 ), f (u2 ), . . . , f (uk )) (Definition 4.5)
0
(Definition 4.2)
= [[ p(v1 , v2 , . . . , vk )]]3S ( f ◦ Z ).
Also, for l ∈ {0, 1}, we have
[[l ]]3S (Z ) = l
vl
(Definition 4.2)
(Definition 4.1)
S0
= [[l ]]3 ( f ◦ Z ). (Definition 4.2)
Let us now show that
0
[[v1 = v2 ]]3S (Z ) v [[v1 = v2 ]]3S ( f ◦ Z ).
0
First, if [[v1 = v2 ]]3S ( f ◦ Z ) = 1/2 then the theorem holds for v1 = v2 , trivially.
0
Second, if [[v1 = v2 ]]3S ( f ◦ Z ) = 1 then by Definition 4.2, (i) f (Z (v1 )) = f (Z (v2 ))
0
and (ii) ι S (sm)( f (Z (v1 ))) = 0. Therefore, by Definition 4.5, Z (v1 ) = Z (v2 )
and ι S (sm)(Z (v1 )) = 0 both hold. Hence, by Definition 4.2, [[v1 = v2 ]]3S (Z ) = 1.
0
Finally, suppose that [[v1 = v2 ]]3S ( f ◦ Z ) = 0 holds. In this case, by Definition 4.2, f (Z (v1 )) 6= f (Z (v2 )). Therefore, Z (v1 ) 6= Z (v2 ), and by Definition 4.2
[[v1 = v2 ]]3S (Z ) = 0.
Induction Step. Suppose that ϕ is a formula with free variables v1 , v2 , . . . , vk .
0
Let Z be a complete assignment for ϕ. If [[ϕ]]3S (Z ) = 1/2, then the theorem
0
holds trivially. Therefore assume that [[ϕ]]3S ( f ◦ Z ) ∈ {0, 1}. We must consider
four cases, according to the outermost operator of ϕ.
Logical-and. ϕ ≡ ϕ1 ∧ ϕ2 . The proof splits into the following subcases.
0
Case 1: [[ϕ1 ∧ ϕ2 ]]3S ( f ◦ Z ) = 0.
0
0
In this case, either [[ϕ1 ]]3S ( f ◦ Z ) = 0 or [[ϕ2 ]]3S ( f ◦ Z ) = 0. Without loss of gen0
erality assume that [[ϕ1 ]]3S ( f ◦ Z ) = 0. Then, by the induction hypothesis for ϕ1 ,
we conclude that [[ϕ1 ]]3S (Z ) = 0. Therefore, by Definition 4.2, [[ϕ1 ∧ ϕ2 ]]3S (Z ) = 0.
0
Case 2: [[ϕ1 ∧ ϕ2 ]]3S ( f ◦ Z ) = 1.
0
0
In this case, both [[ϕ1 ]]3S ( f ◦ Z ) = 1 and [[ϕ2 ]]3S ( f ◦ Z ) = 1. Then, by the induction hypothesis for ϕ1 and ϕ2 , we conclude that [[ϕ1 ]]3S (Z ) = 1 and [[ϕ2 ]]3S (Z ) = 1.
Therefore, by Definition 4.2, [[ϕ1 ∧ ϕ2 ]]3S (Z ) = 1.
Logical-negation. ϕ ≡ ¬ϕ1 . The proof splits into the following subcases.
0
Case 1: [[¬ϕ1 ]]3S ( f ◦ Z ) = 0.
0
In this case, [[ϕ1 ]]3S ( f ◦ Z ) = 1. Then, by the induction hypothesis for ϕ1 , we
conclude that [[ϕ1 ]]3S (Z ) = 1. Therefore, by Definition 4.2, [[¬ϕ1 ]]3S (Z ) = 0.
ACM Transactions on Programming Languages and Systems, Vol. 24, No. 3, May 2002.
Parametric Shape Analysis via 3-Valued Logic
•
291
0
Case 2: [[¬ϕ1 ]]3S ( f ◦ Z ) = 1.
0
In this case, [[ϕ1 ]]3S ( f ◦ Z ) = 0. Then, by the induction hypothesis for ϕ1 , we
conclude that [[ϕ1 ]]3S (Z ) = 0. Therefore, by Definition 4.2, [[¬ϕ1 ]]3S (Z ) = 1.
Existential-Quantification. ϕ ≡ ∃v0 : ϕ1 . The proof splits into the following
subcases.
0
Case 1: [[∃v1 : ϕ1 ]]3S ( f ◦ Z ) = 0.
0
In this case, for all u ∈ U S , [[ϕ1 ]]3S (( f ◦ Z )[v1 7→ f (u)]) = 0. Then, by the
induction hypothesis for ϕ1 , we conclude that for all u ∈ U S [[ϕ1 ]]3S (Z [v1 7→ u]) =
0. Therefore, by Definition 4.2, [[∃v1 : ϕ1 ]]3S (Z ) = 0.
0
Case 2: [[∃v1 : ϕ1 ]]3S ( f ◦ Z ) = 1.
0
0
In this case, there exists a u0 ∈ U S such that [[ϕ1 ]]3S (( f ◦ Z )[v1 7→ u0 ]) = 1.
Because f is surjective, there exists a u ∈ U S such that f (u) = u0 and
0
[[ϕ1 ]]3S (( f ◦ Z )[v1 7→ f (u)]) = 1. Then, by the induction hypothesis for ϕ1 ,
we conclude that [[ϕ1 ]]3S (Z [v1 7→ u]) = 1. Therefore, by Definition 4.2,
[[∃v1 : ϕ1 ]]3S (Z ) = 1.
Transitive Closure. ϕ ≡ (TC v1 , v2 : ϕ1 )(v3 , v4 ). The proof splits into the
following subcases.
0
Case 1: [[(TC v1 , v2 : ϕ1 )(v3 , v4 )]]3S ( f ◦ Z ) = 1.
0
By Definition 4.2, there exist u01 , u02 , . . . , u0n+1 ∈ U S such that for all 1 ≤ i ≤ n,
0
0
]) = 1, ( f ◦ Z )(v3 ) = u01 , and ( f ◦ Z )(v4 ) =
[[ϕ1 ]]3S (( f ◦ Z )[v1 7→ ui0 , v2 7→ ui+1
0
un+1 . Because f is surjective, there exist u1 , u2 , . . . , un+1 ∈ U S such that for
all 1 ≤ i ≤ n + 1, f (ui ) = ui0 . Therefore, Z (v3 ) = u1 , Z (v4 ) = un+1 , and by
the induction hypothesis, for all 1 ≤ i ≤ n, [[ϕ1 ]]3S (Z [v1 7→ ui , v2 7→ ui+1 ]) = 1.
Hence, by Definition 4.2, [[(TC v1 , v2 : ϕ1 )(v3 , v4 )]]3S (Z ) = 1.
0
Case 2: [[(TC v1 , v2 : ϕ1 )(v3 , v4 )]]3S ( f ◦ Z ) = 0.
We need to show that [[(TC v1 , v2 : ϕ1 )(v3 , v4 )]]3S (Z ) = 0. Assume on the contrary
0
that [[(TC v1 , v2 : ϕ1 )(v3 , v4 )]]3S ( f ◦ Z ) = 0, but [[(TC v1 , v2 : ϕ1 )(v3 , v4 )]]3S (Z ) 6=
0. Because [[(TC v1 , v2 : ϕ1 )(v3 , v4 )]]3S (Z ) 6= 0, by Definition 4.2 there exist
u1 , u2 , . . . , un+1 ∈ U S such that Z (v3 ) = u1 , Z (v4 ) = un+1 , and for all 1 ≤ i ≤ n,
[[ϕ1 ]]3S (Z [v1 7→ ui , v2 7→ ui+1 ]) 6= 0. Hence, by the induction hypothesis there ex0
ist u01 , u02 , . . . , u0n+1 ∈ U S such that ( f ◦ Z )(v3 ) = u01 , and ( f ◦ Z )(v4 ) = u0n+1
0
0
and for all 1 ≤ i ≤ n, [[ϕ1 ]]3S (( f ◦ Z )[v1 7→ ui0 , v2 7→ ui+1
]) 6= 0. Therefore, by
S0
Definition 4.2, [[(TC v1 , v2 : ϕ1 )(v3 , v4 )]]3 ( f ◦ Z ) 6= 0, which is a contradiction.
APPENDIX C. OTHER PROOFS
C.1 Properties of the Generated 3-Valued Constraints
LEMMA 6.15. For every pair of structures S \ ∈ 2-CSTRUCT[P, F ] and S ∈
3-STRUCT[P] such that S is a tight embedding of S \ , S |= r̂(F ).
PROOF. Let S \ ∈ 2-CSTRUCT[P, F ] and S ∈ 3-STRUCT[P] be a pair of
\
structures such that S is a tight embedding of S \ via function f : U S → U S .
We need to show that S |= r̂(F ).
ACM Transactions on Programming Languages and Systems, Vol. 24, No. 3, May 2002.
•
292
M. Sagiv et al.
Let ϕ 0 ∈ F and let us show that S |= r(ϕ 0 ). If ϕ 0 ≡ ∀v1 , v2 , . . . , vk : ϕ, then, since
\
\
S |= ϕ 0 , for all assignments Z \ for v1 , v2 , . . . , vk drawn from U S , [[ϕ]]3S (Z \ ) = 1.
S
\
Therefore, by the Embedding Theorem [[ϕ]]3 ( f ◦ Z ) 6= 0. But since f is surjective we conclude that for all assignments Z for v1 , v2 , . . . , vk drawn from U S ,
[[ϕ]]3S (Z ) 6= 0, and therefore S |= r(ϕ 0 ).
Let us now show that S |= r(ϕ 0 ) for ϕ 0 ≡ ∀v1 , v2 , . . . , vk : ϕ ⇒ ab, where
a is an atomic formula that contains no repetitions of logical variables, a 6≡
sm(v), and b ∈ {0, 1}. Let Z be an assignment for v1 , v2 , . . . , vk drawn from
U S . If [[ϕ]]3S (Z ) 6= 1, then by definition S, Z |= ϕ x ab. Therefore, assume that
[[ϕ]]3S (Z ) = 1 and let us show that [[ab]]3S (Z ) = 1. Note that for every assignment
Z \ such that f ◦ Z \ = Z , [[ϕ]]3S (Z ) = 1 implies, by the Embedding Theorem,
\
that [[ϕ]]3S (Z \ ) = 1. Therefore, because S \ |= ϕ 0 , we have
\
\
[[ab]]3S (Z \ ) = 1.
(63)
The remainder of the proof splits into the following cases.
Case 1: b = 1 and a ≡ p(v1 , v2 , . . . , vl ), where l ≤ k, p ∈ P − {sm}.
Note that by Definition 6.13, i 6= j ⇒ vi 6= v j . We have:
[[ p(v1 , v2 , . . . , vl )]]3S (Z )
= ι S ( p)(Z
G (v1 ),\ Z (v2\), . \. . , Z (vl \))
ι S ( p)(u1 , u2 , . . . , ul )
=
(Definition 4.2)
(Definition 4.6)
\
=
=
f (ui )=Z (ui )
G
S\
ι ( p)(Z \ (v1 ), Z \ (v2 ), . . . , Z \ (vl ))
f ◦G
Z \ =Z
\
[[ p(v1 , v2 , . . . , vl )]]3S (Z \ )
(since i 6= j ⇒ vi 6= v j )
(Definition 4.2)
f ◦ Z \ =Z
= 1.
(Equation (63))
Notice that we use the fact that p 6≡ sm because the step from the third line to
the fourth line may not hold for sm (cf. Definition 4.6).
Case 2: b = 0 and a ≡ p(v1 , v2 , . . . , vl ), where l ≤ k, p ∈ P − {sm}.
Again, by Definition 6.13, i 6= j ⇒ vi 6= v j . We have:
[[¬ p(v1 , v2 , . . . , vl )]]3S (Z )
= 1 − ι S ( p)(Z
Z (v2 ), . . . , Z (vl ))
G (v1 ),
\
\
\
\
ι S ( p)(u1 , u2 , . . . , ul )
=1−
(Definition 4.2)
(Definition 4.6)
\
=1−
=1−
f (ui )=Z (vi )
G
S\
ι ( p)(Z \ (v1 ), Z \ (v2 ), . . . , Z \ (vl ))
f ◦G
Z \ =Z
Gf
=
◦ Z \ =Z
\
[[ p(v1 , v2 , . . . , vl )]]3S (Z \ )
\
[[¬ p(v1 , v2 , . . . , vl )]]3S (Z \ )
(since i 6= j ⇒ vi 6= v j )
(Definition 4.2)
(Definition 4.2)
f ◦ Z \ =Z
= 1.
(Equation (63))
ACM Transactions on Programming Languages and Systems, Vol. 24, No. 3, May 2002.
Parametric Shape Analysis via 3-Valued Logic
•
293
Case 3: b = 1 and a ≡ v1 = v2 , for v1 6≡ v2 .
We need to show that Z (v1 ) = Z (v2 ) and ι(sm)(Z (v1 )) = 0. If Z (v1 ) 6= Z (v2 )
then there exists an assignment Z \ such that f ◦ Z \ = Z , and Z \ (v1 ) 6= Z \ (v2 )
contradicting (63). Now assume that ι(sm)(Z (v1 )) = 1/2; thus, by Definition 4.6
\
there exist u1 , u2 ∈ U S such that u1 6= u2 and f (u1 ) = f (u2 ) = Z (v1 ). Therefore,
\
\
for Z (v1 ) = u1 and Z (v2 ) = u2 , we get a contradiction to (63).
Case 4: b = 0 and a ≡ v1 = v2 .
We need to show that Z (v1 ) 6= Z (v2 ). If Z (v1 ) = Z (v2 ) then there exists an assignment Z \ such that f ◦ Z \ = Z , and Z \ (v1 ) = Z \ (v2 ) contradicting (63). h
LEMMA 6.25. For every pair of structures S1 , S2 ∈ 3-CSTRUCT[P, r̂(F )]
such that U S1 = U S2 = U , the structure S1 t S2 is also in 3-CSTRUCT[P, r̂(F )].
PROOF. By contradiction. Assume that constraint ϕ1 x ϕ2 in r̂(F ) is violated. By definition, this happens when for some Z , [[ϕ1 ]]3S1 tS2 (Z ) = 1 and
[[ϕ2 ]]3S1 tS2 (Z ) 6= 1. Because Kleene’s semantics is monotonic in the information
order (Lemma 4.4), [[ϕ1 ]]3S1 (Z ) = 1 and [[ϕ1 ]]3S2 (Z ) = 1. Therefore, because S1 and
S2 both satisfy the constraint ϕ1 x ϕ2 , we have [[ϕ2 ]]3S1 (Z ) = 1 and [[ϕ2 ]]3S2 (Z ) = 1.
But because ϕ2 is an atomic formula or the negation of an atomic formula,
[[ϕ2 ]]3S1 tS2 (Z ) = 1, which is a contradiction. h
C.2 Correctness of the Coerce Algorithm
The correctness of algorithm Coerce stems from the following two lemmas.
LEMMA C.1. For every S ∈ 3-STRUCT[P] and structure S 0 before each iteration of the loop in Coerce(S, r̂(F )), the following conditions hold: (i) S 0 v S;
(ii) if coercer̂(F ) (S) 6= ⊥, then coercer̂(F ) (S) v S 0 .
PROOF.
By induction on the number of iterations.
Basis: When the number of iterations is zero, the claim holds because (i) S 0 = S
and thus S 0 v S, and (ii) if coerce(S) 6= ⊥ then coerce(S) v S = S 0 .
Induction hypothesis: Assume that the lemma holds for i ≥ 0 iterations.
Induction step: Let S 0 be the structure before the ith iteration of the loop in
Coerce, and let ϕ1 x ϕ2 and Z be the constraint and the assignment selected for
the ith iteration of the loop. We show that the induction hypothesis still holds
after the ith iteration.
Part I. It is easy to see that in the cases that Coerce does not return ⊥, Coerce
only lowers predicate values, and therefore (i) holds after the ith iteration.
Part II. Let us show that (ii) holds. Assume that S 00 = coerce(S) 6= ⊥.
0
By part (ii) of the induction hypothesis, S 00 v S 0 . Because [[ϕ1 ]]3S (Z ) = 1, by
S 00
00
Lemma 4.4 we have [[ϕ1 ]]3 (Z ) = 1. Hence, because S |= r̂(F ), it must be that
00
[[ϕ2 ]]3S (Z ) = 1. The proof splits into the following cases.
0
Case 1: ϕ2 ≡ v1 = v2 and Coerce lowers ι S (sm)(Z (v1 )) from 1/2 to 0. Be00
cause [[v1 = v2 ]]3S (Z ) = 1, by Definition 4.2 we know that Z (v1 ) = Z (v2 ) and
00
ι S (sm)(Z (v1 )) = 0. Therefore, S 00 v S 0 holds after the ith iteration.
ACM Transactions on Programming Languages and Systems, Vol. 24, No. 3, May 2002.
294
•
M. Sagiv et al.
0
Case 2: ϕ2 ≡ pb(v1 , v2 , . . . , vk ) and Coerce lowers ι S ( p)(Z (v1 ), Z (v2 ), . . . , Z (vk ))
00
from 1/2 to b. Because [[ pb(v1 , v2 , . . . , vk )]]3S (Z ) = 1, by Definition 4.2 we know
00
that ι S ( p)(Z (v1 ), Z (v2 ), . . . , Z (vk )) = b. Therefore, S 00 v S 0 holds after the ith
iteration. h
LEMMA C.2.
If Coerce returns ⊥, then coercer̂(F ) (S) = ⊥.
PROOF. Let us assume that Coerce returns ⊥, and yet S 00 = coerce(S) 6= ⊥;
we show that this assumption leads to a contradiction.
Let S 0 be the structure at the beginning of the iteration of the loop in
Coerce on which ⊥ is returned, and let ϕ1 x ϕ2 and Z be the constraint and
the assignment selected for that iteration. By Lemma C.1(ii), S 00 v S 0 . Because
0
00
[[ϕ1 ]]3S (Z ) = 1, by Lemma 4.4 we have [[ϕ1 ]]3S (Z ) = 1. Hence, because S 00 |= r̂(F ),
S 00
it must be that [[ϕ2 ]]3 (Z ) = 1. The proof splits into the following cases, according to what causes Coerce to return ⊥.
Case 1: Coerce returns ⊥ when a Type I constraint is violated. Immediate.
Case 2: Coerce returns ⊥ when a Type II constraint is violated. There are
two subcases to consider.
0
Case 2.1: ϕ2 ≡ v1 = v2 . Because [[v1 = v2 ]]3S (Z ) 6= 1 and the constraint is
irreparably violated, it must be the case that Z (v1 ) 6= Z (v2 ). Therefore, by
00
Definition 4.2, [[v1 = v2 ]]3S (Z ) 6= 1, a contradiction.
0
Case 2.2: ϕ2 ≡ ¬(v1 = v2 ). Because [[¬(v1 = v2 )]]3S (Z ) 6= 1, we conclude from Definition 4.2 that Z (v1 ) = Z (v2 ). Therefore, by Definition 4.2,
00
[[¬(v1 = v2 )]]3S (Z ) 6= 1, a contradiction.
Case 3: Coerce returns ⊥ when a Type III constraint is violated. This
0
happens when ι S ( p)(Z (v1 ), Z (v2 ), . . . , Z (vk )) has the definite value 1 − b.
00
By Lemma C.1(ii), S 00 v S 0 , and therefore, by Lemma 4.4, ι S ( p)(Z (v1 ),
Z (v2 ), . . . , Z (vk )) must also have the value 1 − b. Therefore, by Definition 4.2,
00
[[ pb(v1 , v2 , . . . , vk )]]3S (Z ) = 0, a contradiction. h
THEOREM 6.28.
r̂(F )).
For every S ∈ 3-STRUCT[P], coercer̂(F ) (S) = Coerce(S,
PROOF. Let T be the return value of Coerce(S, r̂(F )). There are two cases
to consider, according to the value of T :
—Suppose T = ⊥. By Lemma C.2, coercer̂(F ) (S) = ⊥ = T .
— If T 6= ⊥, then by Lemma C.1(i), T v S. By the definition of the Coerce
algorithm, T |= r̂(F ), and therefore, by Definition 6.23, coercer̂(F ) (S) 6= ⊥.
Consequently, by Lemma C.1(ii), coercer̂(F ) (S) v T . By Definition 6.23,
coercer̂(F ) (S) is the maximal structure that models r̂(F ); therefore, it must be
that coercer̂(F ) (S) = T . h
C.3 Local Safety of Shape Analysis
THEOREM 6.29 (Local Safety Theorem).
all S ∈ 3-STRUCT[P ∪ {sm}]
If vertex w is a condition, then for
ACM Transactions on Programming Languages and Systems, Vol. 24, No. 3, May 2002.
•
Parametric Shape Analysis via 3-Valued Logic
295
d
(i) If S \ ∈ γ (S) and S \ |= cond(w), then there exists S 0 ∈ coerce(focus
F (w) (S))
such that S 0 |=3 cond(w) and S \ ∈ γ (t embedc (S 0 )).
d
(ii) If S \ ∈ γ (S) and S \ |= ¬cond(w), then there exists S 0 ∈ coerce(focus
F (w) (S))
such that S 0 |=3 ¬cond(w) and S \ ∈ γ (t embedc (S 0 )).
If vertex w is a statement, then
d c (coerce(
d 3 (focus
d [[st(w)]]
(iii) If S \ ∈ γ (S), then [[st(w)]](S \ ) ∈ γ̂ (t embed
F (w) (S))))).
PROOF. Let w be either a statement or condition of the control-flow graph,
let S be a structure in 3-STRUCT[P ∪ {sm}], and let S \ ∈ γ (S). By Definition 6.4, there exists S1 ∈ focus F (w) (S), such that S \ ∈ γ (S1 ). The proof proceeds
as follows.
Let us show that (i) holds. Assume that S \ |= cond(w). By Definition 4.8 (as
modified by footnote 12), S \ ∈ γ (S) means that there is a mapping f such that
S \ v f S1 and S \ satisfies the compatibility formulae F . Let S2 be the tight
embedding of S \ with respect to f . Thus, S2 v S1 . By Lemma 6.15, S2 |= r̂(F ).
Therefore, by Definition 6.23, S2 v coercer̂(F ) (S1 ). In particular, this means
0
that there is a mapping f 0 such that S2 v f coercer̂(F ) (S1 ). By the transitiv\
f 0◦ f
coercer̂(F ) (S1 ). Because S \ |= cond(w),
ity of embedding, we have S v
the Embedding Theorem implies that coercer̂(F ) (S1 ) |=3 cond(w). Finally,
t embedc simply folds together and renames individuals from coercer̂(F ) (S1 ),
so we know that the composed mapping (t embedc ◦ f 0 ◦ f ) embeds S \ into
t embedc (coercer̂(F ) (S1 )):
S \ v(t embedc ◦ f
0
◦f )
t embedc (coercer̂(F ) (S1 )).
By the definition of γ (Definition 4.8), this implies property (i) for S 0 =
coercer̂(F ) (S1 ).
The proof of (ii) is identical to the proof of (i), with cond replaced by ¬cond.
Let us show that (iii) holds for a statement w. It follows from the Embedding
Theorem and the definitions of [[st(w)]] and [[st(w)]]3 (Definitions 3.3 and 3.5)
that,
If S \ ∈ γ (S1 ), then [[st(w)]](S \ ) ∈ γ ([[st(w)]]3 (S1 )).
In particular, this means that there is a mapping f such that [[st(w)]](S1 \ ) v f
[[st(w)]]3 (S1 ). Let S2 be the tight embedding of [[st(w)]](S \ ) with respect to f . Thus, S2 v [[st(w)]]3 (S1 ). Because [[st(w)]](S \ ) |= F ,
Lemma 6.15 implies that S2 |= r̂(F ). Therefore, by Definition 6.23, S2 v
coercer̂(F ) ([[st(w)]]3 (S1 )). In particular, this means that there is a mapping f 0
0
such that S2 v f coercer̂(F ) ([[st(w)]]3 (S1 )). By the transitivity of embedding,
0
we have [[st(w)]](S \ ) v f ◦ f coercer̂(F ) ([[st(w)]]3 (S1 )). However, t embedc simply
folds together and renames individuals from coercer̂(F ) ([[st(w)]]3 (S1 )), so we
know that the composed mapping (t embedc ◦ f 0 ◦ f ) embeds [[st(w)]](S \ ) into
t embedc (coercer̂(F ) ([[st(w)]]3 (S1 ))):
¢
¡
0
[[st(w)]](S \ ) v(t embedc ◦ f ◦ f ) t embedc coercer̂(F ) ([[st(w)]]3 (S1 )) .
By the definition of γ (Definition 4.8), this implies property (iii).
h
ACM Transactions on Programming Languages and Systems, Vol. 24, No. 3, May 2002.
296
•
M. Sagiv et al.
ACKNOWLEDGMENTS
We are grateful to T. Lev-Ami for his implementation of TVLA, and for his
suggestions about improving the precision of the algorithms presented in the
article. We are also grateful for the helpful comments of A. Avron, T. Ball,
M. Benedikt, N. Dor, M. Fähndrich, J. Field, M. Gitik, K. Kunen, V. Lifschitz,
A. Loginov, A. Mulhern, F. Nielson, H. R. Nielson, M. O’Donnell, A. Rabinovich,
N. Rinetsky, R. Shaham, K. Sieber, and E. Yahav. The suggestions from the
anonymous referees about how to improve the organization of the article are
also greatly appreciated. We thank K. H. Rose for the XY LaTeX package.
REFERENCES
AIKEN, A. 1996. Position statement. Prepared for the Programming Languages Working Group
of the ACM Workshop on Strategic Directions in Computing Research (Cambridge, Mass, June
14–15, 1996).
ASSMANN, U. AND WEINHARDT, M. 1993. Interprocedural heap analysis for parallelizing imperative
programs. In Programming Models For Massively Parallel Computers, W. K. Giloi, S. Jähnichen,
and B. D. Shriver, Eds., IEEE Press, Washington, DC, 74–82.
BELL, J. AND MACHOVER, M. 1977. A Course in Mathematical Logic. North-Holland, Amsterdam.
BENEDIKT, M., REPS, T., AND SAGIV, M. 1999. A decidable logic for describing linked data structures.
In Proceedings of the 1999 European Symposium on Programming, 2–19.
CARDELLI, L. AND GORDON, A. 1998. Mobile ambients. In Proceedings of Foundations of Software
Science and Computation Structures (FoSSaCS), M. Nivat, Ed., Lecture Notes in Computer
Science, vol. 1378, Springer-Verlag, New York, 140–155.
CHASE, D., WEGMAN, M., AND ZADECK, F. 1990. Analysis of pointers and structures. In SIGPLAN
Conference on Programming Language Design and Implementation, ACM Press, New York, 296–
310.
CLARKE, E. AND JHA, S. 1995. Symmetry and induction in model checking. In Computer Science
Today: Recent Trends and Developments, J. V. Leeuwen, Ed., Lecture Notes in Computer Science,
vol. 1000, Springer-Verlag, New York, 455–470.
CLARKE, E., GRUMBERG, O., JHA, S., LU, Y., AND VEITH, H. 2000. Counterexample-guided abstraction
refinement. In Proceedings of Computer-Aided Verification, 154–169.
CONSTABLE, R., JOHNSON, S., AND EICHENLAUB, C. 1982. Introduction to the PL/CV2 Programming
Logic. Lecture Notes in Computer Science, vol. 135. Springer-Verlag, New York.
COUSOT, P. AND COUSOT, R. 1977. Abstract interpretation: A unified lattice model for static analysis
of programs by construction of approximation of fixed points. In Symposium on Principles of
Programming Languages, ACM, New York, 238–252.
COUSOT, P. AND COUSOT, R. 1979. Systematic design of program analysis frameworks. In Symposium on Principles of Programming Languages, ACM, New York, 269–282.
DAS, S., DILL, D., AND PARK, S. 1999. Experience with predicate abstraction. In Proceedings of
Computer-Aided Verification, 160–171.
DEUTSCH, L. 1973. An interactive program verifier. Ph.D. Thesis, University of California,
Berkeley, Calif.
DIJKSTRA, E. 1976. A Discipline of Programming. Prentice-Hall, Englewood Cliffs, N.J.
ELGAARD, J., MøLLER, A., AND SCHWARTZBACH, M. 2000. Compile-time debugging of C programs working on trees. In Proceedings of the 2000 European Symposium on Programming,
119–134.
EMERSON, E. AND SISTLA, A. 1993. Symmetry and model checking. In Proceedings of ComputerAided Verification, C. Courcoubetis, Ed., 463–478.
GINSBERG, M. 1988. Multivalued logics: A uniform approach to inference in artificial intelligence.
Comp. Intell. 4, 265–316.
GRAF, S. AND SAIDI, H. 1997. Construction of abstract state graphs with PVS. In Proceedings of
Computer-Aided Verification, 72–83.
ACM Transactions on Programming Languages and Systems, Vol. 24, No. 3, May 2002.
Parametric Shape Analysis via 3-Valued Logic
•
297
GRIES, D. 1981. The Science of Programming. Springer-Verlag, New York.
HENDREN, L. 1990. Parallelizing programs with recursive data structures. Ph.D. Thesis, Cornell
Univ., Ithaca, N.Y.
HENDREN, L. AND NICOLAU, A. 1990. Parallelizing programs with recursive data structures. IEEE
Trans. Par. Dist. Syst. 1, 1 (January), 35–47.
HENDREN, L., HUMMEL, J., AND NICOLAU, A. 1992. Abstractions for recursive pointer data structures:
Improving the analysis and the transformation of imperative programs. In SIGPLAN Conference
on Programming Language Design and Implementation, ACM, New York, 249–260.
HOARE, C. 1975. Recursive data structures. Int. J. Comput. Inf. Sci. 4, 2, 105–132.
HORWITZ, S., PFEIFFER, P., AND REPS, T. 1989. Dependence analysis for pointer variables. In
SIGPLAN Conference on Programming Language Design and Implementation, ACM, New York,
28–40.
JENSEN, J., JøRGENSEN, M., KLARLUND, N., AND SCHWARTZBACH, M. 1997. Automatic verification of
pointer programs using monadic second-order logic. In SIGPLAN Conference on Programming
Language Design and Implementation, 226–236.
JONES, N. AND MUCHNICK, S. 1981. Flow analysis and optimization of Lisp-like structures. In
Program Flow Analysis: Theory and Applications, S. Muchnick and N. Jones, Eds., PrenticeHall, Englewood Cliffs, N.J., Chapter 4, 102–131.
JONES, N. AND MUCHNICK, S. 1982. A flexible approach to interprocedural data flow analysis and
programs with recursive data structures. In Symposium on Principles of Programming Languages, ACM, New York, 66–74.
KING, J. 1969. A program verifier. Ph.D. Thesis, Carnegie-Mellon Univ., Pittsburgh, PA.
KLEENE, S. 1987. Introduction to Metamathematics, Second Ed. North-Holland, Amsterdam.
KUNEN, K. 1998. Personal communication.
LARUS, J. AND HILFINGER, P. 1988.
Detecting conflicts between structure accesses. In
SIGPLAN Conference on Programming Language Design and Implementation, ACM, New York,
21–34.
LEV-AMI, T. 2000. TVLA: A framework for Kleene based static analysis. M.S. Thesis, Tel-Aviv
University, Tel-Aviv, Israel.
LEV-AMI, T. AND SAGIV, M. 2000. TVLA: A system for implementing static analyses. In Static
Analysis Symposium, 280–301.
LEV-AMI, T., REPS, T., SAGIV, M., AND WILHELM, R. 2000. Putting static analysis to work for verification: A case study. In ISSTA 2000: Proceedings of the International Symposium on Software
Testing and Analysis, 26–38.
LIFSCHITZ, V. 1998. Personal communication.
MORRIS, J. 1982. Assignment and linked data structures. In Theoretical Foundations of Programming Methodology, M. Broy and G. Schmidt, Eds., D. Reidel, Boston, 35–41.
NIELSON, F., NIELSON, H., AND SAGIV, M. 2000. A Kleene analysis of mobile ambients. In Proceedings
of ESOP 2000, G. Smolka, Ed., Lecture Notes in Computer Science, vol. 1782, Springer, New York,
305–319.
PLEVYAK, J., CHIEN, A., AND KARAMCHETI, V. 1993. Analysis of dynamic structures for efficient parallel execution. In Languages and Compilers for Parallel Computing, U. Banerjee, D. Gelernter,
A. Nicolau, and D. Padua, Eds., Lecture Notes in Computer Science, vol. 768, Springer-Verlag,
Portland, Ore, 37–57.
RINETSKEY, N. AND SAGIV, M. 2001. Interprocedural shape analysis for recursive programs. In
Proceedings of CC 2001, R. Wilhelm, Ed., Lecture Notes in Computer Science, vol. 2027, Springer,
New York, 133–149.
SAGIV, M., REPS, T., AND WILHELM, R. 1998. Solving shape-analysis problems in languages with
destructive updating. Trans. Prog. Lang. Syst. 20, 1 (January), 1–50.
SAGIV, M., REPS, T., AND WILHELM, R. 1999. Parametric shape analysis via 3-valued logic. In
Symposium on Principles of Programming Languages, ACM, New York, 105–118.
STRANSKY, J. 1992. A lattice for abstract interpretation of dynamic (Lisp-like) structures. Inf.
Comp. 101, 1 (Nov.), 70–102.
WANG, E. Y.-B. 1994. Analysis of recursive types in an imperative language. Ph.D. Thesis, Univ.
of Calif., Berkeley, Calif.
ACM Transactions on Programming Languages and Systems, Vol. 24, No. 3, May 2002.
298
•
M. Sagiv et al.
YAHAV, E. 2001. Verifying safety properties of concurrent Java programs using 3-valued logic.
In Proceedings of the Symposium on Principles of Programming Languages, ACM, New York,
27–40.
YAHAV, E., REPS, T., AND SAGIV, M. 2001. LTL model checking for systems with unbounded number of dynamically created threads and objects. Tech. Rep. TR-1424, Comp. Sci. Dept., Univ. of
Wisconsin, Madison, Wisc. March.
Received March 2000; revised October 2001; accepted March 2002
ACM Transactions on Programming Languages and Systems, Vol. 24, No. 3, May 2002.
Téléchargement