New paper: Union types for semistructured data

We are pleased to announce a new technical report, available through 


     Peter Buneman
     Benjamin Pierce


                  Peter Buneman and Benjamin Pierce


Semistructured databases are treated as dynamically typed: they come
equipped with no independent schema or type system to constrain the
data.  Query languages that are designed for semistructured data, even
when used with structured data, typically ignore any type information
that may be present.  The consequences of this are what one would
expect from using a dynamic type system with complex data: fewer
guarantees on the correctness of applications. For example, a query
that would cause a type error in a statically typed query language
will silently return the empty set when applied to a semistructured
representation of the same data.

Much semistructured data originates in structured data.  A
semistructured representation is useful when one wants to add data
that does not conform to the original type or when one wants to
combine sources of different types.  However, the deviations from the
prescribed types are often minor, and we believe that a better
strategy than throwing away all type information is to preserve as
much of it as possible.  We describe a system of untagged UNION TYPES
that can accommodate variations in structure while still allowing a
degree of static type checking.

A novelty of this system is that it involves non-trivial equivalences
among types, arising from a law of distributivity for records and
unions: a value may be introduced with one type (e.g., a record
containing a union) and used at another type (a union of records).  We
describe programming and query language constructs for dealing with
such types, prove the soundness of the type system, and develop
algorithms for subtyping and typechecking.