First Normal Form: Repeating Groups vs Relation Valued Attributes

A repeating group is a multi-valued construction in a database table: an attribute that has no single value but consists of multiple values that are usually accessed by positional index rather than by name. Codd's original First Normal Form (1NF) eschewed tables with such repeating groups in favour of relations consisting of single-valued attributes only. The advantages of relations over index-based arrays are pretty well understood by database developers. Many SQL DBMSs won't even support arrays of values within attributes so the 1NF prohibition against them is not a serious restriction.

Codd later revised/rephrased his definition of 1NF, stating that 1NF also excluded relation-valued attributes. (Note that he did not ever call relation-valued attributes “repeating groups” - relations are a different thing altogether) The ban on relation-valued attributes (RVAs) is a bit more controversial than the original definition of 1NF. A relation in principle is a value and it can be accessed just like any other single value. For “obvious” reasons of clarity and simplicity RVAs are usually avoided. But it doesn't necessarily follow that they should always be avoided. Since a RVA is a value and not a repeating group, relations that contain RVAs can be subject to relational operations just like other relations. To say that a relation ought to consist of any kind of single-valued attribute except for a relation seems unnecessarily restrictive and a bit arbitrary.

Take the example of a collection of points making up a simple graph. A “graph” for the purposes of this example just means a collection of points represented as X and Y co-ordinates. Here's a relvar, G, with attributes, X and Y, and some values that specify a graph:

G is a perfectly good way to represent a single graph. Now suppose we need to model an arbitrary collection of such graphs in a relational database. Clearly one graph must be distinguishable from another because it consists of a different set of points, but we can also assume different graphs might have some points in common. Isn't the most natural representation (perhaps the only possible representation) a new relvar consisting of graph values – i.e. relation values of the same type as G? Let's call it G-nested and populate it with three different graphs:

The G-nested relvar has one attribute, G, which is a relation-valued attribute with attributes X, Y. The key of the G-nested relvar is {G}, the relation-valued attribute, which means that the graphs contained within G-nested must be unique. The key of each nested relation within G is {X,Y}.

If we wished to avoid the nested relation at all costs then a possible alternative would be to extend our original G relvar by adding a new attribute to identify each graph by something other than its points. For example we could add a name attribute. Here is such a relvar, called G-named:

The key of G-named is {Name,X,Y}. The Name attribute performs the function of identifying the three distinct graphs in G-named. In G-nested the graphs were separated in distinct tuples and so they didn't need their own names. G-named only contains tuples for points, not graphs, so we have to identify the graphs by some other means and that is why the Name attribute is required.

G-named may well be the most practical solution for many applications but why must it be the solution under 1NF? It might be highly inconvenient to invent a new name attribute just for the sake of database representation if such an attribute doesn't exist in the reality we are attempting to model. In his classic paper, A Relational Model of Data for Large Shared Data Banks, Dr Codd wrote that the relational model “provides a means of describing data with its natural structure only – that is, without superimposing any additional structure for machine representation purposes.” Thus the spirit of the relational model is precisely that it enables a natural representation of data without artificial technical constraints.

If G-nested is not a “normalized” relation then how should we fix it? The act of normalizing a database does not usually require new attributes to be invented to satisfy any normal form; the normalization procedure simply redistributes attributes among relations such that certain conditions are satisfied. The transformation of G-nested into G-named is really not any form of “normalization”; it is a different kind of transformation altogether. If RVAs are understood to be completely excluded by 1NF then it appears that normalization as a design procedure may be broken in this case - our G-nested relation really cannot be normalized because there is no simple rearrangement of its attributes that satisfies Normal Form. Only an alternative relation (one with a name or other attribute to identify graphs) is fit to be the subject of normalization. If we accept Codd's modified version of 1NF that prohibits RVAs then it seems that we also have to accept limits on what the normalization procedure can do for us.

In conclusion, while I agree that RVAs are probably unwanted in most circumstances, I'm less convinced that it's a good idea to interpret 1NF as excluding them altogether.