Constraints for BETYdb
Autor
Scott Rohde and David LeBauer
Last Updated
hace 10 años
License
Creative Commons CC BY 4.0
Resumen
Constraints on the Biofuel Ecophysiological Traits and Yields Database to ensure its logical integrity.
\documentclass{article}
\usepackage{graphicx}
\usepackage[space]{grffile}
\usepackage{latexsym}
\usepackage{amsfonts,amsmath,amssymb}
\usepackage{url}
\usepackage[utf8]{inputenc}
\usepackage{fancyref}
\usepackage{hyperref}
\hypersetup{colorlinks=false,pdfborder={0 0 0},}
\usepackage{textcomp}
\usepackage{longtable}
\usepackage{multirow,booktabs}
\usepackage{verbatim}
\usepackage{color}
\usepackage{soul}
\begin{document}
\title{Constraints for BETYdb}
\author{David LeBauer\\ University of Illinois at Urbana-Champaign \and Scott Rohde\\ University of Illinois at Urbana-Champaign}
\date{\today}
\bibliographystyle{plain}
\maketitle
%%%%%%%%%%%%%%%%% 1 %%%%%%%%%%%%%%%%%%%%
\section{Introduction}\label{introduction}
We are proposing implementation of database-level constraints. We are
consciously violating Ruby's ``Active Record'' approach. The
\href{http://guides.rubyonrails.org/migrations.html\#active-record-and-referential-integrity}{Rails
Guide on Active Record (database) Migrations} suggests
\begin{quote}
The Active Record way claims that intelligence belongs in your models,
not in the database. As such, features such as triggers or foreign key
constraints, which push some of that intelligence back into the
database, are not heavily used.
\end{quote}
\begin{comment}
The
\href{http://guides.rubyonrails.org/migrations.html\#schema-dumping-and-you}{guide}
states that ActiveRecord will not parse sql code, which is why, for
example, views implemented in SQL are not encoded in Ruby's schema dump.
Thus, to implement constraints, we can either a) find gems (such as
\href{https://github.com/matthuhiggins/foreigner}{foreigner} for foreign
key constraints) to manage the constraints the ``Ruby way'' or b) move
from using \texttt{db/schema.rb} to \texttt{db/production\_structure.sql}, so that
the schema is stored in SQL rather than in Ruby.
The \texttt{db/structure.sql} approach sounds simpler to me.
\end{comment}
We think, however
that the following quote (see
\href{http://ewout.name/2009/12/rails-models-with-teeth-and-database-constraints/}{http://ewout.name/2009/12/rails-models-with-teeth-and-database-constraints}) expresses compelling reasons for bucking the``Active Record way" and enforcing database integrity at the database level:
\begin{quote}
Data tends to outlive its applications. A tight data model is a good
foundation for an application and can save you a lot of trouble when
migrating the data to a different system (years later). Database
constraints can make your models even tighter, and enforce integrity
rules that are hard to enforce in a multi-process application
environment.
\end{quote}
Given that the Ruby Web application is only one of the ways in which we
use the database,
%(e.g.~we don't want to have to use the API with all of
%our R code \ldots{} or do we?),
it seems reasonable to go with the SQL
database-level constraints.
There are certain costs, however, to stepping outside of Rails to manage database structure. First, while we may continue to use Rails migrations to manage updates to database structure, in many cases we will have to write the code of those migrations directly in SQL (using the \texttt{execute \%\{ <SQL statements> \}} construct). This increases the chance that our migrations will no longer be DBMS-agnostic.\footnote{We can at least partially get around the limitations of built-in Rails constructs for expressing database structures by employing gems that extend these constructs. For example, the \emph{foreigner} gem allows one to write foreign-key constraints within migrations in Ruby rather that having to switch to native SQL code. Given that we think it far more likely that we will abandon using Rails as a front end to our database before we will abandon using PostgreSQL as the management-system for that database, we see little benefit to making use of these extensions.}
Second, we can no longer completely define the structure of the database in the file \texttt{db/schema.db}---the language Rails has for expressing database structure is simply not expressive enough. We must switch (and have in fact done so) to using \texttt{db/production\_structure.db} as the repository of the complete and definitive description of our database schema.
Third, we must handle the errors that arise from attempting a change to the database that would violate one of our constraints. In some cases (non-NULL constraints, for example) there are parallel Rails model validations that can enforce constraints (at the cost of some duplication) on the Rails side. In other cases, we will have to catch and handle exceptions generated by the database adapter when a database constraint is violated. This code is likely to be highly DBMS-dependent, but as remarked in the footnote, we are not overly concerned about this.
\begin{comment}
\textbf{Note (SR):} To clarify these points somewhat, the foreigner gem
\emph{does} in fact add constraints at the database level. What it does
is allow the programmer to express those foreign key constraints in Ruby
code rather than in the SQL language. It also plays nicely with
schema.rb so that if adding foreign key constraints were the only
consideration in choosing between db/schema.rb and
db/production\_structure.rb, there would be no compelling reason to opt
for the latter.
Conversely, foreigner \emph{does not} do anything to enforce foreign key
constraints at the client application level. It does not, for example,
add any model validation code that would prevent violation of
foreign-key constraints at the Ruby level and generate a user-friendly
error message when a user attempts a change that would cause a violation
of those constraints.
So whether we use foreigner or not and whether we switch to storing the
database structure in the SQL-based db/production\_structure.rb file are
largely unrelated issues. The main reason for not using foreigner is
that it requires learning (an admittedly minimal) new Ruby-based API and
doesn't give us that much in return. It would potentially allow us to
continue using schema.rb to store database structure, foreign key
constraints and all, but since we already have needed to switch to
db/production\_structure.rb to store trigger functions, this
consideration is moot.
What \emph{is} clear, however, is that we want to enforce database
constraints at the database level whether this be through gems like
foreigner that would also us to do this in the language of Rails or
whether it be directly in the SQL language.
\end{comment}
%%%%%%%%%%%%%%%%%%%%%%%% 2 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Categories of Constraints}\label{categories-of-constraints}
The kinds of constraints we wish to impose may be roughly classified as follows:
\begin{enumerate}
\def\labelenumi{\arabic{enumi}.}
\itemsep1pt\parskip0pt\parsep0pt
\item
value constraints
\begin{enumerate}
\itemsep1pt\parskip0pt\parsep0pt
\item
range constraints on continuous variables
\item
``enum'' constraints on, for example, state or country designations;
this is a form of normalization (``US'', ``USA'', and ``United States'', for example, should be folded
into a common designation)
%; forms utilized by SELECT controls should perhaps be favored
\item
consistency constraints; for example (year, month, day) can't be
(2001, 2, 29); or city-(state)-country vs.~latitude-longitude (this
may be hard, but some level of checking may not be too difficult; for
example, \begin{quote}\texttt{select * from sites where country = `United
States' and lon \textgreater{} 0;}\end{quote} shouldn't return any rows)
\end{enumerate}
\item
foreign key constraints
We also include here certain consistency constraints involving foreign keys. For example, if \texttt{traits.cultivar\_id} is non-null, the value of \texttt{cultivars.specie\_id} in the referred-to row should equal \texttt{trait.specie\_id}.
\item
non-NULL constraints
\item
uniqueness constraints--above all, designation of natural key
In keeping with Rails conventions, we grudgingly continue to use \texttt{id} as the primary key for all or nearly all non-join tables, but wherever possible, we shall try to designate some column or set of columns as the "true, natural" key for the table and will back up this designation with a \texttt{UNIQUE} constraint on the constituent columns together with non-NULL constraints on those columns. This will help prevent what has been a persistent problem with duplicate data---for example, two rows in the species table that are essentially the same, differing only in the value contained in the \texttt{id} column.
\end{enumerate}
In the sections that follow, constraints that have already been implemented are marked with a checkmark (\checkmark).
%%%%% 2.1 %%%%%
\subsection{Reference Documentation} \textcolor{red}{to do: update these}
\begin{itemize}
\itemsep1pt\parskip0pt\parsep0pt
\item
\href{https://docs.google.com/spreadsheets/d/1fJgaOSR0egq5azYPCP0VRIWw1AazND0OCduyjONH9Wk/edit\#gid=956483089}{Google
Doc Spreadsheet}
\item
\href{https://gist.github.com/dlebauer/5522cfd6629cfa2a2610}{SQL dump
with constraints} {[}Note: The foreign-key constraints listed here
have been superceded by the draft ``add\_foreign\_key\_contraints''
migration in BETYdb git branch ForeignKeyConstraints.{]}
\item
\href{https://ebi-forecast.igb.illinois.edu/redmine/issues/1915}{redmine
issue 1915 and related issues / subtasks}
\end{itemize}
%%%%%%%%%%%%%%%%%%%%%%%%% 3 %%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Value Constraints}\label{value-constraints}
Value constraints serve several useful functions:
\begin{itemize}
\item They provide a sanity check on data. For example, we can ensure we don't have negative values for yields or have temperature values of $-500$ degrees Celsius.
\item They help prevent duplicate data. For example, even if we impose a uniqueness constraint on the ``name" column of the variables table, this alone won't prevent having one row with name = `leafC' and another row with name = ` \ \ leafC \ \ ';
\item They help standardize data. For example, without a standardization constraint on country names, a user searching for sites using the string `United States' may get different results from a user searching with the string `U.S.'.
\end{itemize}
Most value constraints will be implemented with a CHECK constraint. More complicated constraints involving multiple tables may require defining a PL/pgSQL function. Even in simpler cases, defining a function to implement a CHECK constraint may make sense if the same sort of constraint will be used repeatedly. Alternatively, we may wish to use SQL's CREATE DOMAIN statement to define a type that has the constraints we need built in, and then alter the columns we wish to constrain to be of the new type.
Since not-NULL constraints may be viewed as a sort of value constraint and are in any case quite bound up with them, they will be dealt with here as well. The set of not-NULL constraints we wish to use is summarized in a separate section below, together with some general remarks about the use of NULLs.
When a default value other than NULL should be set, these are mentioned here as well.
Each table is given its own section below (excepting certain join tables). In most cases, all columns are mentioned except (1) the `id' surrogate key columns; (2) foreign-key columns; (3) the \verb"created_at" and \verb"updated_at" time\-stamp columns (which are dealt with all together in the following ``General constraints'' section).
After each column name (sometimes with its data type, in parentheses), the constraint that should apply immediately follows, if one has been decided upon. After this, a paragraph (beginning with ``Discussion:'') commenting on the chosen constraint or discussing considerations for adding additional future constraints may follow.
A checkmark (\checkmark) after a constraint indicates it has been implemented.
%%%% 3.1 %%%%
\subsection{General constraints applying to multiple tables}\label{general}
\begin{itemize}
\itemsep1pt\parskip0pt\parsep0pt
\item
Text column values should not have leading or trailing white spaces.
To make this easier, we define some PL/pgSQL functions:
\begin{itemize}
\item A function \verb"normalize_whitespace(string)" is defined to return \newline \verb"TRIM(REGEXP_REPLACE(string, '\s+', ' '))".
\item A function to test for normalization, \verb"is_whitespace_normalized(string)", returns the result of the test \verb"string = normalize_whitespace(string)".
\end{itemize}
Then check constraints of the form
\begin{quote}
\texttt{CHECK(is\_whitespace\_normalized(\textless{}columnname\textgreater{}))}\end{quote}
can then be added for each column that should be whitespace-normalized.
These constraints, in conjunction with uniqueness constraints, will go a long way toward ensuring that rows that are essentially duplicates, differing only in the value of the \texttt{id} column and in the white space that occurs in their textual columns, do not occur.
\item All \verb"created_at" and \verb"updated_at" columns should have default value \verb"NOW()".
If feasible, a trigger function should be defined to set the \verb"updated_at" column to \verb"NOW()" upon UPDATE if no explicit value is given in the update.
\end{itemize}
%%%% 3.2 %%%%
\subsection{citations}\label{citations}
\begin{description}
\item[author] not NULL, whitespace-normalized
Discussion: This \emph{should} always be just a last name, but there are a few cases where full names or a list of names is given.
\item[year (integer)] not NULL
Discussion: Consider adding a range restriction---both a lower bound (say 1800---are we ever going to want to cite Aristotle or even Leeuwenhoek?) and some upper bound (say, 2200), or better, a check constraint such as \verb"CHECK(year <= EXTRACT(YEAR FROM NOW()) + 1)", assuming we would never have a citation year more than one year in the future.
\item[title] not NULL, whitespace-normalized
Discussion: The Data Entry guide recommends using `NA' to denote an unknown title, journal, volume (\verb"vol") or page (\verb"pg"). Note, however, that `NA' currently never appears in either the \verb"journal" column or the \verb"pg" column and that it can't be used for \verb"vol" since it has datatype \texttt{integer}. Moreover, since `NA' is commonly used to designate both ``not applicable'' and ``not available'', which mean very different things, we feel it is better to write out which one is meant. (A checkbox or radio button on the Rails app form could both make this easier and help standardize these special values.)
\item[journal] not NULL, whitespace-normalized
\item[vol (integer)] must be $> 0$.
Discussion: We'll allow this to be NULL for now until we decide how to deal with missing values.
\item[pg] not NULL; should match the regular expression \newline \verb"'^([1-9]\d*(\u2013[1-9]\d*)?)?$'" (tentative)
Discussion: This should be either a single positive integer or two integers separated by a dash. Currently, nearly all values match the regular expression \verb"'^[1-9]\d*([-\u2013][1-9]\d*)?$'". Ignoring the one case of the value `test', those that don't either have leading or trailing spaces, a doubled hyphen, have leading zeros in the numbers, look like dates (e.g. 10/21/10) or are the empty string. Values should probably be normalized upon entry to strip spaces and replace hyphen(s) or the word ``to"'' with an n-dash, the conventional typographic symbol for representing a range of numbers.
\item[url] should match a regular expression for URLs or the emtpy string (tentative)
Discussion: At a minimum this should be whitespace-normalized. Unless we want to continue values like `paper copy available in blue folder', `NA', and `not found', we could require entries to either be the empty string or look like an actual URL. (We could relax this somewhat to allow some finite set of prescribed values like `unknown' or `not yet available' in addition to bona fide URLs.)
\item[pdf] same as for url
Discussion: Many, but by no means all, have a `pdf' extension in the filename portion of the URL.
\item[doi] should match the regular expression \verb"'^(|10\.\d+(\.\d+)?/.+)$'"
Discussion: All but 13 of the existing values match \newline \verb"'^(|10\.\d+(\.\d+)?/.+)$'". Most of the non-matching values are either `NA' or they prefix the doi with `doi:' or with `http://dx.doi.org/'. These prefixes should be stripped for succinctness and uniformity since they don't add information not already contained in the column name. Unless we wish to distinguish between unknown DOIs and unregistered citations, we can just use the empty string (rather than `NA') for both.
\end{description}
%%%%%%%%%% 3.3 %%%%%%%%%%%%
\subsection{covariates}\label{covariates}
\begin{description}
\itemsep1pt\parskip0pt\parsep0pt
\item[level] should be in the range corresponding to variable
referenced by \texttt{variable\_id} \checkmark
\item[n] should be positive (tentative)
Discussion: For now, this is allowed to be NULL. Decide if 1 is a permissible value.
\item[statname and stat] These are interdependent. Check that \texttt{statname} is one of ``SD'', ``SE'', ``MSE'',\normalmarginpar
``95\%CI'', ``LSD'', ``MSD'' or the empty string. Create a domain type for this since it is also used elsewhere.
\verb"stat" should be NULL if statname is the empty string and should be non-null otherwise. Tentative: \verb"stat" should also be NULL (and \verb"statname" equal to the empty string) if \verb"n" is NULL or equal to 1 and non-null otherwise.
Discussion: Do any of these statistics have any meaning if \verb"n" = 1? Should these values be \emph{required} if \verb"n" is greater than 1?
\end{description}
%%%%%% 3.4 %%%%%
\subsection{cultivars}\label{cultivars}
\begin{description}
\item[name] should be whitespace-normalized
\item[ecotype] should be one of a small number of finite values (including the empty string) (tentative)
Discussion: Note that in 30 out of 90 cases, \verb"ecotype" has the same value as \verb"name"! This may be an error! For the cases where ecotype does \emph{not} equal name, there are only 5 distinct values: `Lowland', `Upland', `Boreal Forest', the string `NULL', and the empty string. This suggests that possibly ecotype values can be restricted to a small finite set.
\item[notes] not NULL
\end{description}
%%%%%% 3.5 %%%%%
\subsection{dbfiles}\label{dbfiles}
\begin{description}
\item[file\_name] not null, no whitespace (tentative)
Discussion: Possibly be more restrictive: For example, currently all values match \verb"`^[\w.-]*$'".
\item[file\_path] no whitespace, non-empty, and non-null (tentative)
Discussion: Possibly be more restrictive: For example, currently all values match \verb"`^[\w.:/-]*$'".
\item[container\_type] should be in the set ('Model', 'Posterior', 'Input')
\item[container\_id] This looks like a foreign key, but it isn't exactly, so we include it here. It refers to the \verb'id' column of one of the tables \verb"models", \verb"posteriors", or \verb"inputs"; which one depends on the value of \verb"container_type". We should check that the refered-to row exists.
Discussion: The probably requires triggers on this table and each of the referred-to tables to implement. This may be too complicated to be worthwhile.
\item[md5] should match \verb"`^([\da-z]{32})?$'"
Discussion: In other words, it's either empty or a 32-digit hexadecimal number represented as text using lowercase letters. All current values comply. The datatype for this column is varchar(255) but could be varchar(32).
\end{description}
%%%%%%%%%%%%%%%%%%
\subsection{ensembles}
\begin{description}
\item[notes] not NULL
\item[runtype] not NULL (tentative)
Discussion: Perhaps there is a regular expression it should match also.
\end{description}
\subsection{entities}\label{entities}
\begin{description}
\item[name] should be white-space normalized
\item[notes] not NULL
\end{description}
%%%%%%%%
\subsection{formats}
\begin{description}
\item[dataformat] --- to be determined
\item[notes] not NULL
\item[name] not NULL, whitespace-normalized
\item[header] --- to be determined
\item[skip] --- to be determined
\end{description}
%%%%%%%%
\subsection{formats\_variables}
\begin{description}
\item[name] --- to be determined
\item[unit] --- to be determined
\item[storage\_type] --- to be determined
\item[column\_number] --- to be determined
\end{description}
%%%%%%%%%
\subsection{inputs}
\begin{description}
\item[notes] not NULL
\item[start\_date (timestamp)] --- to be determined
\item[end\_date (timestamp)] --- to be determined
\item[name] not NULL, whitespace-normalized (tentative)
\item[access\_level] not NULL, in range 1--4 (tentative)
\item[raw] not NULL (tentative)
\end{description}
%%%%%%%%%%
\subsection{likelihoods}
\begin{description}
\item[loglikelihoods] --- to be determined
\item[n\_eff] --- to be determined
\item[weight] --- to be determined
\item[residual] --- to be determined
\end{description}
%%%%%%%%%%%%
\subsection{machines}
\begin{description}
\item[hostname] no whitespace (tentative)
\end{description}
%%%%%%%%
\subsection{managements}\label{managements}
\begin{description}
\item[citation\_id] --- to be determined
Discussion: There should be some kind of consistency constraint above and beyond the foreign key constraint. Perhaps: The
management should be associated with (at least) one of the treatments
associated with the citation specified by \texttt{citation\_id}.
\item[date] not NULL (tentative)
Discussion: As a kind of consistency check, we could require certain values for the year, month, or day portions of the date for certain values of dateloc. For example, if dateloc = 8 (``year''), we could require the date to be of the form `YYYY-01-01'. (PostgreSQL, unlike MySQL, doesn't allow values of the form `YYYY-00-00'.) NULLs can be eliminated by setting dateloc to 9 (``no data''), which would effectively mean ``ignore the value stored in the date column''. Nevertheless, as a consistency check, we could choose some value (e.g. `1000-01-01') that should always be used when dateloc = 9.
\item[dateloc] use a ``dateloc" domain to constrain to specific values; not NULL (tentative)
Discussion: If this is required to be non-null, we must decide how to handle the few values where dateloc is NULL and date is not null.
\item[mgmttype] not NULL; must be one of a defined set of values specified in the Rails model file for the Management class
Discussion: Consider storing these in the variables
table, or in a separate lookup table. If we used this table to record units and range
restrictions, this would provide additional useful consistency checks between mgmttype, level, and units.
\item[level] should be non-negative or the special value -999 (tentative)
Discussion: This should always be non-negative\marginpar{\em Please comment.} (except in the case that we
want to use the special value -999 for mgmttypes where a level has no
meaning; if so, we should also constrain level to be non-NULL).
\item[units] should be non-null if level is not null; should be constrained to a prescribed set of values (tentative)
Discussion: Ideally, the value should be constrained to a known set of values on a\marginpar{\textcolor{red}{to do: research what standard set of values to use}}
per mgmttype basis; currently there are several varying designations
for the same unit in a number of cases; for example, \verb"kg ha-1" vs.{} \verb"kg ha^-1". See discussion under \verb"mgmttype".
\item[notes] not NULL
\end{description}
%%%%%%
\subsection{methods}
\begin{description}
\item[name] not NULL, whitespace-normalized
\item[description] not NULL
\end{description}
%%%%%%%
\subsection{mimetypes}
\begin{description}
\item[type\_string] not NULL; use a regular expression check
Discussion: A fairly tight check against a regular expression is mentioned in GH \#194.
\end{description}
%%%%%%
\subsection{models}
\begin{description}
\item[model\_name] not NULL, no whitespace (tentative)
\item[revision] not NULL, no whitespace (tentative)
\end{description}
%%%%%%
\subsection{modeltypes}
\begin{description}
\item[name] not NULL, no whitespace (tentative)
\end{description}
%%%%%%
\subsection{modeltypes\_formats}
\begin{description}
\item[tag] not NULL, no whitespace
Discussion: All existing tags are strings of lowercase letters. Should this be required? Should tags be unique?
\item[required] not NULL
\item[input] not NULL
\end{description}
%%%%
\subsection{pfts}
\begin{description}
\item[definition] not NULL
\item[name] not NULL, should match \verb"'^[-\w]+(\.[-\w]+)*$'" (tentative)
Discussion: Existing entries seem to follow a naming pattern so that they all match the given regular expression. We could relax this to ``no white\-space" or ``whitespace-normalized."
\item[pft\_type] --- to be determined
Discussion: Currently this is always `plant'.
\end{description}
%%%%%%%
\subsection{priors}
\begin{description}
\item[phylogeny] not NULL, whitespace-normalized
\item[distn] not NULL; should be one of `unif',
`gamma',
`beta',
`exp',
`lnorm',
`weibull', and
`norm'
\item[parama, paramb, paramc] Some sanity check based on the value of distn should be possible. (For example, should some distributions allow only positive numbers?)
\item[n] --- to be determined
Discussion: This should probably always be $>=$ 2 unless we want to allow certain values (1, 0, or negative numbers) as special values---say to indicate an unknown value for n. (1 is not a good choice here, however, since it would easily be misinterpreted as a legitimate sample size.) Note that several existing values are 0! Also, several are NULL.
\item[notes] not NULL
\end{description}
%%%%%%
\subsection{projects}
\begin{description}
\item[name] not NULL, whitespace-normalized
\item[outdir] --- to be determined
Discussion: At least non-null.
\item[description] not NULL
\end{description}
%%%%%%
\subsection{runs}
\begin{description}
\item[start\_time (timestamp)] --- to be determined
\item[finish\_time (timestamp)] --- to be determined
\item[outdir] --- to be determined
\item[outprefix] --- to be determined
\item[setting] --- to be determined
\item[parameter\_list] --- to be determined
\item[started\_at (timestamp)] --- to be determined
\item[finsihed\_at (timestamp)] --- to be determined
\end{description}
%%%%%%
\subsection{sites:}\label{sites}
\begin{description}
\itemsep1pt\parskip7pt\parsep0pt
\item[city, state, country] --- to be determined
Discussion: Standardize geographic names (city, country, state) using TIGER /
OpenStreetMap. Note that \texttt{state} is currently used not only for
U.S. states, but states, regions, or provinces in other countries.
This may be harder to standardize. (Question: Does TIGER only deal
with U.S. geographic names?)
Use geocoding / reverse geocoding to enfoce consistency between lat,
lon and city, state, country
Country names should be standardized, and probably this standardization should be enforced with a constraint.
\item[som] should be in range 0--100 (this is a percentage)
\item[mat] should be in range -25--40
Discussion: This should be more than adequate: The highest annual mean temperature recorded is 34.4 degrees Celsius and the lowest is -19.4 degrees Celsius.
\item[masl] --- replaced by geometry
Discussion: Although this has been replaced, an altitude restriction could be placed on \verb"geometry".
\item[map] should be in range 0--12000
Discussion: According to Weather Underground, the wettest place in the world has an average annual precipitation of 11871mm.
\item[soil] --- to be determined
Discussion: Should at least be whitespace-normalized and non-null, but constraining to some finite list of values may be possible (anomalous information could go into the \verb"soilnotes" column).
Right now, there are 31 distinct descriptors, but many of these are the same if variations in capitalization and whitespace are ignored.
\item[soilnotes] not NULL
\item[sitename] not NULL, whitespace-normalized
\item[greenhouse (boolean)] --- to be determined
Discussion: Ideally, a not-NULL constraint should be enforced. But there are 272 rows where this is NULL.
\item[local\_time (int4)] --- to be determined
Discussion: A comment should clarify the meaning; I assume it
should mean something like ``the number of hours local standard time
is ahead of GMT'': this column should probably be called \verb"timezone" or better, \verb"utc_offset". Moreover, integer is a poor choice of datatype since certain locales---Iran and Newfoundland for example---have time zones on the half hour, and some locales even use quarter-hours offsets.
This column can be confined to a finite set of values: see the list of all UTC offsets at http://en.wikipedia.org/wiki/Time\_zone\#List\_of\_UTC\_offsets.
Some kind of check certainly possible to ensure
consistency with the longitude extracted from the geometry. The offset is approximately equal to the longitude divided by 15, but in areas like China and Greenland where one timezone spans a wide longitude range, the difference can be as much as 3 or 4 hours.
\item[sand\_pct, clay\_pct] should be in range 0--100, and \verb"sand_pct" $+$
\verb"clay_pct" should be $\leq 100$; more succinctly, each should be non-negative and their sum should be at most 100.
\item[geometry] --- to be determined
Discussion: This replaces lat, lon, and masl. It is not clear to me what
constraints (if any) can or should be placed on geometry.
\end{description}
\subsection{species:}\label{species}
\begin{description}
\itemsep1pt\parskip7pt\parsep0pt
\item[genus] capitalized, not NULL, no whitespace
\item[species] not NULL, whitespace-normalized
\item[scientificname] not NULL, whitespace-normalized; also ensure \verb"scientificname" matches \verb"FORMAT('^%s %s', genus, species)" (tentative)
Discussion: There are 205 cases where the above match fails. For most of these, either the genus name or the species name is not contained in the value of \verb"scientificname". It's not clear if these are due to data entry errors or the use of synonyms. Some are cases where the genus is abbreviated in the \verb"scientificname" column. Often, the variety or subspecies name appears instead of the species name in the \verb"species" column.
A more comprehensive match restriction might be possible---something like
%\begin{quote}
\begin{verbatim}
CHECK(scientificname ~
FORMAT('^%s %s( (ssp\.|var\.?|\u00d7) \w+)?$', genus, species))
\end{verbatim}
%\end{quote}
though this doesn't account for authority designations, e.g. ``Hyacinthoides italica (L.) Rothm.'' In particular, it would be desirable to standardize the hybrid designator to the ``times'' symbol with a space on either side and no longer use the letter ``x''.
\item[commonname] not NULL, whitespace-normalized
\item[notes] not NULL
\end{description}
The remaining columns come from the USDA plant database, and we won't be overly concerned with them. Nevertheless, here are some notes:
\begin{itemize}
\item
Except for species, most taxinomic divisions (``Family", ``Class'', ``Division'', ``Kingdom'', etc.) should probably be constrained to a single capitalized word (no spaces) or the empty string (if the information is not given).
\item Symbols: These should contain no whitespace and should consist of digits and upper-case latin letters. Note that ``Symbol" and ``AcceptedSymbol" are almost always identical.
\item Duration: This is always `Annual', `Biennial', or `Perennial', or some combination of these. The combinations should be standardized. For example, both `Annual, Perennial' and `Perennial, Annual' occur. The empty string (or some other special value) should be allowed for unspecified information.
\item GrowthHabit: Of the 40-some thousand non-NULL, non-empty values given in this column there are only 66 distinct ones. This number could be reduced by standardizing: Most values are comma-separated combinations of `Tree', `Shrub', `Vine', etc. but the order varies.
\item Many columns are essentially booleans but use the varchar(255) with values `Yes', `No', or NULL instead. These should be converted to bona fide boolean types or to a subdomain of varchar with values 'Yes', 'No', and some special values to indicate unknown, non-yet-entered, or inapplicable information.
\item Many columns use only the values `High', `Medium', `Low', and `None', (and perhaps also NULL and the empty string). A domain should be created for this column type.
\item pH\_Minimum, pH\_Maximum: These obviously should be confined to the range 0--14, with pH\_Minimun $<$ pH\_Maximum. Over 40,000 rows have 0.00 in both these columns, which is nonsensical! A single column range could be used in place of these two column.
\item Several columns refer to seasons of the year. A domain should be created.
\item Several quantitative columns should be constrained to be non-negative.
\item MinFrostFreeDays: Obviously should be non-negative and at most 365 (366). Similarly for other ``Days" columns.
\item A few other columns have values that are confined to a small finite set of string values.
\end{itemize}
%%%%%
\subsection{trait\_covariate\_associations}
\begin{description}
\item[required] not NULL
\end{description}
%%%%%%
\subsection{traits}\label{traits}
\begin{description}
\itemsep1pt\parskip0pt\parsep0pt
\item
\item[date, dateloc, time, timeloc, date\_year, date\_month, date\_day,
time\_hour, time\_minute] --- to be determined: but at least constrain dateloc and timeloc to be in the set of recognized values; define a domain for this (see discussion of \verb"managements.dateloc")
Discussion: Check date and time fields consistency: For
example, if dateloc is 91---97, date and date\_year should both be
NULL (but maybe old data doesn't adhere to this?). If date\_year,
date\_month, or date\_day is NULL, date should be NULL as well. Also,
dateloc and timeloc should be constrained to certain meaningful
values. (See comment above on managements.dateloc.)
\item[mean] check mean is in the range corresponding to the variable
referenced by \verb"variable_id" \checkmark
\item[n, stat, statname] See comments above on covariates.
\item[specie\_id and cultivar\_id] these need to be consistent with one another
\item[notes] not NULL
\item[checked] not NULL, equal to 1, 0, or -1
\item[access\_level] not NULL; range is 1--4
\end{description}
\subsection{treatments:}\label{treatments}
\begin{description}
\itemsep1pt\parskip0pt\parsep0pt
\item[name] not NULL, whitespace-normalized
Discussion: Ideally, two names that are identical except for capitalization should in fact be identical but this would be hard to enforce.
\item[definition] not NULL, whitespace-normalized
\item[control] not NULL (tentative)
Discussion: The value is constrained in another way be the requirement stated in the Uniqueness portion of the constraints spreadsheet: ``there must be a control for each (citation\_id, site\_id) tuple''. ``A control'' means a treatment for which \verb"control = true". The meaning of this is not clear, however, since \verb"site_id" and \verb"citation_id" are not columns of this table.
\end{description}
\subsection{users:}\label{users}
\begin{description}
\item[login] --- to be determined: at least not NULL
Description: Enforce any constraints required by the Rails interface.
\item[name] not NULL, whitespace-normalized
\item[email] not NULL; constrain to (potentially) valid email addresses
\item[city] not NULL, whitespace-normalized
\item[country] not NULL, whitespace-normalized
\item[area] to be determined---at least not NULL
This currently isn't very meaningful. We have a mixture of values like ``tundra'' with values like ``Agriculture'' and ``Industry''.
\item[crypted\_password] not NULL
\item[salt] not NULL
\item[remember\_token] to be determined
\item[remember\_token\_expires\_at (timestamp)] to be determined
\item[access\_level] not NULL; range is 1--4.
\item[page\_access\_level] not NULL; range is 1--4.
\item[state\_prov] to be determined---at least not NULL
Discussion: For U.S. users, this could be constrained to valid state or territory names.
\item[postal\_code] to be determined--at least not NULL
Discussion: Ideally, this should be constrained according to the
country. Since most users are (currently) from the U.S., we could at
least constraint U.S. postal codes to ``NNNNN'' or ``NNNNN-NNNN''.
\end{description}
\subsection{variables}
\begin{description}
\item[description] not NULL, whitespace-normalized
\item[units] to be determined
Discussion: These should be standardized. See also discussion of \verb"managments.units".
\item[notes] not NULL
\item[name] not NULL, whitespace-normalized
\item[min, max] require \verb'min' $<$ \verb'max' if both are non-null
Discussion: Note that these are both of type varchar. Ideally, a single column of type \verb'numrange' could be used. If we stay with type varchar for these columns, there is no reason to allow them to be null. But in this case we should also require that the strings look like a number (e.g. \verb"~ '^-?\d+(\.\d*)?$'") possibly allowing in addition certain prescribed values such as \verb'infinity', \verb'-infinity', \verb'N/A', \verb'unspecified', etc.
\item[standard\_name, standard\_units, label, type] --- to be determined
Discussion: None of these columns is currently used: every row has either NULL or the empty string in each of these columns. Do we really want to keep them?
\end{description}
\subsection{workflows}
\begin{description}
\item[folder] --- to be determined: at least not NULL and no whitespace
\item[started\_at (timestamp)] --- to be determined
\item[finished\_at (timestamp)] --- to be determined
\item[hostname] --- to be determined: at least not NULL and no whitespace
\item[params] --- to be determined: at least not NULL and whitespace-normalized
\item[advanced\_edit (bool)] not NULL
\item[start\_date (timestamp)] --- to be determined
\item[end\_date] --- to be determined
\end{description}
\subsection{yields:}
\begin{description}
\itemsep1pt\parskip0pt\parsep0pt
\item
\item[date, dateloc, date\_year, date\_month, date\_day] --- to be determined
Discussion: Check date fields consistency: For
example, if dateloc is 91---97, date and date\_year should both be
NULL (but maybe old data doesn't adhere to this?). If date\_year,
date\_month, or date\_day is NULL, date should be NULL as well. Also,
dateloc should be constrained to certain meaningful
values. (See comment above on \verb"managements.dateloc".)
\item[mean] not NULL, at least 0 and at most 1000 (tentative)
Discussion: The current maximum occurring in this column is 777.0.
\item[n, stat, statname] See comments above on covariates.
\item[specie\_id and cultivar\_id] these need to be consistent with one another
\item[notes] not NULL
\item[checked] not NULL, equal to 1, 0, or -1
\item[access\_level] not NULL; range is 1--4
\end{description}
\section{Foreign Key Constraints}\label{foreign-key-constraints}
All foreign key constraints follow the form
\texttt{table\_id references tables}, following Ruby style conventions.
A \href{https://gist.github.com/dlebauer/12d8d9ed1b2965301d64}{Github
Gist} contains a list of foreign key constraints to be placed on BETYdb.
The foreign keys are named using the form
\texttt{fk\_foreigntable\_lookuptable\_1} where the foreigntable has the
foreign key. Often, however, we will use more meaningful names instead of these auto-generated ones.
\section{Not-Null Constraints}\label{non-null-constraints}
\subsection{Reasons to Avoid Nulls}
\begin{enumerate}
\item The interpretation of a NULL is almost never defined for the database user. A NULL may be used for any of the following reasons, and generally unclear what the reason was in any particular case:
\begin{enumerate}
\item The data entry operator has to look up the information and hasn't yet done so. (And maybe they will forget to ever do so!)
\item The attribute is not applicable for the row in which the null appears. For example, \verb'city' might be null for a site in the middle of the desert not near any city. Or \verb'cultivar_id' may not apply for a trait measurement carried out on a non-domesticated species.
\item The information is pending. For example, \verb'citation.doi' might be null until a DOI has been assigned.
\item The information was never collected and is irretrievably missing. For example, perhaps a long-ago trait measurement was carried out and recorded on a farm crop, but the citation author failed to specify the cultivar of the species measured.
\item The information is missing, is retrievable, but it is not considered worth the effort to retrieve and add it. For example, a site with NULL in the \verb'city' column may be in or near a city, but we may not consider it important to fill in this information.
\end{enumerate}
\item The logic for using nulls defies common sense and is therefore exceptionally prone to yielding erroneous results. Here are some examples:
\begin{enumerate}
\item Supposed we have a table \verb'stats' with integer columns \verb'a' and \verb'b'. Then, normally, we should expect the queries
\begin{quote}
\verb"SELECT sum(a) + sum(b) FROM stats;"
\end{quote}
and
\begin{quote}
\verb"SELECT sum(a + b) FROM stats;"
\end{quote}
to produce the same result, but they probably won't if either column is allowed to contain a NULL.
\item It is commonly said the NULL stands for a value, but for a value we don't happen to know. But consider a column \verb'a' of some numeric type. Then even in cases where we don't know the value of \verb'a', we \emph{do} know that \verb"a - a = 0". But most SQL products don't. For example, compare these query results:
\begin{quote}\begin{verbatim}
bety=# select count(*) from traits;
count
-------
13064
(1 row)
bety=# select count(*) from traits where mean - mean = 0;
count
-------
13058
(1 row)
\end{verbatim}
\end{quote}
The six missing rows in the second query result were rows in which \verb'mean' was NULL. In such a case PostgreSQL considers \verb'mean - mean' to be NULL, not zero, and in the context of a WHERE clause (but not in other contexts!) NULL is considered to be false.
\item Nulls are grouped together by the \verb'GROUP BY' clause but don't compare equal if we use the $=$ operator. (\verb'NULL = NULL' is evaluated to \verb'NULL', not \verb'true'.) Consider for example the difference between the results of the following two queries:
\begin{quote}\begin{verbatim}
SELECT
COUNT (*),
sitename
FROM
sites
GROUP BY
sitename
HAVING
COUNT (*) > 1
ORDER BY
sitename;
SELECT
COUNT (*),
sitename
FROM
sites s1
WHERE
EXISTS (
SELECT
1
FROM
sites s2
WHERE
s1. ID != s2. ID
AND s1.sitename = s2.sitename
)
GROUP BY
sitename
ORDER BY
sitename;
\end{verbatim}
\end{quote}
The first query groups togther the rows with NULL in the sitename column. The second query ignores these rows completely and thus has one fewer group in the result. To get back the missing row, we have to replace \verb"AND s1.sitename = s2.sitename" with \begin{verbatim}
AND (s1.sitename = s2.sitename
OR
s1.sitename IS NULL
AND
S2.sitename IS NULL)
\end{verbatim}
\end{enumerate}
These are just a few of the common sense-defying properties that SQL's 3-valued logic uses to deal with NULLs.
\end{enumerate}
In summary, if the inherent ambiguity of NULLs makes it nearly impossible to come up with a well-defined predicate that defines the relation associated with a table and tells us exactly what fact the presense of a given row in the table is supposed to represent. But even should we do so, the slipperiness of the logic SQL uses to manipulate NULLs makes it highly likely that the query we write to get results from that data won't mean what we intend it to.
\subsection{Summary of Not-NULL constraints}
Some nulls are worse than others perhaps. We probably don't care much that over half the rows in the treatments table have NULL in the \verb'user_id' column since it is unlikely we will write any queries using this column.
Given the overwhelming task of eliminating the use of nulls from a database that has allowed them for so long, we have to make some priorities. The following is a list of columns from which we wish to eliminate (and prevent future) nulls in the near term.
This is a list of fields that should not be allowed to be null. In all
cases, columns making up part of a candidate key (shown below in parenthesized groups) should not be null. For many-to-many relationship
tables, the foreign keys should be non-null. In general, nulls can be eschewed from all textual columns since the empty string can easily be used instead.
\begin{itemize}
\itemsep1pt\parskip0pt\parsep0pt
\item
citations: (author, year, title), journal, pg
\item
citations\_sites: (citation\_id, site\_id)
\item
citations\_treatments: (citation\_id, treatment\_id)
\item
covariates: (trait\_id, variable\_id)
\item
cultivars: (specie\_id, name), ecotype, notes
\item
dbfiles: (file\_name, file\_path,
machine\_id), container\_type, container\_id
\item
ensembles: workflow\_id
\item
entities: name, notes (missing values should be the empty string)
\item
formats: (dataformat)
\item
formats\_variables: (format\_id, variable\_id)
\item
inputs: name, access\_level, (site\_id, start\_date, end\_date, format\_id)
\item
inputs\_runs: (input\_id, run\_id)
\item
inputs\_variables: (input\_id, variable\_id)
\item
likelihoods: (run\_id, variable\_id, input\_id)
\item
machines: (hostname)
\item
managements: (date, management\_type)
\item
managements\_treatments: (treatment\_id, management\_id)
\item
methods: (name, citation\_id), description
\item
mimetypes: (type\_string)
\item
models: (model\_name, revision)\footnote{Using this as a key is still at the proposal stage.}, model\_path, model\_type
\item
pfts: definition, (name, modeltype\_id)
\item
pfts\_priors: (pft\_id, prior\_id)
\item
posteriors: (pft\_id, format\_id)
\item
priors: (phylogeny, variable\_id, notes), (phylogeny, variable\_id, citation\_id), distn, parama, paramb [we've repeated column names that are part of two keys]; n: If we require n to be non-NULL, we have to decide how to handle missing information.
\item
runs: (model\_id, site\_id, start\_time, finish\_time, parameter\_list, ensemble\_id), outdir,
outprefix, setting, started\_at (note:
finished\_at will not be available when record is created)
\item
sites: (geometry, sitename), greenhouse
\item
species: genus, species, (scientificname)
\item
traits: (site\_id, specie\_id, citation\_id, cultivar\_id, treatment\_id, date, time, variable\_id, entity\_id, method\_id, date\_year, date\_month, date\_ day, time\_hour, time\_minute),
mean, checked, access\_level\footnote{The key is still at the proposal stage. Moreover, \emph{many} values in some of these columns are currently NULL.}
\item
treatments: name, control, definition (name should probably also be constrained to be non-empty; but definition may be empty.)
\item
users: (login), name, email, crypted\_password, salt, access\_level,
page\_access\_level, (apikey)\footnote{It is yet to be decided with certainty which columns should be keys.}
\item
variables: (name), units
\item
workflows: folder, started\_at, (site\_id, model\_id, params,
advanced\_edit, start\_date, end\_date), hostname
\item
yields: (citation\_id, site\_id, specie\_id, treatment\_id, cultivar\_id, method\_id, entity\_id, date\_year, date\_month, date\_day), checked, access\_level, mean\footnote{The key here is yet to be finalized.}
\end{itemize}
\section{Uniqueness constraints}\label{uniqueness-constraints}
These are ``natural keys'', that is, combinations of columns that provide a natural way to identify, select, and distinguish members of the set of entities the table is meant to represent. More generally, each is a candidate key, that is, a combination of non-NULL columns guaranteed to be unique within a table.\footnote{Note that SQL's UNIQUE constraint does not prevent duplicate rows if even one column of the constraint is allowed to be NULL. It only guarantees a row is unique in the case where all of its columns that are involved in the uniqueness constraint are non-NULL.} Ideally, each table would have a natural key, but
a table may have more than one candidate key. Each table should \emph{always} have at least one candidate key, and ideally this will be something other than the auto-numbered \texttt{id} column that Rails expects each table to have by default.
For many-to-many relationship tables, the foreign key pairs should be
unique; these should be implemented but are not listed here for brevity except where the table contains columns other than the foreign key and timestamp columns.
\begin{itemize}
\itemsep1pt\parskip0pt\parsep0pt
\item
citations: author, year, title
\item
covariates: trait\_id, variable\_id
\item
cultivars: specie\_id, name
\item
dbfiles: file\_name, file\_path, machine\_id
\item
formats: site\_id, start\_date, end\_date, format\_id
\item
formats\_variables: format\_id, variable\_id
\item
likelihoods: run\_id, variable\_id, input\_id
\item
machines: hostname
\item
managements: date, management\_type
\item
methods: name, citation\_id
\item
models: model\_name, revision (tentative)
\item
pfts: name, modeltype\_id
\item
posteriors: pft\_id, format\_id
\item
priors: phylogeny, variable\_id, \st{distn, parama, paramb}, \ul{citation\_id}
\item
priors: phylogeny, variable\_id, notes
\item
runs: (?) model\_id, site\_id, start\_time, finish\_time,
parameter\_list, ensemble\_id
\item
sites: geometry, sitename
\item
species: scientificname (not genus, species because there may be
multiple varieties)
\item
traits: site\_id, specie\_id, citation\_id, cultivar\_id,
treatment\_id, date, time, variable\_id, entity\_id, method\_id,
date\_year, date\_month, date\_day, time\_hour, time\_minute
\item
treatments:
\begin{itemize}
\item
For a given citation, name should be unique. (Note that there is no citation\_id column in the treatments table. The association of treatments with citations is a many-to-many one via the citations\_treatments table. So the constraint, in words, is something like this: "Given two rows of the treatments table with distinct values for "name", no citation should be associated with both rows." Is this really the restriction we want?)
\item
For a given citation and site, there should be only one control. \marginpar{It is not at all clear how treatment is associated with a site!}
\end{itemize}
\item
users: (each of the following fields should be independently unique
from other records)
\begin{itemize}
\item
login
\item
email [disputed]
\item
crypted\_password [disputed]
\item
salt [disputed]
\item
apikey
\end{itemize}
\item
variables: name
\item
workflows: site\_id, model\_id, params, advanced\_edit, start\_date,
end\_date
\item
yields: site\_id, specie\_id, citation\_id, cultivar\_id,
treatment\_id, date, entity\_id, method\_id, date\_year, date\_month,
date\_day
\end{itemize}
\end{document}