Comments on VOTable 0.4
Clive Davenhall, David Giaretta, Bob Mann, Clive Page, Guy Rixon
23 Jan 2002
Here are some combined comments from Clive Page, Clive Davenhall, Guy Rixon, Bob Mann and David Giaretta.
The comments are divided into three types:
- overall comments/thoughts about the general approach
- technical suggestions/corrections
- comments that are more editorial in nature
It was felt that there was a lack of explanation of the motivation for yet another format, and in particular and explanation of the name VOTable - what are its advantages for VO work?
The importance of XML is clear, but it is not clear what the use of XML brings in this case. For example is the use of XSLT planned?; should we be looking at the use of Schema which would allow the greater use of XML tools?
It is also unclear whether VOTable is regarded as an interchange format or an on-line format for use in day-to-day analysis. Is it also unclear if this is supposed to be adequate for large tables or only for small tables. The combination of pure-XML with non-XML data has advantages for large tables but does cause difficulties.
Is it a design goal that the transformation FITS Table ==> VOTable ==> FITS Table should be guaranteed not to lose information.
One of the additional functionalities noted is the use of the VOTable as a query. Yet it did seem that there may be some omissions from its capabilities in these regards.
The technical comments below expand on several of these points.
There is a concern about the omission from VOTable of metadata which applies to the table as a whole. This kind of parameter is very valuable. For example information such as that to do with provenance is likely to be very significant. Parameters themselves could have associated units, data type and display format. More generally FITS can have a variety of additional records not allowed for in VOTable, for example the HISTORY records in a FITS file. This is in itself a serious omission because for example things like sky coverage, wavelength range or sensitivity could easily be part of a query, if a VOTable is used in that way. It also prevents reversibility of the transformations into and out of VOTable such as FITS table ==> VOTable ==> FITS table.
How is the mapping of VOTable columns to FITS table columns defined. Is it by name or by position?
Section 2.1 says that the VOTable is completely compatible with the FITS
Binary Table. I guess that strictly that is true, but in the wild one often
encounters FITS tables which use extensions to strict Standard, especially the
variable length, multi-dimensional array, and substring array conventions,
described in Appendix B.1 through B.3 of the Standard. I think it is highly
desirable to support all these. I noted the B.3 problem above. The variable
length B.1 facility is used extensively in X-ray astronomy: I think it is
covered by the variable-sized array facility in VOTable, but I'm not an expert
so would welcome confirmation. I don't think the multi-dimensional array
facility of B.2 is covered, though it would not be difficult.
It may be useful to
Section 1.1 Example line 13 has datatype="A" arraysize="10"
This notation has perhaps been copied directly from the FITS Binary Table spec in which the only way of specifying an string of characters is to have an array of them. Although the C programming language has the same limitation, as far as I know all other programming languages uses by astronomers, including Perl, Python, Java, C++, and Fortran (at least all versions later than Fortran66), have a well-defined concept of a character string, and most of these languages can handle an array of character strings.
The need for arrays of strings in FITS tables has been recognised, and
Appendix B.3 of the FITS Standard describes what one can only call a fudge to
achieve this. I think it would be much better for VOTable to have a concept of a
character string, and instead use an attribute such as length="10"
which would allow the definition of a array of strings in a manner exactly as
for all other data types. It would also permit the translation of any FITS file
which uses the "Substring Array" convention (as defined in Appendix
B.3); since this notation has been quite extensively used in FITS files written
in various observatories, it surely needs to be supported.
I don't see any difficulty in translating the "nA" type in a FITS binary table to a string, rather than to an array of characters.
I am also confused by section 3 para 4 which says that "character strings will be padded with null characters if they are shorter than the specified length". It is not clear whether this specifies what is supposed to happen when a FITS table is converted to XML, or when the XML is parsed. There is an obvious need to handle strings of variable length (such as in SQL VARCHAR fields) but FITS has only an uneasy compromise between C pseudo-strings (fixed maximum length, null-terminated if short), and Fortran ones (fixed length, space padded in all practical sitations when lengths do not match). I think the VOTable specification needs to decide whether its strings are tryly fixed (as in Fortran) or fully dynamic (as in SQL, Perl, Python, etc.) in length, and formulate its conversions from FITS to XML accordingly.
Section 1.1 Example line 15 has unit="degrees"
The IAU has in its style manual "Recommendations concerning Units", a copy of which can be found at http://www.iau.org/IAU/Activities/nomenclature/units.html. This has radians as the standard unit of planar angle. Obviously angles measured in radians are not easy for humans to read, but then neither are XML files. If I understand it, VOTable is an XML format to facilitate data interchange, and is therefore really intented only to be machine readable.
The fact that XML is based on ASCII text is a convenience because that avoids problems with endian-ness which make binary formats less portable. For machine reading, radians seem much more sensible as these are the units in which just about all programming languages do their trigonometry, so this avoids unnecessary conversions to/from degrees. Indeed for maximum convenience of humans, at least the astronomical sub-species, even degrees are not optimum, as sexagesimals are so widely used. The OGIP Memo 93-001, reachable from http://legacy.gsfc.nasa.gov/docs/heasarc/ofwg/ofwg_recomm.html set out some standards for units in FITS files which are fairly widely followed: this allows degrees as a alternative, but specifies that it should be specified by the string "deg" not "degree" and certainly not "degrees" (all units ought to be singular).
By the way, what exactly is the datatype attribute of a FIELD (table on p3) defining:
If (b) (which is the obvious interpretation) then precisely what do the
datatype options (in the table on p3) mean when the table is stored as a
TABLEDATA or CSV? (think particularly of options X, B or I, though conceptually
the issue arises with any of them).
I'm happy with the proposed options for datatype and with the TABLEDATA and CSV table representations. However, together they do beg the above question, and I think that the slightly non-intuitive answer is (a).
The table in section 2 has both "D" and "F" as a marker
for double type. The FITS Standard has only "D", it is not clear to me
why a synonym is required.
Section 3 para 4 says that the arraysize attribute specifies the number of 8-bit bytes, but the equivalent FITS binary table specifier, "rB" has "r" giving the number of bits. The number of bytes has to be derived as int((r+1)/8). If you specify the number of bytes, you don't know exactly how many bits are in use. I suggest using the FITS notation here; the alternative would be to have a special attribute for this data type.
How are Boolean values denoted in TABLEDATA? Are upper as well as lower case "T" and "F" allowed?
If a cell contains an array of complex numbers then there are, in principle, several ways in which the values could be ordered. For example, for a 3 element complex array:
Technically, there are also 2 additional options in which the imaginary part comes first. The standard should specify which of these orders the values should occur in - the first seems the most likely. This is footling pedantry of the worst sort, but if it is written into the standard then there is no scope for ambiguity.
It is not clear how COOSYS is tied up to the columns in the table which define the coordinate system; the ID attribute could be used for this.
If the VOTable is used outside Astronomy then additional coordinate systems would be needed. Even between co-operating institutes there may be specialised coordinate systems. Defining "system" as an enumerated attribute may be too restrictive.
On a related topic, catalogues created by detecting objects in CCD images and digitised photographic plates usually contain both the positions measured for the objects in the CCD frame or plate and the celestial coordinates derived from them. It may be useful to store the coefficients used to make these transformations in a standard way.
A standard name for TIME may also be necessary, for example for any time-series work on variable stars, as well as STP and Solar work. CDF uses EPOCH. Rather than this, something like TIMESYS may be better to avoid confusion.
By the way, section 4.3 specifies a date in the form
"2002-01-31T12:00:00:00" (though I think the last colon may be a mis-print).
The reference is to the FITS Standard, but derived from ISO8601. I think it
would be better to refer to the primary source. Unfortunately it costs money to
get ISO8601 from the International Standards Organisation, but a useful summary
exists here: http://www.cl.cam.ac.uk/~mgk25/iso-time.html
Sections 3.4 and 4.1.1 covers NULLs, in part. It is quite important to get this right, since missing values are common in many astronomical tables. The FITS Standard specifies the use of NaN for null values in floating-point fields, but obviously for integer types there is no equivalent, as all bit-patterns may be valid values, so there is an alternative notation, allowing the grabbing of some unlikely value (such as -99) to represent missing information. This has always seemed like a kludge to me, and gets difficult with 8-bit integers, when it can be hard to give one of just 256 values up for this purpose. I would have thought that XML would have a standard way of expressing nulls, but I haven't been able to find it. Since the XML stream is representing integers by strings of characters, there is no need to reserve a paricular integer such as -99, it could just as well be a string such as "NaN". It is not clear to me why the "invalid" attribute is needed, as distinct from merely "null".
Is <CELL></CELL> allowed?
For character strings it is proposed to use the FITS representation, using an ASCII NUL value (zero) as the first character. I can see pragmatic reasons for this, but feel that an out-of-band mechanism would be better, especially as null bytes have a habit of causing problems in data transfers. It also avoids forcing the XML parser to read the contents of each string to see whether it is null or not. If, as I proposed above, the VOTable allows arrays of strings, a better null representation is also needed, since one might want to declare missing just some strings in an array of strings.
Section 2 (p3) The rows in a VOTable are not necessarily unordered, and in
the case where they are ordered it would be useful to have a mechanism to
indicate this. Note that the ordering of tables is not just, or even primarily,
a presentation issue. Rather, knowing that a catalogue is sorted on some column
allows a program reading the catalogue can make fast `range' selections on this
column (binary chops etc).
One thrust of the VOTable document seems to be that the VOTable standard is a general mechanism for storing tabular and catalogue data in astronomy, not just (for example) for representing small tables extracted from a remote archive and transmitted across the internet prior to display. Thus, the VOTable standard should be suitable for storing large catalogues, where preserving information about sort order in order to facilitate fast `range' selections is important.
Obviously, the default where no information about sort order is included in the catalogue metadata, should be that the catalogue is unordered.
Similarly, there appears to be no provision for storing indices on any of the columns, which again allows a program reading the catalogue to make fast range selections on indexed columns.
The RESOURCE element can contain several TABLEs, so I can't see any bar to including additional tables containing simple indices: lists of row numbers in the original table arranged in an order corresponding to a sort on some column (or to a selection of a subset, for that matter). An additional bit of syntax might be required to relate the indices to the original table.
Perhaps more complex (2D) indexing schemes can be deferred to version 2.
Section 2 of the proposal says "a VOTable document may be used to
express a question as well as an answer...the specification of class as an
implicit request for instance."
I'll comment on using VOTable as a query format below, but here I suggest that VOTable is useful for describing data resources in detail. That is, one could use a collection of VOTable documents (or one such document with a large number of RESOURCE elements) in a resource directory to say exactly what queries are possible on various tables.
This usage is attractive because VOTable describes columns of tables in a generic way, and if we have software to handle that kind of metadata we might as well re-use the software for all cases where columns need to be described. It's not clear that VOTable is the best arrangement for a resource directory.
The VOtable material in the directory has to describe the external view of a table in a data-service. That is, the columns in the VOTable header may not be exactly the columns held in the database; there may well be translations going on as queries are accepted and results are returned. The translations need not be symmetric:
ie. the set of columns used in a query need not be the same as the set of columns that can be in the output. Therefore, to make VOTable into a generic representation of a tabular-data resource, the FIELD elements need some annotation saying in which cases they can be used. I suggested adding values to the set allowed for the
type attribute: "in" for fields that can be used in a query; "out" for fields that can be used in results. For fields that can be used in both cases, the field element is duplicated with different types on the two instances.
We would also need some agreement on what the units mean in a resource catalogue. Are the stated units what you get, with no choice in the matter? Or are they what you get unless you specify some conversion.
r_mag > 18 OR z > 0.6
(assuming that "r_mag" and "z" are the names of fields).
The WHERE clause could use just the where-clause syntax of XQuery. The advantages are (a) that that syntax has been well-vetted by specialists and is less likely to give subtle problems; and (b) that we might be able to scavange some parsing code from XQuery implementations.
The WHERE clause could use the syntax defined for constraints in ASU. That syntax is rich in operators, but doesn't have specialised functions.
How would the query express a join between tables?
In general, having a small set of options for the language is good future proofing. The WHERE clause should have a single child-element which identifies the constraint language (like the SQL element in the example above).
As written the example and explanation are a bit confusing. I think that what is intended (and if not it should be) is that:
These rules apply throughout the CSV, including any header lines (the number
of which is indicated by headlines), which are skipped over.
(The present text implies that any header lines to be skipped over must end in \n, irrespective of the row separator used for the rest of the table, which seems perverse. Of course, in most tables \n will be the row separator.)
The above comments for TABLEDATA about unparsable values and empty cells
(here adjacent occurrences of the separator character) also apply.
- It is not entirely clear that enclosing quotation marks should automatically be ignored and removed (if they're just going to be ignored then why are they there?). Which quotation marks: single quotes, double quotes or both?
In section 4.3 (and 6) the syntax href=file://mydata.dat/ is described. As an aside: it is not clear to me in what circumstances quotes are needed around the argument of the href attribute. More importantly, I think it will be important to have a notation both for absolute and relative paths: the example shown looks like an absolute path, but I'm not sure. The use of relative paths is certainly convenient in web pages, as it means that links to dependent files, such as GIF images, on the same directory retain their validity even if the whole collection of files is copied elsewhere. But there may be circumstances when an absolute file path is needed.
We assume that MAX/MIN refer to the maximum/minimum allowed values rather then the actual max/min values occuring in the table, but this is not clera from the text. This functionality is available in XML Schema.
A tree-diagram (for example the usual XMLSpy-type diagram) would be useful to show the structure of the table.
Data should be plural.
Use of a phrase like "will cause an exception to be thrown" (section 3.4) is rather out of place in the description of a format. The VOTable standard should not prescribe how a program reading a VOTable should behave when it encounters an invalid table; its behaviour will depend, in part, on its function and circumstances. It might indeed throw an exception and abort, or it might issue a warning, attempt a guess at the missing datatype and carry on, or even, in some circumstances, attempt to carry on without issuing a warning. It is for the program's designers, not the VOTable standard, to decide what behaviour is appropriate.
The version of the encoding such as gzip should perhaps be specified.