Assigning metadata to your datasets (UCDs, units, utypes, characterization)
Abstract
Data can only be properly interpreted and used when proper metadata
are associated. Image pixel values without FITS header (with astrometric
metadata, instrument information, epoch...), catalogue values without
parameters description (column types, units, ...) are useless.
An important step when publishing data to the VO is to ensure that
relevant metadata are provided, allowing wide usage of the corresponding
data. Several metadata standards have been and are being developed in
the IVOA context to ensure the use of homogeneous metadata across the
VO, and allow good interoperability.
This session will demonstrate how to assign standardized metadata
prior to publishing data to he VO: Unified Content Descriptors (UCDs)
to tables, units.
The IVOA has developped a few datamodels (DM) for interoperability:
STC (for space time wavelength coordinates metadata), Spectrum DM (for
spectra or time series), Characterisation (for Observation descriptions
in data parameter space), Line DM (for accurate description of Atomic
and Molecular Lines) and Theory data model (for description of outputs
of simulations). After recalling the content of some of these datamodels,
we will show how data producers can gather metadata information and organize
them consistently with the IVOA DM and how they can publish them.
We will also demonstrate a few use cases, showing how these metadata
can be used by existing VO tools to perform advanced actions, and how
client developpers and end users can make use of the DMs, using various
tools and various formats (XML, VOTable with utypes, FITS).
External References
Advisors (CDS)
- Sébastien Derrière
- Thomas Boch
- François Bonnarel
Software Requirements
Additionally to the
common workshop software requirements, this session requires:
- Perl 5.x
- Linux: should be preinstalled
- Windows
- MacOSX: preinstalled
Download
http://www.euro-vo.org/dcaworkshop2008/HandsOn/metadata/AC1.tar
Hands-on session
Goals of this session:
- show where metadata are used in the VO
- what format/standard standardize VO metadata
- show practical methods to add them to existing data
- use these metadata in VO tools
The various exercises of this session are independent, and can be
addressed in any order. Some very simplistic test data are provided,
but you are encouraged to try and test applying the demonstrated
paradigms to your own datasets.
The general problem of publishing data to the VO : you have some
dataset, with its original description. You need to identify what has
to be done to publish it to the VO:
- find the relevant data access protocol (ConeSearch, SIAP, SSAP, SNAP...)
- identify which metadata will be needed in the VO format
- convert original description to VO standards
- convert the original data to the VO exchange format (e.g. database to VOTable)
- translation layer
- use existing libraries/tools
- advertise your service by publishing it in a VO registry
- fill-in VOResource metadata
Assigning metadata
Assigning UCDs to a dataset
UCDs (Unified Content Descriptors) provide the semantic meaning of
quantities (what the quantity is?).
They are mainly used for describing the contents of columns in
VOTable documents, with a ucd="" attribute in the FIELD element.
But they can also be used to describe individual parameters, or
tabular data in the registry.
UCDs are standardized and described in two reference documents (IVOA recommendations): one for
the syntax rules
and the other
for the list of valid words
.
Briefly put, a UCD consists of at least one word, or several separated by semicolons (
;
).
The first word carries most of the meaning. To describe a magnitude measured in the
V band, we can use the word
phot.mag
(describing a magnitude), and combine it with the word
em.opt.V
(describing the V band in the optical): the complete UCD will be
phot.mag;em.opt.V
UCD-related documentation and tools can be found online
http://cdsweb.u-strasbg.fr/UCD/
.
A set of on-line tools is also available:
http://vizier.u-strasbg.fr/UCD/tools.htx
The first step for data providers is to identify the relevant UCDs describing the data
they want to publish to the VO.
We will use as a test dataset a catalogue of planetary nebulae in M33 (Ciardullo et al., 2004).
You can download a
CSV file of the data, and a file containing the
description of the 11 selected columns. The goal is to find the
relevant UCDs to describe these columns.
We will open the
CSV file with
TopCat
: File, Load Table,
format=CSV. Then use the button "Display column metadata", and make sure to check "UCD"
in the Display menu.
Once you have assigned metadata (either manually or automatically), you can save your work
in a VOTable, TopCat will do the conversion.
Manual search
Try to find some UCDs, using the UCD builder (
http://cdsweb.u-strasbg.fr/UCD/cgi-bin/descr2ucd
).
You can copy and paste the UCDs in the column metadata.
Automatic search
For large collections, it is desirable to automate the process of finding UCDs corresponding to
descriptions. We can use the "assign" method of the UCD SOAP Web Services
(
http://cdsweb.u-strasbg.fr/cdsws/ucdClient.gml
). We will pass each column description to
the service, and it will give back the corresponding (best guess) UCD.
This SOAP Web Service, as other available methods for UCD manipulation, can be
consumed in a number of ways (PERL, Python, Java). We propose a simple PERL example
for our problem. Example in
Java is also available
.
Edit the PERL script
assign.pl (save and change extension to
.pl
) to find UCDs corresponding
to the descriptions in the file
apj_614.desc.
- You can use hints from http://cdsweb.u-strasbg.fr/cdsws/tucdClient2.gml
.
- You need to make three changes where
CHANGE_ME
is written in the source
- Provide the proper path to the file with the description of the columns
- Give the path to the WSDL
- Invoke the assign method of the service
Run your script to see the result, and copy/paste the UCDs in TopCat. Solution is available
here.
Once you have assigned the UCDs, you can save the table as VOTable (XML file). The VOTable will
contain the data and the metadata (UCDs). We provide a
solution VOTable.
Note that some toolkits will assist you in the process of assigning UCDs to a dataset.
Finding proper units
We will use as a test dataset a catalogue of planetary nebulae in M33 (Ciardullo et al., 2004).
You can download a
CSV file of the data, and a file containing the
description of the 11 selected columns. The goal is to find the
relevant units to describe these columns.
In fact, most columns of this catalogue don't have units. We just know that:
- Right ascension and declination are in decimal degrees
- The OIII magnitude is in magnitudes
- The H{alpha}+[NII] flux is in erg per cm2 per second. We have the log of the flux, use [ ] around the symbols to represent the log
- Velocities are in km/s
We will open the
CSV file with
TopCat
: File, Load Table,
format=CSV. Then use the button "Display column metadata", and make sure to check "Units"
in the Display menu.
Use the on-line resources to find the proper expression for the needed units. The
unit
attribute
of the column metadata must contain a string of symbols, e.g.
W.m-2.sr-1
:
Once you have assigned the units, you can save the table as VOTable (XML file). The VOTable will
contain the data and the metadata (units). We provide a
solution VOTable.
Metadata in the Registry
The final step in publishing a service or dataset to the VO is to register it
in a VO Registry. This means providing some metadata for Curation, Coverage,
etc... The metadata elements are defined by the various schemas:
http://www.ivoa.net/xml/
In practice, registries provide simplified forms to avoid you to write pure XML.
You fill in some elements, and the corresponding XML is stored in the registry.
You can explore some of the resources in :
http://vops1.hq.eso.org:8080/registry/browse.jsp
See that there are human-readable versions and the corresponding XML.
When you register a resource, you must provide an AuthorityId.
For this workshop, this metadata element is
ivo://org.euro-vo
and points to
a specific resource
.
Characterization and utypes
This exercise will be introduced at the beginning of the afternoon session by F. Bonnarel.
Slides in pdf
Launch CAMEA (JNLP)
How metadata are used
These simple exercices will demonstrate how various metadata are used in VO tools.
UCDs
We will show two possible usage of the UCDs:
- Automated detection of columns
- Use in Aladin filters
Launch Aladin
, and
load an Image of M33 (File, Open, Aladin Images, choose Lw-POSSI.E for example).
Then query SIMBAD (or some VizieR survey), and the VOTable where you assigned the UCDs as a local
file (or use the
solution VOTable).
Now select "Cross-match objects" from the Catalog menu: you can notice that the relevant columns for the
coordinates are automatically selected, even if they have different names.
This is because the UCDs indicate unambiguously the nature of the columns.
Select "Create a filter" from the Catalog menu, and select "Draw circles proportional to object luminosity".
Switching to "Advanced mode" reveals that the filter use regular expressions on UCDs to indicate which
column has to be used to interpret the expression. One generic filter can then operate on many different
data sources, if UCDs are present.
Units
We will show the use of units, again using Aladin filters:
Launch Aladin
, and
load an Image of M33 (File, Open, Aladin Images, choose Lw-POSSI.E for example).
Then load the VOTable where you assigned the UCDs as a local
file (or use the
solution VOTable).
Select "Create a filter" from the Catalog menu, and switch to advanced mode. Copy and paste
the following before applying (or load
this as a local file):
$[spect.dopplerVeloc*]<-1.7e5m/s {draw blue square}
{draw red rhomb}
The filters use a
conversion library
that is
able to interpret units, and here perform on the fly the conversion from m/s to km/s.
Utypes
Utypes are used in the description of footprints. These can then be provided by SIA servers.
Launch Aladin
,
and load one image of M33. Then go to the All VO tab in the launch panel.
Open the detailed list, and unselect all. Then simply check the image resource #53 (SIAP Service HST preview images)
and submit.
For each group of result, you can preview the coverage by hovering the mouse on the metadata tree.