Data Extraction Tool - Howto

Introduction

The Data Extraction Tool is the next step after you have found relevant resources thanks to the Registry Query Tool. It helps one extracting tabular data from those relevant resources and to transform it in a uniform schema : same units, same columns names... In addition, one can filter the sources one want to keep in the output, one can generate new columns by combining input columns and one can define rules to generate unique astronomical source identifiers. One can also choose the coordinate system one wants (B1950 or J2000), provided equatorial coordinates are available in the input resource.

First step

The first step is to select the resources that must be processed and to define a uniform schema for the output (column names, units...).

Snapshot

Resources selection

The first step is to choose which resources must be processed. There are two ways to do so:

Load a workspace (it is common to the two tools, the resources that have been marked as relevant will be automaticaly loaded in the processing list)
Load from a file containing the VO resources identifiers, one per line

Features of the resource list

The feature for manipulating the list of resources are the same as in the Registry Query Tool. You can see the concerned section for more details.

Uniform output schema definition

An ordered list of columns must be defined for the output. This is called the "output schema". A column is defined by four parameters :

The name: this parameter is mandatory and describes the name of the output column
The unit: this parameter describes the unit for the data of the output column. This means that the tool will always try to convert the input values for this column into this unit.
The UCD: this parameter describes the UCD of the output column. The tool will highlight all columns of the input table having this UCD to help the user choosing the right one.
The format: this parameter describes the decimal format of the output column. The tool will always try to convert the output value for this column to this format.

The "Output schema" section of the tool allows the user to create and update it:

Create: the Name and optionally the UCD, Unit and Format fields of the output column must be filled. The [...] button must be clicked to add this output column to the schema.
Update: if the user select one of the output column, its information will be displayed in the fields. Updating them and clicking on the [...] button will update the selected column.

Construction of the uniformisation form

By clicking on the "Generate the uniformisation form" button at the bottom of the window, the tool will construct the uniformisation form for the selected resources in the list and according to the output schema defined.

Second step

During the second step one can interact with a form to customize the uniformisation of the resources. Some items will be automatically filled in, but the user can provide additional constraints and customizations : drop constraints, columns generation... The following features are available:

Snapshot

Columns selection

For each resource and for each column of the output schema you can select which original column from the input resource you want to map to the target column. This selection is presented with a scrolling list and to help you doing this mapping the original columns that match the UCD are colored in red. If no column of the original resource can be mapped you can decide to generate a column thanks to the arithmetic form (button

A button provides access to information about the selected input column (name, UCD, unit and description).

Arithmetic expression

The user can combine input column values to generate output values. This can be done with the arithmetic form above. Here is the general syntax:

condition_1  {aritExpression_1}
condition_2  {aritExpression_2}
condition_3  {aritExpression_3}
...
condition_n  {aritExpression_n}
{default_aritExpression}

Algorithm:

If condition_i is verified then the output value will be calculated from the aritExpression_i and no other condition will be tested.
If no condition is verified then the output value will be calculated from the default_aritExpression.

Specification of the condition:

can contain input column names (Flux must be written ${Flux} for example)
can contain the classical logical operators:

||, &&, =, !=, <, >, <=, >=

Specification of the arithmetic expression:

can contain input column names (Flux must be written ${Flux} for example)
can contain the classical arithmetic operators: +, -, *, /, ^, (, )
some mathematical functions are supported: cos, sin, tan, acos, asin, atan, ln, log, abs, deg2rad, rad2deg, sqrt, exp

Note about the input columns: the values are taken "as is" with their original unit, there is no unit management for the moment.

Here is a complete example:

${Flux}>10 || ${Flux}<3 {${FluxDen}*300}
${flux}<5 &&  ${flux}<6 {${FluxDen}*500}
{1}

Decimal format

The format attached to the output schema column defines the decimal format for these values. The general syntax is:

%nb_digits1.nb_digits2

nb_digits1: it is the total number of digits, including the decimal separator
nb_digits2: it is the number of digits after the decimal separator

If the decimal number is too small to fit the format some spaces will be added at the beggining.

Astronomical identifiers management

The output columns corresponding to astronomical identifiers can be generated with a special pattern. It can be useful if they have not been defined in the input table or if there are many duplicates. The general syntax is:

acronym [B|J]RA_pattern[+|-]DEC_pattern

acronym is the acronym for the catalog
B|J is the equinox for output RA and DEC coordinates
RA_pattern and DEC_pattern are the patterns for the RA and DEC coordinates

Here is an example of such a pattern:

B3 JHHMMSS.SS+DDMMSS

The RA and DEC output values are built from the input coordinates values. They are first converted to the correct equinox and then truncated to fit to the pattern.

The previous pattern can for example generate such a value:

B3 J223614.05+684502

If the identifiers are already present in the input resource without acronym one can use this syntax:

acronym *

This means that you concatenate "acronym" and the selected column (don't forget to select an identifer column !) to generate the output identifiers values.

Unit conversion

A unit can be attached to each column of the output schema. The tool will always try to convert input values into the right output unit. But if the conversion is not possible an alert will be displayed at the end of the processing so that the user can react (changing the unit...). Note that in this case the original values are taken as is without unit conversion.

Coordinate equinox selection

One can choose the output coordinates to be expressed with J2000 or B1950 equinox. The tool will always try to convert the input coordinates into the right equinox. This option is found in the preferences window and so is a general option for all the resources in the uniformisation form.

Sources filter

For each resource the user can define a logical condition for filtering the output rows. Each row that verifies the condition won't be written to the output. For example:

${flux}>500 || ${flux}<200

means: "if the flux column value is greater than 500 or lower than 200 for one row in the input table, this row won't be written in the output table". The columns used in the expression are the input columns and so are expressed in their original units.

Resource selection

The user can select the resources to process by selecting or de-selecting them thanks to the checkbox in the left of the form.

Report window

At the end of the processing, a report is shown. It contains three kinds of information:

miscellaneous errors
unit conversion errors
duplicate identifiers

Miscellaneous errors

Some miscellaneous error, like a miss of column definition, coordinate columns that were not found..., are reported in this section. For each error the following information is available:

resource identifier
a small text describing the error.

Notice that if such an error occures for a resource, it has certainly not been processed.

Conversions errors

If a unit conversion could not be done it will be written in this part of the report. For each failure the report contains:

the catalogue
the column
the original unit
the target unit

Duplicate identifiers

For each resource that has been processed one can see information about the duplicate identifiers and can interact. Following information and interactions are available:

The number of duplicate identifiers found
show details button: to see the list of duplicate identifiers
resolve button: to resolve the duplicate identifiers (it is just done internally in the memory, not applied in the output resource)
write button: to write back the resolved identifiers to the output resource

. To resolve the duplicate identifiers, the following algorithm is performed:

B3 J223103+120532 -> B3 J223103+120532A
B3 J223103+120532 -> B3 J223103+120532B
B3 J223103+120532 -> B3 J223103+120532C

User preferences

Some preferences for the tool can be set in the prefence window. To open it, just click on the "Preferences" item of the option menu. The preferences are separated in 4 parts:

Registry: some options about the registry where the tool searches for VO resources metadata can be set here
Output data: some options about how the output data is generated can be set here
Characterization: some options about the registry where the tool searchs for characterization data can be set here
Misc: some misc options can be set here

Registry

URI: the URI of the registry must be written here
Collection: the collection where to find resources metadata must be written here
Login: a valid login for the registry must be written here
Password: a valid password for the previous login must be written here

Output data

Empty values: the tool can automatically replace empty input values by a string or value that can be set here
Coordinates system: the equinox of the output coordinates can be set here (choice between 1950 and 2000)
Formats: the output formats can be set here. The tool will create one resource per selected format (ASCII and VOTable are supported)
ASCII header: it is possible to define a header for each output ASCII table by setting specific parameters to be written to it. Only the number of sources in the output resource is available for the moment.

Characterization

URI: the URI of the characterization registry must be written here
Collection: the collection where to find characterization resources must be written here
Login: a valid login for the characterization registry must be written here
Password: a valid password for the previous login must be written here

Misc

XMLDB driver: a path to the java XMLDB driver must be written here. In most cases the provided default value won't need any modification. The tool provides the java drivers for the eXist and XIndice XMLDB databases.
Verbose database: if checked, the tool will write all actions concerning the database to the standard output

Technical requirements

A java virtual machine (tests have been done with the 1.4 and 1.5 versions)
A running registry to get VO resources metadata
- The registry must be compatible with the XMLDB API
- The user must have access rights to the registry

-- BriceGassmann - 24 Oct 2006

Attachments

Topic attachments
I	Attachment	Action	Size	Date	Who
gif	About16.gif	manage	0.6 K	2006-10-23 - 15:51	UnknownUser
jpg	DataExtractionToolSnapshotWiki3.jpg	manage	31.4 K	2006-09-26 - 14:40	UnknownUser
jpg	DataExtractionToolSnapshotWiki4.jpg	manage	28.7 K	2006-09-26 - 15:30	UnknownUser
gif	Edit16.gif	manage	0.4 K	2006-10-23 - 15:54	UnknownUser
jpg	snapshot2.jpg	manage	85.9 K	2007-02-01 - 14:06	UnknownUser
jpg	snapshot3.jpg	manage	27.8 K	2007-02-02 - 14:50	UnknownUser
jpg	snapshot4.jpg	manage	87.4 K	2007-02-06 - 08:56	UnknownUser
jpg	toolSS1.jpg	manage	86.2 K	2007-02-01 - 14:04	UnknownUser
jpg	toolSS2.jpg	manage	87.8 K	2006-10-23 - 15:11	UnknownUser
jpg	toolSS4.jpg	manage	30.2 K	2006-10-24 - 14:48	UnknownUser
jpg	toolSS5.jpg	manage	32.1 K	2006-10-24 - 14:14	UnknownUser
jpg	toolSS6.jpg	manage	31.4 K	2006-10-24 - 14:26	UnknownUser
jpg	toolSS7.jpg	manage	26.5 K	2006-10-24 - 14:56	UnknownUser

Topic revision: r29 - 2007-03-01 - BriceGassmann

Account
- Log In
- Register User

Centre de Données astronomiques de Strasbourg

Edit
Attach