Tags:
create new tag
, view all tags

Data Extraction Tool - Howto

Introduction

The Data Extraction Tool is the next step after you have found relevant resources thanks to the Registry Query Tool. It helps one extracting tabular data from those relevant resources and to transform it in a uniform schema : same units, same columns names... In addition, one can filter the sources one want to keep in the output, one can generate new columns by combining input columns and one can define rules to generate unique astronomical source identifiers. One can also choose the coordinate system one wants (B1950 or J2000), provided equatorial coordinates are available in the input resource.

First step

The first step is to select the resources that must be processed and to define a uniform schema for the output (column names, units...).

Snapshot

snapshot4.jpg

Resources selection

The first step is to choose which resources must be processed. There are two ways to do so:
  • Load a workspace (it is common to the two tools, the resources that have been marked as relevant will be automaticaly loaded in the processing list)
  • Load from a file containing the VO resources identifiers, one per line

Features of the resource list

The feature for manipulating the list of resources are the same as in the Registry Query Tool. You can see the concerned section for more details.

Uniform output schema definition

An ordered list of columns must be defined for the output. This is called the "output schema". A column is defined by four parameters :

  • The name: this parameter is mandatory and describes the name of the output column
  • The unit: this parameter describes the unit for the data of the output column. This means that the tool will always try to convert the input values for this column into this unit.
  • The UCD: this parameter describes the UCD of the output column. The tool will highlight all columns of the input table having this UCD to help the user choosing the right one.
  • The format: this parameter describes the decimal format of the output column. The tool will always try to convert the output value for this column to this format.

The "Output schema" section of the tool allows the user to create and update it:

  • Create: the Name and optionally the UCD, Unit and Format fields of the output column must be filled. The [...] button must be clicked to add this output column to the schema.
  • Update: if the user select one of the output column, its information will be displayed in the fields. Updating them and clicking on the [...] button will update the selected column.

Construction of the uniformisation form

By clicking on the "Generate the uniformisation form" button at the bottom of the window, the tool will construct the uniformisation form for the selected resources in the list and according to the output schema defined.

Second step

During the second step one can interact with a form to customize the uniformisation of the resources. Some items will be automatically filled in, but the user can provide additional constraints and customizations : drop constraints, columns generation... The following features are available:

Snapshot

snapshot2.jpg

Columns selection

For each resource and for each column of the output schema you can select which original column from the input resource you want to map to the target column. This selection is presented with a scrolling list and to help you doing this mapping the original columns that match the UCD are colored in red. If no column of the original resource can be mapped you can decide to generate a column thanks to the arithmetic form (button Edit16.gif).

A button About16.gif provides access to information about the selected input column (name, UCD, unit and description).

Arithmetic expression

DataExtractionToolSnapshotWiki3.jpg

The user can combine input column values to generate output values. This can be done with the arithmetic form above. Here is the general syntax:

condition_1  {aritExpression_1}
condition_2  {aritExpression_2}
condition_3  {aritExpression_3}
...
condition_n  {aritExpression_n}
{default_aritExpression}

Algorithm:

  • If condition_i is verified then the output value will be calculated from the aritExpression_i and no other condition will be tested.
  • If no condition is verified then the output value will be calculated from the default_aritExpression.

Specification of the condition:

  • can contain input column names (Flux must be written ${Flux} for example)
  • can contain the classical logical operators:
||, &&, =, !=, <, >, <=, >=

Specification of the arithmetic expression:

  • can contain input column names (Flux must be written ${Flux} for example)
  • can contain the classical arithmetic operators: +, -, *, /, ^, (, )
  • some mathematical functions are supported: cos, sin, tan, acos, asin, atan, ln, log, abs, deg2rad, rad2deg, sqrt, exp

Note about the input columns: the values are taken "as is" with their original unit, there is no unit management for the moment.

Here is a complete example:

${Flux}>10 || ${Flux}<3 {${FluxDen}*300}
${flux}<5 &&  ${flux}<6 {${FluxDen}*500}
{1}

Decimal format

The format attached to the output schema column defines the decimal format for these values. The general syntax is:

%nb_digits1.nb_digits2

  • nb_digits1: it is the total number of digits, including the decimal separator
  • nb_digits2: it is the number of digits after the decimal separator

If the decimal number is too small to fit the format some spaces will be added at the beggining.

Astronomical identifiers management

The output columns corresponding to astronomical identifiers can be generated with a special pattern. It can be useful if they have not been defined in the input table or if there are many duplicates. The general syntax is:

acronym [B|J]RA_pattern[+|-]DEC_pattern

  • acronym is the acronym for the catalog
  • B|J is the equinox for output RA and DEC coordinates
  • RA_pattern and DEC_pattern are the patterns for the RA and DEC coordinates

Here is an example of such a pattern:

B3 JHHMMSS.SS+DDMMSS

The RA and DEC output values are built from the input coordinates values. They are first converted to the correct equinox and then truncated to fit to the pattern.

The previous pattern can for example generate such a value:

B3 J223614.05+684502

If the identifiers are already present in the input resource without acronym one can use this syntax:

acronym *

This means that you concatenate "acronym" and the selected column (don't forget to select an identifer column !) to generate the output identifiers values.

Unit conversion

A unit can be attached to each column of the output schema. The tool will always try to convert input values into the right output unit. But if the conversion is not possible an alert will be displayed at the end of the processing so that the user can react (changing the unit...). Note that in this case the original values are taken as is without unit conversion.

Coordinate equinox selection

One can choose the output coordinates to be expressed with J2000 or B1950 equinox. The tool will always try to convert the input coordinates into the right equinox. This option is found in the preferences window and so is a general option for all the resources in the uniformisation form.

Sources filter

DataExtractionToolSnapshotWiki4.jpg

For each resource the user can define a logical condition for filtering the output rows. Each row that verifies the condition won't be written to the output. For example:

${flux}>500 || ${flux}<200

means: "if the flux column value is greater than 500 or lower than 200 for one row in the input table, this row won't be written in the output table". The columns used in the expression are the input columns and so are expressed in their original units.

Resource selection

The user can select the resources to process by selecting or de-selecting them thanks to the checkbox in the left of the form.

Report window

At the end of the processing, a report is shown. It contains three kinds of information:

  • miscellaneous errors
  • unit conversion errors
  • duplicate identifiers

snapshot3.jpg

Miscellaneous errors

Some miscellaneous error, like a miss of column definition, coordinate columns that were not found..., are reported in this section. For each error the following information is available:
  • resource identifier
  • a small text describing the error.

Notice that if such an error occures for a resource, it has certainly not been processed.

Conversions errors

If a unit conversion could not be done it will be written in this part of the report. For each failure the report contains:

  • the catalogue
  • the column
  • the original unit
  • the target unit

Duplicate identifiers

For each resource that has been processed one can see information about the duplicate identifiers and can interact. Following information and interactions are available:

  • The number of duplicate identifiers found
  • show details button: to see the list of duplicate identifiers
  • resolve button: to resolve the duplicate identifiers (it is just done internally in the memory, not applied in the output resource)
  • write button: to write back the resolved identifiers to the output resource

. To resolve the duplicate identifiers, the following algorithm is performed:

B3 J223103+120532 -> B3 J223103+120532A
B3 J223103+120532 -> B3 J223103+120532B
B3 J223103+120532 -> B3 J223103+120532C

User preferences

Some preferences for the tool can be set in the prefence window. To open it, just click on the "Preferences" item of the option menu. The preferences are separated in 4 parts:

  • Registry: some options about the registry where the tool searches for VO resources metadata can be set here
  • Output data: some options about how the output data is generated can be set here
  • Characterization: some options about the registry where the tool searchs for characterization data can be set here
  • Misc: some misc options can be set here

Registry

toolSS4.jpg

  • URI: the URI of the registry must be written here
  • Collection: the collection where to find resources metadata must be written here
  • Login: a valid login for the registry must be written here
  • Password: a valid password for the previous login must be written here

Output data

toolSS5.jpg

  • Empty values: the tool can automatically replace empty input values by a string or value that can be set here
  • Coordinates system: the equinox of the output coordinates can be set here (choice between 1950 and 2000)
  • Formats: the output formats can be set here. The tool will create one resource per selected format (ASCII and VOTable are supported)
  • ASCII header: it is possible to define a header for each output ASCII table by setting specific parameters to be written to it. Only the number of sources in the output resource is available for the moment.

Characterization

toolSS6.jpg

  • URI: the URI of the characterization registry must be written here
  • Collection: the collection where to find characterization resources must be written here
  • Login: a valid login for the characterization registry must be written here
  • Password: a valid password for the previous login must be written here

Misc

toolSS7.jpg

  • XMLDB driver: a path to the java XMLDB driver must be written here. In most cases the provided default value won't need any modification. The tool provides the java drivers for the eXist and XIndice XMLDB databases.
  • Verbose database: if checked, the tool will write all actions concerning the database to the standard output

Technical requirements

  • A java virtual machine (tests have been done with the 1.4 and 1.5 versions)
  • A running registry to get VO resources metadata
    • The registry must be compatible with the XMLDB API
    • The user must have access rights to the registry

-- BriceGassmann - 24 Oct 2006

Topic attachments
I Attachment Action Size Date Who Comment
GIFgif About16.gif manage 0.6 K 2006-10-23 - 15:51 UnknownUser  
JPEGjpg DataExtractionToolSnapshotWiki3.jpg manage 31.4 K 2006-09-26 - 14:40 UnknownUser  
JPEGjpg DataExtractionToolSnapshotWiki4.jpg manage 28.7 K 2006-09-26 - 15:30 UnknownUser  
GIFgif Edit16.gif manage 0.4 K 2006-10-23 - 15:54 UnknownUser  
JPEGjpg snapshot2.jpg manage 85.9 K 2007-02-01 - 14:06 UnknownUser  
JPEGjpg snapshot3.jpg manage 27.8 K 2007-02-02 - 14:50 UnknownUser  
JPEGjpg snapshot4.jpg manage 87.4 K 2007-02-06 - 08:56 UnknownUser  
JPEGjpg toolSS1.jpg manage 86.2 K 2007-02-01 - 14:04 UnknownUser  
JPEGjpg toolSS2.jpg manage 87.8 K 2006-10-23 - 15:11 UnknownUser  
JPEGjpg toolSS4.jpg manage 30.2 K 2006-10-24 - 14:48 UnknownUser  
JPEGjpg toolSS5.jpg manage 32.1 K 2006-10-24 - 14:14 UnknownUser  
JPEGjpg toolSS6.jpg manage 31.4 K 2006-10-24 - 14:26 UnknownUser  
JPEGjpg toolSS7.jpg manage 26.5 K 2006-10-24 - 14:56 UnknownUser  
Topic revision: r29 - 2007-03-01 - BriceGassmann
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback