Difference: ESOhelpIngestion (1 vs. 2)

Revision 22017-11-24 - EmmanuellePerret

 
META TOPICPARENT name="ESOPublicSurveysIngestion"

ESO public survey remarks

I) ESO@vizier

I-A ) why?

  • For vizier: ESO datasets are reference datasets (at least the large cats) and the CDS mission is to gather reference datasets in one place to facilitate user access. So storing and serving ESO cats in vizierbelongs to CDS missions.

  • What’s in it for ESO? => Dissemination of their dataset: ok but if ESO can do it themselves then whats the point for ESO to work with CDS.

Added value by CDS:

  • Accurate, homogeneous metadata, associated data
=> allows complex vizier-wide requests
=> improved discoverability of datasets
=> improved impact of data

  • Linking to relevant data sets:
=> Origin data sets
=> Data sets with complementary measurements
=> Simbad links

  • Visualisation tools (e.g. interactive graphs, SED viewer, Aladin)
=> i.e. rich meta-content with fluid navigation (improved user experience)

  • Compatible with many protocols/tools/services (VO, topcat, TAP…)
=> improved discoverability / reusability of data.
=> impact

=> bottomline: authors putting their data at CDS if to increase the data’s impact. Its the same for ESO data. Its all about IMPACT IMPACT IMPACT!

I-B ) How?

Vizier has several channels through which data is selected and ingested:

  • A&A: editor asks author to send data to vizier + CDS selection
  • AAS: aas staff produces MRTs which we ingest + CDS selection
  • Other journals: data selected by CDS from paper analysis
  • CDS-selected refs: if a cat has not arrived through these pipelines, but are deemed important for the discipline or for CDS processing, they can be added at the request of CDS astronomers.
=> Some phase 3 cats already arrive to VizieR through these channels. But not all. So anything phase 3 which has not arrived through these channels is additional work. In the case of large catalogs (>30 million rows), we estimate it takes between 7 and 32 hours of CDS staff, not counting processing (computer hours). We are willing to put in the work needed to acquire and distribute these reference data sets. BUT NOT AT ALL COSTS: we are operating at constant manpower.

Right now our main concern is that we are operating with conflicting constraints:

  • 100% fidelity constraint: ESO therefore trustworthy. => should store/serve data exactly as it is.
BUT
  • Our experience at CDS is that data produced by even the most skilled and good-willed authors requires some level of meta-data and sometimes data editing/transforming/filtering/correcting.
=> So the goal of this document is to exemplify this with problems encountered with ESO products and the actions taken by CDS to remedy these problems. In the hope that we find a common ground where ESO is satisfied with our level of fidelity to the original data AND vizier staff has enough freedom to act on the data so as not to disrupt the vizier process (i.e. ESO cats should not take us more time to process than journals cats).

II) Problems small cats

Basically, the same as for large cats (curation of values if needed, description of observations (which instruments, date of observations, filters?), flags, measurements, columns to display by default)...

+ Where do we get associated data like spectra (see cat. in prep.: J/A+A/542/A48 => >~18,000 spectra ; J/ApJS/223/29)?

=> Link toward the ESO archive spectra technically not possible...

=> + limited access? Would it be possible, would have we the right to do that?

+ How do we know there are updates for those catalogs? (see liste catalogues ESO painfully updated few days ago)...

III) Problems large cats

Problems with ingestion of ESO phase 3 cats

Examples:

1. VMC (II/351) DR4 : missing explanations for [JHKs]errb (ERRBITS) between 0 and 27 ...

4,171,555 = the largest value for errors on Ksap6 kept but in this catalog magnitudes are not bigger than 38 (that's not the cas in VHS where magnitudes have off the chart values...)

=> No cut. Values left as they are.

2. VHS (II/352) DR3 : large values in magnitudes conducts to errors when inserted into Vizier

  • No description for VHS release on www.eso.org/qi ... Those descriptions help!
  • "3.4028235E38" value for Jpmag, e_Jpmag, Ksmag, e_Ksmag, e_Ksap6 and other very large values : see F.-X. diagrams for each column
  • "nan" in SOURCENAME ... ?
  • "0" in [YJHKs]pperrbits columns whereas there is no info for the respective bands... Normally "0" should meant: no quality problem...
  • And "-99999999" for [YJHKs]errb and [YJHKs]seq and '-9999' for [YJHKs]MERGEDCLASS in the same time as null values...
=> Can we transform values? / => Do we have to keep the data as they are? / => To which point can we modify a catalog?
  • As in VMC, column with always the same value (i.e. CUEVENTID: ID of curation for one catalog?) => ignored in Vizier but present in the binary file on Axel.
  • Columns always empty or =0 are not taken. => Are they there for future versions?
  • There are objects classified as noise (MERGEDCLASS=0). => Should we keep everything or could we filter the list to keep only the MERGEDCLASS!=0 objects? (only 36890 occurrences in the VMC vs 13,857,646 sources).
=> We could choose to display rows by default with pperrbits<256 as they said but there is one per band, can we choose one band in particular? Can we do that over all the datasets? even relatively bright stuff? mag 20 or so?

Column diplay by default: ap3 or pmag? it seems to me it depends on star or gal! - for stars: we want ap3, and ap3 ~ pmag just pmag is noise - for gals: we want pmag and ap3 = pmag by significant amounts… what to do?

=> The 2 are displayed by default when we have the two... Note: in some catalogs (like VVV no pmag: normal as it concerns variables of the Milky Way)

=> Do we have to keep and display all columns? Could ESO choose which ones we have to publish/display by default?

3. KIDS DR2 :

- IDs had different formats: the number of digits for the seconds decimals is not always the same. They can be KIDS JHHMMSS.ss+DDMMSS.ss, KIDS JHHMMSS.ss+DDMMSS.s and sometimes KIDS JHHMMSS.s+DDMMSS.ss. Therefore the ID does not always have the same length and the J is not aligned throughout the table. While we can in
principle deal with this, it can be cumbersome for users and in particular for users doing nomenclature or cross-ids or for finding objects by name. => not the case in DR3...

For both, questions en orange sur la page qui liste les catalogues ESO (~10 grands cat. mais sans compter les différentes release...)

Visite Garching (26-27 Novembre 2017)

- pb catalogs without publications in peer-reviewed journals (ex. VHS) - formats: formats are often much larger than required (plenty of example, latest in VST ATLAS) - quality control: whats an error of -350 in magnitudes? - what to do with phase 3 products that are only an image or 1 spectrum or 1 data cube or what? such as K band Image of the UDS field… theres not ref… - what with the fragmented stuff: example VMC, is it ok to put it all under 1 single cat? even when several refs are given?

example
le gros cat a la ref 2011A&A…527A.116C, and many other eso cats have the same ref, for instance the small RR Lyrae cat from VISTA VMC. so that we regroup. However, other VISTA VMC may have different references, such as cepheids and eclipsing binaries (MNRAS/424/1807 and J/MNRAS/443/432). It is possible to put several refs in one cat/readme and that decision should remain in CDS hands, otherwise it disrupts too much the pipeline and creates too much additional work, not in an efficient way,

storing = displaying web! = file

Exemple de catalogue très bien fait: VPHAS+: yavait deja les formats, des noms de colonnes en minuscule, plus proches des noms de colonne vizier. donnes dans les headers fits.

visite:

I) presentation vizier:

  • Which are the verifications we do: e.g. file with proportion of null values, min/max - histograms for problematic columns on the WHOLE catalog ; description of the survey/release with observations, each parameters and flags ; units/UCDs verifications ; definition of formats, columns displayed, possible links...
  • Show that it is really not just take the data and put them automatically as they are in Vizier via a magic button.
  • Take some examples of problems ( VHS, KiDS...) vs catalog easy to ingest (VPHAS+ with formats given...)
Added:
>
>
N.B. : Gilles rappelle que pour les grands catalogues, VizieR ne s'engage pas à conserver les données d'origine (c'est spécifié dans le DSA).
 II) Exemples de grands cats VMC/VHS/VST

(chiffrage: min-max => D: 4-14; E: 2-14 A: 1-4 ; Total: 7-32 )

stats: min, max, %null,

Fichiers produits a la main: - formats - Readme …

III) problèmes grands cats

trouver un exemple d’objet avec SEDviewer qui fait exploser le plot a cause d’erreurs trop grandes Est-ce que les stats sont vraiment meilleures si pperrbit < 256? est-ce que ca serait raisonnable de n’afficher que les pperrbit <256 dans vizier?

IV) Problemes de versions : ex. VIDEO- voir liste des catalogues ESO: page difficile à maintenir

V) Données associées AMBRE Interface recuperation des fichiers (spectres) J/A+A/542/A48, 2016APJS..223…29V pour les données associées, on ne peut pas créer de liens vers l’archive ESO

misc: web page listing all problems found for each cat

Concerning if they’re interested in what we do or not:

  • If they’re interested in having the data at cds in its entirety, then they need to help and also do more strict quality control.
  • If they are not ready to help then CDS needs to retain the control over what we ingest or not, and decide what is worth taking or not. Also, in that case, CDS decides how to process/modify the data if required.
-- EmmanuellePerret - 2017-11-24
Deleted:
<
<
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback