15 Septembre 2016

Utilisation d’un programme en Scala pour faire le crossmatch

Script utilisé

    #!/bin/bash

    jar_path=target
    jar_name=Xmatch-2.0.jar

    hdfs_namenode=hdfs://hdfs-namenode:8020
    hdfs_input=/data/xmatch
    hdfs_output=/output
    executor_config="--executor-cores 4 --executor-memory 13g"
    execution_options="--conf spark.sql.shuffle.partitions=12"

    hdfs dfs -fs ${hdfs_namenode} -rm -r -f -skipTrash ${hdfs_output}/xmatchRes_3.rdd
    time spark-submit --class cds.xmatch.spark.CrossMatch3 \
        --driver-memory 2G \
        ${execution_options} \
        ${executor_config} \
        ${jar_path}/${jar_name} \
        2MASS ${hdfs_namenode}/${hdfs_input}/tmass.l2.i0.csv \
        SDSS9 ${hdfs_namenode}/${hdfs_input}/sdss9.l2.i0.csv \
        4096 5 \
        ${hdfs_namenode}/${hdfs_output}/xmatchRes_3.rdd
    hdfs dfs -fs ${hdfs_namenode} -du -s -h ${hdfs_output}/xmatchRes_3.rdd

Modification des profils dans le fichier CrossMatch3.scala

    val profiles = Map(
        "2MASS" -> Profile(
        "RAJ2000",
        "DEJ2000",
        "NULL",
        StructType(Array(
            StructField("2MASS", StringType, true),
            StructField("RAJ2000", DoubleType, true),
            StructField("DEJ2000", DoubleType, true),
            StructField("errHalfMaj", FloatType, true),
            StructField("errHalfMin", FloatType, true),
            StructField("errPosAng", DoubleType, true),
            StructField("Jmag", FloatType, true),
            StructField("Hmag", FloatType, true),
            StructField("Kmag", FloatType, true),
            StructField("e_Jmag", FloatType, true),
            StructField("e_Hmag", FloatType, true),
            StructField("e_Kmag", FloatType, true),
            StructField("Qfl", StringType, true),
            StructField("Rfl", StringType, true),
            StructField("X", ByteType, true),
            StructField("MeasureJD", DoubleType, true)))
        ),
        "SDSS9" -> Profile(
        "RAdeg",
        "DEdeg",
        "NULL",
        StructType(Array(
            StructField("SDSS9", StringType, true),
            StructField("RAdeg", DoubleType, true),
            StructField("DEdeg", DoubleType, true),
            StructField("errHalfMaj9", FloatType, true),
            StructField("errHalfMin9", FloatType, true),
            StructField("errPosAng9", DoubleType, true),
            StructField("umag", FloatType, true),
            StructField("e_umag", FloatType, true),
            StructField("gmag", FloatType, true),
            StructField("e_gmag", FloatType, true),
            StructField("rmag", FloatType, true),
            StructField("e_rmag", FloatType, true),
            StructField("imag", FloatType, true),
            StructField("e_imag", FloatType, true),
            StructField("zmag", FloatType, true),
            StructField("e_zmag", FloatType, true),
            StructField("objID", LongType, true),
            StructField("cl", FloatType, true),
            StructField("q_mode", FloatType, true),
            StructField("flags", FloatType, true),
            StructField("Q", IntegerType, true),
            StructField("Obs_Date", DoubleType, true)))
        )
    )

Temps d’execution

Programme Scala (parsing intégré)

10 partitions
real    3m58.033s
user    3m42.808s
sys     0m13.468s
12 partitions
real    3m56.072s
user    3m33.980s
sys     0m13.728s

Programme Java (parsing fait avant)

real    2m9.896s
user    0m10.916s
sys     0m0.568s