Skip to content

Error stitching large sample #43

@FangmingXie

Description

@FangmingXie

Bug report

Description of the problem

I was trying to stitch a large sample of 20 tiles, with each tile having [1920,1920,~2800] pixels. I kept getting a spark session time out error at different stages of the stitching pipeline.

For example, below is a case when the error came from the run_retile stage. But for the same data, it would sometime run through this stage but get the same session time out error at a later stage run_stitching.

This error is unique for large sample only. I have no problem running through a sample that is about ~10x smaller.

Log file(s)

Jun-28 10:26:01.924 [Task monitor] ERROR nextflow.processor.TaskProcessor - Error executing process > 'stitching:stitch:run_retile:spark_start_app (1)'

Caused by:
  Process `stitching:stitch:run_retile:spark_start_app (1)` terminated with an error exit status (1)

Command executed:

  echo "Starting the spark driver"

  SESSION_FILE="/u/home/f/f7xiesnm/project-zipursky/easifish/lt185_stitch/spark/r1/.sessionId"
  echo "Checking for $SESSION_FILE"
  SLEEP_SECS=10
  MAX_WAIT_SECS=7200
  SECONDS=0

  while ! test -e "$SESSION_FILE"; do
      sleep ${SLEEP_SECS}
      if (( ${SECONDS} < ${MAX_WAIT_SECS} )); then
          echo "Waiting for $SESSION_FILE"
          SECONDS=$(( ${SECONDS} + ${SLEEP_SECS} ))
      else
          echo "-------------------------------------------------------------------------------"
          echo "ERROR: Timed out after ${SECONDS} seconds while waiting for $SESSION_FILE    "
          echo "Make sure that your --spark_work_dir is accessible to all nodes in the cluster "
          echo "-------------------------------------------------------------------------------"
          exit 1
      fi
  done
  
   if ! grep -F -x -q "dcfcb7c0-01b8-4119-90ec-8b3f63ab2c0e" $SESSION_FILE
  then
      echo "------------------------------------------------------------------------------"
      echo "ERROR: session id in $SESSION_FILE does not match current session            "
      echo "Make sure that your --spark_work_dir is accessible to all nodes in the cluster"
      echo "and that you are not running multiple pipelines with the same --spark_work_dir"
      echo "------------------------------------------------------------------------------"
      exit 1
  fi



  export SPARK_ENV_LOADED=
  export SPARK_HOME=/spark
  export PYSPARK_PYTHONPATH_SET=
  export PYTHONPATH="/spark/python"
  export SPARK_LOG_DIR="/u/home/f/f7xiesnm/project-zipursky/easifish/lt185_stitch/spark/r1"

  . "/spark/sbin/spark-config.sh"
  . "/spark/bin/load-spark-env.sh"



  SPARK_LOCAL_IP=`hostname -i | rev | cut -d' ' -f1 | rev`
  echo "Use Spark IP: $SPARK_LOCAL_IP"

  echo "    /spark/bin/spark-class org.apache.spark.deploy.SparkSubmit     --properties-file /u/home/f/f7xiesnm/project-zipursky/easifish/lt18
5_stitch/spark/r1/spark-defaults.conf          --conf spark.driver.host=${SPARK_LOCAL_IP}     --conf spark.driver.bindAddress=${SPARK_LOCAL_IP
}     --master spark://172.16.129.70:7077 --class org.janelia.stitching.ResaveAsSmallerTilesSpark --conf spark.executor.cores=16 --conf spark.
files.openCostInBytes=0 --conf spark.default.parallelism=16 --executor-memory 96g --conf spark.driver.cores=1 --driver-memory 12g /app/app.jar
  -i /u/home/f/f7xiesnm/project-zipursky/easifish/lt185_stitch/outputs/r1/stitching/c0-n5.json -i /u/home/f/f7xiesnm/project-zipursky/easifish
/lt185_stitch/outputs/r1/stitching/c2-n5.json -i /u/home/f/f7xiesnm/project-zipursky/easifish/lt185_stitch/outputs/r1/stitching/c3-n5.json --s
ize 64     "

  /spark/bin/spark-class org.apache.spark.deploy.SparkSubmit     --properties-file /u/home/f/f7xiesnm/project-zipursky/easifish/lt185_stitch/s
park/r1/spark-defaults.conf          --conf spark.driver.host=${SPARK_LOCAL_IP}     --conf spark.driver.bindAddress=${SPARK_LOCAL_IP}     --ma
ster spark://172.16.129.70:7077 --class org.janelia.stitching.ResaveAsSmallerTilesSpark --conf spark.executor.cores=16 --conf spark.files.open
CostInBytes=0 --conf spark.default.parallelism=16 --executor-memory 96g --conf spark.driver.cores=1 --driver-memory 12g /app/app.jar  -i /u/ho
me/f/f7xiesnm/project-zipursky/easifish/lt185_stitch/outputs/r1/stitching/c0-n5.json -i /u/home/f/f7xiesnm/project-zipursky/easifish/lt185_sti
tch/outputs/r1/stitching/c2-n5.json -i /u/home/f/f7xiesnm/project-zipursky/easifish/lt185_stitch/outputs/r1/stitching/c3-n5.json --size 64
 &> /u/home/f/f7xiesnm/project-zipursky/easifish/lt185_stitch/spark/r1/retileImages.log

Command exit status:
  1

Command output:
  Starting the spark driver
  Checking for /u/home/f/f7xiesnm/project-zipursky/easifish/lt185_stitch/spark/r1/.sessionId
  Use Spark IP: 172.16.129.70
      /spark/bin/spark-class org.apache.spark.deploy.SparkSubmit     --properties-file /u/home/f/f7xiesnm/project-zipursky/easifish/lt185_stit
ch/spark/r1/spark-defaults.conf          --conf spark.driver.host=172.16.129.70     --conf spark.driver.bindAddress=172.16.129.70     --master
 spark://172.16.129.70:7077 --class org.janelia.stitching.ResaveAsSmallerTilesSpark --conf spark.executor.cores=16 --conf spark.files.openCost
InBytes=0 --conf spark.default.parallelism=16 --executor-memory 96g --conf spark.driver.cores=1 --driver-memory 12g /app/app.jar  -i /u/home/f
/f7xiesnm/project-zipursky/easifish/lt185_stitch/outputs/r1/stitching/c0-n5.json -i /u/home/f/f7xiesnm/project-zipursky/easifish/lt185_stitch/outputs/r1/stitching/c2-n5.json -i /u/home/f/f7xiesnm/project-zipursky/easifish/lt185_stitch/outputs/r1/stitching/c3-n5.json --size 64

Command error:
  INFO:    Could not find any nv files on this host!
  INFO:    Converting SIF file to temporary sandbox...
  Starting the spark driver
  Checking for /u/home/f/f7xiesnm/project-zipursky/easifish/lt185_stitch/spark/r1/.sessionId
  Use Spark IP: 172.16.129.70
      /spark/bin/spark-class org.apache.spark.deploy.SparkSubmit     --properties-file /u/home/f/f7xiesnm/project-zipursky/easifish/lt185_stit
ch/spark/r1/spark-defaults.conf          --conf spark.driver.host=172.16.129.70     --conf spark.driver.bindAddress=172.16.129.70     --master
 spark://172.16.129.70:7077 --class org.janelia.stitching.ResaveAsSmallerTilesSpark --conf spark.executor.cores=16 --conf spark.files.openCost
InBytes=0 --conf spark.default.parallelism=16 --executor-memory 96g --conf spark.driver.cores=1 --driver-memory 12g /app/app.jar  -i /u/home/f
/f7xiesnm/project-zipursky/easifish/lt185_stitch/outputs/r1/stitching/c0-n5.json -i /u/home/f/f7xiesnm/project-zipursky/easifish/lt185_stitch/
outputs/r1/stitching/c2-n5.json -i /u/home/f/f7xiesnm/project-zipursky/easifish/lt185_stitch/outputs/r1/stitching/c3-n5.json --size 64
  INFO:    Cleaning up image..
  
Work dir:
  /u/home/f/f7xiesnm/try_multifish/multifish/work/b3/b86ba5188c95fab8d05a827c510a56

Environment

  • EASI-FISH Pipeline version: latest
  • Nextflow version: 22.10.7
  • Container runtime: Singularity
  • Platform: Local cluster
  • Operating system: Linux

Additional context

(Add any other context about the problem here)

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status

    Todo

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions