Evaluate and Ranking

Two-phase Challenge Submission

For the first phase (validation phase), the participants are required to submit the output of their algorithms as a single compressed zip file to the organization team. The submitted zip files should be formatted like the one below. Make sure the results in the submitted zip file all be matched with the validation images one-to-one (2 classes for Task01 and 6 classes for Task02). Otherwise, the results are considered invalid submissions and no score will be generated.

    team_name/
      Task0*_results/
      • SegRap_0001.nii.gz
      • ......
      • SegRap_xxxx.nii.gz

Note: The participants are allowed to submit the results once per week. A maximum of 5 submissions are allowed during the validation stage to ensure fair evaluation and prevent overfitting.


For the second phase (test phase), participants must prepare and submit their Docker containers along with a short paper outlining their method. The Docker container must satisfy the memory constraints (GPU memory usage less than 24 GB, CPU memory usage less than 64 GB) and execution time constraint (no more than 3 minutes per case).

Evaluation metrics

Two classical medical segmentation metrics: Dice Similarity Coefficient (DSC), and normalized surface dice (NSD), will be used to assess different aspects of the performance of the segmentation methods.

Ranking

Firstly, for each cohort (set), we calculate the average DSC and NSD across all the patients for each class, respectively. Secondly, each participant will be ranked based on the class-level DSC and NSD, with 2 × 2 or 6 × 2 rankings. Then, rankings for all classes are averaged in each cohort (set). Finally, rankings for all cohorts (sets) are averaged and then normalized by the number of teams. At the same time, we will take the statistical ranking. (Allow equal teams if there is no significant difference).


In addition, if the submissions have some missing results on test cases, the corresponding class's DSC and NSD will be set to 0 and 0 for ranking. For example, a test case missed a class and the ranking value of this class in average DSC and NSD will degrade.