Empirical Evaluation of the IFRP-T2P Model Using the KITTI360Pose Dataset

cover
15 Jul 2025

Abstract and 1. Introduction

  1. Related Work

  2. Method

    3.1 Overview of Our Method

    3.2 Coarse Text-cell Retrieval

    3.3 Fine Position Estimation

    3.4 Training Objectives

  3. Experiments

    4.1 Dataset Description and 4.2 Implementation Details

    4.3 Evaluation Criteria and 4.4 Results

  4. Performance Analysis

    5.1 Ablation Study

    5.2 Qualitative Analysis

    5.3 Text Embedding Analysis

  5. Conclusion and References

Supplementary Material

  1. Details of KITTI360Pose Dataset
  2. More Experiments on the Instance Query Extractor
  3. Text-Cell Embedding Space Analysis
  4. More Visualization Results
  5. Point Cloud Robustness Analysis

Anonymous Authors

  1. Details of KITTI360Pose Dataset
  2. More Experiments on the Instance Query Extractor
  3. Text-Cell Embedding Space Analysis
  4. More Visualization Results
  5. Point Cloud Robustness Analysis

4 EXPERIMENTS

4.1 Dataset Description

We evaluate our IFRP-T2P model on the KITTI360Pose dataset [21]. This dataset encompasses 3D point cloud scenes from nine urban areas, spanning a city-scale space of 15.51 square kilometers and consisting of 43,381 paired descriptions and positions. We utilize five areas for our training set and one for validation. The remaining three areas are used for testing. Each 3D cell within this dataset is represented by a cube measuring 30 meters on each side, with a 10-meter stride between each cell. More details are provided in the supplementary material.

4.2 Implementation Details

Our approach begins with pre-training the instance query extractor. This is accomplished through the instance segmentation objective, as outlined in Section 3.2. Our feature backbone employs a sparse convolution U-Net, specifically a Minkowski Res16UNet34C [10] with 0.15 meter voxel size. To align with the requirements of the KITTI360Pose configuration, we utilize cell point clouds, each encompassing a 30 meter cubic space, as our input. The training extends over 300 epochs, utilizing the AdamW optimizer. In the coarse stage, we train the text-cell retrieval model with AdamW optimizer with a learning rate of 1e-3. The model is trained for a total of 24 epochs while the learning rate is decayed by 10 at the

Table 1: Performance comparison on the KITTI360Pose dataset. First three methods utilize ground-truth instances as input. Other methods take raw point cloud data directly.

Table 2: Performance comparison for coarse text-cell retrieval on the KITTI360Pose dataset.

Table 3: Performance comparison of the fine-stage models on the KITTI360Pose dataset. Normalized Euclidean distance is adopted as the metric. All methods directly take raw point cloud as input.

12-th epoch. In the fine stage, we train the regression model with Adam optimizer with a learning rate of 3e-4 for 12 epochs.

Authors:

(1) Lichao Wang, FNii, CUHKSZ ([email protected]);

(2) Zhihao Yuan, FNii and SSE, CUHKSZ ([email protected]);

(3) Jinke Ren, FNii and SSE, CUHKSZ ([email protected]);

(4) Shuguang Cui, SSE and FNii, CUHKSZ ([email protected]);

(5) Zhen Li, a Corresponding Author from SSE and FNii, CUHKSZ ([email protected]).


This paper is available on arxiv under CC BY-NC-ND 4.0 Deed (Attribution-Noncommercial-Noderivs 4.0 International) license.