【机器人】复现 3D-Mem 具身探索和推理 | 3D场景记忆 CVPR 2025 _3dmem
3D-Mem 是用于具体探索和推理的3D场景记忆,来自CVPR 2025.
本文分享3D-Mem复现和模型推理的过程~
下面是一个推理和选择识别的结果:
看一下机器人探索的效果:
下面是真实环境下,官方跑的demo,3D-Mem无需训练的设计,可以无缝适应真实的机器人,从而实现在现实世界中的部署
项目地址:https://umass-embodied-agi.github.io/3D-Mem/
1、创建Conda环境
首先创建一个Conda环境,名字为3dmem,python版本为3.9
进入3dmem环境
conda create -n 3dmem python=3.9 -yconda activate 3dmem
然后下载代码,进入代码工程:https://github.com/UMass-Embodied-AGI/3D-Mem
git clone https://github.com/UMass-Embodied-AGI/3D-Mem.gitcd 3D-Mem
2、安装habitat模拟器
我需要安装habitat-sim==0.2.5、headless 和 faiss-cpu
conda install -c conda-forge -c aihabitat habitat-sim=0.2.5 headless faiss-cpu=1.7.4 -y
等待安装完成~
3、安装 torch 和 pytorch3d
执行下面命令,进行安装torch:
pip install torch==2.0.1 torchvision==0.15.2 --index-url https://download.pytorch.org/whl/cu118
再安装pytorch3d:
conda install https://anaconda.org/pytorch3d/pytorch3d/0.7.4/download/linux-64/pytorch3d-0.7.4-py39_cu118_pyt201.tar.bz2 -y
4、安装依赖库
执行下面命令进行安装:
pip install omegaconf==2.3.0 open-clip-torch==2.26.1 ultralytics==8.2.31 supervision==0.21.0 opencv-python-headless==4.10.* \\ scikit-learn==1.4 scikit-image==0.22 open3d==0.18.0 hipart==1.0.4 openai==1.35.3 httpx==0.27.2
等待安装完成~
5、安装clip
执行下面命令进行安装:
pip install git+https://github.com/openai/CLIP.git
打印信息
Looking in indexes: https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simpleCollecting git+https://github.com/openai/CLIP.git Cloning https://github.com/openai/CLIP.git to /tmp/pip-req-build-imrsh3kf Running command git clone --filter=blob:none --quiet https://github.com/openai/CLIP.git /tmp/pip-req-build-imrsh3kf Resolved https://github.com/openai/CLIP.git to commit dcba3cb2e2827b402d2701e7e1c7d9fed8a20ef1 Preparing metadata (setup.py) ... done.....Successfully built clipInstalling collected packages: clipSuccessfully installed clip-1.0
clip的主要思路流程:
6、修改Hugging Face 镜像源
代码会自动从Hugging Face下载模型权重,需要先配置为国内的镜像源
编辑用户配置文件 ~/.bashrc,
设置为 export HF_ENDPOINT=https://hf-mirror.com
执行下面命令:
echo \'export HF_ENDPOINT=https://hf-mirror.com\' >> ~/.bashrcsource ~/.bashrc # 立即生效
验证环境变量,是否修改成功:
echo $HF_ENDPOINT
正常会输出:https://hf-mirror.com,说明设置成功啦~
7、准备HM3D数据集
我们需要下载 hm3d_v0.2
下载地址:GitHub - matterport/habitat-matterport-3dresearch
选择的下载文件:hm3d-val-habitat-v0.2.tar
然后放到data目录下:
8、准备gpt-4o的Api
推荐使用国内的供应商,比较稳定:https://ai.nengyongai.cn/register?aff=RQt3
首先“添加令牌”,设置额度(比如5块钱),点击查看就能看到Key啦
然后填写到 src/const.py中
# about habitat sceneINVALID_SCENE_ID = []# about chatgpt apiEND_POINT = \"https://ai.nengyongai.cn/v1\"OPENAI_KEY = \"xxxxxxxxxxxxxxxxxxxxx\"
点击模型列表,能查看支持的模型:
看一下使用情况:
9、运行模型推理
查看配置文件 cfg/eval_aeqa.yaml
# 通用设置seed: 77 # 随机种子exp_name: \"exp_eval_aeqa\" # 实验名称output_parent_dir: \"results\" # 输出文件夹的父目录scene_dataset_config_path: \"data/hm3d_annotated_basis.scene_dataset_config.json\" # 场景数据集配置文件路径scene_data_path: \"data/hm3d_v0.2/\" # 场景数据路径questions_list_path: \'data/aeqa_questions-41.json\' # 问题列表文件路径concept_graph_config_path: \"cfg/concept_graph_default.yaml\" # 概念图配置文件路径# 主要设置choose_every_step: true # 是否在每一步都查询视觉语言模型(VLM),还是仅在到达导航目标后查询egocentric_views: true # 是否在提示视觉语言模型时添加自我中心视角prefiltering: true # 是否使用预筛选(实际上不能关闭,否则会超出上下文长度限制)top_k_categories: 10 # 在预筛选过程中保留与目标最相关的前 k 个类别# 关于检测模型yolo_model_name: yolov8x-world.pt # YOLO 模型名称sam_model_name: sam_l.pt # SAM 模型名称class_set: scannet200 # 使用 200 类别的数据集用于 YOLO-world 检测器# 关于快照聚类min_detection: 1 # 最小检测数量# 相机和图像设置camera_height: 1.5 # 相机高度(单位:米)camera_tilt_deg: -30 # 相机倾斜角度(单位:度)img_width: 1280 # 图像宽度(单位:像素)img_height: 1280 # 图像高度(单位:像素)hfov: 120 # 水平视场角(单位:度)# 是否保存可视化结果(这会比较慢)save_visualization: true# 用于提示 GPT-4O 的图像大小prompt_h: 360 # 提示图像高度(单位:像素)prompt_w: 360 # 提示图像宽度(单位:像素)# 导航设置num_step: 50 # 最大导航步数init_clearance: 0.3 # 初始避碰距离(单位:米)extra_view_phase_1: 2 # 第一阶段额外视角的数量extra_view_angle_deg_phase_1: 60 # 第一阶段每个额外视角之间的角度(单位:度)extra_view_phase_2: 6 # 第二阶段额外视角的数量extra_view_angle_deg_phase_2: 40 # 第二阶段每个额外视角之间的角度(单位:度)# 关于 TSDF、深度图和边界更新explored_depth: 1.7 # 已探索深度(单位:米)tsdf_grid_size: 0.1 # TSDF 网格大小(单位:米)margin_w_ratio: 0.25 # 宽度方向的边界比例margin_h_ratio: 0.6 # 高度方向的边界比例planner: # 规划器设置 eps: 1 # 规划器的精度 max_dist_from_cur_phase_1: 1 # 第一阶段未找到目标时,探索边界的步长(单位:米) max_dist_from_cur_phase_2: 1 # 第二阶段找到目标后,接近目标的步长(单位:米) final_observe_distance: 0.75 # 第二阶段找到一个距离目标对象此距离的地方进行观察(单位:米) surrounding_explored_radius: 0.7 # 周围已探索区域的半径(单位:米) # 关于边界选择 frontier_edge_area_min: 4 # 边界边缘最小面积 frontier_edge_area_max: 6 # 边界边缘最大面积 frontier_area_min: 8 # 边界最小面积 frontier_area_max: 9 # 边界最大面积 min_frontier_area: 20 # 边界至少需要的像素数量 max_frontier_angle_range_deg: 150 # 边界中像素所张角度的最大值(单位:度) region_equal_threshold: 0.95 # 区域相等的阈值# 关于场景图构建scene_graph: confidence: 0.003 # 置信度阈值 nms_threshold: 0.1 # 非极大值抑制阈值 iou_threshold: 0.5 # 交并比阈值 obj_include_dist: 3.5 # 包含目标对象的距离(单位:米) target_obj_iou_threshold: 0.6 # 目标对象的交并比阈值
运行下面代码,生成 A-EQA 数据集的预测结果:
python run_aeqa_evaluation.py -cf cfg/eval_aeqa.yaml
运行程序后,会联网下载一些模型权重,
包括:yolov8x-world.pt、sam_l.pt、open_clip_pytorch_model.bin、ViT-B-32.pt等
下面是运行信息:
00:00:00 - ***** Running exp_eval_aeqa *****00:00:00 - Total number of questions: 4100:00:00 - number of questions after splitting: 4100:00:00 - question path: data/aeqa_questions-41.jsonDownloading https://github.com/ultralytics/assets/releases/download/v8.2.0/yolov8x-world.pt to \'yolov8x-world.pt\'...100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 141M/141M [04:04<00:00, 605kB/s]00:04:09 - Load YOLO model yolov8x-world.pt successful!Downloading https://github.com/ultralytics/assets/releases/download/v8.2.0/sam_l.pt to \'sam_l.pt\'...100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.16G/1.16G [11:56<00:00, 1.74MB/s]00:16:12 - Load SAM model sam_l.pt successful!00:16:12 - Loaded ViT-B-32 model config.open_clip_pytorch_model.bin: 70%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎ | 440M/626M [03:17<01:11, 2.58MB/s]....
当下载和加载成功后,会显示:
00:00:00 - ***** Running exp_eval_aeqa *****00:00:00 - Total number of questions: 4100:00:00 - number of questions after splitting: 4100:00:00 - question path: data/aeqa_questions-41.json00:00:00 - Load YOLO model yolov8x-world.pt successful!00:00:02 - Load SAM model sam_l.pt successful!00:00:02 - Loaded ViT-B-32 model config.00:00:04 - Loading pretrained ViT-B-32 weights (laion2b_s34b_b79k).00:00:05 - Load CLIP model successful!00:00:05 - Question 00c2be2a-1377-4fae-a889-30936b7890c3 already processed00:00:05 - Question 013bb857-f47d-4b50-add4-023cc4ff414c already processed00:00:05 - ========Index: 2 Scene: 00848-ziup5kvtCCR00:00:05 - semantic_texture_path: data/hm3d_v0.2/val/00848-ziup5kvtCCR/ziup5kvtCCR.semantic.glb or scene_semantic_annotation_path: data/hm3d_v0.2/val/00848-ziup5kvtCCR/ziup5kvtCCR.semantic.txt does not exist00:00:06 - Loaded 192 classes from scannet 200: data/scannet200_classes.txt!!!00:00:06 - Load scene 00848-ziup5kvtCCR successfully without semantic texture00:00:10 - Question id 01fcc568-f51e-4e12-b976-5dc8d554135a initialization successful!00:00:10 - == step: 000:00:11 - Done! Execution time of detections_to_obj_pcd_and_bbox function: 0.12 seconds00:00:13 - Done! Execution time of detections_to_obj_pcd_and_bbox function: 0.09 seconds00:00:15 - Done! Execution time of detections_to_obj_pcd_and_bbox function: 0.08 seconds00:00:16 - Done! Execution time of detections_to_obj_pcd_and_bbox function: 0.05 seconds00:00:17 - Done! Execution time of detections_to_obj_pcd_and_bbox function: 0.04 seconds00:00:18 - Done! Execution time of detections_to_obj_pcd_and_bbox function: 0.05 seconds00:00:19 - Done! Execution time of detections_to_obj_pcd_and_bbox function: 0.07 seconds00:00:20 - Step 0, update snapshots, 25 objects, 6 snapshots00:00:23 - HTTP Request: POST https://ai.nengyongai.cn/v1/chat/completions \"HTTP/1.1 200 OK\"00:00:23 - Prefiltering selected classes: [\'sofa chair\', \'couch\', \'pillow\', \'coffee table\', \'cabinet\']00:00:23 - Prefiltering snapshot: 6 -> 300:00:23 - Input prompt:00:00:23 - Task: You are an agent in an indoor scene tasked with answering questions by observing the surroundings and exploring the environment. To answer the question, you are required to choose either a Snapshot as the answer or a Frontier to further explore.Definitions:Snapshot: A focused observation of several objects. Choosing a Snapshot means that this snapshot image contains enough information for you to answer the question. If you choose a Snapshot, you need to directly give an answer to the question. If you don\'t have enough information to give an answer, then don\'t choose a Snapshot.Frontier: An observation of an unexplored region that could potentially lead to new information for answering the question. Selecting a frontier means that you will further explore that direction. If you choose a Frontier, you need to explain why you would like to choose that direction to explore.Question: Where is the teddy bear?Select the Frontier/Snapshot that would help find the answer of the question.The following is the egocentric view of the agent in forward direction: [iVBORw0KGg...]The followings are all the snapshots that you can choose (followed with contained object classes)Please note that the contained classes may not be accurate (wrong classes/missing classes) due to the limitation of the object detection model. So you still need to utilize the images to make decisions.Snapshot 0 [iVBORw0KGg...]coffee table, couch, pillowSnapshot 1 [iVBORw0KGg...]coffee table, pillow, sofa chairSnapshot 2 [iVBORw0KGg...]cabinet, couchThe followings are all the Frontiers that you can explore: Frontier 0 [iVBORw0KGg...]Frontier 1 [iVBORw0KGg...]Please provide your answer in the following format: \'Snapshot i[Answer]\' or \'Frontier i[Reason]\', where i is the index of the snapshot or frontier you choose. For example, if you choose the first snapshot, you can return \'Snapshot 0The fruit bowl is on the kitchen counter.\'. If you choose the second frontier, you can return \'Frontier 1I see a door that may lead to the living room.\'.Note that if you choose a snapshot to answer the question, (1) you should give a direct answer that can be understood by others. Don\'t mention words like \'snapshot\', \'on the left of the image\', etc; (2) you can also utilize other snapshots, frontiers and egocentric views to gather more information, but you should always choose one most relevant snapshot to answer the question.00:00:32 - HTTP Request: POST https://ai.nengyongai.cn/v1/chat/completions \"HTTP/1.1 200 OK\"00:00:32 - Response: [frontier 0]Reason: [I would like to explore the hallway further as it may lead to other rooms where the teddy bear might be located.]00:00:32 - Prediction: frontier, 000:00:32 - Next choice: Frontier at [79 33]UserWarning: *c* argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with *x* & *y*. Please use the *color* keyword-argument or provide a 2D array with a single row if you intend to specify the same RGB or RGBA value for all points.00:00:33 - Current position: [ 0.11692 0.021223 6.1057], 1.00500:00:34 - == step: 1
可视化的结果保存在:results/exp_eval_aeqa 中
看一下机器人探索区域可视化:
模型推理示例2
对应的配置文件是:cfg/eval_goatbench.yaml
运行代码,生成 GOAT-Bench 数据集的预测结果:
python run_goatbench_evaluation.py -cf cfg/eval_goatbench.yaml
GOAT-Bench 为每个场景提供了 10 个探索情节,并且由于时间和资源的限制,默认只测试第一情节。
我们还可以通过设置来指定要评估每个场景的情节 --split
。
分享完成~
相关文章推荐:
UniGoal 具身导航 | 通用零样本目标导航 CVPR 2025-CSDN博客
【机器人】复现 UniGoal 具身导航 | 通用零样本目标导航 CVPR 2025-CSDN博客
【机器人】复现 ECoT 具身思维链推理-CSDN博客
【机器人】复现 SG-Nav 具身导航 | 零样本对象导航的 在线3D场景图提示-CSDN博客
【机器人】复现 WMNav 具身导航 | 将VLM集成到世界模型中-CSDN博客