VLM-PL: Advanced Pseudo Labeling Approach for Class Incremental Object Detection via Vision-Language Model

아인샴 2024. 6. 25. 19:21

~~나중에 읽을거다 다짐하는 의미에서 공개해둠~~

어쩜 그냥 보게 생겼다 에휴 ㅠㅠ

https://openaccess.thecvf.com/content/CVPR2024W/CLVISION/papers/Kim_VLM-PL_Advanced_Pseudo_Labeling_Approach_for_Class_Incremental_Object_Detection_CVPRW_2024_paper.pdf

그림 1.method 흐름도 : pretrained model M_old에 의한 pseudo-labeling으로 시작해 Vision-Language 모델을 통해 refine한다. Custom-generated prompts는 각 pseudo ground-truth로 사용된다. 이 refining process는 신뢰할만한 pseudo GTs를 생성하기 위해 잘못된 pseudo-GT를 필터링하고 이 (새로운) 주석은 previous knowledge 를 updated dataset 과 통함하여 detector M-new를 학습하는데 사용된다.

Abstract

In the field of Class Incremental Object Detection (CIOD), creating models that can continuously learn like humans is a major challenge. Pseudo-labeling methods, although initially powerful, struggle with multi-scenario incremental learning due to their tendency to forget past knowledge. To overcome this, we introduce a new approach called Vision-Language Model assisted Pseudo-Labeling (VLM-PL). This technique uses Vision-Language Model (VLM) to verify the correctness of pseudo ground-truths (GTs) without requiring additional model training. VLMPL starts by deriving pseudo GTs from a pre-trained detector. Then, we generate custom queries for each pseudo GT using carefully designed prompt templates that combine image and text features. This allows the VLM to classify the correctness through its responses. Furthermore, VLMPL integrates refined pseudo and real GTs from upcoming training, effectively combining new and old knowledge.

Extensive experiments conducted on the Pascal VOC and MS COCO datasets not only highlight VLM-PL’s exceptional performance in multi-scenario but also illuminate its effectiveness in dual-scenario by achieving state-of-the-art results in both.

Introduction

catastrophic forgetting :
- Regularization : critical paramter의 변경에 penalty를 줌으로써 이전 학습 자체를 보존한다.
- Knowledge distillation : 이전 모델의 지식을 updated 형태로 전달하는데 집중하여 갱신된 모델이 이전 작업에서 잘 수행되도록 보장한다.
- Replay
  - partial experience replay : 새로운 task를 학습하는 동안 memory buffer에 previous task의 data subset을 보존한다.
  - deep generative replay : past task data를 재경험시키기 위해 generative models를 사용한다.
CIL도 어려운데 CIOD(CI-Object Detection)도 어렵다. 이전에 학습했던 class 식별 accuracy를 손상시키지 않으면서 다양한 레이블에 걸친 detection 기능을 향상시키는 고급 methodolgies가 필요하다.
CIOD의 방법론들
- CNNbased에서 transformer-based으로 : detection framework를 transition하여 model generalization 능력을 향상시키는 연구방향이 있었다.이런 이런 전환에서 pseudolabeling이 forgetting을 완화하기 위한 여러 전략에서 모델 성능을 향상시키는데 사용됐다.
- 여러 연구 기술이 있지만 근본적으로 pretrained-model 성능에 의존한다.
- 이러한 의존성이 multi-incremental scenario에 limit을 많이 건다. 시나리오 복잡성이 즐가ㅅ할수록 이전 작업으로부터의 지식이 약화돼서 성능저하로 이어진다. 이게 다 prior model knowledge에 기반해 생성된 pseuo-label 이 부정확했기 때문이다.
- 그래서 VLM으로 잘못된 GT를 정제하는 VLM-PL을 소개한다. 이 방법은 다양한 시나리오에서 정확한 pseudo ground-truths (GTs)의 일관된 사용을 보장한다
- 요즘 pre-trained foundation 모델이 많던데 이걸 많이 참조했다.
- 특히 GT 식별을 위해 VLM prompt-tuning을 하는데 classification 작업을 위한 재학습이 필요없어졌다. 이 전략은 효과적으로 오류누적을 줄였고 multi&dual 시나리오에서 강력한 성능발휘를 했다.
- pascal coco에서 replay stratage 없이 첨단 결과를 달성했으며 제안된 접근방식 흐름은 fig1(그림1)에 나와있다.
요약
- 우리 생각에 우리가 VLM을 CIOD에 선구적으로 적용했으며 이전에 안다룬 문제를 해결한다.
- 우리는 multiple incremental class의 시나리오도 포함해서 효과적인 VLM의 prompt tuning을 도입했으며이를 통해 까다로운 상황에서도 성능저하 방지한다.
- 광범위한 실험을 통해 우리의 접근방식이 multi-incremental scenarios에서 새로운 최첨단을 설정하며 object detction의 VLM assistance 에 미치는 영향은 엄청나다!

그림 2. VLM-assisted Pseudo Labeling 의 Overview : 순서는 detector M_old에서 시작해 해당 bounding box 위치와 함께 potential object(e.g 입력이미지에 대한 배, 차, 고양이)를 식별하기 위해 pseudo-labeling을 적용한다. 식별된 각 개체와 해당 위치는 prompt template으로 capsule화 된다. 이 템플릿은 <image feature> 과 <region feature> 에 대한 placeholder를 통합하며 전자(image feature)는 이미지 전반에 대한 특징을 대변하며 후자는 RoI에 대한 특징을 대변한다. prompt는 VLM의 신뢰성을 위해 yes/no의 응답을 사용해 분류된다. 그 후, refined pseudo GTs는 검출기 M_new를 학습하기 위한 새로운 작업의 새로운 GT와 결합한다.