CLS ํ† ํฐ๊ณผ Patch ํ† ํฐ ์ƒํ˜ธ์ž‘์šฉ ์žฌ๊ณ 

ViT์—์„œ [CLS] ํ† ํฐ๊ณผ patch ํ† ํฐ์˜ ์ฒ˜๋ฆฌ ๊ฒฝ๋กœ๋ฅผ ๋ถ„๋ฆฌํ•˜๋ฉด dense prediction ์„ฑ๋Šฅ์ด ํ–ฅ์ƒ๋œ๋‹ค

WIP core ViT Created: 2025-06-01 | Updated: 2025-06-01

๊ธฐ์กด ViT๋Š” [CLS] ํ† ํฐ๊ณผ patch ํ† ํฐ์ด ๋ชจ๋“  Transformer ๋ ˆ์ด์–ด์—์„œ ๋™์ผํ•˜๊ฒŒ self-attention์„ ๊ณต์œ ํ•œ๋‹ค. ์ด ๋…ผ๋ฌธ์€ ๋‘ ํ† ํฐ ์œ ํ˜•์˜ ์—ญํ• ์ด ๊ทผ๋ณธ์ ์œผ๋กœ ๋‹ค๋ฅด๋‹ค๋Š” ์ ์— ์ฃผ๋ชฉ:

ํ† ํฐ์—ญํ• ์ตœ์  ์ฒ˜๋ฆฌ
[CLS]์ด๋ฏธ์ง€ ์ „์ฒด์˜ global representation๋ถ„๋ฅ˜์— ์ตœ์ ํ™”
Patch๊ฐ ์œ„์น˜์˜ local featuredense prediction์— ์ตœ์ ํ™”

๊ธฐ์กด ๋ฌธ์ œ

๊ธฐ์กด ViT์—์„œ [CLS] ํ† ํฐ์€ ๋ชจ๋“  ๋ ˆ์ด์–ด์˜ attention์— ์ฐธ์—ฌํ•˜๋ฉด์„œ patch ํ† ํฐ์˜ local feature๋ฅผ "์˜ค์—ผ"์‹œํ‚จ๋‹ค. ์ด๋Š” segmentation, detection ๊ฐ™์€ dense prediction task์—์„œ ์„ฑ๋Šฅ ์ €ํ•˜๋กœ ์ด์–ด์ง„๋‹ค.

์ œ์•ˆ ๋ฐฉ๋ฒ•: Decoupled Processing

๊ธฐ์กด: [CLS, P1, P2, ..., PN] -> Transformer Layer x L -> [CLS, P1, P2, ..., PN]

์ œ์•ˆ: [CLS] -> CLS Branch (๊ฒฝ๋Ÿ‰) -> Global Feature [P1..PN] -> Patch Branch (์ฃผ๋ ฅ) -> Dense Features

์ˆ˜์‹

Attnpatch=softmax(QPKPTdk)VP\text{Attn}_{\text{patch}} = \text{softmax}\left(\frac{Q_P K_P^T}{\sqrt{d_k}}\right) V_P Attncls=softmax(qclsKTdk)V\text{Attn}_{\text{cls}} = \text{softmax}\left(\frac{q_{\text{cls}} K^T}{\sqrt{d_k}}\right) V
  • ๋…ผ๋ฌธ: Revisiting [CLS] and Patch Token Interaction in Vision Transformers (ICLR 2026)
๊ฐœ์ธ ๋…ธํŠธ

์ด ์ ‘๊ทผ๋ฒ•์€ noisy label correction ํ”„๋กœ์ ํŠธ์—์„œ๋„ ์‹œ์‚ฌ์ ์ด ์žˆ๋‹ค. Feature extraction ๋‹จ๊ณ„์—์„œ CLS token ๊ธฐ๋ฐ˜ ์œ ์‚ฌ๋„์™€ patch token ๊ธฐ๋ฐ˜ ์œ ์‚ฌ๋„๋ฅผ ๋ถ„๋ฆฌํ•ด์„œ ๊ณ„์‚ฐํ•˜๋ฉด ํด๋Ÿฌ์Šคํ„ฐ๋ง ํ’ˆ์งˆ์ด ๋‹ฌ๋ผ์งˆ ์ˆ˜ ์žˆ์„ ๊ฒƒ ๊ฐ™๋‹ค.

ํŠนํžˆ ์ œ์กฐ ๋„๋ฉ”์ธ ์ด๋ฏธ์ง€๋Š” ๊ฒฐํ•จ ์œ„์น˜๊ฐ€ ์ค‘์š”ํ•˜๋ฏ€๋กœ dense feature๋ฅผ ํ™œ์šฉํ•œ ์œ ์‚ฌ๋„๊ฐ€ ๋” ์œ ์˜๋ฏธํ•  ์ˆ˜ ์žˆ์Œ.

Questions

  • Decoupled ๊ตฌ์กฐ์—์„œ CLS branch๋ฅผ ์•„์˜ˆ ์ œ๊ฑฐํ•˜๊ณ  global avg pooling๋งŒ ์“ฐ๋ฉด?
  • ๊ธฐ์กด pretrained ViT์—์„œ fine-tuning์œผ๋กœ decoupling์„ ์ ์šฉํ•  ์ˆ˜ ์žˆ๋Š”์ง€?
  • patch feature ์œ ์‚ฌ๋„ ๊ธฐ๋ฐ˜ K-means vs CLS feature ๊ธฐ๋ฐ˜ K-means ๋น„๊ต ์‹คํ—˜

halbielee-wiki