論文レビュー: Pixel Aligned Language Models

satojkovic https://blog.hatena.ne.jp/satojkovic/ stMind https://stmind.hatenablog.com/ arxiv.org これまでの視覚と言語の対応付けの研究では、ほとんどが画像全体を入力として行われてきたのに対して、この論文では、出力された単語を画像のピクセルに対応づけて、fine-grainedな位置特定が出来る視覚言語モデルのPixelLLMを提案しています。図1は、PixelLLMで可能なタスクを示しています。 image + textでテキストプロンプトに対応する矩形位置を出力したり、image + locationで矩形領域に対応するキャプションを生成したりすることができます。 PixelLLM architecture for pixel-aligned captioning… 190 <iframe src="https://hatenablog-parts.com/embed?url=https%3A%2F%2Fstmind.hatenablog.com%2Fentry%2F2024%2F07%2F15%2F224910" title="論文レビュー: Pixel Aligned Language Models - stMind" class="embed-card embed-blogcard" scrolling="no" frameborder="0" style="display: block; width: 100%; height: 190px; max-width: 500px; margin: 10px 0px;"></iframe> https://cdn-ak.f.st-hatena.com/images/fotolife/s/satojkovic/20240715/20240715204943.png Hatena Blog https://hatena.blog 2024-07-15 22:49:10 論文レビュー: Pixel Aligned Language Models rich https://stmind.hatenablog.com/entry/2024/07/15/224910 1.0 100%