View PDF ↗
PDF Viewer

Loading PDF...

This may take a moment

BUILDER'S SANDBOX

Core Pattern

AI-generated implementation pattern based on this paper's core methodology.

Understand the commercial significance and market impact.

7-day free trial. Cancel anytime.

Estimated $9K - $13K over 6-10 weeks.

See exactly what it costs to build this -- with 3 comparable funded startups.

7-day free trial. Cancel anytime.

Discover the researchers behind this paper and find similar experts.

7-day free trial. Cancel anytime.

Founder's Pitch

"Exploring adversarial robustness in discrete image tokenizers to enhance multimodal system security."

Adversarial RobustnessScore: 2View PDF ↗

Commercial Viability Breakdown

0-10 scale

High Potential

0/4 signals

0

Quick Build

1/4 signals

2.5

Series A Potential

1/4 signals

2.5

Explore the full citation network and related research.

7-day free trial. Cancel anytime.

Understand the commercial significance and market impact.

7-day free trial. Cancel anytime.

Get detailed profiles of the research team.

7-day free trial. Cancel anytime.

References (40)

[1]
FuseLIP: Multimodal Embeddings via Early Fusion of Discrete Tokens
2025Christian Schlarmann, Francesco Croce et al.
[2]
Selftok: Discrete Visual Tokens of Autoregression, by Diffusion, and for Reasoning
2025Bohan Wang, Zhongqi Yue et al.
[3]
Unified Autoregressive Visual Generation and Understanding with Continuous Tokens
2025Lijie Fan, Luming Tang et al.
[4]
UniTok: A Unified Tokenizer for Visual Generation and Understanding
2025Chuofan Ma, Yi Jiang et al.
[5]
FlexTok: Resampling Images into 1D Token Sequences of Flexible Length
2025Roman Bachmann, Jesse Allardice et al.
[6]
One-D-Piece: Image Tokenizer Meets Quality-Controllable Compression
2025Keita Miwa, Kento Sasaki et al.
[7]
Emu3: Next-Token Prediction is All You Need
2024Xinlong Wang, Xiaosong Zhang et al.
[8]
Show-o: One Single Transformer to Unify Multimodal Understanding and Generation
2024Jinheng Xie, Weijia Mao et al.
[9]
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models
2024Feng Li, Renrui Zhang et al.
[10]
4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities
2024Roman Bachmann, Ouguzhan Fatih Kar et al.
[11]
An Image is Worth 32 Tokens for Reconstruction and Generation
2024Qihang Yu, Mark Weber et al.
[12]
Chameleon: Mixed-Modal Early-Fusion Foundation Models
2024Chameleon Team, Mingda Chen et al.
[13]
Improving Adversarial Robustness in Vision-Language Models with Architecture and Prompt Design
2024Rishika Bhagwatkar, Shravan Nayak et al.
[14]
4M: Massively Multimodal Masked Modeling
2023David Mizrahi, Roman Bachmann et al.
[15]
Segment (Almost) Nothing: Prompt-Agnostic Adversarial Attacks on Segmentation Models
2023Francesco Croce, Matthias Hein
[16]
Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation
2023Lijun Yu, José Lezama et al.
[17]
Finite Scalar Quantization: VQ-VAE Made Simple
2023Fabian Mentzer, David C. Minnen et al.
[18]
On the Adversarial Robustness of Multi-Modal Foundation Models
2023Christian Schlarmann, Matthias Hein
[19]
Towards Reliable Evaluation and Fast Training of Robust Semantic Segmentation Models
2023Francesco Croce, N. Singh et al.
[20]
Visual Adversarial Examples Jailbreak Aligned Large Language Models
2023Xiangyu Qi, Kaixuan Huang et al.

Showing 20 of 40 references