PDF Viewer

BUILDER'S SANDBOX

Build This Paper

Use an AI coding agent to implement this research.

OpenAI Codex
OpenAI CodexAI Agent

Lightweight coding agent in your terminal.

Claude Code
Claude CodeAI Agent

Agentic coding tool for terminal workflows.

AntiGravity IDE
AntiGravity IDEScaffolding

AI agent mindset installer and workflow scaffolder.

Cursor
CursorIDE

AI-first code editor built on VS Code.

VS Code
VS CodeIDE

Free, open-source editor by Microsoft.

Estimated $9K - $13K over 6-10 weeks.

See exactly what it costs to build this -- with 3 comparable funded startups.

7-day free trial. Cancel anytime.

Discover the researchers behind this paper and find similar experts.

7-day free trial. Cancel anytime.

References (45)

[1]
LARGO: Latent Adversarial Reflection through Gradient Optimization for Jailbreaking LLMs
2025Ran Li, Hao Wang et al.
[2]
One-shot Optimized Steering Vectors Mediate Safety-relevant Behaviors in LLMs
2025Jacob Dunefsky, Arman Cohan
[3]
Layer by Layer: Uncovering Hidden Representations in Language Models
2025Oscar Skean, Md Rifat Arefin et al.
[4]
Qwen2.5 Technical Report
2024Qwen An Yang, Baosong Yang et al.
[5]
Controlling Language and Diffusion Models by Transporting Activations
2024Pau Rodríguez López, Arno Blaas et al.
[6]
Robust LLM safeguarding via refusal feature adversarial training
2024Lei Yu, Virginie Do et al.
[7]
The Llama 3 Herd of Models
2024Abhimanyu Dubey, Abhinav Jauhri et al.
[8]
What Makes and Breaks Safety Fine-tuning? A Mechanistic Study
2024Samyak Jain, E. Lubana et al.
[9]
Refusal in Language Models Is Mediated by a Single Direction
2024Andy Arditi, Oscar Obeso et al.
[10]
Distributional Preference Alignment of LLMs via Optimal Transport
2024Igor Melnyk, Youssef Mroueh et al.
[11]
Improved Generation of Adversarial Examples Against Safety-aligned LLMs
2024Qizhang Li, Yiwen Guo et al.
[12]
SimPO: Simple Preference Optimization with a Reference-Free Reward
2024Yu Meng, Mengzhou Xia et al.
[13]
Representation Noising: A Defence Mechanism Against Harmful Finetuning
2024Domenic Rosati, Jan Wehner et al.
[14]
AmpleGCG: Learning a Universal and Transferable Generative Model of Adversarial Suffixes for Jailbreaking Both Open and Closed LLMs
2024Zeyi Liao, Huan Sun
[15]
Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks
2024Maksym Andriushchenko, Francesco Croce et al.
[16]
Soft Prompt Threats: Attacking Safety Alignment and Unlearning in Open-Source LLMs through the Embedding Space
2024Leo Schwinn, David Dobre et al.
[17]
KTO: Model Alignment as Prospect Theoretic Optimization
2024Kawin Ethayarajh, Winnie Xu et al.
[18]
Vaccine: Perturbation-aware Alignment for Large Language Model
2024Tiansheng Huang, Sihao Hu et al.
[19]
Steering Llama 2 via Contrastive Activation Addition
2023Nina Rimsky, Nick Gabrieli et al.
[20]
Tree of Attacks: Jailbreaking Black-Box LLMs Automatically
2023Anay Mehrotra, M. Zampetakis et al.

Showing 20 of 45 references

Founder's Pitch

"Develop enhanced de-jailbreaking tools for language models using optimal transport to improve attack success rates while preserving model capabilities."

AI SafetyScore: 6View PDF ↗

Commercial Viability Breakdown

0-10 scale

High Potential

2/4 signals

5

Quick Build

3/4 signals

7.5

Series A Potential

0/4 signals

0

Sources used for this analysis

arXiv Paper

Full-text PDF analysis of the research paper

GitHub Repository

Code availability, stars, and contributor activity

Citation Network

Semantic Scholar citations and co-citation patterns

Community Predictions

Crowd-sourced unicorn probability assessments

Analysis model: GPT-4o · Last scored: 3/4/2026

Explore the full citation network and related research.

7-day free trial. Cancel anytime.

Understand the commercial significance and market impact.

7-day free trial. Cancel anytime.

Get detailed profiles of the research team.

7-day free trial. Cancel anytime.