Papers
1–3 of 3Research Paper·Jan 16, 2026
Building Production-Ready Probes For Gemini
Frontier language model capabilities are improving rapidly. We thus need stronger mitigations against bad actors misusing increasingly powerful systems. Prior work has shown that activation probes may...
8.0 viability
Research Paper·Mar 10, 2026
Reasoning-Oriented Programming: Chaining Semantic Gadgets to Jailbreak Large Vision Language Models
Large Vision-Language Models (LVLMs) undergo safety alignment to suppress harmful content. However, current defenses predominantly target explicit malicious patterns in the input representation, often...
8.0 viability
Research Paper·Jan 8, 2026
RiskAtlas: Exposing Domain-Specific Risks in LLMs through Knowledge-Graph-Guided Harmful Prompt Generation
Large language models (LLMs) are increasingly applied in specialized domains such as finance and healthcare, where they introduce unique safety risks. Domain-specific datasets of harmful prompts remai...
7.0 viability