PDF Viewer

BUILDER'S SANDBOX

Build This Paper

Use an AI coding agent to implement this research.

OpenAI Codex
OpenAI CodexAI Agent

Lightweight coding agent in your terminal.

Claude Code
Claude CodeAI Agent

Agentic coding tool for terminal workflows.

AntiGravity IDE
AntiGravity IDEScaffolding

AI agent mindset installer and workflow scaffolder.

Cursor
CursorIDE

AI-first code editor built on VS Code.

VS Code
VS CodeIDE

Free, open-source editor by Microsoft.

Estimated $10K - $14K over 6-10 weeks.

See exactly what it costs to build this -- with 3 comparable funded startups.

7-day free trial. Cancel anytime.

Discover the researchers behind this paper and find similar experts.

7-day free trial. Cancel anytime.

References (22)

[1]
Reasoning Robustness of LLMs to Adversarial Typographical Errors
2024Esther Gan, Yiran Zhao et al.
[2]
GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models
2024Iman Mirzadeh, Keivan Alizadeh-Vahid et al.
[3]
Veridical Data Science for Medical Foundation Models
2024Ahmed M. Alaa, Bin Yu
[4]
A Novel Metric for Measuring the Robustness of Large Language Models in Non-adversarial Scenarios
2024Samuel Ackerman, Ella Rabinovich et al.
[5]
A Peek into Token Bias: Large Language Models Are Not Yet Genuine Reasoners
2024Bowen Jiang, Yangxinyu Xie et al.
[6]
MedFuzz: Exploring the Robustness of Large Language Models in Medical Question Answering
2024Robert Osazuwa Ness, Katie Matton et al.
[7]
Improving the Robustness of Large Language Models via Consistency Alignment
2024Zhao Yukun, Lingyong Yan et al.
[8]
Prompt Perturbation Consistency Learning for Robust Language Models
2024Yao Qiang, Subhrangshu Nandi et al.
[9]
NEO-BENCH: Evaluating Robustness of Large Language Models with Neologisms
2024Jonathan Zheng, Alan Ritter et al.
[10]
When Benchmarks are Targets: Revealing the Sensitivity of Large Language Model Leaderboards
2024Norah A. Alzahrani, H. A. Alyahya et al.
[11]
State of What Art? A Call for Multi-Prompt LLM Evaluation
2023Moran Mizrahi, Guy Kaplan et al.
[12]
Large Language Models Are Not Robust Multiple Choice Selectors
2023Chujie Zheng, Hao Zhou et al.
[13]
Lost in the Middle: How Language Models Use Long Contexts
2023Nelson F. Liu, Kevin Lin et al.
[14]
PromptRobust: Towards Evaluating the Robustness of Large Language Models on Adversarial Prompts
2023Kaijie Zhu, Jindong Wang et al.
[15]
Large Language Models Can Be Easily Distracted by Irrelevant Context
2023Freda Shi, Xinyun Chen et al.
[16]
Semantic Answer Similarity for Evaluating Question Answering Models
2021Julian Risch, Timo Moller et al.
[17]
Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity
2021Yao Lu, Max Bartolo et al.
[18]
Measuring Massive Multitask Language Understanding
2020Dan Hendrycks, Collin Burns et al.
[19]
SQuAD: 100,000+ Questions for Machine Comprehension of Text
2016Pranav Rajpurkar, Jian Zhang et al.
[20]
Note on the sampling error of the difference between correlated proportions or percentages
1947Q. Mcnemar

Showing 20 of 22 references

Founder's Pitch

"Develop a robustness testing tool for evaluating LLMs against lexical and syntactic perturbations."

LLM EvaluationScore: 4View PDF ↗

Commercial Viability Breakdown

0-10 scale

High Potential

1/4 signals

2.5

Quick Build

1/4 signals

2.5

Series A Potential

1/4 signals

2.5

Sources used for this analysis

arXiv Paper

Full-text PDF analysis of the research paper

GitHub Repository

Code availability, stars, and contributor activity

Citation Network

Semantic Scholar citations and co-citation patterns

Community Predictions

Crowd-sourced unicorn probability assessments

Analysis model: GPT-4o · Last scored: 2/19/2026

Explore the full citation network and related research.

7-day free trial. Cancel anytime.

Understand the commercial significance and market impact.

7-day free trial. Cancel anytime.

Get detailed profiles of the research team.

7-day free trial. Cancel anytime.