View PDF ↗
PDF Viewer

Loading PDF...

This may take a moment

BUILDER'S SANDBOX

Core Pattern

AI-generated implementation pattern based on this paper's core methodology.

Understand the commercial significance and market impact.

7-day free trial. Cancel anytime.

Estimated $9K - $13K over 6-10 weeks.

See exactly what it costs to build this -- with 3 comparable funded startups.

7-day free trial. Cancel anytime.

Discover the researchers behind this paper and find similar experts.

7-day free trial. Cancel anytime.

Founder's Pitch

"Create calibrated stress tests to evaluate and improve multi-agent workflow metrics with WorkflowPerturb."

AI Workflow EvaluationScore: 5View PDF ↗

Commercial Viability Breakdown

0-10 scale

High Potential

2/4 signals

5

Quick Build

3/4 signals

7.5

Series A Potential

1/4 signals

2.5

Explore the full citation network and related research.

7-day free trial. Cancel anytime.

Understand the commercial significance and market impact.

7-day free trial. Cancel anytime.

Get detailed profiles of the research team.

7-day free trial. Cancel anytime.

References (8)

[1]
The Berkeley Function Calling Leaderboard (BFCL): From Tool Use to Agentic Evaluation of Large Language Models
2025Shishir G. Patil, Huanzhi Mao et al.
[2]
GeneGPT: Augmenting Large Language Models with Domain Tools for Improved Access to Biomedical Information
2023Qiao Jin, Yifan Yang et al.
[3]
ChemCrow: Augmenting large-language models with chemistry tools
2023Andrés M Bran, Sam Cox et al.
[4]
BERTScore: Evaluating Text Generation with BERT
2019Tianyi Zhang, Varsha Kishore et al.
[5]
Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation
2016Yonghui Wu, M. Schuster et al.
[6]
Ground Truth for Grammatical Error Correction Metrics
2015Courtney Napoles, Keisuke Sakaguchi et al.
[7]
Bleu: a Method for Automatic Evaluation of Machine Translation
2002Kishore Papineni, Salim Roukos et al.
[8]
A NEW MEASURE OF RANK CORRELATION
1938M. Kendall