Research — Public Abstract

Empirical Evaluation of Multi-Agent Orchestration Parameters: A 243,000-Trial Combinatorial Simulation Study

Author: Chris Jones Date: March 2026 Status: Internal study, abstract published

Abstract

We present a 243,000-trial combinatorial simulation study evaluating six independent variables affecting multi-agent orchestration performance across seven dependent variables. Using a full factorial design (486 cells, 500 trials per cell), we identify empirically optimal configurations for federated autonomous agent fleets.

Key findings include: coordination intensity has a medium effect on task quality, with a counterintuitive ordering where minimal coordination performs worse than no coordination; election strategy has a large effect on leader selection accuracy but a negligible effect on downstream task quality, revealing a quality buffer mechanism; the largest interaction effect is election strategy by fleet size, demonstrating that orchestration configuration must be scale-aware; and consistency review reduces measured consistency scores—a measurement artifact where absence of detection is mistaken for absence of defects.

All analyses use non-parametric methods with Benjamini-Hochberg FDR correction at q = 0.001, achieving statistical power of 1.0 across all tests.

Hypotheses and Outcomes

	Hypothesis	Outcome
H1	Coordination intensity affects task quality, with full coordination producing the highest quality	Supported
H2	Election strategy has a larger effect on election accuracy than on task quality	Supported
H3	Election strategy and fleet size interact to affect election accuracy	Supported
H4	Consistency review improves consistency scores	Rejected
H5	Task quality and failure recovery are positively correlated	Supported

Methodology

We employed a full factorial combinatorial design with six independent variables (3×3×3×2×3×3 = 486 experimental cells) and seven dependent variables. Each cell was replicated 500 times with deterministic seeding for bit-for-bit reproducibility, yielding 243,000 total trials.

Experimental Design

Design: Full factorial, 6 IVs × 7 DVs
Cells: 486 unique configurations
Replications: 500 per cell (deterministic seeding)
Total trials: 243,000
Main effect tests: 42
Interaction tests: 105 (15 factor pairs × 7 DVs)

Statistical Methods

Normality assessment: Shapiro-Wilk test (violated for all DVs at N = 243K)
Omnibus tests: Kruskal-Wallis H (non-parametric)
Pairwise comparisons: Mann-Whitney U
Interaction effects: Two-way ANOVA F-test (robust at this N)
Multiple comparison correction: Benjamini-Hochberg FDR at q = 0.001
Effect sizes: Omega-squared (bias-corrected), Cohen’s d with Hedges’ g correction
Post-hoc power: Achieved power = 1.0 across all tests

Key Findings

1. The Coordination Paradox

Coordination intensity has a medium effect on task quality (ω² = 0.072). The ordering is full > none > minimal—partial coordination performs worse than no coordination. Half-measures introduce overhead without the structured feedback loops that make full coordination effective.

Kruskal-Wallis H(2) = 49,805.69, p < .001, N = 243,000, power = 1.0

2. The Quality Buffer

Election strategy has a large effect on leader selection accuracy (ω² = 0.352) but a negligible effect on downstream task quality (ω² < 0.001). The system exhibits a quality buffer—downstream mechanisms compensate for suboptimal leader selection. Accuracy and quality are dissociated.

H(2) = 85,456.62, p < .001 (accuracy); H(2) = 415.93, p < .001, ω² = 0.0005 (quality)

3. Scale-Dependent Configuration

The largest interaction effect in the study is election strategy × fleet size on election accuracy (ω² = 0.163, large). Competence-based election maintains high accuracy across fleet sizes while simpler strategies degrade sharply. Configuration must be fleet-size-aware—no single strategy generalizes across scales.

F(4, 242,970) = 11,858.83, p < .001, ω² = 0.163

4. The Consistency Review Paradox

Consistency review reduces the consistency metric (ω² = 0.264, large). Without review, no inconsistencies are detected, producing a perfect score. With review, real deviations are measured. Absence of detection is not absence of defects. Naive optimization against this metric leads to the wrong conclusion.

H(1) = 107,199.80, p < .001, N = 243,000, power = 1.0

5. Quality–Resilience Coupling

Task quality and failure recovery are moderately correlated (r = 0.485, p < .001). Configurations that invest in quality also provide the redundancy needed for failure recovery. Quality and resilience emerge from the same underlying mechanisms—they are not independent investment targets.

Pearson r = 0.485, 99.9% CI [0.479, 0.490], N = 243,000

Top Interaction Effects

Interaction	Outcome Measure	ω²	Effect
Election strategy × fleet size	Election accuracy	0.163	Large
Coordination × election strategy	Delegation efficiency	0.108	Medium
Coordination × fleet size	Delegation efficiency	0.077	Medium
Consistency review × coordination	Consistency score	0.014	Small
Election strategy × fleet size	Delegation efficiency	0.015	Small

All p < .001, all p_adj < .001 after Benjamini-Hochberg FDR correction. 105 total interaction tests conducted.

Cross-Variable Correlations

Variable A	Variable B	r	Effect
Task quality	Failure recovery	0.485	Medium
Delegation efficiency	Resource cost	0.303	Medium
Task quality	Delegation efficiency	0.273	Small
Election accuracy	All other DVs	< 0.02	Null

FDR-corrected across 21 pairwise tests. The null correlation between election accuracy and all other DVs further confirms the quality buffer mechanism.

Effect Size Interpretation

Following Cohen (1988):

Measure	Small	Medium	Large
Omega-squared (ω²)	0.01	0.06	0.14
Cohen’s d	0.20	0.50	0.80
Pearson r	0.10	0.30	0.50

Limitations

Acknowledged Constraints

Synthetic quality model: Task quality is computed mathematically rather than generated by real LLM calls. The quality function captures agent-task match, coordination effects, and stochastic noise, but may miss emergent properties of real language generation.
Fixed task complexity: All trials used moderate complexity. The coordination-quality tradeoff may differ for simple or highly complex tasks.
Constant brain maturity: All agents used a developing maturity level. Mature knowledge graphs may alter the election accuracy findings.
No real LLM heterogeneity: The model includes LLM capability matching but does not capture qualitative differences between real model outputs.

Ongoing & Planned Research

This study is the first in a series. Each acknowledged constraint above represents a planned follow-up study. All validation work is conducted in-house to maintain data integrity and protect proprietary methodology. Findings are applied directly to the product—subscribers benefit from every study through configuration improvements and capability updates.

Research Roadmap

Real-world validation against live LLM fleets (Claude, GPT-4, Gemini) Planned
Task complexity as a varied factor (trivial through complex) Planned
Longitudinal brain maturity study across 1,000+ task sessions Planned
Adversarial robustness testing for federated environments Planned
Cross-constraint interaction mapping at production scale Planned

Each study follows the same publication-grade methodology: full factorial design, deterministic seeding, non-parametric tests, FDR correction, and raw trial-level data preservation.

Opt-In Federated Insights

Subscribers who opt in to anonymized performance telemetry help accelerate this research—and benefit directly from the results. Aggregated patterns across real-world deployments inform configuration refinements that ship back to every participant. Your brain data stays private and encrypted on your machine. Only anonymized orchestration metrics (timing, accuracy, recovery rates) are shared, and only with explicit consent.

This is how Rebis improves: real-world signal from opted-in users, validated through the same rigorous methodology, applied as product updates for everyone.

Follow the research

Get notified when new studies are published. No spam—only findings.

Research updates only. Unsubscribe anytime.

You're on the list. We'll send the next study when it's ready.

References

Alberts, D. S. (2011). The agility advantage: A survival guide for complex enterprises and endeavors. DoD Command and Control Research Program.

Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society: Series B, 57(1), 289–300.

Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Lawrence Erlbaum Associates.

Grassé, P.-P. (1959). La reconstruction du nid et les coordinations interindividuelles chez Bellicositermes natalensis et Cubitermes sp. Insectes Sociaux, 6(1), 41–80.

Ongaro, D., & Ousterhout, J. (2014). In search of an understandable consensus algorithm. USENIX Annual Technical Conference, 305–319.

Parker, L. E. (1998). ALLIANCE: An architecture for fault tolerant multirobot cooperation. IEEE Transactions on Robotics and Automation, 14(2), 220–240.

Smith, R. G. (1980). The Contract Net Protocol: High-level communication and control in a distributed problem solver. IEEE Transactions on Computers, C-29(12), 1104–1113.

This is how Rebis makes decisions

Every orchestration parameter is empirically optimized—not guessed, not copied from a framework default. The full study drives the product.

Start Free Trial