Research — Public Abstract

Empirical Evaluation of Multi-Agent Orchestration Parameters: A 243,000-Trial Combinatorial Simulation Study

Author:  Chris Jones Date:  March 2026 Status:  Internal study, abstract published
243K
simulation trials
486
unique configurations
147
hypothesis tests
1.0
achieved power (all tests)

Abstract

We present a 243,000-trial combinatorial simulation study evaluating six independent variables affecting multi-agent orchestration performance across seven dependent variables. Using a full factorial design (486 cells, 500 trials per cell), we identify empirically optimal configurations for federated autonomous agent fleets.

Key findings include: coordination intensity has a medium effect on task quality, with a counterintuitive ordering where minimal coordination performs worse than no coordination; election strategy has a large effect on leader selection accuracy but a negligible effect on downstream task quality, revealing a quality buffer mechanism; the largest interaction effect is election strategy by fleet size, demonstrating that orchestration configuration must be scale-aware; and consistency review reduces measured consistency scores—a measurement artifact where absence of detection is mistaken for absence of defects.

All analyses use non-parametric methods with Benjamini-Hochberg FDR correction at q = 0.001, achieving statistical power of 1.0 across all tests.

Hypotheses and Outcomes

Hypothesis Outcome
H1 Coordination intensity affects task quality, with full coordination producing the highest quality Supported
H2 Election strategy has a larger effect on election accuracy than on task quality Supported
H3 Election strategy and fleet size interact to affect election accuracy Supported
H4 Consistency review improves consistency scores Rejected
H5 Task quality and failure recovery are positively correlated Supported

Methodology

We employed a full factorial combinatorial design with six independent variables (3×3×3×2×3×3 = 486 experimental cells) and seven dependent variables. Each cell was replicated 500 times with deterministic seeding for bit-for-bit reproducibility, yielding 243,000 total trials.

Experimental Design

  • Design: Full factorial, 6 IVs × 7 DVs
  • Cells: 486 unique configurations
  • Replications: 500 per cell (deterministic seeding)
  • Total trials: 243,000
  • Main effect tests: 42
  • Interaction tests: 105 (15 factor pairs × 7 DVs)

Statistical Methods

  • Normality assessment: Shapiro-Wilk test (violated for all DVs at N = 243K)
  • Omnibus tests: Kruskal-Wallis H (non-parametric)
  • Pairwise comparisons: Mann-Whitney U
  • Interaction effects: Two-way ANOVA F-test (robust at this N)
  • Multiple comparison correction: Benjamini-Hochberg FDR at q = 0.001
  • Effect sizes: Omega-squared (bias-corrected), Cohen’s d with Hedges’ g correction
  • Post-hoc power: Achieved power = 1.0 across all tests

Key Findings

1. The Coordination Paradox

Coordination intensity has a medium effect on task quality (ω² = 0.072). The ordering is full > none > minimal—partial coordination performs worse than no coordination. Half-measures introduce overhead without the structured feedback loops that make full coordination effective.

Kruskal-Wallis H(2) = 49,805.69, p < .001, N = 243,000, power = 1.0

2. The Quality Buffer

Election strategy has a large effect on leader selection accuracy (ω² = 0.352) but a negligible effect on downstream task quality (ω² < 0.001). The system exhibits a quality buffer—downstream mechanisms compensate for suboptimal leader selection. Accuracy and quality are dissociated.

H(2) = 85,456.62, p < .001 (accuracy); H(2) = 415.93, p < .001, ω² = 0.0005 (quality)

3. Scale-Dependent Configuration

The largest interaction effect in the study is election strategy × fleet size on election accuracy (ω² = 0.163, large). Competence-based election maintains high accuracy across fleet sizes while simpler strategies degrade sharply. Configuration must be fleet-size-aware—no single strategy generalizes across scales.

F(4, 242,970) = 11,858.83, p < .001, ω² = 0.163

4. The Consistency Review Paradox

Consistency review reduces the consistency metric (ω² = 0.264, large). Without review, no inconsistencies are detected, producing a perfect score. With review, real deviations are measured. Absence of detection is not absence of defects. Naive optimization against this metric leads to the wrong conclusion.

H(1) = 107,199.80, p < .001, N = 243,000, power = 1.0

5. Quality–Resilience Coupling

Task quality and failure recovery are moderately correlated (r = 0.485, p < .001). Configurations that invest in quality also provide the redundancy needed for failure recovery. Quality and resilience emerge from the same underlying mechanisms—they are not independent investment targets.

Pearson r = 0.485, 99.9% CI [0.479, 0.490], N = 243,000

Top Interaction Effects

Interaction Outcome Measure ω² Effect
Election strategy × fleet size Election accuracy 0.163 Large
Coordination × election strategy Delegation efficiency 0.108 Medium
Coordination × fleet size Delegation efficiency 0.077 Medium
Consistency review × coordination Consistency score 0.014 Small
Election strategy × fleet size Delegation efficiency 0.015 Small

All p < .001, all padj < .001 after Benjamini-Hochberg FDR correction. 105 total interaction tests conducted.

Cross-Variable Correlations

Variable A Variable B r Effect
Task quality Failure recovery 0.485 Medium
Delegation efficiency Resource cost 0.303 Medium
Task quality Delegation efficiency 0.273 Small
Election accuracy All other DVs < 0.02 Null

FDR-corrected across 21 pairwise tests. The null correlation between election accuracy and all other DVs further confirms the quality buffer mechanism.

Effect Size Interpretation

Following Cohen (1988):

Measure Small Medium Large
Omega-squared (ω²) 0.01 0.06 0.14
Cohen’s d 0.20 0.50 0.80
Pearson r 0.10 0.30 0.50

Limitations

Acknowledged Constraints

  1. Synthetic quality model: Task quality is computed mathematically rather than generated by real LLM calls. The quality function captures agent-task match, coordination effects, and stochastic noise, but may miss emergent properties of real language generation.
  2. Fixed task complexity: All trials used moderate complexity. The coordination-quality tradeoff may differ for simple or highly complex tasks.
  3. Constant brain maturity: All agents used a developing maturity level. Mature knowledge graphs may alter the election accuracy findings.
  4. No real LLM heterogeneity: The model includes LLM capability matching but does not capture qualitative differences between real model outputs.

Ongoing & Planned Research

This study is the first in a series. Each acknowledged constraint above represents a planned follow-up study. All validation work is conducted in-house to maintain data integrity and protect proprietary methodology. Findings are applied directly to the product—subscribers benefit from every study through configuration improvements and capability updates.

Research Roadmap

  • Real-world validation against live LLM fleets (Claude, GPT-4, Gemini) Planned
  • Task complexity as a varied factor (trivial through complex) Planned
  • Longitudinal brain maturity study across 1,000+ task sessions Planned
  • Adversarial robustness testing for federated environments Planned
  • Cross-constraint interaction mapping at production scale Planned

Each study follows the same publication-grade methodology: full factorial design, deterministic seeding, non-parametric tests, FDR correction, and raw trial-level data preservation.

Opt-In Federated Insights

Subscribers who opt in to anonymized performance telemetry help accelerate this research—and benefit directly from the results. Aggregated patterns across real-world deployments inform configuration refinements that ship back to every participant. Your brain data stays private and encrypted on your machine. Only anonymized orchestration metrics (timing, accuracy, recovery rates) are shared, and only with explicit consent.

This is how Rebis improves: real-world signal from opted-in users, validated through the same rigorous methodology, applied as product updates for everyone.

Follow the research

Get notified when new studies are published. No spam—only findings.

References

Alberts, D. S. (2011). The agility advantage: A survival guide for complex enterprises and endeavors. DoD Command and Control Research Program.

Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society: Series B, 57(1), 289–300.

Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Lawrence Erlbaum Associates.

Grassé, P.-P. (1959). La reconstruction du nid et les coordinations interindividuelles chez Bellicositermes natalensis et Cubitermes sp. Insectes Sociaux, 6(1), 41–80.

Ongaro, D., & Ousterhout, J. (2014). In search of an understandable consensus algorithm. USENIX Annual Technical Conference, 305–319.

Parker, L. E. (1998). ALLIANCE: An architecture for fault tolerant multirobot cooperation. IEEE Transactions on Robotics and Automation, 14(2), 220–240.

Smith, R. G. (1980). The Contract Net Protocol: High-level communication and control in a distributed problem solver. IEEE Transactions on Computers, C-29(12), 1104–1113.

This is how Rebis makes decisions

Every orchestration parameter is empirically optimized—not guessed, not copied from a framework default. The full study drives the product.

Start Free Trial