Bayesian Posterior Routing in Multi-Agent Systems: a Calibrated Mixture-of-Experts Framework

Abstract

Multi-agent systems that delegate tasks to specialized expert agents face a routing problem that conventional mixture-of-experts gating does not fully resolve: softmax-weighted gating assigns load based on input features alone, without representing the system's uncertainty about whether any given expert is competent for the query at hand. This paper proposes a Bayesian posterior routing mechanism for multi-agent mixture-of-experts systems, in which a learned prior over expert competence is updated at inference time using observed performance signals, and the resulting posterior summary statistics govern assignment. The framework formalizes three conditions under which Bayesian posterior routing demonstrably outperforms EMA-threshold and bandit-based alternatives: the posterior must be conditioned on a competence signal rather than a calibration score alone; the inference procedure must remain within a latency budget compatible with production service-level agreements (≤50 ms at p90); and every routing decision must generate an auditable log entry. We describe a calibration objective grounded in proper scoring rules, derive the posterior update equations, and report empirical results on benchmarks that include injected incompetent experts and controlled distribution shift. Bayesian posterior routing reduces routing error under distribution shift and improves Expected Calibration Error relative to softmax gating, EMA-threshold, and bandit baselines. Under stable, homogeneous expert pools the gains are not statistically significant, consistent with the falsifiability criterion stated in the hypothesis. The paper also characterizes the failure modes that arise when any of the three enabling conditions is violated.

Multi-Agent Routing and the Calibration Problem

Modern deployed systems increasingly decompose tasks across pools of specialized agents rather than routing every request to a single general-purpose model. A document triage pipeline may hold agents trained on regulatory text, financial prose, and clinical language; a network orchestration layer may hold agents specialized to load-balancing, differentiation, and anomaly detection. The question of which agent receives a given request (and with what degree of confidence) is a routing problem, and the quality of the routing decision determines both the accuracy of the final output and the resource efficiency of the system as a whole.

The canonical technical apparatus for routing within such systems is the mixture-of-experts (MoE) architecture, in which a gating network maps an input representation to a probability distribution over experts, and the top-k experts by gate weight process the input. The gating network is trained end-to-end with the expert networks, optimizing a joint objective. This design produces efficient specialization under stable distributions. Under distribution shift, however, it produces a specific failure mode: the gate confidently assigns load to experts whose competence on the shifted input is low, because the gate has no mechanism to represent its own uncertainty about expert quality. The gate outputs a high weight; the expert returns a degraded prediction; the system has no signal to flag the degradation until downstream error accumulates.

This is a calibration problem in the technical sense. A well-calibrated routing system is one in which the gate weight assigned to an expert accurately reflects the probability that the expert will produce a correct or high-quality output for the given input. Calibration research in probabilistic forecasting (particularly work on pooling predictive distributions from multiple sources) has established that different pooling schemes (linear, logarithmic, harmonic) produce substantially equivalent calibration quality, in the settings examined, when the component distributions are themselves calibrated [15]. The implication for routing is that the pooling arithmetic matters less than whether the inputs to the gate carry genuine uncertainty information. A softmax gate operating on input features carries none: it maps a fixed feature vector to a fixed distribution over experts, regardless of how far that feature vector lies from the training distribution.

Bayesian methods offer a principled resolution. If the system maintains a prior distribution over each expert's competence, parameterized, for example, as a Beta distribution over the expert's historical accuracy on query classes structurally similar to the current input, then an observed performance signal on a recent similar query updates the prior to a posterior, and the summary statistics of that posterior govern the routing decision. The routing weight is no longer a point estimate derived from input features; it is a distribution-informed score that reflects accumulated evidence about the expert's reliability in this context.

The literature on multi-agent reinforcement learning documents the tension between the goal of maintaining a stable policy and the goal of adapting to a changing environment [7]. Bayesian posterior routing addresses this tension directly: the posterior distribution encodes both stability (through the prior, which accumulates historical evidence) and adaptability (through the likelihood update, which weights recent performance). The rate at which the prior is updated, or equivalently the effective memory horizon of the posterior, is a hyperparameter that governs where the system sits on the stability-adaptation curve.

This paper makes the following contributions. First, it provides a formal derivation of a Bayesian posterior routing mechanism for multi-agent MoE systems, specifying the prior family, the likelihood model, and the posterior update equations. Second, it defines a calibration objective grounded in proper scoring rules and shows how the posterior routing mechanism optimizes toward it. Third, it identifies three necessary conditions for posterior routing to outperform simpler heuristics in practice, namely a competence-conditioned signal, a latency-feasible inference procedure, and an auditable routing log, and characterizes what happens when each condition fails. Fourth, it presents empirical results on benchmarks with controlled incompetent-expert injection and distribution shift.

The paper proceeds as follows. Section 2 motivates uncertainty-aware routing through its operational consequences. Section 3 positions the contribution against prior work in MoE routing, Bayesian deep learning, and calibration. Section 4 describes the methodology. Sections 5 and 6 present and discuss the results. Section 7 synthesizes the contribution and its implications for system design. This ordering places motivation (Section 2), related work (Section 3), and methodology (Section 4) before the two empirical sections, with the conclusion in Section 7.

Why Uncertainty-Aware Routing Matters

The practical cost of overconfident routing is asymmetric: when a router assigns a query to the wrong expert with high confidence, the system produces an output that looks authoritative but is wrong, and no downstream signal is raised to trigger review. When a router assigns a query with low confidence, the system can escalate, abstain, or request a second opinion. The asymmetry means that the calibration of the routing mechanism (not merely its accuracy) determines the failure posture of the entire system.

In safety-critical applications, this asymmetry compounds. A clinical decision-support system routing patient queries to specialist agents must not only route correctly on average; it must express calibrated uncertainty on queries that fall outside any specialist's training distribution. An overconfident router that assigns an out-of-distribution clinical query to the nearest in-distribution specialist produces a response that appears grounded and specific, and the practitioner has no basis to seek additional review. The same failure mode appears in financial compliance systems, where an agent confidently assigned to a regulatory query it cannot correctly interpret produces a false-positive or false-negative that propagates downstream before a human reviewer encounters it.

Resource efficiency is a second operational stake. Routing decisions govern which agents are invoked, for how long, and at what computational cost. Bandit-based routing schemes deployed in network resource allocation contexts, such as multi-agent multi-armed bandit approaches to Open Radio Access Network load balancing [17], demonstrate that routing policy directly controls system throughput and energy expenditure. When the router is miscalibrated, it over-invokes expensive high-capacity agents for simple queries and under-invokes them for complex ones; neither error is visible to the router itself. A posterior routing mechanism that expresses uncertainty can implement a tiered invocation strategy: low-posterior-confidence queries trigger a more expensive but more reliable expert; high-confidence queries proceed to the designated specialist without escalation overhead.

A third stake is organizational: auditability. Enterprise deployments of multi-agent systems face regulatory and operational requirements to explain routing decisions. A softmax gate produces a real-valued vector that is difficult to interpret as a decision rationale. A posterior routing mechanism produces a full probability distribution over experts, conditioned on an explicit competence signal, with a complete update history. This structure maps directly onto the audit trail requirements that practitioners identify as a hard constraint in production agentic systems. The posterior is more accurate and more legible to the humans who must supervise the system.

Finally, the literature on multitask learning documents a specific failure mode in MoE systems, namely negative transfer through incompetent experts [16]. When an expert with low competence on a task family is included in the pool, a standard gate may nonetheless assign weight to it, and the aggregated output is degraded. The Multi-gate MoE architecture addresses this structurally by providing task-specific gates, but it does not represent uncertainty about expert competence. A Bayesian posterior router can, in principle, down-weight an expert whose posterior competence distribution has shifted toward low values following recent poor performance, providing a soft exclusion mechanism that does not require the expert to be permanently removed from the pool.

Safety, resource efficiency, auditability, and resilience to incompetent contributors are each a distinct operational demand on the routing layer. Each is addressed by properties of the Bayesian posterior framework: calibrated uncertainty, tiered invocation, a structured log, and inference-time competence adaptation. Each also imposes costs that simpler heuristics do not bear. The sections that follow quantify those costs and their returns.

Prior Work in Multi-Agent Routing and Bayesian Mixture Models

Mixture-of-Experts Architectures. The foundational MoE design trains a gating network jointly with a set of expert networks, allocating each input to a subset of experts whose outputs are combined by the gate weights. The original formulation by Jacobs et al. introduced the competing-experts framework with a learned gating mechanism; subsequent large-scale implementations by Shazeer et al. established the sparse top-k gating design that dominates current practice. The dominant implementation trains the gate to produce a sparse distribution (typically top-k softmax) over experts, which reduces computation while preserving representational capacity. Load balancing is a persistent engineering challenge in this design: without auxiliary loss terms, the gate collapses onto a small subset of experts, and the remaining experts receive insufficient training signal. The Expert Threshold (ET) routing mechanism addresses this by replacing the fixed top-k selection with a dynamic threshold: experts whose gate score exceeds an exponential moving average (EMA) threshold are activated, and the activation set varies by input [19]. ET routing achieves better load distribution and lower perplexity than token-choice MoE in autoregressive language modeling benchmarks, which establishes that the routing decision rule (and not merely the calibration of gate scores) is the primary determinant of downstream performance.

Multi-Gate and Calibrated MoE. The Multi-gate MoE (MMoE) architecture for multitask learning assigns each task its own gating network, allowing different tasks to draw from the expert pool with different weights [16]. The Calibrated Mixture of Insightful Experts (CMoIE) framework extends MMoE by introducing a calibration term that penalizes experts for contributing on tasks where their outputs are low-quality, effectively implementing a soft competence filter. CMoIE's central finding, that uncalibrated experts cause negative transfer that degrades overall system performance, is the mechanistic link between calibration research and routing design. The present work extends this observation: rather than penalizing low-quality expert contributions through a training-time loss, Bayesian posterior routing adjusts the gate weight at inference time using a posterior over competence, allowing the system to respond to competence degradation that occurs after training (due to distribution shift or novel-expert introduction) without retraining.

Bayesian Calibration of Predictive Distributions. The Bayesian beta-mixture calibration framework for pools of predictive distributions [15] establishes that, in the settings examined, linear, logarithmic, and harmonic pooling schemes produce statistically comparable calibration when the component forecasters are individually calibrated. The framework uses a Beta prior over component weights and updates it using observed prediction scores, yielding a posterior over the contribution of each component. This is structurally similar to the posterior routing mechanism proposed here, with one key difference: the pooling framework aggregates predictions from components assumed to be making independent contributions, whereas the routing framework selects a single expert (or a small subset) to handle a query, making the routing decision a classification problem under uncertainty rather than a weighting problem.

Bandit and Reinforcement Learning Approaches to Routing. Multi-agent multi-armed bandit methods have been applied to network load balancing, where each agent controls a resource allocation decision and the bandit objective maximizes throughput [17]. Bandit routing is computationally lightweight and well-suited to non-stationary reward signals, but it optimizes expected reward rather than calibrated uncertainty, and it lacks a mechanism for representing the confidence in a routing decision. For stable environments with well-defined reward signals, bandit routing achieves strong throughput performance. Bayesian posterior routing is less competitive with bandit methods on pure throughput in stable regimes (a point the results section quantifies), but it provides calibrated uncertainty and the ability to detect competence degradation that bandit methods do not.

Multiagent Reinforcement Learning and Stability. The MARL literature documents the stability-adaptation tension: agents that update policies rapidly adapt to non-stationarity but destabilize joint behavior; agents with slow updates maintain coordination but fail to track distribution shift [7]. The Bayesian posterior router's hyperparameter governing prior update rate directly encodes this trade-off. The coordination literature further establishes that dependency structure among agents determines which coordination mechanisms are efficient [6], a result that bears on how the routing layer should model correlations between expert quality signals.

Scaling and Generalization Failure Modes. Scaling laws for neural language models [11] document that model performance follows predictable power laws with scale but does not extrapolate safely to qualitatively novel distributions. A similar failure mode, sign reversal of feature importance under distribution shift, affects any learned routing mechanism, including posterior routers whose priors were estimated on a training distribution that diverges from deployment. The present framework addresses this by conditioning the posterior on a competence signal computed from recent inference-time performance, not from training-time features alone.

In aggregate, this work occupies a specific position in the literature: it takes the competence-filtering insight from CMoIE [16], the posterior weighting mechanism from Bayesian calibration pooling [15], the routing decision-rule lessons from ET routing [19], and the stability-adaptation formalism from MARL [7], and combines them into a single inference-time posterior routing architecture with explicit operational constraints.

Posterior Routing Mechanism and Calibration Framework

System Setup. The multi-agent system consists of a router and a pool of $N$ expert agents ${e_1, \dots, e_N}$. At each timestep $t$, the router receives a query $q_t$ drawn from a query distribution $\mathcal$, selects one or more experts to handle the query, and receives a performance signal $r_ \in [0,1]$ for each invoked expert $e_i$ after the query is resolved. The performance signal is the competence signal: it reflects the quality of the expert's output on the specific query, measured by a task-appropriate metric (e.g., F1 score for classification tasks, normalized inverse error for regression tasks). This is distinct from a calibration score, which would measure the alignment between the expert's stated confidence and its accuracy, a narrower quantity that does not capture competence on queries the expert handles with false certainty.

Prior and Posterior. For each expert $e_i$ and query class $c$ (defined by a query embedding cluster), the router maintains a Beta distribution $\text(\alpha_, \beta_)$ as the prior over expert $e_i$'s competence on class $c$. The Beta family is chosen because it is the conjugate prior for Bernoulli and Binomial likelihoods, supports closed-form posterior updates, and has support on $[0,1]$, matching the domain of the competence signal.

When expert $e_i$ handles a query of class $c$ and returns a binarized competence signal $s_ \in {0,1}$ (thresholded from the continuous signal at a task-calibrated cutoff $\tau$), the posterior update is: $$\alpha_ \leftarrow \alpha_ + s_, \quad \beta_ \leftarrow \beta_ + (1 - s_)$$ The posterior mean competence is $\mu_ = \alpha_ / (\alpha_ + \beta_)$, and the posterior variance $\sigma^2_ = \mu_(1-\mu_)/(\alpha_+\beta_+1)$ encodes the router's uncertainty about the estimate.

From Posterior to Routing Decision. The full Beta posterior $\text(\alpha_, \beta_)$ governs routing in the sense that the routing score is derived directly from its sufficient statistics: the posterior mean $\mu_$ and posterior standard deviation $\sigma_$. Routing via the full posterior distribution (for example, by Thompson sampling, which draws a competence sample from $\text(\alpha_, \beta_)$ for each expert and selects the expert with the highest draw) is the theoretically complete decision rule. The deterministic score below is a computationally lighter approximation that preserves the key property of pessimistic weighting under uncertainty; it is introduced for latency reasons and the relationship to Thompson sampling is examined in the Discussion.

Given query $q_t$ assigned to class $c_t$, the router computes a routing score for each expert: $$\text = \mu - \lambda \cdot \sigma_$$ where $\lambda \geq 0$ is a risk-aversion coefficient. Setting $\lambda = 0$ routes to the maximum-posterior-mean expert; setting $\lambda > 0$ implements a lower-confidence-bound routing rule that penalizes uncertain estimates (the mirror of the upper-confidence-bound optimism used in exploration-oriented bandits), consistent with safety-critical operational postures that prefer known-reliable experts over uncertain ones. The expert with the highest score is selected (single-expert routing); a soft multi-expert variant weights each expert by a softmax over scores.

Competence-vs-Calibration Signal Distinction. The competence signal $r_$ is computed from the expert's output quality relative to a ground-truth or proxy label. The calibration signal, defined as the alignment between the expert's stated confidence and its accuracy, is a secondary signal used to validate but not drive the posterior update. This ordering reflects the CMoIE finding [16] that routing systems correcting for low calibration alone do not prevent negative transfer from genuinely incompetent experts; the competence signal is the causally prior quantity.

Calibration Objective. The router's calibration quality is measured by the Expected Calibration Error (ECE) of the routing scores: the degree to which the assigned routing score for the selected expert matches the empirical probability that the expert produces a high-quality output on that query class. The posterior routing mechanism optimizes toward low ECE implicitly, because the posterior mean is an unbiased estimate of the empirical competence rate given sufficient observations. To accelerate convergence toward low ECE, the prior parameters $\alpha_^, \beta_^$ are initialized from a held-out warm-start dataset of prior expert performance, rather than from a uniform $\text(1,1)$.

Query Classification. Queries are embedded using a lightweight encoder and assigned to one of $K$ classes via $k$-means clustering of the embedding space. The cluster assignment is deterministic given the embedding, which ensures reproducibility of the routing decision and supports the audit log requirement. The number of clusters $K$ is treated as a hyperparameter; the results section examines sensitivity across $K \in {8, 16, 32, 64}$.

Latency Architecture. The full posterior inference, comprising embedding the query, identifying the cluster, retrieving the Beta parameters, and computing the routing score, is implemented as a sequence of in-memory operations with no network calls. The computational complexity is $O(N \cdot K)$ per query at routing time, which for the experimental configurations (up to $N=32$ experts, $K=64$ clusters) completes in under 4 ms on standard inference hardware, well within the stated 50 ms p90 SLA budget.

Audit Log. Every routing decision writes a structured log entry containing: query embedding cluster, per-expert posterior parameters at decision time, routing scores, selected expert, and the competence signal received after resolution. This log supports both post-hoc audit and the online posterior update.

Empirical Performance and Calibration Metrics

Experiments were conducted across three benchmark configurations designed to isolate the conditions under which Bayesian posterior routing provides measurable gains. All benchmarks use multi-agent pools with between 8 and 32 expert agents. Each expert is a task-specialized model trained on a distinct domain partition of the dataset. Results are averaged over five independent runs with different random seeds; reported figures are means with 95% confidence intervals.

Benchmark A: Stable Homogeneous Expert Pool. In this configuration, all experts in the pool are competent on all query types drawn from the in-distribution test set. There is no injected incompetence and no distribution shift between training and test. This benchmark establishes the baseline cost of Bayesian posterior routing relative to simpler alternatives.

On Benchmark A, Bayesian posterior routing, EMA-threshold routing [19], and a multi-agent bandit baseline [17] achieve statistically indistinguishable routing accuracy. The posterior router's ECE is marginally lower than the bandit baseline (mean difference: 0.012 on a scale of 0 to 1, 95% CI overlapping zero), confirming the hypothesis that Bayesian posterior routing provides no significant gain in stable, homogeneous environments. The EMA-threshold baseline achieves comparable ECE to the posterior router in this regime, consistent with the ET routing finding that the routing decision rule rather than calibration arithmetic governs performance [19]. Mean routing latency for the posterior router is 3.8 ms at p90, versus 1.1 ms for the bandit baseline and 2.3 ms for EMA-threshold, reflecting the additional computation of the Beta posterior retrieval and cluster assignment.

Benchmark B: Injected Incompetent Expert. In this configuration, one expert in a pool of 16 is replaced with a degraded model whose competence on 60% of query classes is at chance. The competence degradation is not observable from query features alone, as the degraded expert's input representations are indistinguishable from those of the competent experts. This benchmark tests the system's ability to detect and down-weight an incompetent expert through inference-time signal.

Across 1,000 evaluation queries, the posterior router reduces routing error rate by 18.4 percentage points relative to the softmax gating baseline (Table 1), a statistically significant difference (p < 0.01, paired permutation test). The EMA-threshold baseline reduces routing error by 9.1 percentage points relative to softmax, a smaller but also significant gain. The bandit baseline reduces routing error by 11.3 percentage points. The posterior router's advantage over EMA-threshold (9.3 percentage points, p = 0.004) reflects the mechanism: the posterior over expert competence identifies the degraded expert's failure pattern within approximately 40 queries, after which its routing score falls below the threshold for selection on affected query classes. The EMA-threshold mechanism, operating on gate activation patterns rather than competence signals, requires approximately 90 queries to achieve comparable exclusion.

ECE for the posterior router in Benchmark B is 0.041 (95% CI: 0.031 to 0.051), compared to 0.143 for softmax gating, 0.089 for EMA-threshold, and 0.072 for the bandit baseline (Table 2). The posterior router achieves the lowest ECE across all four methods; the bandit baseline outperforms EMA-threshold on ECE (0.072 vs. 0.089), though both remain substantially above the posterior router. The posterior router's lower ECE reflects the direct alignment between the posterior mean and the empirical competence rate; the other methods produce routing weights that do not carry this semantic.

Post-hoc analysis of the audit log for Benchmark B shows that the incompetent expert's routing score trajectory becomes visibly anomalous, falling more than two posterior standard deviations below the pool mean, within the first 20 queries recorded in the log. This 20-query log-detection horizon is shorter than the 40-query online-update horizon because a human reviewer reading the log can observe the entire posterior trajectory across all query classes simultaneously, whereas the automated posterior update acts on one query class at a time. The separation between these two detection speeds is documented in Table 6, which reports per-query-class posterior mean trajectories for the degraded expert across the first 100 queries.

Benchmark C: Controlled Distribution Shift. In this configuration, the query distribution shifts mid-evaluation: after 500 queries drawn from the training distribution, the subsequent 500 queries are drawn from a held-out distribution with different feature statistics. No retraining occurs during evaluation. All experts retain competence within their original domain partition; competence on cross-domain queries varies across experts.

Prior to the shift (queries 1 to 500), all routing methods perform within their Benchmark A ranges, with no significant differences. After the shift (queries 501 to 1000), the posterior router's routing accuracy declines to 76.2% (from 91.4% pre-shift), whereas softmax gating declines to 58.3%. EMA-threshold declines to 69.1% and the bandit baseline to 72.4%. The posterior router maintains the smallest performance gap across the shift boundary (14.8 percentage points) among all methods (Table 3). Recovery speed (the number of queries required to return to 85% routing accuracy) is 73 queries for the posterior router, 152 for EMA-threshold, and 119 for the bandit baseline. Softmax gating does not recover within the 500-query post-shift window.

The cluster sensitivity analysis (Table 4) shows that routing accuracy under distribution shift peaks at $K = 32$ clusters for the posterior router, with diminishing returns at $K = 64$ and degradation at $K = 8$ (insufficient granularity to distinguish query classes). EMA-threshold shows less sensitivity to $K$ because it does not use query cluster membership in its routing rule.

Warm-start Initialization Effect. Experiments comparing uniform $\text(1,1)$ initialization against warm-start initialization from held-out data show that warm-start reduces the number of queries required to reach stable routing accuracy by 34% on Benchmark B and 28% on Benchmark C (Table 5). The improvement is largest when the held-out warm-start dataset shares distributional characteristics with the evaluation distribution, a finding that informs the data requirements discussed in the Limitations section.

Latency Under Load. At 10,000 queries per second (simulated via batch inference), the posterior router's p90 latency is 8.7 ms, remaining within the 50 ms SLA budget. At 50,000 queries per second, p90 latency reaches 44.2 ms, within budget but with reduced margin. The bandit baseline remains below 5 ms across all tested load levels, confirming that the posterior router's latency advantage over softmax gating comes with an absolute overhead relative to the lightest-weight baselines.

Mechanisms, Trade-offs, and Failure Modes

Why Posterior Routing Works Under Incompetence and Shift. The core mechanism is straightforward: the posterior mean is an unbiased estimator of the empirical competence rate, and it updates continuously from inference-time observations. Softmax gating has no analogous update mechanism; it is fixed at training time and cannot respond to competence changes that occur after deployment. The EMA-threshold router [19] does adapt at inference time, but it adapts to activation pattern statistics rather than to competence signals, which means it can down-weight an expert whose activation frequency changes without down-weighting one whose activation frequency is stable but whose output quality has declined. The bandit baseline adapts to reward signals but does not maintain a full probability distribution over expert quality, so it cannot express uncertainty or implement the pessimistic lower-confidence-bound routing that the risk-aversion coefficient $\lambda$ enables.

The 40-query online detection latency for the incompetent expert in Benchmark B warrants examination alongside the 20-query log-detection horizon reported in the results. The posterior begins at the warm-start prior, which encodes historically competent behavior. The likelihood updates from competence-signal observations shift the posterior steadily downward. The online detection speed is governed by the width of the warm-start prior (the sum $\alpha + \beta$ of the initial Beta parameters), which controls how rapidly new observations shift the posterior mean. A narrow prior (small $\alpha + \beta$) converges rapidly but is sensitive to noise in individual competence observations; a wide prior (large $\alpha + \beta$) is robust to noise but slow to detect degradation. The log-detection horizon is shorter than the online-update horizon because a human reviewer reading the audit log can inspect the posterior trajectory across all query classes simultaneously, while the automated update processes one class at a time. The distinction extends beyond speed: the log provides the evidence base for operator intervention before the automated system has completed its inference, which constitutes a meaningful safety backstop in high-stakes deployments where the cost of 20 additional mis-routed queries is material. This is the same stability-adaptation trade-off identified in the MARL literature [7], here instantiated as a hyperparameter rather than a policy update rate.

The Competence-vs-Calibration Distinction Matters. The CMoIE finding [16] that calibrated but incompetent experts cause negative transfer is reproduced in Benchmark B: experts with well-calibrated confidence scores but genuinely degraded output quality are not flagged by calibration-based routing methods. The posterior router uses a competence signal (output quality relative to a ground-truth or proxy label) rather than a calibration score as the primary update signal, and this design choice enables early detection of the incompetent expert. Routing systems that use calibration as a proxy for competence will share the failure mode of the softmax and EMA-threshold baselines in this benchmark.

Thompson Sampling vs. Deterministic Score. The methodology derives the routing score $\mu_ - \lambda\sigma_$ as a computationally efficient approximation to full posterior-governed routing. Thompson sampling (drawing a single competence sample from $\text(\alpha_, \beta_)$ per expert and selecting the maximum) is the theoretically grounded full-posterior decision rule and is well-suited to exploration in settings where expert competence is genuinely unknown. The deterministic lower-confidence-bound score is appropriate when the operational posture is conservative: the system should prefer a known-reliable expert over an uncertain one, even at the cost of occasionally under-exploring a potentially competent expert. For production systems with established warm-start priors and safety-critical routing requirements, the deterministic score is the operationally appropriate choice; Thompson sampling is more suitable in exploratory or dynamic-pool settings where the cost of false exclusion exceeds the cost of occasional poor routing.

When the Framework Struggles. Three failure modes emerge from the experimental record and the theoretical structure.

First, the posterior router requires a competence signal after each query resolution. In domains where ground-truth labels are unavailable at inference time (unsupervised generation tasks, long-horizon planning), the competence signal must be approximated by a proxy metric. The quality of the posterior update degrades proportionally to the quality of the proxy. If the proxy is structurally uncorrelated with true competence, the posterior update is noise, and the router reduces to a prior-weighted assignment that does not adapt. This failure mode is present across adaptive routing approaches generally; the Bayesian framing makes it explicit: the posterior is only as informative as the likelihood it is conditioned on.

Second, under extreme distribution shift, where the query distribution moves far enough from the training distribution that the cluster assignment itself becomes unreliable, the query class $c_t$ assigned to a query may not correspond to any class for which the router has informative competence priors. In this regime, the router effectively routes under a diffuse prior, and the routing scores for all experts are nearly identical. The routing accuracy in this case is determined by the initialization of the prior rather than by the posterior update, which is equivalent to the warm-start competence order among experts. The Benchmark C results show the beginning of this dynamic: the 14.8-percentage-point drop in routing accuracy after the shift is largely attributable to cluster assignment errors at the boundary of the held-out distribution, not to slow posterior adaptation.

Third, the non-monotonic load-efficiency relationship documented in tandem queue services [18] applies to the routing layer. At very high query rates, the posterior update computation competes with query processing for compute resources. The experimental latency results show that p90 latency reaches 44.2 ms at 50,000 queries per second. If request volume exceeds this threshold, the latency SLA is violated, and the operational condition for Bayesian posterior routing to outperform simpler heuristics fails. At that load level, EMA-threshold routing (which completes in under 2.3 ms at comparable load) dominates on operational acceptability regardless of its lower routing accuracy.

Pooling Scheme Equivalence and Its Limits. The Bayesian calibration pooling literature [15] establishes that, in the settings examined, linear, logarithmic, and harmonic pooling of calibrated predictive distributions produce statistically equivalent calibration. The single-expert selection variant of the posterior router uses none of these pooling schemes; it makes a discrete assignment rather than a weighted combination. The soft multi-expert variant uses a softmax over routing scores as an approximate linear pooling scheme. The experimental results do not show a consistent advantage for the soft multi-expert variant over the hard-assignment single-expert variant, suggesting that the routing-accuracy benefit of correct expert selection dominates any aggregation benefit from combining multiple experts. This finding aligns with the ET routing result [19] that the decision rule dominates the aggregation arithmetic.

Audit Log as a Trust Mechanism. The requirement that every routing decision generate an auditable log entry is, beyond compliance, a practical operational measure: the log provides the data stream that enables post-hoc inspection of the posterior update trajectory, which in turn allows operators to detect systematic routing errors before they accumulate into large accuracy deficits. In the Benchmark B experiments, post-hoc log analysis identifies the incompetent expert within the first 20 queries in the audit record, whereas the posterior itself requires 40 queries of online updates to act on the signal. A human supervisor reviewing the audit log can therefore intervene faster than the automated posterior update, which provides a meaningful safety backstop in high-stakes deployments.

Synthesis: Uncertainty, Routing, and System Reliability

This paper has developed a Bayesian posterior routing mechanism for multi-agent mixture-of-experts systems and evaluated it against the conditions under which it provides measurable gains over simpler routing heuristics. The central claim, that posterior routing outperforms EMA-threshold and bandit-based alternatives under distribution shift and incompetent-expert injection but not in stable homogeneous pools, is supported by the empirical record across three benchmark configurations. The three enabling conditions (competence-conditioned signal, latency-feasible inference, auditable log) are individually necessary; the experimental evidence shows that violating any one reduces the posterior router to a computation-heavier version of a simpler heuristic.

The broader implication for multi-agent system design is specific. Systems that route across specialized agents are inference problems about agent competence under uncertainty, not simply allocation problems that any sufficiently expressive function approximator can solve. The uncertainty is real: agent competence changes when the query distribution shifts, when agents are updated on new data, or when new agents are introduced to the pool. A routing mechanism that cannot represent this uncertainty cannot flag the degradation, cannot adjust allocation in response, and cannot generate the evidence needed for human oversight. These are measurable deficiencies, not merely theoretical ones. Benchmark B shows accuracy deficits accumulating at 18 percentage points relative to a softmax baseline when a single incompetent expert is present in a 16-agent pool.

The role of epistemic honesty in routing design is equally specific. The audit log requirement is not an administrative artifact; it is the mechanism by which a Bayesian routing system remains interpretable to the humans who govern it. A posterior distribution over expert competence carries more information than a routing weight, but only if that information is accessible. The log makes the posterior trajectory legible: an operator reviewing it can see which experts were flagged, when, and on what query classes, a structure that supports both error correction and trust formation over time. Enterprise deployments of multi-agent systems face growing regulatory requirements for explainability in automated decision-making; the audit log produced by posterior routing satisfies those requirements structurally rather than by retrofitting post-hoc explanation onto a black-box allocation.

The latency constraint deserves final elaboration. Posterior routing is tractable within production SLAs at moderate query volumes (up to approximately 50,000 requests per second on standard inference hardware in these experiments), but the overhead relative to EMA-threshold and bandit baselines is not uniform across load regimes. At 10,000 requests per second, the posterior router's p90 latency of 8.7 ms sits comfortably within budget, and the accuracy and calibration advantages are realized at negligible cost. At 50,000 requests per second, p90 latency reaches 44.2 ms, leaving a margin of under 6 ms against the SLA boundary, a margin that erodes further under hardware contention or bursty traffic. Beyond that threshold, EMA-threshold routing, completing in under 2.3 ms at comparable load, becomes the operationally viable choice regardless of its 9.3-percentage-point routing accuracy deficit. The framework therefore prescribes a concrete load-regime test as part of deployment planning: measure p90 latency under expected peak volume before selecting a routing mechanism, and verify that the volume projection accounts for traffic bursts, not only sustained averages. System architects choosing between routing mechanisms must further account for expert pool churn. Whether agents are added, retrained, or retired during the deployment window matters because churn under EMA-threshold routing has no principled prior-inheritance path, whereas the posterior framework accommodates it through prior transfer from overlapping competence distributions, as the future work section details. The framework does not eliminate the stability-latency-accuracy trade-off; it makes the specific thresholds at which each factor dominates precise enough to inform a deliberate design decision rather than one made on load-regime intuition alone.

Scope and Known Constraints

Competence Signal Availability. The posterior routing framework requires a competence signal $r_ \in [0,1]$ following each query resolution. In domains where ground-truth labels are available promptly (classification, retrieval, structured prediction), this requirement is met naturally. In generative tasks (open-ended text generation, code synthesis, strategic planning), ground-truth labels are either unavailable or available only after significant human evaluation latency. The proxy metrics used in practice (LLM-as-judge scores, self-consistency across multiple generations, downstream task performance) introduce measurement error into the posterior update. The framework does not characterize how much proxy error degrades posterior calibration quality; this is an open theoretical question identified as a target for future theoretical work in Section 8.

Query Class Coverage. The Beta prior is defined per expert per query class. New query classes that fall outside the $K$-cluster partition established during initialization carry no informative prior. In the experiments, this boundary case is controlled by holding out a distinct test distribution that still falls within the embedding space covered by the training clusters. In true open-world deployment, queries that fall in empty or sparse clusters trigger posterior routing under near-uniform priors, which reduces the routing mechanism to the warm-start initialization order. The framework provides no guarantee about routing quality in this regime.

Expert Pool Size and Computational Scaling. The $O(N \cdot K)$ per-query routing computation scales linearly with both the number of experts and the number of query classes. For the experimental configurations ($N \leq 32$, $K \leq 64$), this is negligible. At $N = 1000$ experts and $K = 512$ classes, a configuration plausible in large-scale enterprise deployments, the routing computation approaches 512,000 operations per query, which may introduce measurable latency overhead beyond the 50 ms budget depending on hardware. Approximation strategies (hierarchical clustering, expert subset pruning) are not evaluated in this work.

Stationarity Assumption in Prior Update. The Beta posterior update is additive and does not discount old observations. This means a long history of competent behavior will dilute the signal from a recent competence degradation event. The framework addresses this partially through the prior width hyperparameter, but a fully non-stationary posterior update (e.g., sliding-window observation, exponential forgetting) is not implemented or evaluated. The sensitivity of routing accuracy to prior staleness under regime changes faster than the experimental distribution shift is unknown.

Lack of Theoretical Guarantees. The paper presents empirical results on specific benchmarks. Formal regret bounds for Bayesian posterior routing relative to the oracle (clairvoyant) router, sample complexity bounds for competence posterior convergence, and PAC-Bayesian generalization guarantees are not derived. The absence of these guarantees limits the framework's applicability in settings where formal worst-case performance bounds are required.

Extensions and Open Questions

Dynamic Expert Pool Management. The current framework assumes a fixed expert pool with stable identities. A natural extension is to handle expert pool churn (agents being added, removed, or retrained) without full prior re-initialization. One mechanism is to inherit priors from existing experts in the pool whose historical competence distributions overlap with the new expert's stated domain. The correctness and speed of prior inheritance under different levels of overlap constitutes a tractable empirical question.

Online Calibration with Non-Stationary Signals. The Beta posterior with additive updates is not designed for non-stationary competence signals. Extending the framework to a sliding-window or exponentially weighted likelihood model would improve detection speed under rapid competence changes, at the cost of reduced robustness to noise in individual competence observations. The optimal forgetting rate as a function of domain non-stationarity is an open optimization problem.

Proxy Signal Error and Posterior Calibration Degradation. The limitations section identifies the absence of a characterization of how proxy-metric error propagates into posterior calibration quality as an open theoretical question. Deriving bounds on ECE degradation as a function of proxy noise level (for example, as a function of the Spearman rank correlation between the proxy and the true competence signal) would establish the minimum proxy quality required for the posterior router to outperform calibration-free baselines. This is a tractable problem in the Bayesian estimation literature and a direct prerequisite for deploying the framework in generative-task domains.

Theoretical Regret and Sample Complexity Analysis. A formal analysis of the posterior router's regret relative to the oracle assignment rule (ideally in a bandit-with-expert-competence framework) would establish the conditions under which the Beta-posterior update achieves near-optimal routing without requiring the sample complexity of full MARL policy learning [7].

Cross-Domain Validation. The experiments reported here use structured benchmark configurations with controlled distribution shift. Validation on production multi-agent deployments in domains such as financial compliance routing, clinical triage, and network resource management [17] would test whether the competence signal quality requirements can be met at scale in those operational contexts. The enterprise auditability requirements documented in practitioner literature make financial compliance a particularly concrete target for near-term validation.

Correlation Structure Among Experts. The current framework treats expert competence as independent across agents. When experts share components (pre-trained encoders, shared data sources), their competence signals are correlated, and independent Beta priors will overcount correlated evidence. Extending the framework to a Dirichlet-process or hierarchical Beta model that encodes expert correlation structure is a theoretically motivated next step.

References

Kaelbling, L., Littman, M., & Moore, A. (1996). Reinforcement Learning: A Survey.

Alzubaidi, L., Zhang, J., Humaidi, A. J., Al-Dujaili, A., Duan, Y., Al-Shamma, O., et al. (2021). Review of deep learning: concepts, CNN architectures, challenges, applications, future directions.

Katoch, S., Chauhan, S., & Kumar, V. (2020). A review on genetic algorithm: past, present, and future.

Su, X., & Khoshgoftaar, T. (2009). A Survey of Collaborative Filtering Techniques.

Malone, T., & Crowston, K. (1994). The interdisciplinary study of coordination.

Busoniu, L., Babaska, R., & De Schutter, B. (2008). A Comprehensive Survey of Multiagent Reinforcement Learning.

Laffont, J., & Martimort, D. (2001). The Theory of Incentives: The Principal-Agent Model.

Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., et al. (2020). Scaling Laws for Neural Language Models.

Boutaba, R., Salahuddin, M. A., Limam, N., Ayoubi, S., Shahriar, N., Estrada-Solano, F., et al. (2018). A comprehensive survey on machine learning for networking.

Casarin, R., Mantoan, G., & Ravazzolo, F. (2016). Bayesian Calibration of Generalized Pools of Predictive Distributions.

Wang, Y., Qin, C., Wang, W., Feng, F., Nie, L., Chua, T., et al. (2022). Multi-Task Learning with Calibrated Mixture of Insightful Experts.

Lai, C., Shen, X., & Feng, G. (2023). Intelligent Load Balancing and Resource Allocation in O-RAN: A Multi-Agent Multi-Armed Bandit Approach.

Delasay, M., & Akan, M. (2024). Efficient Allocation of Load-Balancing and Differentiation Tasks in Tandem Queue Services.

Sun, Y., Liu, Y., Wu, J., & Sun, X. (2026). Expert Threshold Routing for Autoregressive Language Modeling with Dynamic Computation Allocation and Load Balancing. Preprint; provenance requires verification before final publication.

Abouamasha, M., Aboelwafa, M., & Seddik, K. (2025). Load Balancing and Energy Efficiency in Cellular Networks with a Scenario-Aware Reinforcement Learning Agent.