Я вот буквально на выходных статью читал, не помню ничего про одного эксперта
Я сейчас говорю про раздел 2.1, в котором пишут следующее:
Shazeer et al. (2017) conjectured that routing to k > 1 experts was necessary in order to have non-trivial gradients to the routing functions. The authors intuited that learning to route would not work without the ability to compare at least two experts. Ramachandran & Le (2018) went further to study the top-k decision and found that higher k-values lower in the model were important for models with many routing layers. Contrary to these ideas, we instead use a simplified strategy where we route to only a single expert. We show this simplification preserves model quality, reduces routing computation and performs better. This k = 1 routing strategy is later referred to as a Switch layer.