0N
Size: a a a
0N
MK
0N
The design divides the accelerator’s memory into two
8MB banks: the D-RAM and the W-RAM. Each RAM can
supply a 4,096-byte vector on every cycle, producing 20TB/s
of total bandwidth at 2.5GHz. Only one RAM can be written
on each cycle, matching the output rate of the compute
pipeline. Writes from the ring interrupt this sequence, but
since it takes 64 bus cycles to load enough data for a single 4,096-byte write, these interruptions are rare. For highreliability applications, both RAMs implement 64-bit ECC
across the entire 4,096-byte output value.
Data from the RAMs first flows into the data unit,
which performs various shift and permute functions. Specifically, it can perform up to three functions in a single
2.5GHz clock cycle, such as rotating an entire 4,096-byte
vector by up to 64 bytes, broadcasting a single INT8 value
(e.g., a weight) to fill a vector, compressing blocks (for pooling), and swapping bytes.
Although such wide vectors require sizable die area for
a single register, the data unit contains four such registers. It
can read or write any of these registers on each clock cycle.
For example, it can merge a RAM value with a register value
using one of the other registers as a byte mask. Thus, one or
both RAMs can be powered down on many
0N
Good Performance at Low Cost
Centaur’s goal is to deliver the best neural-network performance per dollar in its class. Via will ultimately determine
the price of CHA-based products, but if they sell for about
the same price as a Xeon Silver, customers will essentially get
the DLA for free. Even though external DLAs based on the
NNP-I or the T4 deliver considerably better performance,
they’re far from free; in fact, they cost more than the processor. Thus, for essentially no cost, Ncore customers could get
a 5x speedup on neural networks relative to a similarly
priced system with no external accelerator. Centaur is still
optimizing its software (it released MLPerf numbers only a
month after receiving working silicon), so its scores could
improve further by the time the product reaches the market.
A
Good Performance at Low Cost
Centaur’s goal is to deliver the best neural-network performance per dollar in its class. Via will ultimately determine
the price of CHA-based products, but if they sell for about
the same price as a Xeon Silver, customers will essentially get
the DLA for free. Even though external DLAs based on the
NNP-I or the T4 deliver considerably better performance,
they’re far from free; in fact, they cost more than the processor. Thus, for essentially no cost, Ncore customers could get
a 5x speedup on neural networks relative to a similarly
priced system with no external accelerator. Centaur is still
optimizing its software (it released MLPerf numbers only a
month after receiving working silicon), so its scores could
improve further by the time the product reaches the market.
𝔻
A
𝔻