In an earlier post I took one AI-generated 5G LDPC decoder from 221 to 463 MHz, past a paid commercial IP on the same FPGA. That was a single configuration. The 5G LDPC standard supports two base graphs and 51 lifting factors - many code points in all. To show the timing method generalizes, I needed many more decoders, generated and tuned the same way.

Tuning one decoder by hand is not a design method. The real value of AI here is not tuning one design, but automating the process: generating and closing a batch of configurations, quickly, and proving every one correct.

Tuning one is the start; the tool is the goal

Every code point in the 5G LDPC standard maps to a different FPGA circuit. Change the configuration and the resource and timing work starts over. So instead of babysitting each one, I distilled the timing method into an automated flow: parameter-driven, it generates twenty decoders across configurations and closes timing and verifies correctness on each. The twenty cover both base graphs, two code rates each, and five lifting factors - information lengths from a few hundred to several thousand bits.

One skeleton, generated by parameters

This is the crux: the twenty decoders are not written twenty times by hand. They come from one generator, run with different parameters. The skeleton is the same folded layered decoder from the first post - read estimate, run checks, accumulate and write back - where “folded” means processing a layer’s hundreds of lanes in a few passes rather than all at once, saving a large amount of hardware while staying single-engine and zero-DSP.

One configuration in - base graph, lifting factor, code rate - one dedicated decoder out.
One configuration in (base graph, lifting factor, code rate); one dedicated decoder out. The skeleton is fixed; only parameters and the schedule change.

Given a configuration, the generator derives the fold count, the fixed-point widths, the schedule, and the stall cycles needed between layers, and emits a dedicated decoder. A larger lifting factor folds in more passes; at smaller sizes the datapath itself narrows. None of this is set by hand. Twenty configurations, twenty parameter sets - that is all.

The risk in batch is batch-producing errors

Generating fast is one thing; generating correct is another. A tool that produces quickly but produces errors is worse than nothing, so every decoder has to be proven correct first. There are three models, each closer to hardware: a mathematical reference of the standard algorithm, a cycle-accurate model, and finally the circuit synthesized into the FPGA. The rule is strict: any change goes into the cycle-accurate model first, is verified bit-for-bit, and only then regenerates the circuit. The cycle-accurate model is the hardware ground truth - if the model is wrong, faster hardware just computes the error faster.

Three models, each closer to hardware, checked layer by layer and bit by bit.
Three models, each closer to hardware, checked bit by bit. Every decoder is reconciled against the mathematical reference and cross-checked against the 3GPP codec.

Across all twenty configurations, every generated decoder matched the 3GPP reference in the MATLAB 5G Toolbox, bit for bit.

Closure as a loop that rolls back

After a design is generated and verified, timing still has to be closed. In the first post the cuts were found by hand, one at a time. Here that skill is rolled into a closed loop and pointed at all twenty configurations. For each design it first binary-searches how tight the clock target should be, then iterates a loop: measure across several placement strategies and take the median (a single run is noisy); locate the slowest path; select a fix from a library by the path’s type; apply it - a constraint, a parameter, or a register in the middle of a long wire - then re-run the bit-exact and throughput checks; and accept if it got faster, or roll back and record the dead end so it is not tried again.

Automated closure: measure, locate, select, apply, verify, decide - roll back on failure.
The closure loop. What matters most is not that it pushes forward, but that it fails honestly and rolls back rather than committing an unverified result.

The most important property of this loop is that it rolls back. It would rather report “this point did not close” than hand over an unverified result as a success.

Results: clock matched, throughput close

The flow generated, verified, and closed all twenty decoders, on the same FPGA the commercial IP names, at the same speed grade. Nineteen of the twenty matched or beat the IP’s 459 MHz clock - a median slightly above its clock, with the fastest well past it. The only one to miss was the densest configuration, held back by routing congestion; the tool reported it plainly, with nothing hidden.

Measured clock of all twenty configurations against the commercial IP's 459 MHz.
Measured clock across all twenty configurations against the commercial IP's 459 MHz. Nineteen meet or beat it; the densest one, reported honestly, falls short.

Throughput, honestly

Throughput needs to be split out. The architecture is the same, so it should be close; the measured gap comes down to throughput techniques each side uses. Using the measured clock plus early stopping, most of the twenty beat the IP - but that lead comes from a higher clock and from terminating iterations early at good signal-to-noise, not from the architecture. Hold the clock and iteration count equal and compare codeword for codeword, and the IP leads again.

Throughput shown two ways, making clear where the lead comes from.
Throughput shown two ways. Read honestly, the clean win is the clock; the throughput lead is bought with a higher clock and early stopping, and labeled as such.

The difference is two throughput techniques this version does not use: block interleaving, where the IP decodes several codewords interleaved to fill one codeword’s wait with another’s compute; and multiple check-node cores at small sizes, where the IP packs in extra cores to fill a pipeline that is otherwise underused. Both are addable; this version simply spends its budget on early stopping and a higher clock instead.

The leverage is in the tool, not the single point

The first post used AI to push one design past a commercial IP. This one used AI to build a flow that generates, closes, and bit-exact verifies twenty designs. The second is where AI compounds in hardware design. Accelerating one design is linear - one tuned, one earned. A tool is multiplicative: the same generator, the same verification net, the same closure loop, spread over more configurations, lowers the marginal cost of each.

Same chip: clock matched across the batch, throughput from the same architecture.
Same chip across the batch: the clean win is the clock; where the throughput edge comes from is labeled plainly.

Twenty 5G LDPC decoders, one flow, nineteen matching or beating a paid IP’s clock, every one matching the 3GPP reference bit for bit. That is not the win of any single design; it is the win of the tool that produced them all.