A2HCoder

An LLM-Driven Coding Agent for Hierarchical Algorithm-to-HDL Translation

Jie Lei¹*, Ruofan Jia²*, J. Andrew Zhang¹†, Hao Zhang¹
¹Global Big Data Technologies Centre, University of Technology Sydney, Sydney, Australia
²State Key Laboratory of Integrated Services Networks, Xidian University, Xi'an, China
*These authors contributed equally. †Corresponding author

Abstract

In wireless communication systems, stringent requirements such as ultra-low latency and power consumption have significantly increased the demand for efficient algorithm-to-hardware deployment. A2HCoder addresses the persistent gap between algorithm design and hardware implementation by introducing a hierarchical framework that enhances both robustness and interpretability while suppressing common hallucination issues in LLM-generated code.

Framework Architecture

A2HCoder operates under a hierarchical transformation strategy that incorporates both horizontal algorithm decomposition and vertical refinement flow. The framework bridges the semantic and structural gap between high-level algorithm design and hardware synthesis through a structured, multi-stage transformation pipeline.

A2HCoder Framework Architecture
Figure 1: The architecture of A2HCoder. The framework begins with modular decomposition of high-level MATLAB algorithms, breaking them into smaller submodules. Each submodule then undergoes a three-stage processing flow: code adaption, code translation, and refinement.

Horizontal Decomposition

Disassembles complex communication algorithms into modular, loosely coupled components, simplifying the translation process and improving robustness.

Vertical Refinement

Performs step-by-step refinement from MATLAB source code to synthesizable C code, ultimately targeting HDL synthesis via existing HLS toolchains.

Stream-based Adaptation

Reconciles the frame-based global memory paradigm of MATLAB with the stream-based, dataflow-oriented execution preferred in hardware.

Feedback Loops

Agent-style feedback loops allow the LLM to iteratively revise outputs based on synthesis feedback and task constraints.

Three-Stage Processing Pipeline

1. Code Adaptation within MATLAB Domain

The first stage operates entirely within the MATLAB domain, adapting high-level, CPU-oriented code to meet FPGA architecture constraints. Global memory accesses are restructured into sequential patterns, and batch-based operations are rewritten into sample-wise streaming pipelines.

Code Adaptation Workflow
Figure 2: Workflow of Code Adaptation stage, transforming MATLAB algorithms for hardware compatibility.

2. Code Translation from MATLAB to HLS C++

The second stage transforms optimized MATLAB code into synthesizable HLS C++ code, focusing on restructuring function interfaces and internal operations for hardware-oriented execution with stream-oriented structures.

Code Translation Workflow
Figure 3: Workflow of Code Translation from MATLAB to HLS C++.

3. Optimization and Refinement

The final stage focuses on optimizing validated HLS C++ code to reduce resource utilization and computation latency through buffer management mechanisms and design space exploration techniques.

Refinement Workflow
Figure 4: Workflow of Refinement stage for optimization and performance enhancement.

4. System-Level Integration

After submodule-level processing, individually refined modules are composed into a unified, executable hardware design using stream-based dataflow architecture.

System Integration Workflow
Figure 5: Workflow of System Integration combining processed submodules.

Experimental Results

We validated A2HCoder through a real-world deployment case in the 5G wireless communication domain, implementing a complete 5G New Radio (NR) Synchronization Signal Block (SSB) detection system with five core submodules.

Ablation Study Results

Method LUT FF DSP BRAM Latency Clock (MHz)
calcThreshold - Direct 36,500 80,434 38 16 6,385 Failed
calcThreshold - Adaptation 685 1,176 24 4 6,301 277.09
calcThreshold - Refinement 173 274 3 1 6,013 322.27
extractSSBsig - Direct 4,468 7,071 0 24 24,890 265.11
extractSSBsig - Adaptation 275 353 0 4 12,441 253.29
extractSSBsig - Refinement 155 148 0 4 6,730 269.11

Interactive Demo

Explore the three-stage transformation process of A2HCoder with the calcThreshold function. See how the original MATLAB code evolves through adaptation, translation, and optimization stages to become efficient hardware implementation.

Stage 1: Code Adaptation within MATLAB Domain

Transform the original MATLAB algorithm to be hardware-friendly while staying in MATLAB. The filter-based approach is converted to a circular buffer with running sum for streaming compatibility.

Key Changes: Filter function → Circular buffer, Batch processing → Sample-wise streaming, Global memory → Sequential access
Original MATLAB (calcThreshold.m)
% This function calculates threshold for correlation values from signal energy.
% Author: Jie Lei
% Date: 03/19/2025
% Revision: 0.1
%--------------------------------------------------------------------------
function threshold = calcThreshold(RxSignal)
    % Calculate signal energy using a moving average filter
    Nfft = 256; % Number of FFT points
    % Create a moving average filter with length Nfft
    % The filter coefficients are all equal to 1
    % The filter is applied to the squared magnitude of the input signal
    % to calculate the signal energy
    % The filter is implemented using the filter function in MATLAB
    energyFilt  = ones(Nfft,1);
    sig_mag_sq  = abs(RxSignal).^2;
    sigEnergy   = filter(energyFilt,1,sig_mag_sq);
    minimumThreshold  = Nfft*((2^-12)^2);          % Set minimumThreshold to avoid triggering on noise
    PSSThreshold_dB  = -6;                       % Strength threshold in dBs (0 is least sensitive).
    thresholdScaling = 10^(PSSThreshold_dB/10);  % Threshold scaling factor
    % Calculate threshold for correlation values from signal energy.
    threshold = sigEnergy.*thresholdScaling;     % Calculate the threshold using the signal power and threshold scaling factor
    threshold(threshold<minimumThreshold) = minimumThreshold; % Apply minimum threshold value where calculated threshold is lower than the minimum
end
Hardware-Optimized MATLAB (calcThreshold_hls.m)
function threshold = calcThreshold_hls(RxSignal)
    Nfft = 256;
    minimumThreshold = Nfft*(2^-12)^2;
    PSSThreshold_dB = -6;
    thresholdScaling = 10^(PSSThreshold_dB/10);
    
    inputLength = length(RxSignal);
    threshold = zeros(inputLength, 1);
    
    circularBuffer = zeros(Nfft, 1);
    writePtr = 0;
    bufferSum = 0;
    
    for i = 0:inputLength-1
        currentSample = RxSignal(i+1);
        sig_mag_sq = abs(currentSample)^2;
        
        oldValue = circularBuffer(writePtr + 1);
        circularBuffer(writePtr + 1) = sig_mag_sq;
        
        bufferSum = bufferSum - oldValue + sig_mag_sq;
        
        sigEnergy = bufferSum;
        
        thresholdValue = sigEnergy * thresholdScaling;
        if thresholdValue < minimumThreshold
            thresholdValue = minimumThreshold;
        end
        
        threshold(i+1) = thresholdValue;
        
        writePtr = writePtr + 1;
        if writePtr >= Nfft
            writePtr = 0;
        end
    end
end

Stage 2: Code Translation from MATLAB to HLS C++

Translate the hardware-optimized MATLAB code to HLS C++ with streaming interfaces and hardware-specific optimizations.

Key Changes: MATLAB syntax → C++, Arrays → Streams, Function calls → Hardware interfaces, Data types → Fixed-point
Hardware-Optimized MATLAB
function threshold = calcThreshold_hls(RxSignal)
    Nfft = 256;
    minimumThreshold = Nfft*(2^-12)^2;
    PSSThreshold_dB = -6;
    thresholdScaling = 10^(PSSThreshold_dB/10);
    
    inputLength = length(RxSignal);
    threshold = zeros(inputLength, 1);
    
    circularBuffer = zeros(Nfft, 1);
    writePtr = 0;
    bufferSum = 0;
    
    for i = 0:inputLength-1
        currentSample = RxSignal(i+1);
        sig_mag_sq = abs(currentSample)^2;
        
        oldValue = circularBuffer(writePtr + 1);
        circularBuffer(writePtr + 1) = sig_mag_sq;
        
        bufferSum = bufferSum - oldValue + sig_mag_sq;
        
        sigEnergy = bufferSum;
        
        thresholdValue = sigEnergy * thresholdScaling;
        if thresholdValue < minimumThreshold
            thresholdValue = minimumThreshold;
        end
        
        threshold(i+1) = thresholdValue;
        
        writePtr = writePtr + 1;
        if writePtr >= Nfft
            writePtr = 0;
        end
    end
end
HLS C++ Translation
void calcThreshold(
    ComplexStream &rxSignal,
    ThresholdStream &threshold,
    int numSamples
) {
    // HLS interface pragmas for AXI4-Stream and AXI4-Lite
    #pragma HLS INTERFACE axis port=rxSignal
    #pragma HLS INTERFACE axis port=threshold
    #pragma HLS INTERFACE s_axilite port=numSamples
    #pragma HLS INTERFACE s_axilite port=return
    
    // Circular buffer for storing squared magnitude values
    FixedPoint circularBuffer[NFFT];
    #pragma HLS ARRAY_PARTITION variable=circularBuffer type=cyclic factor=2
    #pragma HLS RESOURCE variable=circularBuffer core=RAM_2P_BRAM
    
    // State variables
    int writePtr = 0;
    EnergyType bufferSum = 0;
    
    // Constants for threshold calculation
    const ThresholdType thresholdScaling = THRESHOLD_SCALING;
    const ThresholdType minimumThreshold = MINIMUM_THRESHOLD;
    
    // Initialize circular buffer to zero
    INIT_BUFFER: for (int i = 0; i < NFFT; i++) {
        #pragma HLS PIPELINE II=1
        circularBuffer[i] = 0;
    }
    
    // Main processing loop
    PROCESS_SAMPLES: for (int sampleIdx = 0; sampleIdx < numSamples; sampleIdx++) {
        #pragma HLS PIPELINE II=1
        #pragma HLS LOOP_TRIPCOUNT min=1024 max=8192 avg=4096
        
        // Read input sample from stream
        ComplexType currentSample = rxSignal.read();
        
        // Calculate squared magnitude
        FixedPoint realPart = currentSample.real();
        FixedPoint imagPart = currentSample.imag();
        FixedPoint sigMagSq = realPart * realPart + imagPart * imagPart;
        
        // Circular buffer operation
        FixedPoint oldValue = circularBuffer[writePtr];
        circularBuffer[writePtr] = sigMagSq;
        bufferSum = bufferSum - oldValue + sigMagSq;
        
        // Calculate and apply threshold
        EnergyType sigEnergy = bufferSum;
        ThresholdType thresholdValue = sigEnergy * thresholdScaling;
        
        if (thresholdValue < minimumThreshold) {
            thresholdValue = minimumThreshold;
        }
        
        threshold.write(thresholdValue);
        writePtr = (writePtr + 1) % NFFT;
    }
}

Stage 3: Optimization and Refinement

Apply advanced optimizations to reduce latency and resource utilization. Smart buffer management eliminates initialization overhead.

Key Optimizations: Remove initialization loop (saves 256 cycles), Smart buffer management, Conditional logic for uninitialized values, Enhanced HLS pragmas
Standard HLS C++
void calcThreshold(
    ComplexStream &rxSignal,
    ThresholdStream &threshold,
    int numSamples
) {
    // HLS interface pragmas for AXI4-Stream and AXI4-Lite
    #pragma HLS INTERFACE axis port=rxSignal
    #pragma HLS INTERFACE axis port=threshold
    #pragma HLS INTERFACE s_axilite port=numSamples
    #pragma HLS INTERFACE s_axilite port=return
    
    // Circular buffer for storing squared magnitude values
    FixedPoint circularBuffer[NFFT];
    #pragma HLS ARRAY_PARTITION variable=circularBuffer type=cyclic factor=2
    #pragma HLS RESOURCE variable=circularBuffer core=RAM_2P_BRAM
    
    // State variables
    int writePtr = 0;
    EnergyType bufferSum = 0;
    
    // Constants for threshold calculation
    const ThresholdType thresholdScaling = THRESHOLD_SCALING;
    const ThresholdType minimumThreshold = MINIMUM_THRESHOLD;
    
    // Initialize circular buffer to zero
    INIT_BUFFER: for (int i = 0; i < NFFT; i++) {
        #pragma HLS PIPELINE II=1
        circularBuffer[i] = 0;
    }
    
    // Main processing loop
    PROCESS_SAMPLES: for (int sampleIdx = 0; sampleIdx < numSamples; sampleIdx++) {
        #pragma HLS PIPELINE II=1
        #pragma HLS LOOP_TRIPCOUNT min=1024 max=8192 avg=4096
        
        // Read input sample from stream
        ComplexType currentSample = rxSignal.read();
        
        // Calculate squared magnitude
        FixedPoint realPart = currentSample.real();
        FixedPoint imagPart = currentSample.imag();
        FixedPoint sigMagSq = realPart * realPart + imagPart * imagPart;
        
        // Circular buffer operation
        FixedPoint oldValue = circularBuffer[writePtr];
        circularBuffer[writePtr] = sigMagSq;
        bufferSum = bufferSum - oldValue + sigMagSq;
        
        // Calculate and apply threshold
        EnergyType sigEnergy = bufferSum;
        ThresholdType thresholdValue = sigEnergy * thresholdScaling;
        
        if (thresholdValue < minimumThreshold) {
            thresholdValue = minimumThreshold;
        }
        
        threshold.write(thresholdValue);
        writePtr = (writePtr + 1) % NFFT;
    }
}
Optimized HLS C++
void calcThreshold(
    ComplexStream &rxSignal,
    ThresholdStream &threshold,
    int numSamples
) {
    // HLS interface pragmas for AXI4-Stream and AXI4-Lite
    // #pragma HLS INTERFACE axis port=rxSignal
    // #pragma HLS INTERFACE axis port=threshold
    // #pragma HLS INTERFACE s_axilite port=numSamples
    // #pragma HLS INTERFACE s_axilite port=return
    
    // Circular buffer for storing squared magnitude values
    EnergyType circularBuffer[NFFT];
    // #pragma HLS ARRAY_PARTITION variable=circularBuffer type=cyclic factor=2
    #pragma HLS RESOURCE variable=circularBuffer core=RAM_2P_BRAM
    
    // State variables
    int writePtr = 0;
    EnergyType bufferSum = 0;
    
    // Constants for threshold calculation
    const ThresholdType thresholdScaling = THRESHOLD_SCALING;
    const ThresholdType minimumThreshold = MINIMUM_THRESHOLD;
    
    // LATENCY OPTIMIZATION: Removed INIT_BUFFER loop (saves NFFT=256 cycles)
    // Smart buffer management: treat uninitialized values as zero for first NFFT samples
    
    // Main processing loop - optimized for reduced latency
    PROCESS_SAMPLES: for (int sampleIdx = 0; sampleIdx < numSamples; sampleIdx++) {
        #pragma HLS PIPELINE II=1
        #pragma HLS LOOP_TRIPCOUNT min=1024 max=8192 avg=4096
        
        // Read input sample from stream
        ComplexType currentSample = rxSignal.read();
        
        // Calculate squared magnitude
        FixedPoint realPart = currentSample.real();
        FixedPoint imagPart = currentSample.imag();
        EnergyType sigMagSq = realPart * realPart + imagPart * imagPart;
        
        // Smart circular buffer operation - no initialization required
        EnergyType oldValue;
        if (sampleIdx < NFFT) {
            oldValue = 0;  // Treat uninitialized as zero for first NFFT iterations
        } else {
            oldValue = circularBuffer[writePtr];  // Normal circular buffer operation
        }
        
        // Store new value in circular buffer
        circularBuffer[writePtr] = sigMagSq;
        
        // Update running sum: subtract old, add new
        bufferSum = bufferSum - oldValue + sigMagSq;
        
        // Calculate signal energy and apply threshold
        EnergyType sigEnergy = bufferSum;
        ThresholdType thresholdValue = sigEnergy * thresholdScaling;
        
        // Apply minimum threshold constraint
        if (thresholdValue < minimumThreshold) {
            thresholdValue = minimumThreshold;
        }
        
        // Write threshold to output stream
        threshold.write(thresholdValue);
        
        // Update circular buffer pointer
        writePtr = (writePtr + 1) % NFFT;
    }
}
Result: 256 cycles latency reduction with identical functionality

Real-World Deployment

The complete 5G SSB detection system was successfully deployed on a USRP X310 platform equipped with Xilinx Kintex-7 FPGA, demonstrating A2HCoder's capability to generate modular, high-performance, and synthesizable hardware directly from high-level MATLAB specifications.

Real-World Deployment Results
Figure 6: End-to-end deployment of the A2HCoder-generated 5G SSB detection pipeline on a USRP X310 platform. The system receives wireless signals, processes them on the Kintex-7 FPGA, and outputs detection results.

System-Level Performance

Module LUTs FFs DSP BRAMs Latency Clock (MHz)
pssCorrelator 6,329 21,088 276 0 54,060 254.00
calcThreshold 173 274 3 1 6,013 322.27
peakFinder 1,061 1,439 0 0 6,007 279.02
collectLocations 85 211 0 0 6,004 332.78
extractSSBsig 155 148 0 4 6,730 269.11
detectSSB (Complete System) 8,669 24,216 279 7 53,872 292.23

Explore A2HCoder

Ready to bridge the gap between algorithm design and hardware implementation?

Download Paper Back to Top

Visitors: 0