A2HCoder: LLM-Driven Algorithm-to-HDL Translation

Framework Architecture

A2HCoder operates under a hierarchical transformation strategy that incorporates both horizontal algorithm decomposition and vertical refinement flow. The framework bridges the semantic and structural gap between high-level algorithm design and hardware synthesis through a structured, multi-stage transformation pipeline.

Figure 1: The architecture of A2HCoder. The framework begins with modular decomposition of high-level MATLAB algorithms, breaking them into smaller submodules. Each submodule then undergoes a three-stage processing flow: code adaption, code translation, and refinement.

Horizontal Decomposition

Disassembles complex communication algorithms into modular, loosely coupled components, simplifying the translation process and improving robustness.

Vertical Refinement

Performs step-by-step refinement from MATLAB source code to synthesizable C code, ultimately targeting HDL synthesis via existing HLS toolchains.

Stream-based Adaptation

Reconciles the frame-based global memory paradigm of MATLAB with the stream-based, dataflow-oriented execution preferred in hardware.

Feedback Loops

Agent-style feedback loops allow the LLM to iteratively revise outputs based on synthesis feedback and task constraints.

Three-Stage Processing Pipeline

1. Code Adaptation within MATLAB Domain

The first stage operates entirely within the MATLAB domain, adapting high-level, CPU-oriented code to meet FPGA architecture constraints. Global memory accesses are restructured into sequential patterns, and batch-based operations are rewritten into sample-wise streaming pipelines.

Figure 2: Workflow of Code Adaptation stage, transforming MATLAB algorithms for hardware compatibility.

2. Code Translation from MATLAB to HLS C++

The second stage transforms optimized MATLAB code into synthesizable HLS C++ code, focusing on restructuring function interfaces and internal operations for hardware-oriented execution with stream-oriented structures.

Figure 3: Workflow of Code Translation from MATLAB to HLS C++.

3. Optimization and Refinement

The final stage focuses on optimizing validated HLS C++ code to reduce resource utilization and computation latency through buffer management mechanisms and design space exploration techniques.

Figure 4: Workflow of Refinement stage for optimization and performance enhancement.

4. System-Level Integration

After submodule-level processing, individually refined modules are composed into a unified, executable hardware design using stream-based dataflow architecture.

Figure 5: Workflow of System Integration combining processed submodules.

Experimental Results

We validated A2HCoder through a real-world deployment case in the 5G wireless communication domain, implementing a complete 5G New Radio (NR) Synchronization Signal Block (SSB) detection system with five core submodules.

Ablation Study Results

Method	LUT	FF	DSP	BRAM	Latency	Clock (MHz)
calcThreshold - Direct	36,500	80,434	38	16	6,385	Failed
calcThreshold - Adaptation	685	1,176	24	4	6,301	277.09
calcThreshold - Refinement	173	274	3	1	6,013	322.27
extractSSBsig - Direct	4,468	7,071	0	24	24,890	265.11
extractSSBsig - Adaptation	275	353	0	4	12,441	253.29
extractSSBsig - Refinement	155	148	0	4	6,730	269.11

Interactive Demo

Explore the three-stage transformation process of A2HCoder with the calcThreshold function. See how the original MATLAB code evolves through adaptation, translation, and optimization stages to become efficient hardware implementation.

Stage 1: Code Adaptation within MATLAB Domain

Transform the original MATLAB algorithm to be hardware-friendly while staying in MATLAB. The filter-based approach is converted to a circular buffer with running sum for streaming compatibility.

Key Changes: Filter function → Circular buffer, Batch processing → Sample-wise streaming, Global memory → Sequential access

                                     Original MATLAB (calcThreshold.m)
                                

                                    % This function calculates threshold for correlation values from signal energy.
% Author: Jie Lei
% Date: 03/19/2025
% Revision: 0.1
%--------------------------------------------------------------------------
function threshold = calcThreshold(RxSignal)
    % Calculate signal energy using a moving average filter
    Nfft = 256; % Number of FFT points
    % Create a moving average filter with length Nfft
    % The filter coefficients are all equal to 1
    % The filter is applied to the squared magnitude of the input signal
    % to calculate the signal energy
    % The filter is implemented using the filter function in MATLAB
    energyFilt  = ones(Nfft,1);
    sig_mag_sq  = abs(RxSignal).^2;
    sigEnergy   = filter(energyFilt,1,sig_mag_sq);
    minimumThreshold  = Nfft*((2^-12)^2);          % Set minimumThreshold to avoid triggering on noise
    PSSThreshold_dB  = -6;                       % Strength threshold in dBs (0 is least sensitive).
    thresholdScaling = 10^(PSSThreshold_dB/10);  % Threshold scaling factor
    % Calculate threshold for correlation values from signal energy.
    threshold = sigEnergy.*thresholdScaling;     % Calculate the threshold using the signal power and threshold scaling factor
    threshold(threshold<minimumThreshold) = minimumThreshold; % Apply minimum threshold value where calculated threshold is lower than the minimum
end
                                

                                     Hardware-Optimized MATLAB (calcThreshold_hls.m)
                                

                                    function threshold = calcThreshold_hls(RxSignal)
    Nfft = 256;
    minimumThreshold = Nfft*(2^-12)^2;
    PSSThreshold_dB = -6;
    thresholdScaling = 10^(PSSThreshold_dB/10);
    
    inputLength = length(RxSignal);
    threshold = zeros(inputLength, 1);
    
    circularBuffer = zeros(Nfft, 1);
    writePtr = 0;
    bufferSum = 0;
    
    for i = 0:inputLength-1
        currentSample = RxSignal(i+1);
        sig_mag_sq = abs(currentSample)^2;
        
        oldValue = circularBuffer(writePtr + 1);
        circularBuffer(writePtr + 1) = sig_mag_sq;
        
        bufferSum = bufferSum - oldValue + sig_mag_sq;
        
        sigEnergy = bufferSum;
        
        thresholdValue = sigEnergy * thresholdScaling;
        if thresholdValue < minimumThreshold
            thresholdValue = minimumThreshold;
        end
        
        threshold(i+1) = thresholdValue;
        
        writePtr = writePtr + 1;
        if writePtr >= Nfft
            writePtr = 0;
        end
    end
end
                                

Stage 2: Code Translation from MATLAB to HLS C++

Translate the hardware-optimized MATLAB code to HLS C++ with streaming interfaces and hardware-specific optimizations.

Key Changes: MATLAB syntax → C++, Arrays → Streams, Function calls → Hardware interfaces, Data types → Fixed-point

                                     Hardware-Optimized MATLAB
                                

                                    function threshold = calcThreshold_hls(RxSignal)
    Nfft = 256;
    minimumThreshold = Nfft*(2^-12)^2;
    PSSThreshold_dB = -6;
    thresholdScaling = 10^(PSSThreshold_dB/10);
    
    inputLength = length(RxSignal);
    threshold = zeros(inputLength, 1);
    
    circularBuffer = zeros(Nfft, 1);
    writePtr = 0;
    bufferSum = 0;
    
    for i = 0:inputLength-1
        currentSample = RxSignal(i+1);
        sig_mag_sq = abs(currentSample)^2;
        
        oldValue = circularBuffer(writePtr + 1);
        circularBuffer(writePtr + 1) = sig_mag_sq;
        
        bufferSum = bufferSum - oldValue + sig_mag_sq;
        
        sigEnergy = bufferSum;
        
        thresholdValue = sigEnergy * thresholdScaling;
        if thresholdValue < minimumThreshold
            thresholdValue = minimumThreshold;
        end
        
        threshold(i+1) = thresholdValue;
        
        writePtr = writePtr + 1;
        if writePtr >= Nfft
            writePtr = 0;
        end
    end
end
                                

                                     HLS C++ Translation
                                

                                    void calcThreshold(
    ComplexStream &rxSignal,
    ThresholdStream &threshold,
    int numSamples
) {
    // HLS interface pragmas for AXI4-Stream and AXI4-Lite
    #pragma HLS INTERFACE axis port=rxSignal
    #pragma HLS INTERFACE axis port=threshold
    #pragma HLS INTERFACE s_axilite port=numSamples
    #pragma HLS INTERFACE s_axilite port=return
    
    // Circular buffer for storing squared magnitude values
    FixedPoint circularBuffer[NFFT];
    #pragma HLS ARRAY_PARTITION variable=circularBuffer type=cyclic factor=2
    #pragma HLS RESOURCE variable=circularBuffer core=RAM_2P_BRAM
    
    // State variables
    int writePtr = 0;
    EnergyType bufferSum = 0;
    
    // Constants for threshold calculation
    const ThresholdType thresholdScaling = THRESHOLD_SCALING;
    const ThresholdType minimumThreshold = MINIMUM_THRESHOLD;
    
    // Initialize circular buffer to zero
    INIT_BUFFER: for (int i = 0; i < NFFT; i++) {
        #pragma HLS PIPELINE II=1
        circularBuffer[i] = 0;
    }
    
    // Main processing loop
    PROCESS_SAMPLES: for (int sampleIdx = 0; sampleIdx < numSamples; sampleIdx++) {
        #pragma HLS PIPELINE II=1
        #pragma HLS LOOP_TRIPCOUNT min=1024 max=8192 avg=4096
        
        // Read input sample from stream
        ComplexType currentSample = rxSignal.read();
        
        // Calculate squared magnitude
        FixedPoint realPart = currentSample.real();
        FixedPoint imagPart = currentSample.imag();
        FixedPoint sigMagSq = realPart * realPart + imagPart * imagPart;
        
        // Circular buffer operation
        FixedPoint oldValue = circularBuffer[writePtr];
        circularBuffer[writePtr] = sigMagSq;
        bufferSum = bufferSum - oldValue + sigMagSq;
        
        // Calculate and apply threshold
        EnergyType sigEnergy = bufferSum;
        ThresholdType thresholdValue = sigEnergy * thresholdScaling;
        
        if (thresholdValue < minimumThreshold) {
            thresholdValue = minimumThreshold;
        }
        
        threshold.write(thresholdValue);
        writePtr = (writePtr + 1) % NFFT;
    }
}
                                

Stage 3: Optimization and Refinement

Apply advanced optimizations to reduce latency and resource utilization. Smart buffer management eliminates initialization overhead.

Key Optimizations: Remove initialization loop (saves 256 cycles), Smart buffer management, Conditional logic for uninitialized values, Enhanced HLS pragmas

                                     Standard HLS C++
                                

                                    void calcThreshold(
    ComplexStream &rxSignal,
    ThresholdStream &threshold,
    int numSamples
) {
    // HLS interface pragmas for AXI4-Stream and AXI4-Lite
    #pragma HLS INTERFACE axis port=rxSignal
    #pragma HLS INTERFACE axis port=threshold
    #pragma HLS INTERFACE s_axilite port=numSamples
    #pragma HLS INTERFACE s_axilite port=return
    
    // Circular buffer for storing squared magnitude values
    FixedPoint circularBuffer[NFFT];
    #pragma HLS ARRAY_PARTITION variable=circularBuffer type=cyclic factor=2
    #pragma HLS RESOURCE variable=circularBuffer core=RAM_2P_BRAM
    
    // State variables
    int writePtr = 0;
    EnergyType bufferSum = 0;
    
    // Constants for threshold calculation
    const ThresholdType thresholdScaling = THRESHOLD_SCALING;
    const ThresholdType minimumThreshold = MINIMUM_THRESHOLD;
    
    // Initialize circular buffer to zero
    INIT_BUFFER: for (int i = 0; i < NFFT; i++) {
        #pragma HLS PIPELINE II=1
        circularBuffer[i] = 0;
    }
    
    // Main processing loop
    PROCESS_SAMPLES: for (int sampleIdx = 0; sampleIdx < numSamples; sampleIdx++) {
        #pragma HLS PIPELINE II=1
        #pragma HLS LOOP_TRIPCOUNT min=1024 max=8192 avg=4096
        
        // Read input sample from stream
        ComplexType currentSample = rxSignal.read();
        
        // Calculate squared magnitude
        FixedPoint realPart = currentSample.real();
        FixedPoint imagPart = currentSample.imag();
        FixedPoint sigMagSq = realPart * realPart + imagPart * imagPart;
        
        // Circular buffer operation
        FixedPoint oldValue = circularBuffer[writePtr];
        circularBuffer[writePtr] = sigMagSq;
        bufferSum = bufferSum - oldValue + sigMagSq;
        
        // Calculate and apply threshold
        EnergyType sigEnergy = bufferSum;
        ThresholdType thresholdValue = sigEnergy * thresholdScaling;
        
        if (thresholdValue < minimumThreshold) {
            thresholdValue = minimumThreshold;
        }
        
        threshold.write(thresholdValue);
        writePtr = (writePtr + 1) % NFFT;
    }
}
                                

                                     Optimized HLS C++
                                

                                    void calcThreshold(
    ComplexStream &rxSignal,
    ThresholdStream &threshold,
    int numSamples
) {
    // HLS interface pragmas for AXI4-Stream and AXI4-Lite
    // #pragma HLS INTERFACE axis port=rxSignal
    // #pragma HLS INTERFACE axis port=threshold
    // #pragma HLS INTERFACE s_axilite port=numSamples
    // #pragma HLS INTERFACE s_axilite port=return
    
    // Circular buffer for storing squared magnitude values
    EnergyType circularBuffer[NFFT];
    // #pragma HLS ARRAY_PARTITION variable=circularBuffer type=cyclic factor=2
    #pragma HLS RESOURCE variable=circularBuffer core=RAM_2P_BRAM
    
    // State variables
    int writePtr = 0;
    EnergyType bufferSum = 0;
    
    // Constants for threshold calculation
    const ThresholdType thresholdScaling = THRESHOLD_SCALING;
    const ThresholdType minimumThreshold = MINIMUM_THRESHOLD;
    
    // LATENCY OPTIMIZATION: Removed INIT_BUFFER loop (saves NFFT=256 cycles)
    // Smart buffer management: treat uninitialized values as zero for first NFFT samples
    
    // Main processing loop - optimized for reduced latency
    PROCESS_SAMPLES: for (int sampleIdx = 0; sampleIdx < numSamples; sampleIdx++) {
        #pragma HLS PIPELINE II=1
        #pragma HLS LOOP_TRIPCOUNT min=1024 max=8192 avg=4096
        
        // Read input sample from stream
        ComplexType currentSample = rxSignal.read();
        
        // Calculate squared magnitude
        FixedPoint realPart = currentSample.real();
        FixedPoint imagPart = currentSample.imag();
        EnergyType sigMagSq = realPart * realPart + imagPart * imagPart;
        
        // Smart circular buffer operation - no initialization required
        EnergyType oldValue;
        if (sampleIdx < NFFT) {
            oldValue = 0;  // Treat uninitialized as zero for first NFFT iterations
        } else {
            oldValue = circularBuffer[writePtr];  // Normal circular buffer operation
        }
        
        // Store new value in circular buffer
        circularBuffer[writePtr] = sigMagSq;
        
        // Update running sum: subtract old, add new
        bufferSum = bufferSum - oldValue + sigMagSq;
        
        // Calculate signal energy and apply threshold
        EnergyType sigEnergy = bufferSum;
        ThresholdType thresholdValue = sigEnergy * thresholdScaling;
        
        // Apply minimum threshold constraint
        if (thresholdValue < minimumThreshold) {
            thresholdValue = minimumThreshold;
        }
        
        // Write threshold to output stream
        threshold.write(thresholdValue);
        
        // Update circular buffer pointer
        writePtr = (writePtr + 1) % NFFT;
    }
}
                                

Result: 256 cycles latency reduction with identical functionality

Real-World Deployment

The complete 5G SSB detection system was successfully deployed on a USRP X310 platform equipped with Xilinx Kintex-7 FPGA, demonstrating A2HCoder's capability to generate modular, high-performance, and synthesizable hardware directly from high-level MATLAB specifications.

Figure 6: End-to-end deployment of the A2HCoder-generated 5G SSB detection pipeline on a USRP X310 platform. The system receives wireless signals, processes them on the Kintex-7 FPGA, and outputs detection results.

System-Level Performance

Module	LUTs	FFs	DSP	BRAMs	Latency	Clock (MHz)
pssCorrelator	6,329	21,088	276	0	54,060	254.00
calcThreshold	173	274	3	1	6,013	322.27
peakFinder	1,061	1,439	0	0	6,007	279.02
collectLocations	85	211	0	0	6,004	332.78
extractSSBsig	155	148	0	4	6,730	269.11
detectSSB (Complete System)	8,669	24,216	279	7	53,872	292.23

A2HCoder

Abstract