Data Offload#

The Data Offload Engine is, in essence, a clock-domain crossing store-and-forward buffer (or FIFO) with some extra features useful in bursty RF applications. More specifically, it was designed to sit between the DMA and the DAC for the TX and between the ADC and the DMA for the RX path of a digital RF chain. This is reflected in the synthesis settings of the device, that also enable or disable certain other settings of features where appropriate. For example, in the receive path cyclic operation isn’t supported.

Features#

  • Configurable storage unit with support for Block-RAM and External DRAM (Up to 16 GiB) or external High Bandwidth Memory (HBM)

  • Configurable interface width and rates

  • External timing synchronization for precisely timed buffers (For example, in combination with the Timing-Division Duplexing Controller)

  • Cyclic and oneshot store and forward operation

  • Bypass mode to completely bypass all features and act as a pure CDC FIFO

  • Many settings configurable at runtime via MM AXI4-Lite bus

Note

The data offload does NOT support, in its current state, continuous streaming except when in bypass mode (thus disabling all other functionality).

Utilization#

Device Family

LUTs

FFs

Xilinx Zynq UltraScale+

750

2000

Files#

Name

Description

library/data_offload/data_offload.v

Verilog source for the peripheral.

drivers/misc/adi-axi-data-offload.c

Linux Driver.

arch/arm64/boot/dts/xilinx/zynqmp-zcu102-rev10-ad9081-m8-l4-do.dts

Example device tree using the data offload.

Block Diagram#

Data Offload block diagram

Configuration Parameters#

Name

Description

Default Value

Choices/Range

ID

Instance identification number.

0

MEM_TYPE

Define the used storage type: 0: BlockRAM; 1: external DDR.

0

Internal memory (0), External memory (1)

MEM_SIZE_LOG2

Define the log2 size of the storage element in bytes.

10

1kB (10), 2kB (11), 4kB (12), 8kB (13), 16kB (14), 32kB (15), 64kB (16), 128kB (17), 256kB (18), 512kB (19), 1MB (20), 2MB (21), 4MB (22), 8MB (23), 16MB (24), 32MB (25), 64MB (26), 128MB (27), 256MB (28), 512MB (29), 1GB (30), 2GB (31), 4GB (32), 8GB (33), 16GB (34)

TX_OR_RXN_PATH

If set TX path enabled, otherwise RX.

0

RX path (0), TX path (1)

SRC_DATA_WIDTH

The data width of the source interface.

64

DST_DATA_WIDTH

The data width of the destination interface.

128

DST_CYCLIC_EN

Enables CYCLIC mode for destinations like DAC.

True

AUTO_BRINGUP

If enabled the IP runs automatically after bootup.

1

SYNC_EXT_ADD_INTERNAL_CDC

If enabled the CDC circuitry for the external sync signal is added.

True

HAS_BYPASS

If enabled the bypass circuitry is added.

True

Interface#

Physical Port

Logical Port

Direction

Dependency

s_axi_awaddr AWADDR

in [15:0]

s_axi_awprot AWPROT

in [2:0]

s_axi_awvalid AWVALID

in

s_axi_awready AWREADY

out

s_axi_wdata WDATA

in [31:0]

s_axi_wstrb WSTRB

in [3:0]

s_axi_wvalid WVALID

in

s_axi_wready WREADY

out

s_axi_bresp BRESP

out [1:0]

s_axi_bvalid BVALID

out

s_axi_bready BREADY

in

s_axi_araddr ARADDR

in [15:0]

s_axi_arprot ARPROT

in [2:0]

s_axi_arvalid ARVALID

in

s_axi_arready ARREADY

out

s_axi_rdata RDATA

out [31:0]

s_axi_rresp RRESP

out [1:0]

s_axi_rvalid RVALID

out

s_axi_rready RREADY

in

Physical Port

Logical Port

Direction

Dependency

s_axi_aclk CLK

in

Physical Port

Logical Port

Direction

Dependency

s_axi_aresetn RST

in

Physical Port

Logical Port

Direction

Dependency

m_axis_ready TREADY

in

m_axis_valid TVALID

out

m_axis_data TDATA

out [127:0]

m_axis_last TLAST

out

m_axis_tkeep TKEEP

out [15:0]

Physical Port

Logical Port

Direction

Dependency

s_axis_ready TREADY

out

s_axis_valid TVALID

in

s_axis_data TDATA

in [63:0]

s_axis_last TLAST

in

s_axis_tkeep TKEEP

in [7:0]

Physical Port

Logical Port

Direction

Dependency

wr_request_enable request_enable

out

wr_request_valid request_valid

out

wr_request_ready request_ready

in

wr_request_length request_length

out [9:0]

wr_response_measured_length response_measured_length

in [9:0]

wr_response_eot response_eot

in

wr_overflow status_overflow

in

Physical Port

Logical Port

Direction

Dependency

rd_request_enable request_enable

out

rd_request_valid request_valid

out

rd_request_ready request_ready

in

rd_request_length request_length

out [9:0]

rd_response_eot response_eot

in

rd_underflow status_underflow

in

Physical Port

Logical Port

Direction

Dependency

s_storage_axis_ready TREADY

out

s_storage_axis_valid TVALID

in

s_storage_axis_data TDATA

in [127:0]

s_storage_axis_tkeep TKEEP

in [15:0]

s_storage_axis_last TLAST

in

Physical Port

Logical Port

Direction

Dependency

m_storage_axis_ready TREADY

in

m_storage_axis_valid TVALID

out

m_storage_axis_data TDATA

out [63:0]

m_storage_axis_tkeep TKEEP

out [7:0]

m_storage_axis_last TLAST

out

Physical Port

Direction

Dependency

Description

s_axis_aclk

in

Source Domain Clock Signal Input. Bus m_storage_axis_s_axis is synchronous to this clock domain.

s_axis_aresetn

in

Bus m_storage_axis_s_axis is synchronous to this reset signal.

m_axis_aclk

in

Destination Domain Clock Signal Input. Bus s_storage_axis_m_axis is synchronous to this clock domain.

m_axis_aresetn

in

Bus s_storage_axis_m_axis is synchronous to this reset signal.

init_req

in

Indicator that the signal source (e.g. DMA) intends to provide new data soon.

sync_ext

in

External synchronization signal, with or without internal clock-domain crossing logic. Can be used to couple certain state machine transitions to external processes.

ddr_calib_done

in

MEM_TYPE == 1

Allows the user to read back status information about the DDR calibration status from software.

Register Map#

DWORD

BYTE

Reg Name

Description

BITS

Field Name

Type

Default Value

Description

0x0 0x0 VERSION

Version of the peripheral. Follows semantic versioning. Current version 1.00.61.

[31:16] VERSION_MAJOR RO 0x0001

[15:8] VERSION_MINOR RO 0x00

[7:0] VERSION_PATCH RO 0x61

0x1 0x4 PERIPHERAL_ID

[31:0] PERIPHERAL_ID RO ID

Value of the ID configuration parameter.

0x2 0x8 SCRATCH

[31:0] SCRATCH RW 0x00000000

Scratch register useful for debug.

0x3 0xc IDENTIFICATION

[31:0] IDENTIFICATION RO 0x44414f46

Peripheral identification (‘D’, ‘A’, ‘O’, ‘F’).

0x4 0x10 SYNTHESIS_CONFIG_1

[2:2] HAS_BYPASS RO HAS_​BYPASS

If set the bypass logic is implemented.

[1:1] TX_OR_RXN_PATH RO TX_​OR_​RXN_​PATH

If this device was configured for the TX path, this bit will be set to 1. Conversely, the bit will be 0 for the RX path.

[0:0] MEMORY_TYPE RO MEM_​TYPE

This bit identifies the type of memory that was chosen during synthesis. A value of 1 identifies external memory, while a value of zero indicates that block ram was used.

0x5 0x14 SYNTHESIS_CONFIG_2

[31:0] MEM_SIZE_LSB RO MEM_​SIZE_​LSB

32 bits (LSB) of the storage unit size. MEM_SIZE_LSB = 1<<MEM_SIZE_LOG2

0x6 0x18 SYNTHESIS_CONFIG_3

[1:0] MEM_SIZE_MSB RO MEM_​SIZE_​MSB

2 bits (MSB) of the storage unit size. MEM_SIZE_MSB = (1<<MEM_SIZE_LOG2)>>32

0x7 0x1c TRANSFER_LENGTH

[31:0] TRANSFER_LENGTH RW TRANSFER_​LENGTH

The transfer length register can be used to override the transfer length in RX mode in increments of 64 bytes. TRANSFER_LENGTH = (2**MEM_SIZE_LOG2-1)>>6

0x20 0x80 MEM_PHY_STATE

[5:5] UNDERFLOW RW1C 0x0

Indicates that storage could not handle data rate during play. Available when core is in TX mode.

[4:4] OVERFLOW RW1C 0x0

Indicates that storage could not handle data rate during capture. Available when core is in RX mode.

[0:0] CALIB_COMPLETE RO 0x0

Indicates that the memory initialization and calibration have completed successfully.

0x21 0x84 RESET_OFFLOAD

[0:0] RESETN RW AUTO_​BRINGUP

“Software Reset”: Resets all the internal address registers and state machines.

0x22 0x88 CONTROL

[1:1] ONESHOT_EN RW ONESHOT_​EN

Enables oneshot mode. This means that the data offload will only play a received buffer once, and then stop. This mode is useful when you want to use the data offload for its synchronization features, but don’t need the repeating output. ONESHOT_EN = ~TX_OR_RXN_PATH

[0:0] OFFLOAD_BYPASS RW 0x0

Enables bypass mode. In this mode pretty much all functionality of the data offload is bypassed, and the data offload will simply act as an asynchronous dual-port FIFO and forward your data stream.

0x40 0x100 SYNC_TRIGGER

[0:0] SYNC_TRIGGER RW1C 0x0

Software trigger for software sync mode.

0x41 0x104 SYNC_CONFIG

[1:0] SYNC_CONFIG RW 0x0

Synchronization mode: 0: Auto, 1: Hardware trigger, 2: Software trigger, 3: Reserved.

0x80 0x200 FSM_BDG

[11:8] FSM_STATE_READ RO

It force the Rx side offload state machine into the required state.

[4:0] FSM_STATE_WRITE RO

The current state of the offload state machine.

Access Type

Name

Description

RO

Read-only

Reads will return the current register value. Writes have no effect.

RW

Read-write

Reads will return the current register value. Writes will change the current register value.

RW1C

Read,write-1-to-clear

Reads will return the current register value. Writing the register will clear those bits of the register which were set to 1 in the value written. Bits are set by hardware.

Detailed Description#

General Use Cases#

Note

This IP will always have a storage unit (internal or external to the FPGA) and is designed to handle high data rates. If your data paths will run in a lower data rate, and your intention is just to transfer the data to another clock domain or to adjust the bus width of the data path, you may want to check out the util_axis_fifo or util_axis_fifo_asym IPs.

The initialization and data transfer looks as follows:

  • in case of DAC, the DMA initializes the storage unit, after that the controller will push the data to the DAC interface in one-shot or cyclic way.

  • in case of ADC, the DMA requests a transfer, the controller saves the data into the storage unit, after that it will push it to the DMA.

  • BYPASS mode: simple streaming FIFO in case of clock rate or data width differences between source and sink interfaces (data rate MUST match in order to work); the BYPASS mode is used when an initially high rate path is downgraded to lower rates.

Generic Architecture#

The main role of our data paths is to stream data from point A to point B in a particular system. There are always a SOURCE and a DESTINATION point, which can be a device (ADC or DAC), a DMA (for system memory) or any other data processing IP.

In the context of Data Offload IP, we don’t need to know who is the source and who is the destination. Both interfaces are AXI4 Stream interfaces, which can be supported in both Xilinx’s an Intel’s architecture, and can be connected to any device core or DMA.

The storage unit is connected to the Data Offload controller via two AXIS interfaces. This way the same controller can be used for various storage solutions. (BRAM, URAM, external memory etc.)

Interfaces and Signals#

Register Map Configuration Interface#

AXI4 Lite Memory Mapped Subordinate (S_AXI4_LITE)#

This interface is used to access the register map.

// interface clock -- system clock -- 100 MHz
input                   s_axi_aclk
// interface resetn -- synchronous reset active low
input                   s_axi_aresetn

/* write address channel */

// validates the address on the bus
input                   s_axi_awvalid
// write address
input       [15:0]      s_axi_awaddr
// protection type -- not used in the core
input       [ 2:0]      s_axi_awprot
// write ready, indicates that the subordinate can accept the address
output                  s_axi_awready

/* write data channel */

// validate the data on the bus
input                   s_axi_wvalid
// write data
input       [31:0]      s_axi_wdata
// write strobe, indicates which byte lanes to update
input       [ 3:0]      s_axi_wstrb
// write ready, indicates that the subordinate can accept the data
output                  s_axi_wready

/* write response channel */

// validates the write response of the subordinate
output                  s_axi_bvalid
// write response, indicates the status of the transfer
output      [ 1:0]      s_axi_bresp
// response ready, indicates that the manager can accept the data
input                   s_axi_bready

/* read address channel */

// validates the address on the bus
input                   s_axi_arvalid
// read address
input       [15:0]      s_axi_araddr
// protection type -- not used in the core
input       [ 2:0]      s_axi_arprot
// read ready, indicates that the subordinate can accept the address
output                  s_axi_arready

/* read data channel */

// validates the data on the bus
output                  s_axi_rvalid
// read response, indicates the status of the transfer
output      [ 1:0]      s_axi_rresp
// read data driven by the subordinate
output      [31:0]      s_axi_rdata
// read ready, indicates that the manager can accept the data
input                   s_axi_rready

Supported Data Interfaces#

AXI4 Stream Interface (S_AXIS | M_AXIS)#
  • The AXI Stream Subordinate (S_AXIS) interface is used to receive AXI stream from the transmit DMA or ADC device.

  • The AXI Stream Manager (M_AXIS) interface is used to transmit AXI stream to receive DMA or DAC device.

// NOTE: this reference is a manager interface

// interface clock -- can be device/core clock or DMA clock
input                        m_axis_aclk
// interface resetn -- synchronous reset with the system clock
input                        m_axis_resetn
// indicates that the subordinate can accept a transfer in the current cycle (in case of an ADC core, this will control the stream)
input                        m_axis_ready
// indicates that the manager is driving a valid transfer
output                       m_axis_valid
// primary payload
output [DATA_WIDTH-1:0]      m_axis_data
// indicates the boundary of a packet
output                       m_axis_last
// byte qualifier, we need this so we can have different DMA and device data widths
output [(DATA_WIDTH/8)-1:0]  m_axis_tkeep

Note

A packet will always be a full buffer. All the data beats are going to be full beats (all the bytes of the bus are valid), except for the last one. axis_last and axis_tkeep will be used to indicate a partial last beat. This information should be transferred from the source domain to the sink domain, so we can read back the data from memory correctly.

AXIS Source and Destination Interface to the Storage Unit#

This is a blocking (back-pressure) interface for the storage unit, with similar behavior of main AXIS data interfaces.

Initialization Request Interface#

Defines a simple request interface to initialize the memory:

  • The request will come from the system and will put the data offload FSM into a standby/ready state.

Synchronization Modes#
  • AUTOMATIC

    • ADC: The IP will start to fill up the buffer with samples as soon as possible.

    • DAC: As the DMA will send a valid last, the FSM will start to send the stored data to the device.

  • HARDWARE

    • ADC and DAC: An external signal will trigger the write or read into or from the memory.

  • SOFTWARE

    • The software writes a RW1C register which will trigger the reads or writes into or from the memory.

Note

In case of DAC, if the DMA does not send all the data into the buffer, before a hardware sync event, then the unsent data will be ignored. It’s the user/software responsibility to sync up these events accordingly.

Clock Tree#

In general there are at least two different clocks in the data offload module:

  • DMA or system clock : on this clock will run all the front end interfaces

  • Memory Controller user clock : user interface clock of the DDRx controller (optional)

  • Device clock : the digital interface clock of the converter

Clock Domains diagram

A general frequency relationship of the above clocks are:

CLKdma <= CLKddr <= CLKconverter

The clock domain crossing should be handled by the util_axis_fifo module.

All the back end paths (device side) are time critical. The module must read or write from or into the storage at the speed of the device.

DDR data rate >= Device data rate
DDR data rate >= ADC data rate + DAC data rate

Data Path#

Data Path diagram
  • The data path should be designed to support any difference between the source, memory and sink data width.

  • The data width adjustments will be made by the CDC FIFO.

  • In both paths (ADC and DAC) the data stream at the front-end side is packetized, meaning there is a valid TLAST/TKEEP in the stream. While in the back-end side the stream is continuous (no TLAST/TKEEP).

    • The DAC path has to have a depacketizer to get rid of the last partial beat of the stream.

    • Because the ADC path already arrives in a packed form, and we always will fill up the whole storage, we don’t need to treat special use-cases.

Used Storage Elements#

ZC706

ZCU102

A10SOC

FPGA

XC7Z045 FFG900 – 2

XCZU9EG-2FFVB1156

10AS066N3F40E2SG

External Memory Type

DDR3 SODIMM

DDR4

DDR4 HILO

External Memory Size

1 GB

512 MB

2 GB

Embedded Memory Type

BRAM

BRAM

M20K

Embedded Memory Size

19.1 Mb

32.1 Mb

41 Mb

Data Width Manipulation#

  • Data width differences should be treated by the CDC FIFO.

  • The smallest granularity should be 8 bits. This constraint will mainly generate additional logic just in the TX path, taking the fact that the data from the ADC will come packed.

  • The main role of the gearbox is to improve the DDR’s bandwidth, stripping the padding bits of each samples, so the raw data could be stored into the memory.

Xilinx’s MIG vs. Intel’s EMIF#

  • Incrementing burst support for 1 to 256 beats, the length of the burst should be defined by the internal controller.

  • Concurrent read/write access, the external memory to be shared between an ADC and DAC.

  • Dynamic burst length tuning: an FSM reads and writes dummy data until both ADC’s overflow and DAC’s underflow lines are de-asserted. Pre-requisites: both devices’ interfaces should be up and running.

  • Optional gearbox to congest the samples in order to increase the maximum data rate.

  • In general all samples are packed into 16 bits. This can add a significant overhead to the maximum real data rate on the memory interface. The gearbox’s main role is to pack and unpack the device’s samples into the required data width (in general 512 or 1024 bits).

Boards with FPGA side DDR3/4 SODIMMs/HILO: ZC706, ZCU102, A10SOC.

ZC706

ZCU102

A10SOC

Max data throughputs (MT/s)

1600

2400

2133

DDRx reference clocks

200 MHz

300 MHz

133 MHz

DDRx Data bus width

64

16

64

Memory to FPGA clock ratio

4:1

4:1

4:1

UI type & burst length

AXI4-256

AXI4-256

Avalon Memory Map

UI data width

512

128

512

Internal Cyclic Buffer Support for the TX Path#

Data Path with external storage diagram
  • On the front end side of the TX path, a special buffer will handle the data width up/down conversions and run in cyclic mode if the length of the data set is smaller than 4/8 AXI/Avalon bursts. This way, we can avoid to overload the memory interface with small bursts.

  • On the back end side, because the smallest granularity can be 8 bytes, we need a dynamic ‘depacketizer’ or re-aligner, which will filter out the invalid data bytes from the data stream (this module will use the tlast and tkeep signal of the AXI stream interface).

Control Path - Offload FSM#

../../_images/tx_bram_fsm.svg

TX Control FSM for Internal RAM Mode#

../../_images/rx_bram_fsm.svg

RX Control FSM for Internal RAM Mode#

Linux Driver#

The linux driver has two responsibilities:

  • Initializes the data offload on startup.

  • Integrates with cf_axi_dds to allow IIO to utilize the data offload for cyclic operation.

The former of those two is covered by the device tree, which implements five options:

  • adi,bringup will automatically enable the data offload on startup. Note that this option isn’t always necessary, because the HDL itself may have been synthesized with auto-bringup.

  • adi,oneshot configures the default mode of operation for TX data offloads. This will usually be overridden by the IIO buffer integration and thus doesn’t have an effect in most situations.

  • adi,bypass enables bypass mode, i.e. disables all functionality and makes the data offload act like a small asynchronous FIFO.

  • adi,sync-config determines how the synchronization mechanism should operate. More information about this value can be found in the register map.

  • adi,transfer-length is useful for RX instances, where the size of the receive buffer can be reduced from the default (All available storage).

The latter is addressed by the integration into cf_axi_dds.c and cf_axi_dds_buffer_stream.c, which allow the drivers to control the oneshot functionality of the data offload based on what was requested with the current IIO buffer, assuming that bypass was disabled.

References#