Vivado HLS in a Nutshell

Z时代
2024-01-10
分类：技术分享

本文将详细介绍Vivado HLS的配置、入门及优化方法，包括各类pragma及库函数，同时给出大量参考资料供查阅。

C-HLS可以简单理解为C/C++语言的扩展，即提供了一些硬件编译指示，从而使得高层的规范(specification)可以被映射到RTL层级的电路描述。

快速入门

C/C++中的设施与硬件设施有如下对应。

C/C++	硬件
函数	模块(module)
参数	输入/输出端口(port)
算子	函数单元
标量	线(wire)或寄存器
数组	内存(memory)
控制流	控制逻辑

通常情况下RTL代码/硬件模块层次与原始C/C++代码层次一致。

March 11, 2020 - Vivado HLS in a Nutshell

下面以矩阵乘法为例（摘自Zynq Book Tutorials Exercise 3），需要写下列4个程序。

matrix_mult.h：头文件，包括基本宏定义、类型定义及函数原型

#ifndef __MATRIXMUL_H__
#define __MATRIXMUL_H__
#include <cmath>
usingnamespacestd;
// Compare TB vs HW C-model and/or RTL
#define HW_COSIM
#define IN_A_ROWS 5
#define IN_A_COLS 5
#define IN_B_ROWS 5
#define IN_B_COLS 5
typedefcharmat_a;
typedefcharmat_b;
typedefshortmat_prod;
// Prototype of top level function for C-synthesis
voidmatrix_mult(
mat_aa[IN_A_ROWS][IN_A_COLS],
mat_bb[IN_B_ROWS][IN_B_COLS],
mat_prodprod[IN_A_ROWS][IN_B_COLS]);
#endif // __MATRIXMUL_H__ not defined

matrix_mult.cpp：核心函数实现

#include "matrix_mult.h"
voidmatrix_mult(
mat_aa[IN_A_ROWS][IN_A_COLS],
mat_bb[IN_B_ROWS][IN_B_COLS],
mat_prodprod[IN_A_ROWS][IN_B_COLS])
{
// Iterate over the rows of the A matrix
Row:
for(inti=0;i<IN_A_ROWS;i++)
{
// Iterate over the columns of the B matrix
Col:
for(intj=0;j<IN_B_COLS;j++)
{
prod[i][j]=0;
// Do the inner product of a row of A and col of B
Product:
for(intk=0;k<IN_B_ROWS;k++)
{
prod[i][j]+=a[i][k]*b[k][j];
}
}
}
}

matrix_mult_test.cpp：测试代码，用于软硬件协同模拟

#include <iostream>
#include "matrix_mult.h"
usingnamespacestd;
intmain(intargc,char**argv)
{
mat_ain_mat_a[5][5]={
{0,0,0,0,1},
{0,0,0,1,0},
{0,0,1,0,0},
{0,1,0,0,0},
{1,0,0,0,0}};
mat_bin_mat_b[5][5]={
{1,1,1,1,1},
{0,1,1,1,1},
{0,0,1,1,1},
{0,0,0,1,1},
{0,0,0,0,1}};
mat_prodhw_result[5][5],sw_result[5][5];
interror_count=0;
// Generate the expected result
// Iterate over the rows of the A matrix
for(inti=0;i<IN_A_ROWS;i++)
{
for(intj=0;j<IN_B_COLS;j++)
{
// Iterate over the columns of the B matrix
sw_result[i][j]=0;
// Do the inner product of a row of A and col of B
for(intk=0;k<IN_B_ROWS;k++)
{
sw_result[i][j]+=in_mat_a[i][k]*in_mat_b[k][j];
}
}
}
#ifdef HW_COSIM
// Run the Vivado HLS matrix multiplier
matrix_mult(in_mat_a,in_mat_b,hw_result);
#endif
// Print product matrix
for(inti=0;i<IN_A_ROWS;i++)
{
for(intj=0;j<IN_B_COLS;j++)
{
#ifdef HW_COSIM
// Check result of HLS vs. expected
if(hw_result[i][j]!=sw_result[i][j])
{
error_count++;
}
#else
cout<<sw_result[i][j];
#endif
}
}
#ifdef HW_COSIM
if(error_count)
cout<<"TEST FAIL: "<<error_count<<"Results do not match!"<<endl;
else
cout<<"Test passed!"<<endl;
#endif
returnerror_count;
}

run_hls.tcl：自动化编译运行代码

# run.tcl
# open the HLS project mm.prj
set src_dir "."
open_project -reset matrix_mult_prj
# set the top-level function of the design
set_top mmult_hw
# add design and testbench files
add_files $src_dir/matrix_mult.h
add_files $src_dir/matrix_mult.cpp
add_files -tb$src_dir/matrix_mult_test.cpp
open_solution "solution"
# use Zynq device
set_part {xc7z020clg484-1}
# target clock period is 10 ns
create_clock -period 10 -name default
# do a c simulation
csim_design -clean
# synthesize the design
csynth_design
# do a co-simulation
#cosim_design
# close project and quit
close_project
exit

通过vivado_hls -f run_hls.tcl调用。

命令行运行的结果如下。

Command line execution results

****** Vivado(TM) HLS - High-Level Synthesis from C, C++ and SystemC v2018.1 (64-bit)
**** SW Build 2188600 on Wed Apr  4 18:40:38 MDT 2018
**** IP Build 2185939 on Wed Apr  4 20:55:05 MDT 2018
** Copyright 1986-2018 Xilinx, Inc. All Rights Reserved.
#######
# set up projects
#######
# c simulation
INFO: [SIM 211-2] *************** CSIM start ***************
INFO: [SIM 211-4] CSIM will launch GCC as the compiler.
Compiling ../../../../matrix_mult_test.cpp in debug mode
Compiling ../../../../matrix_mult.cpp in debug mode
Generating csim.exe
Test passed!
INFO: [SIM 211-1] CSim done with 0 errors.
INFO: [SIM 211-3] *************** CSIM finish ***************
# synthesis
INFO: [HLS 200-10] Analyzing design file './matrix_mult.cpp' ...
INFO: [HLS 200-10] Validating synthesis directives ...
INFO: [HLS 200-111] Finished Checking Pragmas Time (s): cpu = 00:00:01 ; elapsed = 00:00:13 . Memory (MB): peak = 101.551 ; gain = 44.590
INFO: [HLS 200-111] Finished Linking Time (s): cpu = 00:00:01 ; elapsed = 00:00:14 . Memory (MB): peak = 101.563 ; gain = 44.602
INFO: [HLS 200-10] Starting code transformations ...
INFO: [HLS 200-111] Finished Standard Transforms Time (s): cpu = 00:00:02 ; elapsed = 00:00:15 . Memory (MB): peak = 102.961 ; gain = 46.000
INFO: [HLS 200-10] Checking synthesizability ...
INFO: [HLS 200-111] Finished Checking Synthesizability Time (s): cpu = 00:00:02 ; elapsed = 00:00:15 . Memory (MB): peak = 103.191 ; gain = 46.230
INFO: [HLS 200-111] Finished Pre-synthesis Time (s): cpu = 00:00:02 ; elapsed = 00:00:16 . Memory (MB): peak = 124.961 ; gain = 68.000
INFO: [HLS 200-111] Finished Architecture Synthesis Time (s): cpu = 00:00:02 ; elapsed = 00:00:17 . Memory (MB): peak = 124.961 ; gain = 68.000
INFO: [HLS 200-10] Starting hardware synthesis ...
INFO: [HLS 200-10] Synthesizing 'matrix_mult' ...
INFO: [HLS 200-10]
----------------------------------------------------------------
INFO: [HLS 200-42] -- Implementing module 'matrix_mult'
INFO: [HLS 200-10] ----------------------------------------------------------------
INFO: [SCHED 204-11] Starting scheduling ...
INFO: [SCHED 204-11] Finished scheduling.
INFO: [HLS 200-111]  Elapsed time: 17.173 seconds; current allocated memory: 75.025 MB.
INFO: [BIND 205-100] Starting micro-architecture generation ...
INFO: [BIND 205-101] Performing variable lifetime analysis.
INFO: [BIND 205-101] Exploring resource sharing.
INFO: [BIND 205-101] Binding ...
INFO: [BIND 205-100] Finished micro-architecture generation.
INFO: [HLS 200-111]  Elapsed time: 0.281 seconds; current allocated memory: 75.202 MB.
INFO: [HLS 200-10]
----------------------------------------------------------------
INFO: [HLS 200-10] -- Generating RTL for module 'matrix_mult'
INFO: [HLS 200-10] ----------------------------------------------------------------
INFO: [RTGEN 206-500] Setting interface mode on port 'matrix_mult/a' to 'ap_memory'.
INFO: [RTGEN 206-500] Setting interface mode on port 'matrix_mult/b' to 'ap_memory'.
INFO: [RTGEN 206-500] Setting interface mode on port 'matrix_mult/prod' to 'ap_memory'.
INFO: [RTGEN 206-500] Setting interface mode on function 'matrix_mult' to 'ap_ctrl_hs'.
INFO: [SYN 201-210] Renamed object name 'matrix_mult_mac_muladd_8s_8s_16ns_16_1_1' to 'matrix_mult_mac_mbkb' due to the length limit 20
INFO: [RTGEN 206-100] Generating core module 'matrix_mult_mac_mbkb': 1 instance(s).
INFO: [RTGEN 206-100] Finished creating RTL model for 'matrix_mult'.
INFO: [HLS 200-111]  Elapsed time: 0.229 seconds; current allocated memory: 75.575 MB.
INFO: [HLS 200-111] Finished generating all RTL models Time (s): cpu = 00:00:03 ; elapsed = 00:00:19 . Memory (MB): peak = 124.961 ; gain = 68.000
INFO: [SYSC 207-301] Generating SystemC RTL for matrix_mult.
INFO: [VHDL 208-304] Generating VHDL RTL for matrix_mult.
INFO: [VLOG 209-307] Generating Verilog RTL for matrix_mult.
INFO: [HLS 200-112] Total elapsed time: 18.761 seconds; peak allocated memory: 75.575 MB.
INFO: [Common 17-206] Exiting vivado_hls at Thu Mar 12 16:38:28 2020...

可以得到下面的结果（见生成的matrix_mult_prj\solution\syn\report\matrix_mult_csynth.rpt文件）

Performance estimates

================================================================
== Performance Estimates
================================================================
+ Timing (ns):
* Summary:
+--------+-------+----------+------------+
|  Clock | Target| Estimated| Uncertainty|
+--------+-------+----------+------------+
|ap_clk  |  10.00|      8.70|        1.25|
+--------+-------+----------+------------+
+ Latency (clock cycles):
* Summary:
+-----+-----+-----+-----+---------+
|  Latency  |  Interval | Pipeline|
| min | max | min | max |   Type  |
+-----+-----+-----+-----+---------+
|  311|  311|  311|  311|   none  |
+-----+-----+-----+-----+---------+
+ Detail:
* Instance:
N/A
* Loop:
+--------------+-----+-----+----------+-----------+-----------+------+----------+
|              |  Latency  | Iteration|  Initiation Interval  | Trip |          |
|   Loop Name  | min | max |  Latency |  achieved |   target  | Count| Pipelined|
+--------------+-----+-----+----------+-----------+-----------+------+----------+
|- Row         |  310|  310|        62|          -|          -|     5|    no    |
| + Col        |   60|   60|        12|          -|          -|     5|    no    |
|  ++ Product  |   10|   10|         2|          -|          -|     5|    no    |
+--------------+-----+-----+----------+-----------+-----------+------+----------+

通过流水线方式，降低初始间隔(initial interval, II)，提升并行度，提升吞吐率。

voidmatrix_mult(
mat_aa[IN_A_ROWS][IN_A_COLS],
mat_bb[IN_B_ROWS][IN_B_COLS],
mat_prodprod[IN_A_ROWS][IN_B_COLS])
{
// Iterate over the rows of the A matrix
Row:
for(inti=0;i<IN_A_ROWS;i++)
{
// Iterate over the columns of the B matrix
Col:
for(intj=0;j<IN_B_COLS;j++)
{
#pragma HLS PIPELINE II=1
prod[i][j]=0;
// Do the inner product of a row of A and col of B
Product:
for(intk=0;k<IN_B_ROWS;k++)
{
prod[i][j]+=a[i][k]*b[k][j];
}
}
}
}

重新编译运行可以得到

Command line execution results (add pipelining)

INFO: [HLS 200-10] ----------------------------------------------------------------
INFO: [HLS 200-42] -- Implementing module 'matrix_mult'
INFO: [HLS 200-10] ----------------------------------------------------------------
INFO: [SCHED 204-11] Starting scheduling ...
INFO: [SCHED 204-61] Pipelining loop 'Row_Col'.
WARNING: [SCHED 204-69] Unable to schedule 'load' operation ('a_load_1', ./matrix_mult.cpp:22) on array 'a' due to limited memory ports. Please consider using a memory core with more ports or partitioning the array 'a'.
INFO: [SCHED 204-61] Pipelining result : Target II = 1, Final II = 3, Depth = 5.
WARNING: [SCHED 204-21] Estimated clock period (10.779ns) exceeds the target (target clock period: 10ns, clock uncertainty: 1.25ns, effective delay budget: 8.75ns).
WARNING: [SCHED 204-21] The critical path consists of the following:
'mul' operation ('tmp_7_2', ./matrix_mult.cpp:22) (3.36 ns)
'add' operation ('tmp2', ./matrix_mult.cpp:22) (3.02 ns)
'add' operation ('tmp_8_4', ./matrix_mult.cpp:22) (2.08 ns)
'store' operation (./matrix_mult.cpp:22) of variable 'tmp_8_4', ./matrix_mult.cpp:22 on array 'prod' (2.32 ns)
INFO: [SCHED 204-11] Finished scheduling.

从下面的性能分析报告中可以看到Row和Col被合并了，latency大大减少，提升了近4倍！（事实上在更大的数据集下，单一的流水线即可提升10+倍）

Performance estimates (add pipelining)

================================================================
== Performance Estimates
================================================================
+ Timing (ns):
* Summary:
+--------+-------+----------+------------+
|  Clock | Target| Estimated| Uncertainty|
+--------+-------+----------+------------+
|ap_clk  |  10.00|     10.78|        1.25|
+--------+-------+----------+------------+
+ Latency (clock cycles):
* Summary:
+-----+-----+-----+-----+---------+
|  Latency  |  Interval | Pipeline|
| min | max | min | max |   Type  |
+-----+-----+-----+-----+---------+
|   78|   78|   78|   78|   none  |
+-----+-----+-----+-----+---------+
+ Detail:
* Instance:
N/A
* Loop:
+-----------+-----+-----+----------+-----------+-----------+------+----------+
|           |  Latency  | Iteration|  Initiation Interval  | Trip |          |
| Loop Name | min | max |  Latency |  achieved |   target  | Count| Pipelined|
+-----------+-----+-----+----------+-----------+-----------+------+----------+
|- Row_Col  |   76|   76|         5|          3|          1|    25|    yes   |
+-----------+-----+-----+----------+-----------+-----------+------+----------+

需要完成循环所需总的时钟周期数为

\[N_{loop}=(J\times N_{body})+N_{control}\]

注意到在上面scheduling的报告中，提到虽然我们的目标II是1，但是最好只能做到3，因为内存端口限制了。因此要提升性能，需要将数组进行划分，以提升IO效率。

voidmatrix_mult(
mat_aa[IN_A_ROWS][IN_A_COLS],
mat_bb[IN_B_ROWS][IN_B_COLS],
mat_prodprod[IN_A_ROWS][IN_B_COLS])
{
#pragma HLS ARRAY_RESHAPE variable=a complete dim=2
#pragma HLS ARRAY_RESHAPE variable=b complete dim=1
// Iterate over the rows of the A matrix
Row:
for(inti=0;i<IN_A_ROWS;i++)
{
// Iterate over the columns of the B matrix
Col:
for(intj=0;j<IN_B_COLS;j++)
{
prod[i][j]=0;
// Do the inner product of a row of A and col of B
Product:
for(intk=0;k<IN_B_ROWS;k++)
{
prod[i][j]+=a[i][k]*b[k][j];
}
}
}
}

最后可得到结果报告如下，latency降到了29，也即比原始最naive的矩阵乘法已经提升了10倍！而我们只需要在原始C++代码中插入3行即可。

Execution results & performance estimates (add pipelining & array partition)

INFO: [HLS 200-10] ----------------------------------------------------------------
INFO: [HLS 200-42] -- Implementing module 'matrix_mult'
INFO: [HLS 200-10] ----------------------------------------------------------------
INFO: [SCHED 204-11] Starting scheduling ...
INFO: [SCHED 204-61] Pipelining loop 'Row_Col'.
INFO: [SCHED 204-61] Pipelining result : Target II = 1, Final II = 1, Depth = 4.
WARNING: [SCHED 204-21] Estimated clock period (11.477ns) exceeds the target (target clock period: 10ns, clock uncertainty: 1.25ns, effective delay budget: 8.75ns).
WARNING: [SCHED 204-21] The critical path consists of the following:
'mul' operation ('tmp_7_4', ./matrix_mult.cpp:25) (3.36 ns)
'add' operation ('tmp3', ./matrix_mult.cpp:25) (3.02 ns)
'add' operation ('tmp2', ./matrix_mult.cpp:25) (3.02 ns)
'add' operation ('tmp_8_4', ./matrix_mult.cpp:25) (2.08 ns)
INFO: [SCHED 204-11] Finished scheduling.
================================================================
== Performance Estimates
================================================================
+ Timing (ns):
* Summary:
+--------+-------+----------+------------+
|  Clock | Target| Estimated| Uncertainty|
+--------+-------+----------+------------+
|ap_clk  |  10.00|     11.48|        1.25|
+--------+-------+----------+------------+
+ Latency (clock cycles):
* Summary:
+-----+-----+-----+-----+---------+
|  Latency  |  Interval | Pipeline|
| min | max | min | max |   Type  |
+-----+-----+-----+-----+---------+
|   29|   29|   29|   29|   none  |
+-----+-----+-----+-----+---------+
+ Detail:
* Instance:
N/A
* Loop:
+-----------+-----+-----+----------+-----------+-----------+------+----------+
|           |  Latency  | Iteration|  Initiation Interval  | Trip |          |
| Loop Name | min | max |  Latency |  achieved |   target  | Count| Pipelined|
+-----------+-----+-----+----------+-----------+-----------+------+----------+
|- Row_Col  |   27|   27|         4|          1|          1|    25|    yes   |
+-----------+-----+-----+----------+-----------+-----------+------+----------+

当然在报告中还有更加详细的内存、资源（LUT、FF、BRAM、DSP）占用信息，这里就没有再贴出来。但需要注意HLS对于资源的估计相当不精确，与后端综合后的结果相比可能有非常大的差异。

C HLS pragma

#pragma HLS pipeline II=<int>

#pragma HLS array_partition variable=<variable> <block, cyclic, complete> factor=<int> dim=<int>
- Partition的效果可见米兰理工大学的视频
- 可以对同个数组的不同维度采用多个partition

#pragma HLS array_reshape variable=<variable> <block, cyclic, complete> factor=<int> dim=<int>
- 区别参见此文

#pragma HLS dataflow

#pragma HLS unroll factor=<N>

#pragma HLS loop_tripcount min=<int> max=<int> avg=<int>
- 辅助的pragma，对于动态循环（循环次数在编译前未确定）可以设定循环次数的上下限，从而方便在HLS report里得到估计的latency；否则这些动态循环的latency将会以?形式显示

pipeline

详情见UG P331

dout_tloop_pipeline(din_tA[N]){
inti,j;
staticdout_tacc;
LOOP_I:for(i=0;i<20;i++){
LOOP_J:for(j=0;j<20;j++){
acc+=A[i]*j;
}
}
returnacc;
}

如果不加pipeline，那么全部代码串行执行
- 延迟为$20\times 20\times T_{mac}$

如果对内层循环添加pipeline，则LOOP_J在硬件上只有1份拷贝（单一的乘法器）
- 延迟为$(20\times 20-1)\times II + T_{mac}$，若II(initial interval)为1，则总时延大约是400 cycles（相当于乘法的延迟被掩盖了），只需小于100 LUTs和寄存器

如果对外层循环添加pipeline，则LOOP_J会被unroll产生20份拷贝，会有20个乘法器和20个数组访问需要被调度
- 延迟只有$(20-1)\times II + T_{mac}$ cycles（如果乘法器能够同时完成操作）

如果对整个函数进行pipeline，则一共产生数千个LUT和寄存器
- 延迟只有10（20个双端口访问），但需要大量硬件资源

HLS Stream Library

Vivado HLS提供了hls::stream<>的模板类（引入头文件<hls_stream.h>），表现为无限长度的FIFO（无需定义大小，在硬件实现上是深度为1），数据只能从队列中读出来一次，顶层接口用ap_fifo实现。

做C++函数传递时，只能通过传引用方式传递，如&my_stream。

如果hls::stream用于任务之间的数据传递，那么需要考虑将这些任务实现在一个DATAFLOW区域内。

如果在非数据流区域，则任务会被一个一个串行完成，也就是说FIFO应该足够大去保存其中间结果，否则会报错（注意如果没有采用显式function，那经过loop unrolling，Vivado HLS依然会辨别不出是否为数据流区域，因此最好还是用函数声明）。

ERROR: [XFORM 203-733] An internal stream xxxx.xxxx.V.user.V' with default size is used in a non-dataflow region, which may result in deadlock. Please consider to resize the stream using the directive 'set_directive_stream' or the 'HLS stream' pragma.

默认hls::stream的读写都是阻塞的(blocked)，也就是当FIFO空时想读，FIFO满时想写会阻塞；如果是非阻塞读写则会返回真值示意是否成功。

// Usage of void write(const T & wdata)
hls::stream<int>my_stream;
intsrc_var=42;
my_stream.write(src_var);
// my_stream << src_var;
// Usage of void read(T &rdata)
hls::stream<int>my_stream;
intdst_var;
my_stream.read(dst_var);
// int dst_var = my_stream.read();
// my_stream >> dst_var;
// non-blocking read/write
my_stream.read_nb
my_stream.write_nb
// full/empty
my_stream.full()
my_stream.empty()

下列编译指示可以用于定制化流数据的类型

#pragma HLS stream variable=<variable> depth=<int> dim=<int>

注意需要保证FIFO的读写次数一致，使用hls::stream可以在csim时就发现问题所在，而采用传统的数组则没有办法发现读写次数不一致的问题，这将导致后端cosim死循环，硬件执行deadlock等。

Reference

hls::stream Class

SDAccel pragma HLS stream

HLS Study Notes

Issues

Achieving II=1 for streaming an array, https://forums.xilinx.com/t5/High-Level-Synthesis-HLS/Achieving-II-1-for-streaming-an-array/m-p/1072414#M19669

The entries are not accessed in sequential order, https://forums.xilinx.com/t5/High-Level-Synthesis-HLS/Cycle-synthesis-error-in-Vivado-HLS-2018-2-amp-3/m-p/951573

Estimating stream depth, https://forums.xilinx.com/t5/High-Level-Synthesis-HLS/Estimating-HLS-Stream-Depth/td-p/658115

如果用了INTERFACE，则需要保证位宽是8的倍数

voidtest(ap_uint<1>A[10][10])
#pragma HLS INTERFACE m_axi port=A offset=slave bundle=gmem0
#pragma HLS INTERFACE s_axilite port=A bundle=control

  ERROR: [v++ 203-801] Interface parameter bitwidth 'A.V' (/home/hc2238/heterocl-demo/s1-project/kernel.cpp:15:1)
must be a multiple of 8 for AXI4 master port.

HLS Video Library

需要包含头文件<hls_video.h>，其中最有用的是LineBuffer和WindowBuffer。

LineBuffer

// hls::LineBuffer<rows, columns, type> variable;
hls::LineBuffer<3,5,char>Buff_A;
Buff_A.shift_pixels_down(2);
Buff_A.insert_top_row(100,2);
Value=Buff_A.getval(1,3);// 9

Row	Column 0	Column 1	Column 2	Column 3	Column 4
Row 0	1	2	3	4	5
Row 1	6	7	8	9	10
Row 2	11	12	13	14	15

经过上述操作变成

Row	Column 0	Column 1	Column 2	Column 3	Column 4
Row 0	1	2	100	4	5
Row 1	6	7	3	9	10
Row 2	11	12	8	14	15

其他API包括

shift_pixels_up()

shift_pixels_down()

insert_bottom_row()

insert_top_row()

getval(row,column)

WindowBuffer

// hls::Window<row, column, type> variable;
hls::Window<3,3,char>Buff_B;

shift_pixels_up()

shift_pixels_down()

shift_pixels_left()

shift_pixels_right()

insert_pixel(value,row,colum)：直接覆盖

insert_row()

insert_bottom_row()

insert_top_row()

insert_col()

insert_left_col()

insert_right_col()

getval(row, column)

Column 0	Column 1	Column 2	Row
1	2	3	Row 0
6	7	8	Row 1
11	12	13	Row 2

经过Buff_B.shift_pixels_up()可得到

Column 0	Column 1	Column 2	Row
6	7	8	Row 0
11	12	13	Row 1
New	New	New	Row 2

可以看到这种模式对卷积的实现是非常高效的。

经过char C[3] = {50, 50, 50}; Buff_B.insert_row(C,1);会得到

Column 0	Column 1	Column 2	Row
1	2	3	Row 0
50	50	50	Row 1
11	12	13	Row 2

可以看到WindowBuffer通常都是对一整块内存进行操作，而LineBuffer更多针对单一元素。

Interface

March 11, 2020 - Vivado HLS in a Nutshell

AXI_Lite: s_axilite, 被动接受数据, 标量数据。AXI的轻量级版本，用于存储器映射的单次数据通信会话。这个版本的好处是简化了的接口占用较少的逻辑部分面积。这个版本不支持批量数据，因此只支持每次传输单个数据。

AXI_Master: m_axi, 主动传输数据。最高性能的接口，适合存储器映射的通信，支持每个地址阶段最高 256 个数据传输周期的批量传输。

AXI_Stream: axis, FIFO, 不可被寻址

HLS INTERFACE可选属性：

depth=<int>：定义FIFO大小

bundle=<string>：将函数参数全部归到同一端口进行输入，默认只会用一个端口进行数据传输

offset=<string>：控制m_axi和s_axilite的地址偏移量
- s_axilite：<string>指定register map里的地址
- m_axi：<string>可以为以下值：
  - direct: Generate a scalar input offset port
  - slave: Generate an offset port and automatically map it to an AXI4-Lite slave interface
  - off: Do not generate an offset port

Reference

Pynq Workshop

HLS INTERFACE

数据类型

任意精度整数(Arbitrary Precision, AP)，具体实现可参见HLS Arbitrary Precision Types，其实都是C++的模板类。

整数

#include "ap_int.h"

ap_int有符号，ap_uint无符号

用模板类声明，如ap_uint<24>代表24位无符号整数

定点数

#include "ap_fixed.h"

ap_fixed和ap_ufixed

ap_fixed<W,I,Q,O>
- W：总字长
- I：整数字长
- Q：量化(quantization)模式
- O：上溢(overflow)模式

如ap_ufixed<11,8,AP_TRN,AP_WRAP>代表

11位长度定点数，8位整数位，3位小数位

AP_TRN表示量化时采用截断(truncation)

AP_WRAP表示用wrapping来处理上溢（即直接丢除最高位，这会导致循环）；另外一种是浸润模式AP_SAT，高于最大值都当最大值，低于最小值都当最小值

类成员操作

相关类成员操作如下：（在UG C++ Arbitrary Precision Types一节）

基本的运算符均已被重载

length

to_int, to_double, to_string

Concat

ap_concat_ref ap_(u)int::concat (ap_(u)int low)

ap_concat_ref ap_(u)int::operator , (ap_(u)int high, ap_(u)int low)

ap_uint<10>Rslt;
ap_int<3>Val1=-3;
ap_int<7>Val2=54;
Rslt=(Val2,Val1);// Yields: 0x1B5
Rslt=Val1.concat(Val2);// Yields: 0x2B6
(Val1,Val2)=0xAB;// Yields: Val1 == 1, Val2 == 43

Bit selection
- ap_bit_ref ap_(u)int::operator [] (int bit)
- 注意返回是一个引用，意味着可以直接赋值修改

Range selection

ap_range_ref ap_(u)int::range (unsigned Hi, unsigned Lo)

ap_range_ref ap_(u)int::operator () (unsigned Hi, unsigned Lo)

ap_uint<4>Rslt;
ap_uint<8>Val1=0x5f;
ap_uint<8>Val2=0xaa;
Rslt=Val1.range(3,0);// Yields: 0xF
Val1(3,0)=Val2(3,0);// Yields: 0x5A
Val1(4,1)=Val2(4,1);// Yields: 0x55
Rslt=Val1.range(7,4);// Yields: 0xA; bit-reversed!

Reduce
- bool ap_(u)int::and_reduce ()
- and, or, xor, nand, nor, xnor

set, clear, invert

需要注意的点

HLS不支持递归、系统调用（文件读取）、动态内存分配

默认情况下，循环都不展开(rolled)

当外层循环用了pipeline或者unroll时，内层循环默认展开

单层循环unroll，在HLS report中将不会作为loop出现

HLS默认优化面积，即用最小的资源实现目标（串行架构），因此时延可能非常慢，吞吐率低

常量数组（分配在ROM上，默认相当于completely array_partition）需要声明为全局变量，否则作为局部变量会非常慢

编译综合模式

csim：C语言层面进行模拟

csyn：C综合生成RTL代码

cosim：在RTL层进行模拟，与C结果输出进行比对

impl：将RTL打包成IP核

对应的Tcl如下

# Simulate the C++ design csim_design # Synthesize the design csynth_design # Co-simulate the design cosim_design # Implement the design export_design -flow impl

工具安装

Xilinx的下载页面下载最新版本的Vivado Design Suite - HLx Edition（最新版v2020的安装包已经达到了35.5G，而且必须全部下载并安装，Xilinx并不提供单独安装HLS的方式）

Vitis

Vitis是Xilinx新推出的一个更高层次的编程框架，内嵌Vivado HLS以及后端的Runtime，在命令行下的编译执行操作要比原来的Vivado方便很多。

下载好上述完整安装包后，双击xsetup可以运行安装程序，注意这里需要有图形化界面及Java支持。之后的安装选项即可选择Vitis，默认安装在/tools/Xilinx/Vitis/2020.1文件夹下。安装好后执行settings64.sh可以自动配置好环境变量。

如果需要下载Runtime (XRT)，可在这个页面下载。其中即包含了后端编译的运行脚本，可以直接编译生成比特流，然后通过OpenCL的编程界面上板。

具体编程与之前的Vivado HLS不同在于其涉及到host-device的数据传输，因此需要添加#pragma hls interface，否则无法通过综合。

Alveo加速卡的相关信息可见官网，以及Nimbix的FPGA云服务。

参考代码库：

HLx_Examples：有完整的测试样例和tcl执行代码

FlexCNN：可以找到CNN各个layer的HLS实现

执行问题

csyn

Vivado HLS不提供并行编译选项，因此综合大型代码耗费的时间会比较长。但如果硬件设计做得好（如流水线添加合理，数组划分正确），即便是大型代码也可以在10分钟内综合完成。也就是说，如果某个design综合的时间过长，那一定是优化没做好。

一些综合中出现的问题可能可在这个博客中找到。

cosim

在编译综合大型电路设计之前，一定要先跑csim和cosim验证结果正确性，之后才生成bitstream上板。

在vivado_hls cosim时可能会出现以下问题。

/home/jfrye/sw/Xilinx/Vivado/2018.2/include/mpfr.h:244:2: error: ‘__gmp_const’ does not name a type
__MPFR_DECLSPEC __gmp_const char * mpfr_get_version _MPFR_PROTO ((void));
^~~~~~~~~~~
/home/jfrye/sw/Xilinx/Vivado/2018.2/include/mpfr.h:245:2: error: ‘__gmp_const’ does not name a type
__MPFR_DECLSPEC __gmp_const char * mpfr_get_patches _MPFR_PROTO ((void));

这是Xilinx内部使用的头文件与系统头文件冲突导致，可以通过修改include内的文件来修复。

修改/tools/Xilinx/Vivado/2020.1/include/mpfr.h，将系统导入头文件<gmp.h>，改成当前文件夹导入"gmp.h"。

/* Check if GMP is included, and try to include it (Works with local GMP) */ #ifndef __GMP_H__ # include "gmp.h" #endif

如果还是不行，则需要在每个源文件（main.cpp和test.cpp）头顶添加下面语句，参见此问题解决方案。

#include <gmp.h>
#define __gmp_const const

同时需要在.tcl文件中添加合理的编译flag，参见此回答。

通常，如果cosim的结果很久没有出来，或者百分比超过100%，那这个设计就是有问题的，可以提前手动终止，参见此回答。

Windows端调用

如果要在WSL内使用vivado_hls，其实还是相当麻烦的。之前通过大量的尝试，才得到了一个比较好的解决方案。由于Vivado在Linux下的安装一定要图形界面，因此尝试在WSL内安装了图形桌面后，调用xsetup安装，但似乎安装界面Java虚拟机的大量解释开销，一直都没法进入正常的安装界面，故此方法最后还是放弃。

最后试出来的方法是在Windows环境下安装好Vivado套件后，在WSL内通过两层封装进行调用。

首先需要拷贝一份Xilinx\Vivado\2020.1\bin\vivado_hls.bat（不妨命名拷贝为my_vivado_hls.bat），然后修改文件内容。将弹出新窗口的%COMSPEC%指令移除，直接换成vivado_hls的调用。完整的bat文件如下，这里可以通过%1 %2进行命令行指令的传递。

@echo off
setPATH=%~dp0;%PATH%;%~dp0..\tps\win64\msys64\usr\bin;%~dp0..\tps\win64\msys64\mingw64\bin
setAUTOESL_HOME=%~dp0..
setVIVADO_HLS_HOME=%~dp0..
echo===============================
echo==VivadoHLSCommandPrompt
echo==Availablecommands:
echo==vivado_hls,apcc,gcc,g++,make
echo===============================
setRDI_OS_ARCH=32
if[%PROCESSOR_ARCHITECTURE%]==[x86](
ifdefinedPROCESSOR_ARCHITEW6432(
setRDI_OS_ARCH=64
)
)else(
ifdefinedPROCESSOR_ARCHITECTURE(
setRDI_OS_ARCH=64
)
)
ifnot"%RDI_OS_ARCH%"=="64"goto _NotX64
setCOMSPEC=%WINDIR%\SysWOW64\cmd.exe
rem %COMSPEC% 
vivado_hls%1%2
gotoEOF
:_NotX64
setCOMSPEC=%WINDIR%\System32\cmd.exe
rem %COMSPEC% /c %0 %1 %2 %3 %4 %5 %6 %7 %8 %9 
vivado_hls%1%2
:EOF

这是Windows端的封装，这样弄完后就可以直接在PowerShell里通过.\my_vivado_hls -f run.tcl执行综合过程，而不会弹出新的窗口。

至于WSL端的封装则是创建一个bash文件，里面通过cmd.exe执行上述指令，脚本如下。

#!/bin/bash cmd.exe /c <path_to_xilinx>/Xilinx/Vivado/2020.1/bin/my_vivado_hls $1$2

这里同样通过$1 $2传递参数，同时将此脚本保存为vivado_hls（注意<path_to_xilinx>应该是Windows下的路径，同时需要对脚本文件chmod +x添加权限），假装它就是一个可执行文件，同时放在可被Linux的PATH搜索到的地方。这样执行which vivado_hls也能正常执行（这是alias所做不到的）。

最终就可以愉快地在WSL里调用vivado_hls -f run.tcl进行综合啦！