^{1}

^{*}

^{1}

^{1}

^{1}

^{1}

Convolutional neural network (CNN) is an essential model to achieve high accuracy in various machine learning applications, such as image recognition and natural language processing. One of the important issues for CNN acceleration with high energy efficiency and processing performance is efficient data reuse by exploiting the inherent data locality. In this paper, we propose a novel CGRA (Coarse Grained Reconfigurable Array) architecture with time-domain multithreading for exploiting input data locality. The multithreading on each processing element enables the input data reusing through multiple computation periods. This paper presents the accelerator design performance analysis of the proposed architecture. We examine the structure of memory subsystems, as well as the architecture of the computing array, to supply required data with minimal performance overhead. We explore efficient architecture design alternatives based on the characteristics of modern CNN configurations. The evaluation results show that the available bandwidth of the external memory can be utilized efficiently when the output plane is wider (in earlier layers of many CNNs) while the input data locality can be utilized maximally when the number of output channel is larger (in later layers).

Convolutional neural networks (CNNs) are attracting much attention by ac- hieving high accuracy in various applications such as image recognition, natural language processing, object detection. CNN is a type of neural networks which employs convolution operation as a feature extraction method, where the weights of the convolution are also self-obtained through the training of the network. This “trainable” feature extraction algorithm is the key of high reco- gnition accuracy of the CNNs.

CNNs are computationally intensive, many kinds of hardware acceleration such as GPGPU computation or ASIC/FPGA-based implementations [

The difference of computational structures between convolutional and fully- connected layers prevents the accelerators to process the entire CNN efficiently. In addition, the convolutional layers in a single CNN may have quite different memory (data transfer) requirements.

In this paper, we analyze the data access patterns of CNN layers and find out a reconfigurable architecture that has the ability of changing data access/operation patterns layer by layer in order to utilize the locality of the layers. We propose a novel CGRA (Coarse Grained Reconfigurable Array) accelerator with time- domain multithreading, which exploits the locality of the input data and con- ceals the latency of the external memory access. Finally, we evaluate the benefits of the multithreading by estimating the data reusability and the hardware re- quirements for our experimental architecture. As a result, the efficiency and usability of the multithreaded processing in CGRA are proven, and the practical and quantitative methods for building/using multithreaded CGRA neural net- work accelerators are shown.

This paper is based on our previous work [

The purpose of this paper is to propose a CNN accelerator architectue and design for embedded applications. This paper is organized as follows: in Section 2, we analyze the computation pattern of CNNs. In Section 3, we discuss the localities of the CNN layers and propose an ideal multithreaded architecture to exploit them. In Section 4, we consider two feasible architectures with multith- readed accumulators. And then in Section 5 we evaluate the performances, arithmetic intensities, and hardware requirements of the architectures on several situations. In Section 6, we present some related studies. Then in Section 7, and we finally conclude this paper.

A CNN includes the layers operate 2-dimensional convolution between the inputs and the weights (kernels), called “convolutional” (CONV) layers, which behave as feature extraction that the pixels having the similar pattern to the weight pattern are amplified. The weights of the CNN are trained to have the reasonable patterns for the task, and this self-tuned feature extraction is the secret the CNNs succeed in the natural data recognition/classification. Typically a CNN might have the “fully-connected” (FC) layers after the CONV layers, and these layers operate vector transformation and work as classifier like the classical multi-layer perceptron.

In this section, we discuss the difference and the similarity of the com- putational structures and the data paths of the two types of layers.

First, we take a look at the computation of the CONV layer with single input/output channel.

An output pixel of the CONV layer is calculated using the weights shaped H × H and the input partial area which has the same shape, as shown in

The computation of the CONV layer consists of the same number of the multiplication and summation. Since all the output elements share the weights, each element of the output uses all of the weights. Then, the corresponding input partial area is uniquely determined by the location of the output element and the size of the weight matrix(H). Input pixels used by each output element are determined when we pick one weight element. Therefore, the convolution can be

computed by selecting a weight element one by one and multiplying the corresponding input area for all output elements. This one operation requires only one weight value and as many the input pixels as the output elements, as shown

From another view, as shown as black solid lines in

The computation of an FC layer is denoted as the multiplication of the input vector and the weight matrix. The name “fully-connected” layer comes from that an output element is calculated using all of the input elements via the weights (

Similar to the CONV layers, the dominant operation of the FC layers is the multiplication and the addition. An output element is connected to all of the input element via their weight, as we discussed above, therefore an input ele- ment has the influence to all output elements. The computation of the FC layer can be done by picking an input element and the weights corresponding to the input and all output elements. One input element and as many weight values as output elements are needed in one operation cycle (

Here, we look at the data reusability of an FC layer (

The discussion of the previous section shows that the dominant operation of the CONV and FC computation is commonly the multiply-add operation but data sizes of the operand are reversed. In this section, we propose the basic idea of the flexible architecture which does not need any difference of the hardware str- ucture through the entire CNN computation.

For the multiply-add (MULADD) operation, the main calculation of the CONV and FC layers, we construct simple multiply-accumulator-based array processor. This is an output-parallel system where a processor engine (PE) handles an output element. According to the above discussion, we need to distribute a common value to all PEs and individual data to each of PEs per one clock cycle.

In the CONV processing, we feed multiple input elements and a shared weight value into the PE array (

The basic processing flow of the single-channel convolution is shown in

space for the input image or features for the hidden layers), convolution be- comes multi-channel (that is, although the CONV layers have 2-d topology in the plane the channels of the input and the output are “fully-connected” so as to express the combination of the features), but let us think about the case the input and the output have only one channel here. Let the parallelism of the array P x × P y , the size of the weight plane H × H , the stride of the convolution s (in this example, P x = P y = 4 , H = 3 , s = 1 .) The operation for one clock cycle is feeding one weight value into the shared input, cropping P x × P y partial area from the input and feeding it through the individual input. After H × H cycles (one operation per one weight), then the convolution is done. In this example, the processor array covers whole the output plane, but in the case the number of PEs is not sufficient the output plane divided into blocks and the array executes the time-division processing.

On the other hand, in the FC layer an input, not a weight, is used by multiple elements. An input element of the FC layer and the weight values for each PE are provided at one time (

The basic computation for the FC layer is shown in

Lower data transfer per operation is the better in point of the computational time and efficiency. There are two opportunities for reducing the data accesses.

First is the sequential locality of the input partial area of CONV layers. A pixel in an input channel is referenced H × H times, the operations of the H weights in each row are consecutive. As shown in

Moreover, the temporal locality of the input of the multiple channel con- volution is also observed. Let the number of input channels C i , the number of the output channels C o . One partial input area is used C o times because all output channels (features) are computed from the combination of the all input channels. According to this, once the input elements are fetched for the com- putation of the first output channel, we can skip loading the same inputs for following output channels, as shown in

We have explained the computation procedure of CONV and FC layers using an ideal processor array with individual and shared inputs, and the data reusability of CNNs on the array system. In this section, we build a feasible arrayed architecture with the multithreaded accumulator and the row-wise data delivery feature to exploit the locality the layers originally have.

Although a square partial input area is needed in every clock cycle in the CONV processing, the square area is not memory-friendly because it requires dis- continuous memory access. In the ideal architecture we indicated in the previous

section, we assumed that the input partial area is fetched within a cycle to the PE array. This requires a very wide-band internal buffer. For a practical imple- mentation, we propose a shift-based row-wise memory system, aiming simpler access pattern and a compact architecture design.

Since memory address of the CONV operations with output-parallel spatial array is continuous in the weight row, row-buffers where each buffer retains an input row can be appropriate (

The row-wise buffers are not sufficient because the CONV operations require H input rows per PE and ( P y + H − 1 ) rows in the entire PE array ( P y rows (colored input values in

_{y}-W P_{y}-R multi-port SRAMs, where the inputs can be read from any row. Compared to the basic idea (

written to the row buffers again from the forwarding output of the most right column.

When we employ the row-wise buffers and the shift-based data delivery instead of the ideal fully-parallel memory system, the overhead for filling all PE columns occurs prior to the multiplication of the first element of the weight row. To overcome this, the time-domain multithreaded partial sum accumulation could be effective. Generally, many CNN implementations contain many-channel and

small-plane CONV layers in deeper part of them, the overhead for sequential transfer will be relatively small, because an input connected to more output channels could be used more times and the data load is reduced by multith- reading.

Now, we try to extend the multi-channel accumulator from a register to an SRAM buffer. An SRAM can have more channels than a register array; this helps us to gain the number of operations per row (i.e. arithmetic intensity). Though SRAM accumulators require more area, it is feasible because the most layers of a CNN have smaller plane and more channels, so the number of PEs and threads (accumulators) could be actually less and more respectively.

We evaluate the performance, the arithmetic intensity of the proposed arch- itecture. Then the complexity of SRAM types for row selection is discussed.

In this section, the input plane size is N i x × N i y × C i , the output plane is N o x × N o y × C o , the kernel size is H × H ( H 2 C i C o elements for all input/out- put channel), the number of PEs is P x × P y , and the number of thread is T. The system operates at clock frequency f and the data bit precision is b.

We evaluated the processing time and data rate of the architecture (

In the CONV processing, we pick all weights from all input/output channels

CONV | FC | |
---|---|---|

Pre-requirements | Out | Out |

Processing Time | ||

Input Rate | ||

Output Rate |

one by one with each PE processing an output element ( ( P x + H − 1 ) H C o C i cycles; it takes P x cycles per row for filling the array and ( H − 1 ) cycles for loading all the weight columns). Time-division processing is taken if the output

size exceeds the number of PEs ( ⌈ N o x P x ⌉ ⌈ N o y P y ⌉ cycles). The processing time is:

( P x + H − 1 ) H C o C i ⌈ N o x P x ⌉ ⌈ N o y P y ⌉ [ Cycles ] (1)

The data transfer is buffered so as to fetch all pixels required by a row-wise processing block ( ( P y + H − 1 ) N i x ) , the input data size is:

( P y + H − 1 ) N i x ⌈ N o y P y ⌉ ⌈ C o T ⌉ C i [ Data ] (2)

The input data rate is denoted as:

b f ( P y + H − 1 ) N i x ⌈ C o T ⌉ ( P x + H − 1 ) H C o ⌈ N o x P x ⌉ [ bps ] (3)

The output data size and rates are:

N o x N o y C o [ Data ] (4)

b f N o x N o y ( P x + H − 1 ) H C i ⌈ N o x P x ⌉ ⌈ N o y P y ⌉ [ bps ] (5)

In the FC processing, it takes

N i P x ⌈ N o P x P y ⌉ [ Cycles ] (6)

(It takes P x cycles per an input element for weight preloading, and ⌈ N o P x P y ⌉

is the number of time-division processing blocks), requiring

N i ( P x P y + 1 ) ⌈ N o P x P y ⌉ [ Data ] (7)

inputs (while ( P x P y ) weights are needed for each of N i input elements) and

P x P y ⌈ N o P x P y ⌉ [ Data ]

outputs. The input and output data rates are:

b f ( 1 + P x P y ) P x [ bps ] (8)

b f P y / N i [ bps ] (9)

We evaluated the theoretical processing time and data rates of the proposed (“multi-port” and “multi-bank”) architecture with AlexNet [

According to these results, the required data rate in CONV layers is 16 MB/s at most. For the embedded systems this is an acceptable result.

“Util.” means “PE Utilization”, the ratio of the averaged number of operating PEs to the number of PEs through the layer processing. Total PE utilization is calculated as:

( SpaceRatio ) × ( TimeRatio ) = P x P y P x P y ⌊ N o x P x ⌋ ⌊ N o y P y ⌋ ⌈ N o x P x ⌉ ⌈ N o y P y ⌉ + R x P y P x P y ( ⌈ N o x P x ⌉ − ⌊ N o x P x ⌋ ) ⌊ N o y P y ⌋ ⌈ N o x P x ⌉ ⌈ N o y P y ⌉ + R x R y P x P y ( ⌈ N o x P x ⌉ − ⌊ N o y P y ⌋ ) ( ⌈ N o y P y ⌉ − ⌊ N o y P y ⌋ ) ⌈ N o x P x ⌉ ⌈ N o y P y ⌉ (10)

# | Type | Weight Shape | Output Shape | Cycles | Rate [GB/s] | Util. [%] |
---|---|---|---|---|---|---|

1 | CONV | 16,144,128 | 0.013 | 100.0 | ||

2 | CONV | 19,660,800 | 0.009 | 73.9 | ||

3 | CONV | 21,233,664 | 0.005 | 71.2 | ||

4 | CONV | 3,981,312 | 0.010 | 66.0 | ||

5 | CONV | 2,654,208 | 0.016 | 66.0 | ||

6 | FC | 4096 | 11,075,584 | 6.425 | 100.0 | |

7 | FC | 4096 | 1,048,576 | 6.425 | 100.0 | |

8 | FC | 1000 | 262,144 | 6.425 | 97.7 | |

Total | 76,060,416 |

where the remainder pixels of the time-division blocks are R x = N o x − P x ⌊ N o x / P x ⌋ and R y = N o y − P y ⌊ N o y / P y ⌋ , as illustrated in

but required resources such as data rate are independent, the availability is retained over the problem size.

Arithmetic intensities of some situations are indicated in

line is for 1-thread (i.e. non-multithreaded register) accumulator, the dotted line for 8-thread, and the solid lines are for 512-thread memory. The red line is in C i = C o = 512 case, assuming the deeper layer of a CNN, while the black ones are C i = 3 , C o = 32 assuming the input layer. This tells us that, highly-multith- readed system works better especially in the layers with more output channels (i.e. the deeper layers) of CNNs, since the multithreaded computation reduces the data load from the external RAMs by exploiting the input data reusability among the output channels. If the problem size is bigger than the spatial array size, the time-divided processing is needed, but the arithmetic intensity is saturated since the required computation and data transfer per output element do not dramatically vary.

We introduced two types of the row-wise SRAM buffers with “multi-port” and “multi-bank” data selection in Sections 4.1, and estimated the performance of the row-wise shift architectures in Section 5.3. Let us discuss the complexity of the ways of data selecting, “multi-port” and “multi-bank” SRAMs.

As described in Section 4.1, a PE row requires H input rows. To select which row buffer to connected to a PE row, two types of the buffer system have been proposed; “multi-port” SRAM that any PE row can have connection to any row buffer, and “multi-bank” one that row selection can be done by moving data internally.

For “multi-port” SRAM, we utilize a ( P y + H − 1 ) -read/write SRAM as the row buffer. The input data for an input row are stored only once and are read by at most H PE rows (i.e. H output rows), so the number of buffer write accesses is minimized (once per input element). However, an n-read SRAM costs hardware resource by O ( n 2 ) , generally. For row buffers, the capacity (costs linearly) and the number of ports (quadratically) are both in O ( P y ) , so this costs O ( P y 3 ) area.

In the “multi-bank” system, on the other hand, we use a ( P + H ) sets (or “banks”) of simple 1-write 1-read SRAM. This costs O ( P y ) resource for O ( P y ) rows. After finishing loading input data for a CONV process, every clock cycle the data are fed into each row via the read port of SRAM, while the write port of SRAM acquires the data overflowed from the forwarding output of the right PE on the previous row. Therefore, although the input data are transferred from the external memory to the system only once, just like the “multi-port” imple- mentation, the internal memory write is taken for at most H times per input element as the loaded data flows over the array.

^{2}] and energy [nJ] specification of the “multi- port” and “multi-bank” 16-bit SRAMs (512 words per row) calculated using CACTI simulation [

Multi-port | Multi-bank | |||||
---|---|---|---|---|---|---|

#Ports | Area | Energy | #Banks | Area | Energy | |

2 | 4 | 0.694 | 0.706 | 4 | 0.085 | 0.103 |

4 | 8 | 4.726 | 5.041 | 6 | 0.127 | 0.279 |

6 | 8 | 4.726 | 9.410 | 8 | 0.169 | 0.535 |

14 | 16 | 41.332 | 105.697 | 16 | 0.338 | 2.387 |

external memory system, and the target application (mainly H). Seeing the result of this calculation, although the “multi-bank” memories require more write accesses, the energy and area are still lower than “multi-port” memories.

In CONV processing, the array requires P y input data at the first output channel in the multithreaded accumulators, therefore the input data rate requirement of the array (internal memory bandwidth) is P y / T [ Words / Cycle ] . Note that the data rate may be P y / C o [ Words / Cycle ] if the number of output channels is less than that of accumulator threads. And more, if the read interval T or C o is greater than P x P y , the number of input values of the whole array, even a 1W1R buffer with b-bit width and the shift registers for whole the PE array can be used.

The P x P y T output values of the CONV processing become valid every

H 2 C i T cycles, the required output data rate is P x P y H 2 C i [ Words / Cycle ] . For the

weights, every clock cycle the array multiply-accumulates the inputs with one weight value, so weight data rate is 1 [ Words / Cycle ] constantly. This means that higher input bandwidth is required when the output channels or the threads are fewer, and that the output rate requirement gets severe if the input channels are fewer or the filter kernels are smaller.

Then we discuss the optimal configuration of the internal SRAMs. To get over the latency of loading input data from the external memories, the processing time must exceed the memory access time, we need row SRAMs of enough depth that the solid partial input area can be read from the external memory behind the multithreaded computation. At this point, if the number of threads is large, the SRAM buffer size should be at least 2 P x words per row (needs “2” for independent read/write accesses). The data rate of the internal read access is P y / T [ Words / Cycle ] , as described above, required data rate of the external bus is also P y / T [ Words / Cycle ] . When this data rate requirement is met, the time for the data acquirement can be covered by the processing time of the array.

The area of the PE array is in proportion to the number of multithreading accumulators T, while the external bandwidth or the area of the row buffers is in proportion to the array height P y and in inverse proportion to T. The pro- cessing time is in inverse ratio to P x P y and direct to T. For planning the practical CGRA system, these observation should be in consideration on de- mand from the applications.

AlexNet [

Various kinds of neural network accelerators have been developed. Zhang et al. analysed the data reusability and the arithmetic intensity of input- and output-parallel architectures built on an FPGA, and provided a method to optimize it when a target application is given [

For an example of purpose-built architecture, MIT Eyeriss [

Some researches are trying to reduce the data transference and the multiply operations. Recently “binarized” neural networks were proposed [

Commercial systems for neural network processing have been working. Google [

We proposed a CGRA architecture with time-domain multithreading for ex- ploiting input data locality. This architecture employs the multithreaded PE accumulators for obtaining multiple in-flight operations and the row-wise data feeding for reusing the input of the neighboring PEs. Our evaluation proved that the proposed architecture is suitable for both deeper and shallower layers in general CNNs. The multithreading in the PEs improves the arithmetic intensity by an order of magnitude. We also showed the quantitative way to determine the specification of the architecture on demand from the applications. It indicated that the architecture requires only a little bandwidth of the external memory, this feature is useful for embedded machine learning applications. Though we showed the advantage of the architecture through the evaluation of the per- formance, the energy efficiency evaluation is desired to develop a more efficient architecture. In order to realize more useful accelerators, we should consider the co-design of the CNN algorithms and the architectures.

This work is supported in part by the JST ACCEL program. We thank the editor and the anonymous reviewers for their useful comments.

Ando, K., Takamaeda-Yamazaki, S., Ikebe, M., Asai, T. and Motomura, M. (2017) A Multithreaded CGRA for Convolutional Neural Network Processing. Circuits and Systems, 8, 149- 170. https://doi.org/10.4236/cs.2017.86010