Introduction
The goal of this project was to design the convolution and max-pool layers of a single layer CNN.
I watched a really nice YouTube video to understand what exactly convolution means. In essence, convolution is an operation that takes two grids of values and combines them to produce another grid of values. The smaller grid of values is called a kernel. And the larger grid could be an image.
A kernel is convolved with an input image by sliding it along the image, and multiplying and accumulating values in the kernel with the corresponding values in the input to produce a single value in the output image
CNNs are widely used in applications such as pattern recognition (OCR). Accelerating them with the help of an ASIC is a worthwhile investment.
🧠 Design
We were provided with three SRAMs for the inputs, weights, and outputs, and one scratchpad SRAM.
My approach relied on a pattern in the way input matrices are stored in the SRAM to perform 4 matrix multiplications simultaneously. i.e., the first element of the four input matrices is multiplied with the first kernel element.
I picked values from the input matrix based on this pattern and stored them in a temporary buffer. The blue, green, orange, and red elements correspond to the first elements of each 3x3 matrix among a set of four 3x3 matrices.
The idea of the kernel sliding over the image is emulated by shifting this temporary buffer every clock cycle. This also ensures that the elements of interest are always at the same location.
🔨 Implementation
FSMs
I had three major FSMs in my design:
- The first was to calculate the next
read_address
for the input matrix- The 4x4 values that I picked from the input matrix aren’t stored sequentially in the SRAM. The read address has to jump around and can’t simply be incremented.
- The second FSM was to keep track of the state of the temporary buffer.
- The four (out of 16) inputs in this buffer that are multiplied with the weights are always in the same location.
- This is enforced by shifting the next set of inputs (to be multiplied with the weights) into this fixed location.
- The inputs can either be shifted by 1 or 2 locations.
- This FSM keeps track of whether to shift the buffer by 1 or 2 entries.
- It also keeps track of the empty and full states of the buffer and indicates when multiplication is allowed (multiplication is only allowed when the buffer is full)
- The third FSM was to keep track of when the convolution for one matrix had completed
- The input SRAM could store more than one input matrix.
Microarchitecture
Datapath
Synthesis
I synthesized my design using Synopsys Design Compiler. Synthesis helped me uncover pesky bugs such as unintentional latches, wired-or logic (signal driven by more than one piece of logic) and combinatorial loops in my design that may not have been caught by the Verilog compiler.
Results
My design achieved a clock period of 4.8 ns and had an area of 6612.2280 .