Implementing a Single-Layer CNN using Verilog

ASIC Design ECE 564
GitHub - SantoshSrivatsan24/ece564_project1: ECE 564 Project 1: Verilog description of the convolution Layer of a CNN

ECE 564 Project 1: Verilog description of the convolution Layer of a CNN - GitHub - SantoshSrivatsan24/ece564_project1: ECE 564 Project 1: Verilog description of the convolution Layer of a CNN

Introduction

The goal of this project was to design the convolution and max-pool layers of a single layer CNN.

I watched a really nice YouTube video to understand what exactly convolution means. In essence, convolution is an operation that takes two grids of values and combines them to produce another grid of values. The smaller grid of values is called a kernel. And the larger grid could be an image.

A kernel is convolved with an input image by sliding it along the image, and multiplying and accumulating values in the kernel with the corresponding values in the input to produce a single value in the output image

Fig. 1: Convolution between an image I and a kernel K

CNNs are widely used in applications such as pattern recognition (OCR). Accelerating them with the help of an ASIC is a worthwhile investment.

🧠 Design

We were provided with three SRAMs for the inputs, weights, and outputs, and one scratchpad SRAM.

My approach relied on a pattern in the way input matrices are stored in the SRAM to perform 4 matrix multiplications simultaneously. i.e., the first element of the four input matrices is multiplied with the first kernel element.

I picked values from the input matrix based on this pattern and stored them in a temporary buffer. The blue, green, orange, and red elements correspond to the first elements of each 3x3 matrix among a set of four 3x3 matrices.

The idea of the kernel sliding over the image is emulated by shifting this temporary buffer every clock cycle. This also ensures that the elements of interest are always at the same location.

Fig. 2a: Picking a 4x4 set of values from an image to store into a buffer
Fig. 2b: Shifting the input buffer

🔨 Implementation

FSMs

I had three major FSMs in my design:

  1. The first was to calculate the next read_address for the input matrix
    1. The 4x4 values that I picked from the input matrix aren’t stored sequentially in the SRAM. The read address has to jump around and can’t simply be incremented.
  2. The second FSM was to keep track of the state of the temporary buffer.
    1. The four (out of 16) inputs in this buffer that are multiplied with the weights are always in the same location.
    2. This is enforced by shifting the next set of inputs (to be multiplied with the weights) into this fixed location.
    3. The inputs can either be shifted by 1 or 2 locations.
    4. This FSM keeps track of whether to shift the buffer by 1 or 2 entries.
    5. It also keeps track of the empty and full states of the buffer and indicates when multiplication is allowed (multiplication is only allowed when the buffer is full)
  3. The third FSM was to keep track of when the convolution for one matrix had completed
    1. The input SRAM could store more than one input matrix.

Microarchitecture

Fig. 3: High-level architecture and description of data flow

Datapath

Fig. 4: Datapath

Synthesis

I synthesized my design using Synopsys Design Compiler. Synthesis helped me uncover pesky bugs such as unintentional latches, wired-or logic (signal driven by more than one piece of logic) and combinatorial loops in my design that may not have been caught by the Verilog compiler.

Results

My design achieved a clock period of 4.8 ns and had an area of 6612.2280 μm2\mu m^2.