An interconnect is the backbone of any system as many processor cores, DMA, graphic engines, memory and other I/O devices connect to it. Performance requirements have undergone a steep climb in today's sophisticated world where electronic chips can be found everywhere including consumer appliances, healthcare, industrial controls, and automobiles. Whatever the field may be, the consumer always expect top notch performance without any visible lag or mediocre user experience. Hence, in recent years another field of verification has sprung up in additional to functional - performance.
I recently had a chance to do some work on performance verification of an interconnect, and have come across some common strategies. The switch fabric is a piece of hardware that allows certain masters to be connected to certain slaves. In simple terms, it helps in connecting different masters to different slaves and typically contain logic to multiplex, demultiplex, arbitrate and route transactions from a particular master to any slave based on the address mapping.
For example, the master could issue a read transaction from the address 0x3000_0000 which is mapped to the SRAM, and the interconnect would route that transaction to the SRAM slave. Then the slave will return some data which will get routed back to the same master. In a similar way some other master can perform read and write transactions to any given slave. As you can imagine, the fabric will be responsible for routing thousands of such transactions through its internal elements, hence creating the possibility for bottlenecks.
Lets assume that a DMA engine in the system is responsible for reading data from the SRAM and writing it to a location in the DRAM after buffering some data. Now, if the fabric takes quite a long time to provide the necessary data to the DMA engine, then the average performance of the DMA block will turn out to be low and could create a lot of empty fifo conditions.
Two main metrics are used in performance verification - latency and bandwidth. Latency is measured from the time a master issues a transaction and the time it receives a response from the slave. For example, if an AXI master issues a write transaction at time 10ns with an address 0x5000_0000 and 4 beat data of size 8 bytes (64-bit), then the slave will respond with an OKAY response on the BResp signal if the data was successfully received by the slave. If the BResp was received by the master at 50ns, then the total round trip latency for this write transaction is 40ns.
Bandwidth is the rate of data transfer or in simple terms, an indication of how many bytes of data is transferred per second. It is usually expressed in MBytes/s or GBytes/s. Sometimes Mbits/s and Gbits/s are also used. Remember that 1 byte is 8 bits. Consider that a master runs on a 100 MHz clock and has an AXI interface with a data bus width of 64-bits (8-bytes). Then if the master is capable of transferring 64 bits of data every clock cycle, then the bandwidth would be 8 bytes * 100 MHz which equals 800 MBytes/s. Ofcourse, this is the theoretical value, and an actual IP might very well be less than this value.
Moreover, many interconnect IPs also feature quality of service schemes where certain masters can be assigned to have a higher priority over the others, thereby enjoying a better latency and bandwidth. Sometimes such schemes can also be software configurable and this introduces an element of dynamism into the transaction flow. Based on such register configurations, each master can have different priorities at different times which again increases the chances of having more bottlenecks.
Another question to be pondered is how fast the interconnect is able to respond to such software configurations, if it has any. Consider an example where a graphics engine wants to read data from the memory and perform quick computations and store it back. If the processor CPU occupies the bandwidth to the SRAM because of a higher priority, then the graphics engine will then need to be configured via software to have a higher priority than the processor CPU. This will not have the intended effect if the time taken for the interconnect to respond to this change is more than what the graphics engine would have normally taken to perform its original transaction.
With multiple such connection points possible inside the fabric, many such bottlenecks can be uncovered by an appropriate performance verification plan along with the normal functional verification test suites.