Selecting a compute resource for inference

Mark Patrick, Mouser Electronics

There are two distinct phases of using a machine learning neural network; training and inference. An artificial intelligence (AI) neural network model learns through the use of a labeled training dataset, which in the case of an object classification model, includes hundreds, or potentially thousands, of pictures of different objects. The model’s algorithm applies various filters to isolate, salient visual features that, for example, differentiate a dog from a cat. The inference is the next stage that uses the pre-trained model to infer the probability that a picture the algorithm views is a cat even though it may not have seen that specific image before. The inference is where the prediction is made based on what the model has learned.

The training of a model usually takes place on a high-performance PC or workstation due to the significant amount of processing involved. Depending on the performance of the computer, this can also be a lengthy process. However, when it comes to inferencing, the computation requirements are not so demanding as training. Being able to use the model away from a high-performance workstation also opens up the opportunities for inference to occur in many different physical locations. The internet of things (IoT) is one such application that can benefit from machine learning techniques, although reliance on a low latency cloud connection may become a limiting factor. Conducting inference at ‘the edge’ can significantly challenge the compute and communication capabilities of an IoT device. An IoT device may transfer data to a cloud-based inference service to overcome device hardware limitations, but this probably requires large amounts of data to be continuously transferred and results in a high end-to-end link latency. Conducting inference at the device or sending it to the cloud both impose a heavy processing load, which in turn, increases the power consumption profile of the IoT device.

One way of solving the inference at the edge challenge is to use an AI accelerator.


What is an AI accelerator

An AI machine learning accelerator can be hardware- or software-based. Its purpose is to speed the high number of routine computational tasks that a trained neural network needs to perform to yield a probability. Hardware-based accelerators use heavily optimized processing functions, and typically the power consumption profile is much lower than a general-purpose processor for the same task. The processing performance or throughput bandwidth of an accelerator is also much higher.

Accelerators can use different processor architectures, from central processing units (CPUs), vision processing units (VPUs), graphic processing units (GPUs), system on chip(SoC) or field-programmable gate arrays (FPGAs). From the software perspective, machine learning frameworks and task-specific software development toolkits optimize the neural network for throughput and simplify the task of AI software development.

Using CPUs and GPUs for acceleration

CPUs have always been general-purpose processing units, suitable for a wide variety of computing tasks. Early AI and machine learning projects used CPUs, but their architectural limitations make them unsuitable for the high computation workload of today’s model training and inference tasks. Semiconductor vendors have responded to these challenges with CPUs that are configured and architected to meet the specific computational jobs found within neural networks. Examples include the DynamIQ multi-core energy-efficient processor series from Arm and the Exynos 9 series from Samsung.

Other semiconductor vendors believe that the GPU architecture is better suited to machine learning inference tasks. Initially designed for use in complex image rendering tasks in gaming consoles, GPUs employ more processing units than a CPU and are capable of parallel processing. Many neural network tasks can utilize this parallel processing capability, significantly reducing computation time and increasing throughput. GPUs are heavily integrated devices, making them smaller and lighter than using multiple CPUs, and ideal for inclusion in space-constrained IoT devices. Companies manufacturing GPUs for machine learning applications include Qualcomm (Adreno), NEC (SX-Aurora), NVIDIA (DGX), and AMD (Radeon Instinct).

Programmable logic for computational flexibility

The general-purpose nature of most CPUs and GPUs is that they trade computational flexibility to accommodate a large number of different use cases. Unfortunately, this trade-off means that they are continually moving data between the processing units and memory storage. This approach impacts performance throughput and also accounts for considerable energy consumption. Also, the highly integrated and dense architecture of CPUs and GPUS can result in significant heat generation that lowers device reliability and potentially cause premature failure. For a compact IoT device at the edge, the high energy profile and thermal management challenges within a relatively small device require a different approach. Investigating how to offload the more demanding computational tasks to a power-efficient architecture is worthy of consideration.

Field-programmable gate arrays use a fabric of programmable logic cells to achieve in hard-wired logic what a set of software instructions would perform. FPGAs are capable of being reconfigured by designers to accommodate specific tasks and offer a fast and deterministic way of undertaking complex calculations in hardware. FPGAs provide more flexibility than a CPU or GPU but are typically larger, occupying more board space. Squeezing them into a compact IoT sensor might present a challenge; however, for other less-constrained use-cases, FPGAs offer a viable choice.

As with CPUs and GPUs, FPGA vendors are continually adapting their line-up to include FPGA devices optimized for AI and machine learning applications. Vendors of FPGAs suiting inference applications include NVIDIA (TensorRT), and Xilinx. Microsoft has recently announced it has been collaborating with FPGA vendor Intel on inference acceleration projects. Intel continues to innovate across its different processor architectures, including FPGAs.


SoCs offer high levels of functional integration

SoCs bring high levels of functional integration into a single IC package. SoCs are increasingly popular to incorporate many different functional blocks to suit the demands of specific applications while achieving a relatively compact package size. Functions can include microcontroller cores, power management, wireless transceivers, analog-to-digital converters, and digital signal processors. Semiconductor vendors have been adept at constructing SoCs with a building block approach to suit high volume popular applications. Initially adapted for use in integrating video, image, and voice processing tasks into a single IC within a smartphone, SoCs are now available tailored towards machine learning tasks. The arm has seized this opportunity with the launch of its Machine Learning Processor and an Object Detection Processor IP blocks. Another vendor, HiSilicon, a Huawei company, is using the Arm IP blocks to create a family of SoCs for use in future smartphones and tablets. HiSilicon is also manufacturing the Ascend ICs for Huawei.

Another major SoC vendor is Arteris. The company is currently developing FlexNoc, a network-on-chip interconnect fabric SoC used by many mobile and wireless companies.

Other semiconductor vendors in the throws of bringing machine learning/AI optimized SoCs to market include, STMicroelectronics, Texas Instruments, NXP, and Renesas.


Application-Specific ICs, Vision Processors and Neuromorphic Chips

The basic concept of ASICs has been around for a very long time. Unlike providing a general-purpose building block approach like an SoC, an ASIC only contains the circuitry that the task requires. Machine learning ASICs are already being produced, with examples including Intel’s Nervana, and Google’s Edge TPU.

A VPU is designed specifically to accelerate vision and image processing tasks. A convolutional neural network is typically the machine learning model used for inferencing images, and a VPU is optimized and architected for this purpose. Unlike a GPU, which tends to accommodate a broader range of capabilities, a VPU is designed for task-specific vision processing. Vendor examples include Google’s Pixel Visual Core, Microsoft’s HoloLens, and Inutive’s NU series.

As the semiconductor industry grapples with meeting the increasing demands for low latency, low power, high bandwidth processing, several companies are now investigating creative ways to architect a new technology approach.

To some, a more hybrid approach that combines the best of high-performance digital circuitry with the low power, low latency attributes of analog ICs promises to deliver results. Using an approach similar to artificial neural networks that are digital interpretations of the human brain, IBM has developed a neuromorphic processor called TrueNorth. Research into replicating the human brain within silicon is also the goal of projects at several leading universities, including Stanford, Columbia, and the University of Michigan’s IC lab.


Machine Learning Frameworks Simplify Development

As mentioned earlier, software frameworks also play an important role in simplifying the development of machine learning algorithms and inference deployment. Frameworks, which typically comprise a collection of libraries, example models, pre-trained data sets aid development teams structure greatly and plan their projects through the use of optimized code examples and open-source libraries. Popular frameworks include Caffe, Keras, and TensorFlow. Intel, with their OpenVINO toolkit, has taken the process further by allowing heterogeneous support across all Intel processor architectures and specifically targets vision processing.


Getting Started with AI and Machine Learning

Mouser stocks a wide variety of hardware designed for machine learning applications. Examples include the compact AAEON UP AI Core processing module that suits use in an industrial environment and is based around the mini-PCIe format. It features the Intel Movidius VPU that integrates a processing power-house of 12 very long instruction word cores, and 512 MGyes of DDR memory together with dedicated vision accelerators. Another Intel Movidius-based machine learning device is the plug-and-play Neural Compute Stick 2. The NCS2 uses a Movidius X VPU that achieves an impressive performance of 4 trillion instructions per second (TOPS) but only consumes 3 Watts. The Gumstix Aerocore 2 board employs an array of NVIDIA Jetson TX1 and TX2 CUDA cores to give it strong parallel processing capabilities that make it well suited to industrial machine vision use cases such as object recognition and production line inspection.

What will the future bring?

The last five years have seen new AI/ML semiconductor entrants target the expanding IoT and inference at the edge space. Many are emphasizing low power credentials such as GreenWave, Reduced Energy Microsystems, Syntiant, and Mythic. In particular, Syntiant and Mythic see their innovations target battery-powered applications. Another new player, Wiliot, has designed a battery-less Bluetooth IC that is powered by energy harvested from ambient radio signals.

At the other end of the machine learning scale, new entrants include Vathys, Graphcode, and Cerebras. For specific deep learning applications, BrainChip has developed what is believed to be the first spiking neural processor.

The pace of AI and machine learning processor development is gathering significant momentum. Achieving low latency low power inference in a host of edge devices is a priority for the semiconductor industry as industrial IoT and Industry 4.0 deployments become a reality.

Share this post