Forming a Benchmark for ML at the Edge

Article By : Sally Ward-Foxton

EEMBC pins down variables to compare edge inference chips

A new benchmark for machine learning inference chips aims to ease comparisons between processing architectures for embedded edge devices.

Developed by EEMBC (the Embedded Microprocessor Benchmark Consortium), MLMark uses three of the most common object detection and image classification models: ResNet-50, MobileNet and SSDMobileNet. The first hardware to be supported comes from Intel, Nvidia and Arm, with scores already available for Nvidia and HiSilicon (Arm-based) parts.

The market for machine learning inference compute in edge devices, while still emerging, has huge growth potential. EEMBC has seen strong demand for a new benchmark to help counter industry hype about performance, said Peter Torelli, the group’s President.

“Organising and taxonomising these different architectures begins with reporting, and having a database of scores to see how they are performing on what are considered the standard models today,” he said.

Moving Target
Developing a benchmark for such a rapidly evolving space is challenging, to say the least. Part of the problem is that for any machine learning (ML) model, there are many variables that can be optimised, as well as different model formats and training datasets.

“This isn’t like a traditional benchmark where you download some C code and push it to your compiler and then measure a number,” Torelli said. “It’s all about graphs, and everyone has their own SDK with an optimiser. We are trying to put a stake in the ground and compare how [different architectures] are performing.”

EEMBC has effectively chosen a baseline by fixing certain variables in each model and locking down SDKs (software development kits), the target hardware supported, and other parameters. In most cases, the most common options, or the most likely to run on edge hardware, were selected.

“We need to try to bring some order to all these different parts and understand what the gains really are. Someone has to actually lay something out and say, we’re freezing these particular parameters,” he said.

Peter Torelli, President, EEMBC (Image: EEMBC)
Peter Torelli, President, EEMBC (Image: EEMBC)

Existing benchmarks for machine learning inference, such as MLPerf and DAWNBench, have largely arisen from the world of academia and therefore have a slightly different approach to EEMBC, which serves the embedded computing industry.

Comparing MLMark with existing benchmarks, Torelli highlighted EEMBC’s tightly defined approach, and its requirement for implementation details to be made public. Benchmarks driven by academia often allow a lot more flexibility so vendors can create the highest performance scenarios, and keep the fine details to themselves.

“EEMBC’s closer to the vendors, manufacturers and integrators,” Torelli said. “We’ve made [MLMark] more accessible to embedded developers, and made it reflect the kinds of optimisations you’re going to see in use, by making the source code available and locking down what SDKs are used. The aim is to make it more of a development tool than a database of requests to vendors.”

Benchmark Reticence
Existing machine learning benchmarks have met a certain amount of reticence from chip vendors; so far very few manufacturers have submitted scores. Torelli speculated that companies’ marketing departments may be to blame — nobody wants to appear to be at the bottom of the list.

“Some [EEMBC members] are saying it’s too soon [to submit scores]; they are silicon vendors who don’t know what the competition looks like yet,” he said. “That’s part of the reason we made MLMark open source… because we’ve defined the guardrails in the benchmark, anyone can submit scores.”

While it’s true that third parties can download the MLMark code and upload scores, the caveat is that there are currently only three types of hardware supported: Intel CPUs, GPUs and neural compute sticks using OpenVINO; NVIDIA GPUs using TensorRT; and Arm Cortex-A CPUs and Mali GPUs using Neon and OpenCL, respectively.

Torelli said he expects some of the scores to be effectively crowdsourced from third parties such as silicon integrators. Some of these integrators have a different attitude to the silicon vendors, thinking of benchmarks as a tool, or a capability to enable analysis, he said.

“They understand the importance of putting a stake in the ground today, to enable visibility into future trends,” said Torelli. “[They understand that] somebody has to start collecting this data and archiving it and understanding what it means.”

Power Consumption
MLMark uses three metrics: throughput (inferences per second), latency (processing time for one inference) and accuracy (percentage of correct predictions).

A power consumption metric is notably absent from MLMark Version 1.0, given that the target hardware is edge devices which may include battery operated systems. According to Torelli, this is down to the high cost of entry — measuring power consumption with the right accuracy requires expensive, cutting edge test equipment.

“For EEMBC’s other benchmarks that focus specifically on power [ULPMark, IoTMark], STMicroelectronics created a power meter board that can measure down to the nanoJoule, but it’s limited to 50mA current, which isn’t going to support any of these [machine learning] boards,” he said.

There isn’t an easy solution to the power metric problem, Torelli said.

“We couldn’t provide an EEMBC-quality power number without a significant test harness and demands to manufacturers to provide evaluation boards with isolated power planes,” he said. “We want to make sure it’s very clear what’s being measured and how it’s being measured… there are a lot of details around it that have to be hammered out, but we do want to do that.”

Future Versions
The two next versions of the benchmark are currently being co-developed.

The next release, expected early next year, will keep the same models from version 1.0 but adding support for Google’s TPU and some other dedicated ML accelerators.

Version 2.0 will expand to applications outside vision, moving into models for language translation, text to speech, keyword searching and other forms of decision making. The plan for Version 2.0 also includes support for ultra-low power and low-cost devices.

Subscribe to Newsletter

Test Qr code text s ss