A Shift in Computer Vision Is Coming – EE Times Europe

EE Times Europe
Is computer vision about to reinvent itself, again?
Ryad Benosman, professor of ophthalmology at the University of Pittsburgh and an adjunct professor at the CMU Robotics Institute, believes that it is. As one of the founding fathers of event-based vision technologies, Benosman expects that neuromorphic vision — computer vision based on event-based cameras — will be the next direction computer vision will take.
“Computer vision has been reinvented many, many times,” Benosman said. “I’ve seen it reinvented twice at least, from scratch, from zero.”
Benosman cited the shift in the 1990s from image processing with a bit of photogrammetry to a geometry-based approach and then to today’s rapid advance toward machine learning. Despite those changes, modern computer-vision technologies are still predominantly based on image sensors — cameras that produce an image similar to what the human eye sees.
According to Benosman, until the image-sensing paradigm is no longer useful, it holds back innovation in alternative technologies. The development of high-performance processors, such as GPUs, delay the need to look for alternative solutions and thus have prolonged this effect.
“Why are we using images for computer vision? That’s the million-dollar question to start with,” he said. “We have no reasons to use images — it’s just because there’s the momentum from history. Before even having cameras, images had momentum.”
Image cameras have been around since the pinhole camera emerged in the fifth century B.C.E. By the 1500s, artists were using room-sized devices to trace the image of a person or a landscape outside the room onto canvas. Over the years, the paintings were replaced with film to record the images. Innovations such as digital photography eventually made it easy for image cameras to become the basis for modern computer-vision techniques.
Benosman argues, however, that image-camera–based techniques for computer vision are hugely inefficient. His analogy is the defense system of a medieval castle: Guards positioned around the ramparts look in every direction for approaching enemies. A drummer plays a steady beat, and on each drumbeat, every guard shouts out what they see. Amid all the shouting, how easy is it to hear the one guard who spots an enemy at the edge of a distant forest?
The 21st century hardware equivalent of the drumbeat is the electronic clock signal, and the guards are the pixels. A huge batch of data is created and must be examined on every clock cycle, which results in a lot of redundant information and thus requires a lot of unnecessary computation.
“People are burning so much energy, it’s occupying the entire computation power of the castle to defend itself,” Benosman said. If an interesting event is spotted — represented by the enemy in this analogy — “you’d have to go around and collect useless information, with people screaming all over the place, so the bandwidth is huge … and now imagine you have a complicated castle. All those people have to be heard.”
Enter neuromorphic vision. The basic idea is inspired by the way biological systems work, detecting changes in the scene dynamics rather than analyzing the entire scene continuously. In our castle analogy, this would mean having guards keep quiet until they see something of interest, then shout their location to sound the alarm. In the electronic version, this means having individual pixels determine whether they see something relevant.
“Pixels can decide on their own what information they should send,” said Benosman.
“Instead of acquiring systematic information, they can look for meaningful information — features. That’s what makes the difference.”
This event-based approach can save a huge amount of power and reduce latency compared with systematic acquisition at a fixed frequency.
“You want something more adaptive, and that’s what that relative change [in event-based vision] gives you — an adaptive acquisition frequency,” he said. “When you look at the amplitude change, if something moves really fast, we get lots of samples. If something doesn’t change, you’ll get almost zero, so you’re adapting your frequency of acquisition based on the dynamics of the scene. That’s what it brings to the table. That’s why it’s a good design.”
Benosman entered the field of neuromorphic vision in 2000, convinced that advanced computer vision could never work because images are not the right way to do it.
“The big shift was to say that we can do vision without gray levels and without images, which was heresy at the end of 2000 — total heresy,” he said.
The techniques Benosman proposed — the basis for today’s event-based sensing — were so different that papers presented to the foremost IEEE computer-vision journal at the time were rejected without review. Indeed, not until the development of the dynamic vision sensor (DVS) in 2008 did the technology start gaining momentum.
Neuromorphic technologies are those inspired by biological systems, including the ultimate computer: the brain and its neurons, or compute elements. The problem is that no one fully understands exactly how neurons work. While we know that neurons act on incoming electrical signals called spikes, until relatively recently, researchers characterized neurons as rather sloppy, thinking only the number of spikes mattered. This hypothesis persisted for decades, but more recent work has proved that the timing of these spikes is absolutely critical and that the architecture of the brain creates delays in these spikes to encode information.
Today’s spiking neural networks, which emulate the spike signals seen in the brain, are simplified versions of the real thing — often binary representations of spikes. “I receive a 1, I wake up, I compute, I sleep,” Benosman explained. The reality is much more complex. When a spike arrives, the neuron starts integrating the value of the spike over time; there is also leakage from the neuron, meaning the result is dynamic. Furthermore, there are roughly 50 different types of neurons with 50 different integration profiles.
The current electronic versions are missing the dynamic path of integration, the connectivity between neurons, and the different weights and delays. “The problem is that to make an effective product, you cannot [imitate] all the complexity, because we don’t understand it,” he said. “If we had good brain theory, we would solve it. The problem is, we just don’t know [enough].”
Bensoman runs a unique laboratory dedicated to understanding the mathematics behind cortical computation, with the aim of creating new mathematical models and replicating them as silicon devices. This includes directly monitoring spikes from pieces of real retina.
For the time being, Benosman is against trying to faithfully copy the biological neuron, describing that approach as old-fashioned.
“The idea of replicating neurons in silicon came about because people looked into the transistor and saw a regime that looked like a real neuron, so there was some thinking behind it at the beginning,” he said. “We don’t have cells; we have silicon. You need to adapt to your computing substrate, not the other way around … If I know what I’m computing and I have silicon, I can optimize that equation and run it at the lowest cost, lowest power, lowest latency.”
The realization that it’s unnecessary to replicate neurons exactly and the development of the DVS camera are the drivers behind today’s vision systems. While systems are already on the market, there is progress to be made before fully humanlike vision becomes available for commercial use.
Initial DVS cameras had “big, chunky pixels,” as the components around the photodiode itself reduced the fill factor substantially, said Benosman. While investment in the development of these cameras accelerated the technology, Benosman made it clear that the event cameras of today are simply an improvement of the original research devices developed as far back as 2000. State-of-the-art DVS cameras from Sony, Samsung, and Omnivision have tiny pixels, incorporate advanced technology such as 3D stacking, and reduce noise. Benosman’s worry is whether the types of sensors used today can successfully be scaled up.
“The problem is, once you increase the number of pixels, you get a deluge of data, because you’re still going super-fast,” he said. “You can probably still process it in real time, but you’re getting too much relative change from too many pixels. That’s killing everybody right now, because they see the potential, but they don’t have the right processor to put behind it.”
General-purpose neuromorphic processors are lagging behind their DVS camera counterparts. Efforts from some of the industry’s biggest players (IBM Truenorth, Intel Loihi) are still works in progress. Benosman said that the right processor with the right sensor would be an unbeatable combination.
“[Today’s DVS] sensors are extremely fast, are super-low–bandwidth, and have a high dynamic range so you can see indoors and outdoors,” Benosman said. “It’s the future. Will it take off? Absolutely.
“Whoever can put the processor out there and offer the full stack will win, because it’ll be unbeatable,” he added.
This article originally ran on sister site EE Times.
Read also:
Embedded AI Processors: The Cambrian Explosion
Neuromorphic Vision in Space
Sally Ward-Foxton covers AI technology and related issues for EETimes.com and all aspects of the European industry for EETimes Europe magazine. Sally has spent more than 15 years writing about the electronics industry from London, UK. She has written for Electronic Design, ECN, Electronic Specifier: Design, Components in Electronics, and many more. She holds a Masters' degree in Electrical and Electronic Engineering from the University of Cambridge.
Your email address will not be published.
Comment
Name
Email
Website

Δ
This site uses Akismet to reduce spam. Learn how your comment data is processed.

source

Related Articles