How Neural Networks Power Starship Robots by Tanel Pärnamaa

Starship builds a fleet of robots to deliver packages on site upon request. To achieve this successfully, robots must be safe, polite and fast. But how to get there with low computing resources and without expensive sensors like LIDARs? This is the engineering reality you have to deal with, unless you live in a universe where customers are happy to pay $ 100 for shipping.

For starters, robots begin by sensing the world with radar, multiple cameras and ultrasound.

The challenge, however, is that most of this knowledge is low-level and non-semantic. For example, a robot can sense that an object is ten meters away, but without knowing the category of the object, it is difficult to make decisions about safe driving.

Machine learning via neural networks is surprisingly useful in converting this unstructured low-level data into higher-level information.

Robot starships move mainly on sidewalks and cross streets when necessary. This poses a different set of challenges compared to self-driving cars. Road traffic for cars is more structured and predictable. Cars move in lanes and do not change direction too often, while people often stop abruptly, meander, may be accompanied by a dog on a leash and do not signal their intentions with turn signals.

To understand the real-time environment, a central component of the robot is an object detection module, a program that enters images and returns a list of object boxes.

This is very good, but how to write such a program?

The image is a large three-dimensional array consisting of countless numbers representing the intensity of the pixels. These values change significantly when the image is taken at night and not during the day; when the color, scale, or position of the object changes, or when the object itself is cut or blocked.

On the left – what the man sees. Right – what the computer sees.

For some complex problems, teaching is more natural than programming.

In the robot software we have a set of learning units, mostly neural networks, where the code is written by the model itself. The program is represented by a set of weights.

Initially, these numbers are initialized at random, and the output of the program is also random. Engineers present model examples of what they would like to predict and ask the network to improve the next time they see such an input. By iteratively changing the weights, the optimization algorithm looks for programs that provide increasingly constraining fields.

Development of programs discovered through the optimization procedure.

However, we need to think deeply about the examples that are used to teach the model.

Should the model be sanctioned or rewarded when he finds a car reflected in the window?
What will he do when he finds a picture of a person on a poster?
Should a car trailer full of cars be annotated as a whole or should each car be annotated separately?

These are all examples that have occurred during the construction of the module for detecting objects in our robots.

The neural network detects objects in reflections and posters. Error or function?

When you teach a machine, big data just isn’t enough. The data collected should be rich and varied. For example, just by using evenly extracted images and then annotating them, many pedestrians and cars will appear, but there will be no examples of motorcycles or skaters in the model that can reliably detect these categories.

The team must dig especially for difficult examples and rare cases, otherwise the model will not progress. Starship operates in several different countries and different weather conditions enrich the set of examples. Many people were surprised when the Starship delivery robots worked during the snowstorm “Emma” in the United Kingdom,, however, airports and schools remained closed.

The robots deliver packages in different weather conditions.

At the same time, annotating data takes time and resources. Ideally, it is best to train and improve models with less data. This is where architectural engineering comes into play. We encode prior knowledge in architecture and optimization processes to reduce search space to programs that are more likely in the real world.

We incorporate prior knowledge into neural network architectures to obtain better models.

In some computer vision applications, such as pixel segmentation, it is useful for the model to know if the robot is on a sidewalk or crossing a road. To suggest, we encode clues globally of the image in neural network architecture; then the model determines whether to use it or not without having to learn it from scratch.

After data engineering and architecture, the model can work well. However, deep learning models require a significant amount of computing power and this is a great challenge for the team, as we cannot take advantage of the most powerful graphics cards of robotic batteries to deliver batteries.

Starship wants our supplies to be cheap, which means our hardware has to be cheap. This is the same reason Starship does not use LIDARs (a radar-based detection system that uses laser light), which would make understanding the world much easier – but we don’t want our customers to pay more than they need for delivery.

State-of-the-art object detection systems published in academic papers are around 5 frames per second. [MaskRCNN]and real-time object detection documents do not report speeds above 100 FPS [Light-Head R-CNN, tiny-YOLO, tiny-DSOD]. Moreover, these figures are reported on one image; however, we need a 360-degree understanding (the equivalent of processing approximately 5 single images).

To provide perspective, Starship models run above 2,000 FPS when measured on a custom GPU, and process a full 360-degree panoramic image with a single forward transition. This is equivalent to 10,000 FPS when processing 5 single images with a batch size of 1.

Neural networks are better than humans at many visual problems, although they can still contain errors. For example, a restrictive box may be too wide, trust too low, or an object hallucinations in a place that is actually empty.

Potential problems in the object detection module. How to solve them?

Correcting these mistakes is a challenge.

Neural networks are considered black boxes that are difficult to analyze and understand. However, to improve the model, engineers need to understand failure cases and dive deep into the specifics of what the model has learned.

The model is represented by a set of weights and can visualize what each particular neuron is trying to detect. For example, the first layers of the Starship network are activated to standard patterns such as horizontal and vertical edges. The next block of layers detects more complex textures, while the higher layers reveal automotive parts and entire objects.

The way the neural network we use in robots builds the understanding of images.

Technical duty takes on a different meaning with machine learning models. Engineers are constantly improving architectures, optimization processes, and datasets. As a result, the model becomes more accurate. However, changing the detection model for the better does not necessarily guarantee success in the overall behavior of the robot.

There are dozens of components that use the model output to detect objects, each of which requires a different accuracy and call level, which are set based on the existing model. However, the new model may act differently in different ways. For example, the probability distribution of the output may be shifted to larger values or be wider. Although the average performance is better, it may be worse for a certain group such as large cars. To avoid these obstacles, the team calibrates the probabilities and checks for regressions on multiple stratified data sets.

The average performance does not tell you the whole story of the model.

Monitoring trainee software components poses a different set of challenges than monitoring standard software. Little concern is given about the output time or memory usage, as they are mostly constant.

However, moving a data set becomes a major concern – the distribution of data used to train the model is different from what the model is currently deployed in.

For example, there may suddenly be electric scooters moving on the sidewalks. If the model does not take this class into account, the model will find it difficult to classify it correctly. The information received from the object detection module will not be in agreement with other sensory information, which will lead to a request for assistance from human operators and thus delay delivery.

Main concern in practical machine learning – training and test data come from different distributions.

Neural networks enable Starship robots to be safe at road crossings by avoiding obstacles such as cars, and on sidewalks by understanding all the different directions in which humans and other obstacles can choose.

The robots of the stars achieve this by using cheap hardware that poses many engineering challenges but makes the supply of robots a reality today. Starship robots make real deliveries seven days a week in many cities around the world, and it’s rewarding to see how our technology is constantly bringing people more comfort in their lives.

Source link