Reliability Engineering for the Starship Site | by Martin Pihlak Stellar ship technology

Photo by Ben Davis, Instagram slovaceck_Controlling autonomous robots on city streets is a challenge for software engineering.  Some of this software works on the robot itself, but much of it actually works in the backend.  Things like remote control, finding paths, comparing robots to customers, managing fleet health, but also interacting with customers and dealers.  All this must work 24×7, without interruptions and be scaled dynamically to match the load.
SRE in Starship is responsible for providing the cloud infrastructure and services on the platform to implement these backend services.  We have standardized Governors for our microservices and we perform it from above AWS. MongoDb is the main database for most backend services, but we also like it PostgreSQL, especially when strong input and transaction guarantees are required.  For asynchronous messages Kafka is the messaging platform and we use it for almost everything except sending video streams from robots.  We rely on observation Prometheus and Grafana,, Loki,, to the left and Jaeger.  CICD is processed by Jenkins.
Much of the time for SRE is spent on maintaining and improving Kubernetes’ infrastructure.  Kubernetes is our main deployment platform and there is always room for improvement, whether it’s fine-tuning the settings for automatic scaling, adding Pod interrupt policies, or optimizing the use of Spot instances.  Sometimes it’s like laying bricks – just installing a Helm diagram to provide specific functionality.  But often the “bricks” have to be carefully selected and evaluated (whether Loki is good for managing logs, Service Mesh is something after that) and sometimes the functionality doesn’t exist in the world and has to be written from scratch.  When this happens, we usually turn to Python and Golang, but also to Rust and C when needed.
Another major part of the infrastructure for which the SRE is responsible is data and databases.  Starship started with a monolithic MongoDb – a strategy that has worked well so far.  However, as the business grows, we need to rethink this architecture and start thinking about supporting thousands of robots.  Apache Kafka is part of the history of scaling, but we also need to understand sharding, regional clustering, and microservice database architecture.  On top of that, we are constantly developing tools and automation to manage the current database infrastructure. Examples: add MongoDb monitoring with a custom sidebar proxy to analyze database traffic, enable PITR support for databases, automate regular fault tolerance and recovery tests, collect Kafka reallocation metrics, activate Kafka retention data.
Finally, one of the most important goals of Site Reliability Engineering is to minimize the downtime of Starship’s production.  While SREs are sometimes called upon to deal with infrastructure outages, more impactful work is being done to prevent outages and ensure a speedy recovery.  This can be a very broad topic, ranging from a robust K8s infrastructure to engineering practices and business processes.  There are great opportunities for impact!
One day in the life of SREArrival at work, some time between 9 and 10 (sometimes remote work).  Grab a cup of coffee, check Slack messages and emails.  Examine the signals that were triggered during the night, see if there is anything interesting there.
Find that the latency of the MongoDb connection increased overnight.  Looking at the Prometheus performance with Grafana, find that this happens during the time the backups are running.  Why is this suddenly a problem, we have been making these archives for centuries?  It turns out that we are very aggressive in compressing backups to save network and storage costs, and this consumes the entire available processor.  The database load seems to have increased slightly to make this noticeable.  This happens on a node in standby mode without affecting production, but it is still a problem if the primary fails.  Add an item from Jira to fix this.
Incidentally, change the sample code to MongoDb (Golang) to add more histogram buckets to better understand the latency distribution.  Run the Jenkins pipeline to put the new probe into production.
At 10 a.m. there is a meeting, share your updates with the team and learn what others have done – set up VPN server monitoring, instrument Python with Prometheus, set up ServiceMonitors for external services, troubleshoot MongoDb connections, pilot deployments of canaries with Flagger.
After the meeting, resume the planned work for the day.  One of the things I planned to do today was to create an additional Kafka cluster in a test environment.  We are running Kafka on Kubernetes, so it should be easy to take the existing cluster YAML files and change them for the new cluster.  Or, come to think of it, should we use Helm instead, or maybe a good Kafka operator is already available?  No, I’m not going there – too much magic, I want clearer control over my condition.  It’s raw YAML. A new cluster works an hour and a half later.  The setup was pretty simple;  only init containers that register Kafka brokers in DNS need a configuration change.  Generating application credentials required a small script to set up Zookeeper accounts.  One small part that was left hanging was setting up Kafka Connect to capture events in the database change log – it turns out that the test databases do not work in ReplicaSet mode and Debezium cannot get an oplog from it.  Keep that up and move on.
Now it’s time to prepare a script for the Wheel of Misfortune exercise.  At Starship, we implement them to improve our understanding of systems and share troubleshooting techniques.  It works by breaking down part of the system (usually in a test) and causing an unfortunate person to try to fix and mitigate the problem.  In this case I will set up a load test with Hey to overload the micro-service for route calculations.  Set this up as a job at Kubernetes called a “mower” and hide it well enough so that it doesn’t show up right away on Linkerd’s network service (yes, evil 😈).  Later, perform the exercise “Wheel” and pay attention to all the gaps we have in the game textbooks, indicators, signals, etc.
During the last few hours of the day, block all interruptions and try to do a little coding.  I turned the Mongoproxy BSON analyzer into asynchronous streaming (Rust + Tokio) and I want to know how much this works with real data.  It turned out that there was an error somewhere in the analyzer’s gut and I had to add a deep log to find out.  Find a wonderful library to track Tokyo and get carried away with it …
Disclaimer: The events described here are based on a true story.  Not everything happened on the same day.  Some meetings and interactions with colleagues have been edited.  We rent.

Leave a Reply Cancel reply