Running stand-alone robots on urban roads is a challenge for software developers. Some of the program runs on the robot itself but most run in the background. Factors such as remote access, route access, customized robots, ship management and customer and business connections. All of this is required to drive 24×7, without interruption and grow vigorously to suit the task.
SRE at Starship is responsible for providing cloud architecture and platform services to support past operations. We are settled on Gentlemen of our Microservices and running on top of AWS. MongoDb It is the original storehouse of many previous works, but we also love it PostgreSQL, especially when there is a need for strong singing and marketing signals. For async messaging Kafka then a messaging platform and we use it for aesthetic purposes besides sending videos from robots. In appearance we depend Encouragement and Grafana, Loki, on the left and Jaeger. CICD is operated by Jenkins.
A good portion of the SRE is used to repair and refurbish the Kubernetes infrastructure. Kubernetes is our main delivery platform and there is always something to be accomplished, it would be good to customize, and add confusing Pod points or improve Spot usage. Sometimes it’s like laying bricks – just putting on a Helm chart to make it work better. But often the “bricks” have to be carefully selected and checked (with Loki the right to manage the log, and the Service Mesh item after that) and sometimes its functionality is not in the world and has to be written from scratch. When this happens we turn to Python and Golang as well as Rust and C when needed.
Another major tool that SRE manages is information and archive. Starship started with a single monolithic MongoDb – a method that has worked so far. However, as the business grows we need to revisit this design and start thinking about the integration of robots with a thousand. Apache Kafka is part of the development agenda, but we also need to know about the small architecture of natural groups. On top of this we are developing tools and adapters to take care of the equipment that is available here. Examples: increase MongoDb availability and sidecar proxy to analyze database data, enable PITR support on pages, clean up permanent tests and recovery, collect Kafka metrics for updates, enable data storage.
Finally, one of the most important goals in Site Reliability Engineering is to reduce the production time of Starship products. While SRE is sometimes called upon to address the shortage of infrastructure, the most important work is done to prevent shutdown and to ensure a speedy recovery. This could be a bigger topic, ranging from having solid K8s architecture to engineering methods and business processes. There are so many wonderful opportunities to make a living!
One day in the life of SRE
Arriving at work, sometimes between 9 and 10 (sometimes they work remotely). Hold up a cup of coffee, view Slack messages and emails. Review the warnings given at night, to see if there is anything interesting there.
Note that MongoDb latency connections are out at night. Digging deep into the path of Prometheus and Grafana, discover that this is happening at a time when repatriation is in full swing. Why is this suddenly a problem, we have been running recovery programs for many years? It turns out that we put a lot of pressure on network storage and storage and this destroys all available CPUs. It seems that the goods in this category have grown a little bit to make this known. This is happening in a stand-alone way, with no impact on production, yet a problem, once it has failed. Add a Jira item to fix this.
Along the way, modify the MongoDb (Golang) code to add more histogram buckets to better understand latency divisions. Running the Jenkins pipeline to create a new research to produce.
At 10am there is a Standup meeting, share your updates with the team to learn what others have been doing – setting up a VPN server monitoring, using Python and Prometheus software, setting up ServiceMonitors on external services, setting up MongoDb connections, running canary traffic and Flagger.
After the meeting, resume the activity that was scheduled for the day. One of the things I planned to do today was to set up another section of Kafka in the practice area. We are running Kafka on Kubernetes so it should be straightforward to take existing YAML files and translate them to fit the new cluster. Or, for a second thought, should we use Helm instead, or maybe there is a better Kafka driver available here? No, don’t go there – more magic, I want to improve my appearance. Raw YAML is. After an hour and a half, one new group is on the move. Its installation was simple; the only ones registering Kafka brokers in the DNS require a configuration change. Creating applications required a small bash script to set up accounts on Zookeeper. The only piece left, was to set up Kafka Connect to get the series update features – indicating that the test set isn’t running in ReplicaSet mode and Debezium can’t find it. Go back to this and move on.
Now is the time to prepare for the Wheel of Misfortune event. At Starship we run this to help us understand systems and share solutions. It works by breaking another part of the system (usually on trial) and having a poor person try to deal with the problem. This way I start a product test with Hey multiplying microservice in computing methods. Put this as Kubernetes’ work called “haymaker” and hide it so that it will not appear immediately in Linkerd’s service (yes, bad 😈). Afterwards make a “Wheel” game and see every opportunity we have in game books, metrics, warnings, and more.
In the last few hours of the day, eliminate all distractions and try to write more. I have also set up an Mongoproxy BSON controller with asynchronous advertising (Rust + Tokio) and I want to know how this works with real knowledge. It turns out there is something wrong with the gut and I have to add more logging to find out. Find the best Tokyo library to enjoy…
Disclaimer: The information presented here is based on fact. Not all of this happened in one day. Some meetings and communications with colleagues have been changed. We are hiring.