Avoiding Data Difficulties – Starship Data Network by Taavi Pungas Stellar ship technology


Taavi Pungas

Gigabyte of food bag data. This is what you get when you make a robotic delivery. That’s a lot of data – especially if you repeat it more than a million times, as we have.

But the rabbit hole goes deeper. The data is also extremely diverse: robot sensor and image data, user interaction with our applications, order transaction data and much more. And the uses are equally diverse, ranging from deep neural network training to creating polished visualizations for our trading partners and everything in between.

So far, we’ve been able to handle all this complexity with our centralized data team. So far, the ongoing exponential growth has led us to look for new ways to work to keep pace.

We have found that the data network paradigm is the best way forward. I will describe Starship’s opinion on the data network below, but first, let’s look at a brief summary of the approach and why we decided to do it.

What is a data network?

The data network framework was first described by Jamak Dehgani. The paradigm is based on the following basic concepts: data products, data areas, data platform and data management.

The main purpose of the data network framework is to help large organizations overcome the difficulties of information engineering and deal with complexity. Therefore, it examines many details that are relevant to the enterprise, ranging from data quality, architecture and security to management and organizational structure. Currently, only a few companies have publicly announced that they will adhere to the paradigm of the data network – all large enterprises for billions of dollars. However, we believe that it can be successfully applied in smaller companies.

Starship data network

Whether the data works close to the people who produce or consume the information

To manage hyperlocal robotic supply markets around the world, we need to turn a wide variety of data into valuable products. The data comes from robots (eg telemetry, routing solutions, ETA), merchants and customers (with their applications, orders, offers, etc.) and all operational aspects of the business (from short tasks of a remote operator to global logistics of spare parts and robots).

The variety of use cases is the key reason that drew us to the data network approach – we want to work with data very close to the people who produce or consume the information. Following the principles of the data network, we hope to meet the diverse data needs of our teams, while keeping central oversight reasonably light.

Because Starship is not yet enterprise-wide, it is not practical for us to implement all aspects of a data network. Instead, we have opted for a simplistic approach that makes sense to us now and puts us on the right path for the future.

Data products

Determine what your data products are – each with an owner, interface and users

Applying product thinking to our data is the foundation of the whole approach. We think of anything that reveals data to other users or processes as a data product. It can display its data in any form: as a BI dashboard, a Kafka theme, a data warehouse view, a response from a predictive microservice, and so on.

A simple example of a data product in Starship might be a BI site management dashboard to track the volume of business on their site. A more complex example would be the self-service channel of software engineers for robots to send all kinds of information about driving by robots in our data lake.

In any case, we do not treat our data warehouse (actually Lake Databricks) as a single product, but as a platform that supports a number of interconnected products. Such granular products are usually owned by the data researchers / engineers who build and maintain them, not by specialized product managers.

The product owner is expected to know who their users are and what needs they address with the product – and based on that to define and justify the product quality expectations. Perhaps as a result, we have begun to pay more attention to interfaces, components that are crucial to usability but time consuming to change.

Most importantly, understanding consumers and the value each product creates for them makes it much easier to prioritize ideas. This is critical in a startup context where you have to move fast and not have time to do everything perfectly.

Data domains

Group your data products into areas that reflect the company’s organizational structure

Before we understood the model of the data network, we successfully used the form of slightly embedded data researchers for a while in Starship. In practice, some key teams had a part-time data team member working with them — whatever that means in each particular team.

We continued to define areas of data in accordance with our organizational structure, this time taking care to cover every part of the company. After mapping products with data to domains, we appointed a member of the data team to oversee each domain. This person is responsible for taking care of the entire set of data products in the domain – some of which are owned by the same person, others by other engineers on the domain team, or even some of the other members of the data team (e.g. Due to resource reasons).

There are a few things we like when setting up our domain. First of all, now every area in the company has a person who takes care of the data architecture. Given the subtleties inherent in each area, this is only possible because we have divided the work.

Creating a structure in our data products and interfaces has also helped us better understand our data world. For example, in a situation with more domains than members of the data team (currently 19 vs. 7), we now do a better job of making sure that each of us works on an interconnected set of topics. And now we understand that in order to alleviate the growing pain, we need to minimize the number of interfaces that are used outside the domain.

Finally, a finer bonus than using data domains: now we feel we have a recipe for dealing with any new situations. Whenever a new initiative emerges, it is much clearer to everyone where it belongs and who should apply with it.

There are also some open questions. While some areas naturally strive for the most exposure of inputs and others for their consumption and transformation, there are some that have a fair amount of both. Should we separate them when they get too big? Or should we have subdomains in the larger ones? We will have to make these decisions along the way.

Data platform

Enable the people who build your data products through standardization, without centralization

The purpose of the Starship data platform is clear: to enable a data person (usually a data specialist) to take care of an end-to-end domain, ie. to keep the team of the central data platform out of the day-to-day work. This requires providing domain engineers and data scientists with good tools and standard building blocks for their data products.

Does this mean that you need a full team of data platforms for the data network approach? Not exactly. Our data platform team consists of a data platform engineer who spends half of their time embedded in a domain in parallel. The main reason we can be so weak in designing a data platform is to choose Spark + Databricks as the core of our data platform. Our previous, more traditional data warehouse architecture has put us at a significant cost of data engineering due to the diversity of our data areas.

We found it useful to make a clear distinction in the data stack between the components that are part of the platform from everything else. Some examples of what we provide to domain teams as part of our data platform:

  • Databricks + Spark as a work environment and universal computing platform;

As a general approach, our goal is to standardize as much as makes sense in the current context – even parts that we know will not remain standardized forever. While this helps in the moment of productivity and does not centralize any part of the process, we are happy. And of course, some elements are completely missing from the platform right now. For example, the tools for data quality assurance, data discovery and data genealogy are things we leave for the future.

Data management

Strong personal property supported by feedback loops

Having fewer people and teams is actually an advantage in some aspects of management, for example it is much easier to make decisions. On the other hand, our key management issue is also a direct consequence of our size. If there is one data person in each domain, they cannot be expected to be experts in every potential technical aspect. However, they are the only person with a detailed understanding of their domain. How to maximize the chances that they will make a good choice in their domain?

Our answer: through a culture of ownership, discussion and feedback in the team. We have borrowed extensively from Netflix’s management philosophy and cultivated the following:

  • personal responsibility for the result (for its products and domains);

We also made several specific agreements on how to approach quality, recorded our best practices (including naming conventions), and so on. But we believe that good feedback loops are a key ingredient in making guidelines a reality.

These principles apply beyond the “building” work of our data team – that was the focus of this blog post. Obviously, there is much more than providing data products about how our data scientists create value in the company.

Last thought about management – we will continue to repeat our way of working. There will never be a “best” way to do things, and we know we have to adapt over time.

Concluding remarks

This is it! These were the 4 basic network data concepts applied in Starship. As you can see, we have found an approach to the data network that suits us as an agile company in the growth stage. If it sounds appealing in your context, I hope reading our experience has been helpful.

If you want to get involved in our work, see our career page for a list of open positions. Or check out our Youtube channel to learn more about our world’s leading robotic delivery service.

Contact me if you have questions or thoughts and let’s learn from each other!



Source link

Leave a Reply

Your email address will not be published.