March 29, 2021
Your background
Tell us about yourself and your background
I am an environmental biologist. I decided to study biology because I wanted to work helping to get a cure for cancer, so biochemistry was my logical specialization. But then I ended up going through the ecology path because I found out that it was the way to connect all that I had learned and it gave me a broad idea of how nature works. In a way, it is very similar to software and data engineering. You have a backend made by all the ecological networks and flows, and then you have the landscape, the frontend of nature. Data is being transferred all over the place. I haven’t worked as a biologist yet, but I think that my time in the University structured my brain and influenced my way of thinking a great deal.
After taking a masters degree in urban sustainability (I thought that was the solution for the “construction bubble” in Spain), I struggled to get a job in the environmental sector. I worked in retail a lot and even as a volunteer in Costa Rica. Until the summer of 2015 when I broke my foot hiking in the Pyrenees. Thanks to this unfortunate episode, during my team recovering I learnt to code, more concretely SQL, JavaScript and Python. Some months later, I joined the Solutions Engineering team at CARTO. I will always be thankful to Miguel Arias and Daniel Carrion to give an opportunity. During the time at CARTO, I had the chance to interact with the development team learning how the platform works, but also with clients, partners and users, which gave me also the idea of the pains and problems that we needed to solve.
What’s your position at Planet, how does your day looks like
I work as a Geospatial Data Engineer. I give data and metrics support to several teams within the company, from customer and tech support to collection planning. On the one hand, my team maintains pipelines and databases of satellite imagery metadata and logs. And on the other, we build and develop scripts, Geographic Information System plugins, dashboards and web mapping tools to analyze and visualize the imagery collection and order fulfillment.
Planet currently operates roughly 180 Doves and 21 SkySats circling the Earth, taking 3 million images every day. The satellites collect imagery of 350 million Km2 of landmass area, downlinking 25TB of data to the 48 ground stations each day.
Data at Planet
What is the company about? What are you building?
Planet is a geospatial company providing daily global satellite imagery to a number of commercial and humanitarian organizations. We design, build, and operate a fleet of roughly 180 cubesats called Doves that image Earth’s landmass on a near-daily basis. In addition, we also operate 21 high-resolution satellites called SkySats. Planet also develops and provides the online software, tools and analytics that enable users to simply and effectively derive value from satellite imagery.
How is your data team[s] organized? People, roles…
You have to understand that dealing with satellites, there is data moving all over the World! Instructions are being sent to the satellites via antennas, and the health of the fleet (what is called telemetry), raw data and metadata are also downlinked to 48 different groundstations Planet operates around the world. Pipelines that process the raw bytes (from the different bands) are combined to make our satellite imagery products. These can also continue to be processed to generate seamless mosaics for our Basemaps or extract objects such as boats or planes using our analytics feeds.
So it is not very surprising that there are many engineers and teams working with data across Planet. In our case, our team works with SkySat data. Concretely, one colleague focuses on helping automate and optimize the ingestion of SkySat orders. I work with the metadata of SkySat orders and imagery, and engineer, analyze and visualize the data behind these processes. We’re looking forward to continuing growing the team with more geospatial data engineers.
Can you describe the high level data architecture?
Let’s reduce the scope and talk only about the data architecture of the system I am currently working with: SkySat. This constellation could be tasked internally or by Planet’s final customer using our Tasking Dashboard or API. These orders are stored in a relational database. When an image of a particular order is collected and published, its metadata is also stored in the same database. If the image (we called it “capture”) is valid then the order is fulfilled.
The system I have built and maintained consists of a data warehouse project fed by ETL pipelines from updates of several internal endpoints coming from this relational database. The reasons for this separation of data and transition was mainly because my team needed to have access and perform heavy operations and adapt this database to their needs. At that time, having two separate databases, one from production and another for metrics made sense. For instance, there is an increasingly long list of views that are generated to be consumed by different stakeholders as data sources in internal dashboards and applications. But why a data warehouse? First, because it was easy to use thanks to its many SDKs, APIs and SQL interface, but secondly, because of its computing and spatial capabilities.
Besides my current efforts, we’re also working on an event model that will log every (previously subscribed) change in Planet’s different systems into a data warehouse project using Pub/Sub messages with a predefined schema. In doing so, Planet is able to understand the behavior of events related to a particular client, how our tasking satellites from SkySat behaves in terms of scheduling, imaging, downlinking, or how an order is progressing in time. Basically, we are building a database of database events.
What are some of the hardest data problems you’ve had to solve so far?
Extracting and combining data from different systems could be one of the main difficulties I have come across. Some metadata is stored in microservices, some is still being stored in a legacy system, another is being scraped from the replacement API, while some events have started to be sent to the data warehouse…
The last years have been a master class about the importance of table schemas and data types. Most of the problems that I encountered when debugging pipeline issues have been around these topics. Your code has to be very reactive and adapt when maintaining a pipeline from an API that is currently being developed.
Now that we have the data, it’s time to put more eyes into the metrics. Making sense of the data requires field domain and in-house knowledge.
What are the hard ones still unsolved?
Utilizing a data warehouse is not the same as a relational database. I miss a lot of functionalities such as UPSERT or *all* geospatial functions from other utilities. The UI is not very user friendly. I am sure I am missing lots of useful features in the ecosystem that could help me in my day to day data engineering tasks, but its interface and documentation is expansive and can be overwhelming.
Orders are being active during a period of time (from the start to the expiration, cancellation or fulfillment date). What you have in your database is just a snapshot of the present moment. The same row (order, capture) could have up to 10 different timestamps such as created time, updated time, start time, end time, fulfilled time… My colleagues and I struggle to come up with queries which use all of these timestamps to unnest and aggregate the data in a meaningful way. Similarly, one single row could have more than one geometry. For example, a capture could have the original geometry of the order (or area of interest), a portion of the latter if the order is too big, and the actual footprint.
Tell us some interesting numbers (traffic, rows, gigabytes you store, process and so on).
Planet currently operates roughly 180 Doves and 21 SkySats circling the Earth, taking 3 million images every day. The satellites collect imagery of 350 million Km2 of landmass area, downlinking 25TB of data to the 48 ground stations each day.
But the numbers I am dealing with are less breathtaking. My system is not interested in the high quality of the image but its metadata. The biggest tables are around 1 GB or 1.5 GB and they are growing 8-10k rows per day. I have created more than 60 views and a same number of custom queries to be used in different analysis and visualization tools. There is a deployment job used as an ETL process running every 2 hours. I chose to go this route because of inhouse support. The jobs just execute a Python script calling API requests with some data warehouse SDK code.
What’s the most interesting use case solved with data.
SkySat is being used for disaster monitoring and response, illegal mining prevention or deforestation monitoring. By the way we were able to capture the snowfall in Madrid that happened last January!
Internally, one of the most interesting use cases is still ongoing. SkySat orders usually have very demanding specifications in terms of cloud coverage and related issues. At the moment, we have a combination of processes to optimize capturing cloud-free areas, quality checks pipelines and manual checks done by operators. Being able to extract the assessment done by the machine, the pipeline and the operator, we are able to fine tune the first two in order to rely less on the latter.
Another very interesting problem that we are solving is contention for capacity in a given region. As I said, not all areas of the Earth are equally interesting. Having the data and being able to plot it on a map in a meaningful way is helping a lot the work of planners and customer managers. I hope that this will continue to improve enriching our visualizations with events and satellite logs.
What are the most interesting, unexpected, or challenging lessons that you have learned at Planet (or any previous company you can talk about) ?
I love when technology meets geography. This has happened in CARTO (my previous company) and it is happening quite a lot on Planet. For instance, you have a well distributed fleet of satellites circling around the Earth. First, you don’t take pictures of the open seas. Why? Because you need points of reference to anchor your images. Secondly, different parts of the globe are interesting to different customers. All in all, there are tech limitations and geography constraints.
In terms of growth and development, I am becoming better at creating a network of peers. Sometimes you don’t have at hand the resources and need to learn something quick to build what you have been asked. Figuring out who is willing to share the knowledge is key in these situations. In Planet, I was the more technical guy in my team (they know much more about satellites, imagery and mission operations) and I had to rely on external teams such as Software Platform and Professional Services to get started in new technologies.
Data engineering
What are the trends you see in data engineering?
Nowadays data is in the front page of every news media. And I am not talking about Big Data, I am referring to a very limited number of rows that are being updated daily. My experience in Planet has taught me how difficult it is to maintain a database that depends on feeds from different sources and in constant development. We need data products or solutions that could help these types of small data problems.
What are your favorite data tools?
A SQL console. SQL allows you to ask the right questions in a very straightforward way. Because my job concerns working with geospatial data, visualizing the results of a query on a map is always magic.
Some blogs, books, podcasts, people related to data engineering (or tech in general) you read/follow?
I really enjoy listening to Command Line Heroes from RedHat. I am completely fascinated by the beginnings of our industry. Another great podcast is 13 minutes to the Moon from the BBC. It is great to hear about failures and errors in space and how they were able to understand, continue working and improve based on their learnings.
What other companies are you curious about how they manage their data?
Google Maps. I would like to learn about their pipeline to update their vector tiles and the continuous development that they are doing in vector rendering.