Q&A With Priceline CTO Marty Brodbeck
Jun 03, 2021
In this week’s SlimDevOps Twitch show, Slim.AI CEO John Amaral interviewed Priceline Chief Technology Officer Marty Brodbeck. Marty is a visionary technologist and innovator who built high-performing technology teams at Shutterstock, Pearson, Pfizer, and Merrill Lynch. He focuses on building IT foundations that boost developer productivity and efficiency while also driving business value.
In this interview, John and Marty discuss how Priceline is building a modern tech platform in the travel industry. Marty dives deeper into topics around developer productivity, how containers are changing the way developers work, and steps his teams have taken to create smooth transitions to the latest technologies while maintaining their commitment to customer and developer experiences.
John: Will you give us a little bit more about your background, your role, and what kinds of big things have you been working on?
Marty: The funny thing is that I was never going to get into technology. But here I am, 25 years later. I was going to be a lobbyist; my mom still scratches her head when I talk about where I landed.
I’ve had the chance to work in financial services, pharmaceuticals, and consumer products… I worked at Diageo for a long time, in addition to Pearson, Shutterstock, and now Priceline. Each one of those companies has been an interesting challenge for me.
I worked on my very first cloud project back in 2007 when I was at Pfizer. We were trying to figure out how to store clinical data in a much more efficient and cost-effective way. A small “start up” (at the time) called Amazon Web Services contacted us. They had a novel idea to store our data in a much more effective way. So, go figure, here we are in 2021 almost 14 years later, and the world has completely changed with AWS and other cloud providers. Still, some things have stayed the same.
At Priceline, I’ve been fortunate to work at such a fantastic company. From a technology perspective, our company has a couple key things that we have been working on.
The first is: How do we drive future strategic business growth by making our applications and platforms more resilient and scalable? In thinking through that strategy, we started developing a plan to modernize all of our applications into twelve-factor architecture running on containers and GKE. By the end of this year, we will have 80-percent of our entire product platform running in Google Cloud on containers and Kubernetes.
The tricky part about this is that it has not been a “lift and shift.” I have been through that experience and the problem is you are just moving garbage from one pile to another. I learned the hard way that in order to do these cloud-native modernization projects the right way, you have to start with the application and go from there.
Additionally, we’re thinking a ton about developer productivity. How do you move daily activities into the developer’s hands to make them more productive? The economics suggests that you will never have enough DevOps and SRE people to keep up with what developers are doing. So how do you shift those responsibilities left? This typically falls into the hands of infrastructure folks, who become an enabler for developers.
Third, we’re transforming our data into a customer-first platform. This will be where we can stream data in real-time, we understand what our customers are doing on our platform, and in response, we can personalize and recommend our products to our customers.
Those are the big macro things that we have been working on at Priceline.
John: How has containerization shifted your thinking about how developers work together?
Marty: Developer productivity is really easy in a startup; you can have one of your employees be responsible for making it happen within a small group of 10 people. When you are in organizations of 500-1000+ engineers, it gets more difficult.
Containers are basically leading the world at this point, at least within the last two companies I have been at. What I mean by that is, the whole notion of containers is to drive more productivity for developers. For example, I can develop a feature that is completely isolated and has no dependencies on other pieces of my infrastructure to design, develop, and deploy. In doing that, you basically have to buy in to the fact that I am going to shift a lot of the major responsibilities for the design and development of software into the hands of the developers, therefore embracing this “developer first” mentality. What do I need to ensure that my developers have a toolbox to make them the most effective they can possibly be?
In addition, when applying the factor of COVID on top, this topic is magnified; where you have a distributed workforce over five to six offices globally, how do you ensure developers are being the most efficient and effective?
John: What key challenges do you face while working with developers?
Marty: I begin with the architecture around twelve-factor. What does twelve-factor really mean in practical usage? It is one thing to read a book, and then plan to build all your applications to be twelve-factor. But how do you actually embed it into your organization so that every developer understands what it means to develop and design twelve-factor applications? It took a lot of developer training to actually implement this.
First, we dipped our toe in the water by targeting one or two groups to move to this twelve-factor mentality and then applying all of the great principals of containers into that. Stitching those two things together made us look at how we should be modernizing our CI/CD pipeline to embrace those concepts. Not only new architecture practices, but also the notion of having a monorepo at the center of our developer experience and using that to foster teamwork.
After this, we introduced open source technologies like Rush to understand the interdependence between different projects. So, if I am developing this new microservice, who are the other projects within my monorepo that are depending on that? When you get into the build cycle, you have to think about how to package up this business function into a container that can then be deployed seamlessly into production. You start off with one team around that, and then you times it by ten, which then entails a different class of problems.
John: How do you manage that change to cloud-native without overwhelming the engineering organization?
Marty: We tackle it step by step, to be honest with you. It would be complete chaos to try to boil the ocean and say we are going to take the entire CI/CD pipeline and modernize it at the same time.
We’ve been pretty deliberate about the changes that we’ve made. We started off with knowing that we needed to have a monorepo architecture to facilitate sharing code, which was very important. We then understood the need for having a very modern build architecture and so we transitioned to GitHub Actions. Following this, there was a whole piece around automating build infrastructure and our GKE environment. We also had to decide how to script those, so we moved a lot of our scripting to Terraform.
Finally, we wanted to create a very declarative way to facilitate continuous deployment. How do we do a canary deploy and how do we roll that back? So that is where we have introduced technologies like Harness to facilitate the continuous deployment model that we want in our applications.
We did not do all of those things at the same time, we divided them out step by step. Parallel to these steps, we were educating and training the organization on twelve-factor development.
John: How do you know these changes are effective in helping your teams create great software?
Marty: I think about things like financial processes that the organization view as the most mission critical — there is a lot of investment that goes into optimizing those. Well, today, for many companies, the software development process is now the single most important business process that they have. So to optimize it, you almost have to take a lean six sigma approach to it. In other words, what are the bottlenecks that are causing my developers to not deliver features and functions in a timely manner? That is the way we looked at it.
When I worked at Pfizer, we were big on lean six sigma. We looked at our manufacturing process, and we found out what the bottlenecks were. We applied the same thing to software development. Where are developers getting tied up in our process? Where can we make them most efficient? For instance, we would deploy our code to 100-percent of the environment and find out in ten minutes when we had an issue that was impacting customers, which had a real relevant impact. We focused on introducing canary deployments to allow our developer to go faster, while minimizing the error rate that we would have with our customers. That was one lean six sigma practice that we applied to our software development process.
At Priceline, we also want to make sure that we enable sharing more effectively. If I’m developing a React component, why not make sure everyone is sharing and using that component? That is why we developed a monorepo. We looked at it in a manufacturing way and asked ourselves what were the best things that we could do to make developers more productive.
We measure, from an engineering perspective, on a metric around features driving higher conversion rates on our website. We have an A/B testing culture, which is important. The more features we develop, the more tests we can do, and our conversion rates improve as we find wins for the customer. Developing new features is very important for us.
John: What is the next innovation that you think is going to help individuals like you grow to be faster and stronger?
Marty: Containers are leading the world. In which I mean, we have all of these autonomous units of work running in your environment. You are able to iterate and develop on those.
From a monitoring and infrastructure perspective, you don't know what is running inside of those things. It is kind of like a “black box of goodness” running in your environment, yet you do not have a strong sense of what is actually being run inside.
Then, you compound that with development teams all over the world — how do you standardize a version of a container for a developer? Our front-end team would have a standard container that they would operate with, with the appropriate software and versioning. Our back-end engineers would have the same thing. Container management, understanding how those things are operating in your environment and the interdependencies between them, is a problem that will start to emerge once we get our entire product platform into Kubernetes.
After this it is crucial to effectively monitor and maintain those as software becomes more of a remote practice. Forcing container standardization at the design level will become even more important for your development teams to be productive.
John: How do you keep track of the interdependencies in a microservices environment? What problems arise?
Marty: Traditionally, people would take their APM tool and view how those things interact together or create an artifact that would define the architecture. But as you know, things change and people can leave the organization. The APM will trigger an event when there are issues with CPU or memory. Unfortunately, those tools don’t lay out the interdependencies at a physical level and how containers are dependent upon one another.
A great example is a check-out flow, where a customer puts an item into a cart, the cart accounts for all its products, those products are then priced, and finally you trigger a payment event using a credit card or an alternative payment. That could consist of eight to 15 containers making up that flow. But, today, there are no tools to illustrate the interdependence between Kubernetes clusters and how the containers should be versioned, updated, and managed. All of the physical monitoring happens today at the Kubernetes level, not physically at the container level.
It is the age-old problem where the bug gets into production and then you have to go back all the way to the design in order to figure out where the issue was. It would be nice if you understood that problem at the design level, and could iterate the testing of it before having it in the production.
John: What is a long-standing challenge in software development that you’ve seen throughout your career?
Marty: The one thing that I've seen teams really struggled with as we move to cloud is figuring out effective auto-scaling and HPA at scale. Having that capability be a decision at the design time and figure out how these things should auto-scale together would be wonderful. We used Helm at previous companies but the feedback was that it can be difficult for some developers to work with. There is a declarative way to figure out auto-scaling for containers at the design-time level, and iterate it until you get into production. I believe that this would be a huge productivity gain. Often, we try to figure these things out and auto-scale them in the run cycle. When building out these applications, it is better to think through these scale problems at the design phase.
John: Do you find there to be a gray area between the team building your infrastructure and the team building the apps on the infrastructure?
Marty: All the time. I think that the hardest thing is scaling Kubernetes. The things inside Kubernetes are containers, so you run into memory leaks or even running it effectively. We still need to figure out the interaction between Kubernetes and containers at a design level to simulate how they should be running. It is to the point where: You design your container, you build your code, and then you put it all in Kubernetes — where you then find out all of the issues that you are having with it. It would be great to deal with those problems during design. When you are in production and it is running already, that is when you are trying to diagnose the problem and it’s too late. It’s very difficult to go back to the repos. You have to re-engineer each step all over again.
John: How are you facilitating testing at the developer level?
Marty: We have a function within Priceline called SDETs, which are software developers in test. They work very closely with the developers on designing functional tests for their code within a particular area. That test then runs through the build cycle. In our pipeline, we have automated testing for smoke, integration, quality, security, et cetera. We just recently introduced this notion of an approved pre-prod Kubernetes environment that allows us to simulate a lot of what we were doing in K8s. We saw all the issues that can happen when you don't appropriately test these things in a pre-prod environment. We have made a significant investment in setting up an automated pre-prod environment so that developers can more seamlessly test their containers in a live environment before it goes into production, allowing them to catch some of these issues.
John: If you were to give one piece of advice to a tech innovator, what would it be?
Marty: In doing these things, I have had my fair share of failures. I think that failing fast is one of the greatest lessons that I have learned. In a lot of these common practices, it is often trial-and-error. You don’t know how it is going to work, but you know it is the right thing to do. You may get frustrated by your culture or organization because it may or may not be ready for something new.
The other big thing in my job is to listen to what our developers are telling us as an organization and make sure that we are removing the roadblocks to make them more productive. Consistently taking a customer-centered approach to developer productivity is something that we have focused on. I have learned in the past that losing focus on a customer-centered approach or not listening to developers can basically disenfranchise an entire engineering organization, which is not a good place to be.
Audience Question: Priceline must be doing a ton with data and machine learning? What AI/ML tools are you using within your infrastructure?
Marty: For our data streaming architecture we capture all of our customer data through Kafka. We are introducing a new framework from a company called Rudderstack that allows us to collect user patterns on a website. All of our data is streamed into Kafka and then from Kafka we put it into our data lake which runs on cloud SQL from Google. We then run TensorFlow on top of that to operationalize a lot of our AI/ML models.
We are also introducing new technology from a company called ValoHai, which gives you a lot of machine learning operations similar to DevOps. We basically roll and run our AI/ML models through that infrastructure.
John: Thanks Marty. Developers are driving the modern economy and we need to support them every way we can. It is how the world goes around today. Thanks again for all of your insight and being open to sharing tons of good knowledge.
Thank you for reading our interview with Marty Brodbeck! If you would like to watch the full interview, head over to our Twitch channel. Stay tuned for more episodes on our Twitch!
We would love your feedback and thoughts! Let us know what you think on Twitter @SlimDevOps.