The Great Data Debate

Lakes v. warehouses, analytics v. AI/ML, SQL v. everything else… As the technical capabilities of data lakes and data warehouses converge, are the separate tools and teams that run AI/ML and analytics converging as well?

In this podcast, originally recorded as part of Fivetran’s Modern Data Stack conference, five leaders in data infrastructure debate that question: a16z general partner and pioneer of software defined networking Martin Casado, former CEO of Snowflake Bob Muglia; Michelle Ufford, founder and CEO of Noteable; Tristan Hardy, founder of Fishtown Analytics and leader of the open source project dbt, and Fivetran founder George Fraser.

Their conversation covers the future of data lakes, the new use cases for the modern data stack, data mesh and whether decentralization of teams and tools is the future, and how low we actually need to go with latency. And while the topic of debate is the modern data stack, the themes and differing perspectives hit on an even more longstanding question: how does technology evolve in complex enterprise environments?

Show Notes

The future of data lakes [1:07] and specific operations that may impact their usefulness [6:01], including AI/ML [8:55]
The evolution of two-stack architecture [9:35] and Arrow as a potential solution [11:32]
The pros and cons of a data mesh [16:18], future use cases for the modern data stack [20:07], and data apps [22:05]
Discussion of latency and ways to reduce it [22:46], and predictions for a future data platform [25:41]

Transcript

The future of the data lake

George: I’m going to kick this off with a spicy topic, at least spicy in this crowd, which is data lakes. Data lakes is a blurry term used by different people to mean different things, but for the purposes of this discussion, let’s define data lakes as tabular data – so tables, rows and columns – stored in an open source file format, like Parquet or ORC, in a public cloud object storage, like S3 or Google Cloud storage.

In a world where we have data warehouses that use object storage to store their data and give you some of the advantages of data lakes, do data lakes still have a place? Let’s start with you, Martin, does the data lake have a future?

Martin: One of the biggest fallacies that we do as an industry is we look at an architecture, and we’re like, oh, that can do all of these things, therefore it will be pushed into service to do all of these things. And that’s just not how technology evolves. We make decisions in the design space based on the primary use cases that technology is being used for.

If you look at the use cases that data warehouses are being used for, they’re largely driven by analytics, which is a certain workflow, it’s a certain query pattern. And if you look at data lakes, it’s actually quite different. They tend to have more unstructured data, focused on operational AI, compute intensive. If you look at the respective technologies, they’re just being optimized in this massive design space for different use cases.

Architecturally, sure, they can both do what the other one does, but in the end, you’ve got products and companies optimized around use cases. And I think the operational AI use case is the larger one, and it’s growing faster. So I actually think over time you can argue that it’s the data lake that ends up consuming everything, not the data warehouse.

George: You’re just trying to provoke Bob there, Martin.

Bob: You succeeded.

Martin: I’m watching Bob’s face.

George: All right, Bob. Let’s hear from you. The data lake, does it have a future?

Bob: No, I see these things very largely converging onto a relational SQL-based model. Five years from now data is going to sit behind a SQL prompt, and SQL data warehouses will replace data lakes.

From the perspective of storing structured and semi-structured data, the cloud SQL data warehouses already do everything that is necessary, and there really is no reason for people to have a separate data lake except for historical precedent. A lot of companies come from environments where they had a lot of semi-structured data in a Hadoop environment, and having a data lake is a natural transition. And in a sense, the data lake, which is really S3 storage together with any tools you want to put on top of it, is a very generalized platform.

But, over time, infrastructure evolves to take on more and more of the use cases. SQL relational data warehouses have evolved to the point that for structured and semi-structured data, storage and query, they subsume pretty much all of what needs to be done today. What remains is images, video, documents, PDFs.

Now I don’t call that unstructured data. I think that’s a misnomer. There is no such thing as unstructured data. All data has structure of some kind. Structured data is tables, rows and columns. Semi-structured data is like JSON. It’s hierarchical in its nature. And I think there’s a third category of data, which is what I call complex data: images, documents, videos. Most things that are streaming fall into this category, and more and more machine learning can be applied to the content of those data sources that turn it into semi-structured data that can be used for building complex data applications and for doing predictive analytics.

So what’s missing in the case of the data warehouse today is the support for complex data. But that’s going to come. That’s called a feature. Can you imagine if you could transact, fully transact all of these types of images, videos, and things together with any source of semi-structured data in a data warehouse? The applications that open up are remarkable, and that’s going to come in the next two to three years.

Michelle: I could see images being easily retrieved from the database. But do you actually see all of the image processing or the video processing taking place in the database as well?

Bob: Not with SQL. SQL can’t do that. You’ll use procedural logic and Python, or something else to do that, at least for now. In the long run, relational will win, too, but that’s probably more like 8 to 10 years away.

Martin: I think we’ve been waiting for that for 40 years, Bob.

Bob: We have, but look what’s happened. Over time, navigational and hierarchical in the 1980s was replaced with SQL. OLAP was replaced with SQL over the last 10 years or so. We’ve replaced MapReduce with relational. All of these things, relational always wins.

Michelle: Well relational wins for the actual retrieval, but what about the processing? The technology that you need to process images is fundamentally different than you do to retrieve data records.

George: Tristan, what are your thoughts on this?

Tristan: So, I completely agree that SQL is going to dominate data processing, at least a very large chunk of data processing, but there’s different APIs that the data lake and the data warehouse expose. There’s the file storage layer, and for a lot of reasons I believe that an organization will store their files one time. You will not have a data warehouse copy of the file and a data lake copy of the file, which, in some architectures today, that’s what you see. And that requires you to have an open source file format that is shared between your data warehouse use cases and your other use cases.

Above that you have indexing and meta data that is a core part of the data warehouse, but it’s also a core part of the data lake. I think those have to also start to converge so that different use cases can take advantage of the same stuff. And then you have the SQL prompt, and maybe, at the SQL prompt layer, the data warehouse dominates, but I think you need to allow different access patterns as well because one closed source firm is never going to accomplish literally all data processing use cases in the world.

Bob: All of these things should interoperate in an open source and an open format way. But the issues of format have kind of gone away because you can input and output any kind of format and export into any kind of format very easily.

The question is: what are the operations that actually need to be performed against data that sits in a data lake? Today anything associated with complex data, the data warehouse can’t help you, and so there’s a huge reason to have a data lake today. In 2025, I don’t think so.

I think that we really have five platforms being created globally: Snowflake, Databricks, and then the three clouds. Both Snowflake and Databricks, while they will come from very different places – Snowflake will always be SQL and declarative in its approach, and Databricks certainly historically has been procedural and code-based, so it’s a version of SQL versus code in some sense – you’ll see both companies and pretty much everybody else in the industry offering both within their platforms.

Martin: So, you’ve got two technologies that start with different use cases, somewhat different architectures, but they’re clearly going to a converged point, which is you have some declarative something, and you have some procedural something. Whether one is on top of the other at the end of the day, they can both do both. But, in the meantime, you have this decade-long journey, and in that decade-long journey, there is an architecture that’s optimized around use cases. The amount of tradeoffs and decisions you make when building one of these systems is…

Tristan: Yeah, like TimescaleDB has very different characteristics than Snowflake, and they are characteristics that are optimized for workflow.

Martin: Yeah, entire companies focusing on different points in the design space with different optimization parameters. It’s the use case that drives the technology because of all of the gravity around it. And so, again, if it turns out that AI/ML and an operational use is growing quicker, which it seems to be, that is going to dictate the technology from an architectural standpoint.

Tristan: Martin, you’ve said a couple times now that the AI/ML space is appearing to grow faster. I’ve actually not heard that assertion before.

Martin: Let me clarify. So broadly, there are two use cases. There’s the analytics use case, which is driven by queries and dashboarding. The other one is creating a complex model from a data scientist and then serving that in production. That does things like wait time prediction. That does things like fraud detection. That does things like dynamic pricing. These were folks in R building complex models on existing data and then coming up with a bespoke way of serving that. That is very clearly now turning into a pattern that’s being served by a data lake.

Now it’s on a much smaller base, but if you actually look in the industry, it’s a very rapidly growing use case.

George: Michelle, you’ve spent time in both the data science community and the analytics community, and notebooks in many ways are the place where these things sometimes come together. I’m curious to hear your thoughts about how the two stacks have evolved. Maybe they’re converging. Maybe they’re building each other’s features and getting more similar, but where does that take us? Do we still have two stacks five years hence?

Michelle: I think we’re going to continue to see greater and greater specialization because we’re not going to have the ability or the budget to hire enough data scientists. Those stacks are going to continue to evolve, and it’s going to be specialized based upon what it is that they’re trying to do.

The data lake will have a place. Your images, your blob storage, all of those things are probably going to remain in the data lake and have a home there for a long time to come. I just think it’s not going to look like how it looks today. Today, it’s just been a lack of understanding around what data do we really need to collect? We went from one extreme to the other. We weren’t collecting any data. Now we’re collecting everything because we don’t know what’s valuable. And the reality is that’s not necessarily a good idea either.

The movement of data, I think we’re going to see that stop, but format is going to be really important. We need that interoperability because reprocessing data at scale is just cost prohibitive. It’s time prohibitive. It’s not something we want to do if we can avoid it.

And I think you’re going to see decentralization here, at the lower levels, where you’ve got either the business units embedded, or you’ve got your new product teams, you’ve got your data science teams embedded in those product teams. You’re going to need a unifying layer at the very top the form of technologies that make it easier for everybody to query or be able to serve information.

I think that the notebook is probably the best suited for that because it does have the language agnostic approach. It gives you the ability to look at both data and code and have all of that context, that rich business context, the visualizations. We’re going to see that evolve as this modern data document, and we can use that as part of our unifying layer because your data scientists can then work with R, your data analysts can work with SQL, but we can, at the end of the day, really hide all of the code and really get to: what is the business implication of these things that we’re doing?

Will two stacks become one?

George: This really brings us to the second major topic that I wanted to discuss, which is: how do we bring the machine learning, Python, Scala world, and the analytics, SQL, BI tool world together? There really are two stacks and two communities who sync the exact same data sources to Delta Lake and to Snowflake simply for operational reasons. There’s not a fundamental technological reason, but it’s just the way the tooling has evolved. It’s too inconvenient to cross that boundary.

And there’s essentially three visions of that world. One is that you’re going to put machine learning into SQL, and probably BigQuery is the furthest along in pursuing this. You basically create a bunch of UDFs that do your linear algebra stuff. The other is more the Databricks vision where you put SQL into Python or SQL into Scala and you use data frames to do that. And then there’s maybe a third vision where you use Arrow, the interchange format, and everything can just talk to each other, and you can arrange it any way you want.

Which of these visions do you think is going to win?

Michelle: What I would like to see win is something like Arrow, so that you have the interoparability. You’re going to see machine learning moving into SQL because you’re going to have data engineers who are perfectly capable and have the need to do some anomaly detection or some logistic regression, and it’s within their ability to do that. Feature engineering is just another data transformation for them. But they don’t have the same background in stats, and so they can only take it so far.

And then you’re going to see, on the other side of the spectrum, your data scientists where they have all of this really great math background, and they understand how to do more advanced deep learning, but they don’t have the technology skills. SQL is the most successful language for working with data, so you’re really going to see both of them really become capable of supporting both use cases. Ultimately, you’ll continue to see specialization where the things that you want to do if you’re trying to do deep learning are just fundamentally different than the types of things if you’re just trying to do predictive models.

Tristan: I think a lot about the Arrow vision of the world, and I think that will end up in the fullness of time dominating for the same reason that Martin has been talking about: tools end up evolving to the personas that they serve and the use cases they serve.

I want to do all the data prep and feature engineering. And then I want machine learning models to be trained on top of that. People do that, certainly. But the fact that the infrastructures to do those two different things are generally separate creates this big slowness. It’s purely a technical slowness. Arrow doesn’t solve all of that. Arrow certainly helps, but, there’s dumb things like the servers that do those things are in different clouds. And the interchange fees, what do you, do you call them interchange fees?

George: Egress fees.

Tristan: Egress fees are expensive.

George: They’re criminal. They’re not just expensive. They’re ridiculous.

Tristan: As more people do this, it’s going to be become smoother. they’re going to become more localized.

Martin: There’s a reason why you’ve got multiple languages, and it’s not because one is Turing complete and the other isn’t. The reason is because people build their entire workflow around languages and all of the tools, and so you’re going to have a heterogenous, fragmented system. Therefore you do need to have open interfaces.

George: Bob?

Bob: I’m a big believer, at this time, in the approach of having multiple systems that interact with common formats.

Arrow is a huge step forward for that, not just because it’s an efficient format, but because it provides a consistent in-memory layout for people to do advanced analytics in their Spark environments. It’s the way the world is working right now because most customers actually have a data warehouse and an analytics platform separately, and they are connecting them together.

Now, I’m going to continue to be the ultimate radical, however, and declare that the approach that we’re taking today in terms of machine learning is still roughly the approach of the internal combustion engine in the automobile. The approach that’s happening where Arrow ties together those predictive systems with declarative databases, that’s really the creation of the hybrid, or the Prius era.

Hybrid will dominate for the next, say, three to five years. You will see hybrid systems being built by every major vendor, and all of them will have a full predictive stack and a full declarative, relational, SQL stack built in using some kind of interface like that. But that’s only until relational actually solves the broader set of problems.

George: Does that mean that you’ll be using SQL functions, PredictX, or…?

Bob: No. Ironically, I think that while SQL will dominate well into the 2030s for doing data modelling and data transformation, there’s another step beyond that which is business modelling, and that needs to be represented in a knowledge graph. Knowledge graphs are how we’ll do predictive analytics in the 2030s. And what needs to happen is a whole new generation of data system that’s based on relational knowledge graphs to create that.

Data mesh: decentralized teams, unified architecture?

George: Michelle, you brought up a term earlier that I wanted to follow up on, which is data mesh. And I wonder if you could define that briefly for everyone because similar to data lakes versus data warehouses, there’s a question whether going forward that’s more of a historical phenomenon or an actual, good architecture that we want to continue.

Michelle: Data mesh is really a concept of decentralizing the data processing and the ETL and the analytics into each individual business unit and then having some sort of unifying solution at the top. To do this well requires having specialized data teams, having specialized roles, having infrastructure as a service available to them for data processing, and then having some overarching standards board, almost like a federation, of your data engineers to ensure that all of your ETL looks consistent so that as you are trying to do data retrieval on some common, query tool, you’ll have that familiarity that you need.

We are going to see things like Arrow really come to the forefront sooner rather than later. I think customers are going to demand it because of all the challenges that we’re currently having. You’ve got all of the cost of the storage and the processing. Your teams that are trying to do the processing don’t have the business context that they need. As a result, you have this back and forth and a lot of wasted time. You’ve got a lot of data quality errors. You have data multiple times. Ultimately, we want to take that body of knowledge and put the technology where that body of knowledge lives. The data mesh is an attempt to do that.

Bob: One part of what the data mesh folks are talking about is how to organize and how to structure a team to manage data across a large enterprise with very disparate and important data sources. That’s very, very important, and there’s some good ideas in data mesh for that.

Architecturally, data mesh has this sort of odd idea that data is basically streaming, and you can use facilities, like Kafka, to do transforms as the data is in flight. And I don’t believe that.

While there is streaming data, and you can do quite a bit with data that’s simply streaming — in other words, append-only data — to me, another critical source of data is transactional data coming out of business systems. The streaming solutions have no answer for that, and they just pretend that data consistency is unimportant. I don’t understand that because I put data consistency at the top of the issues that I think about when I think about managing data.

Martin: Mesh has historically been one of these terms that conflate architecture with administrative domains, and distant service mesh, and distant Wi-Fi mesh, and mesh networking, etc. I think actually Bob is exactly right, which is there is a very real issue with separate administration domains, separate processing domains, separate access to tool sets. That’s very, very different than building a fully distributed architecture, which just tends to be hard and inefficient. And it’s often not the people that promote the mesh idea, but when people hear the term mesh, they default to full distribution, which tends to be just a bad way to build systems.

George: Said like a networking guy.

Martin: Having seen this exact same thing happen in other domains for a couple of decades.

Tristan: All of us are very technology-focused human beings, so when we think about data mesh, we tend to think about the architecture part of it. Bob, I’m glad you pointed out the distributed teams and the people aspect of this. My constant question for data mesh is: why can’t you enable the distributed nature of what you’re talking about with a unified architecture?

Michelle: My preference is always to have one data set that is very clean and well understood that we do not have to move anywhere, that is performant alongside our large batch analytical processing, which is also working with our data science. That’s the nirvana. That’s the goal is to just have one data storage and then having something that sits over top of it, and each of those different things are specialized for each of the different use cases but you have one data store.

Next use case for the modern data stack

George: The modern data stack keeps swallowing up more and more use cases. It killed cubes a while ago. It’s mostly killed Hadoop at this point. It keeps pulling more use cases into its orbit because it’s fundamentally so flexible and so capable of doing many different things well enough that you don’t really want to buy another system, build another system for one use case. What are some of the most interesting, surprising, significant use cases that may start to get pulled into the orbit of the modern data stack in the next couple years?

Bob: Complex data. We now have all this very, very interesting stuff that’s happening in predictive analytics. And to me we’ve gone from semi-structured data as being the most interesting data sources to now having a wide variety of data sources. I was talking to a company involved in the medical field yesterday, and just the rich amount of data that exists, and the images, and the doctors’ notes, all of that is opaque to our systems today. It will not be in five years. That will all become part of the modern data stack, and to me that’s a gigantic transformation for the types of applications that will be created in the years to come.

Tristan: My last job was I ran marketing for a company, and I was deep into growth marketing. The problem that you run into there is that you’re constantly writing code to push data back and forth between systems because the different operational systems do different things, and you need the same data in all of them.

No one has yet rearchitected the systems to, in the modern data stack, just take all of the work that you’ve ingested and now push it back out to your operating systems or your operational systems. But I think we’re at the beginning of that.

Bob: What you’re really talking about Tristan is the advent of the modern data app, which basically is an operational application that autonomously can make decisions for the business. We’ve seen very few of those and very trivial examples, but boy will they be significant in the future.

George: There’s really two visions of the data app that I’ve seen. One of them is the data app is a separate system, and you take the important data from your data warehouse, and you push it. Then the other vision is the data app is just natively built to run on top of the data warehouse. I’m curious whether people have opinions about those two models and where they see that going.

Bob: It’s really the same conversation we’ve been having about how these things are built. A data app is predictive analytics that actually takes autonomous action. It takes the data that would otherwise be presented to a person and instead leverages that to actually take actions within the business. They’re being built every which way today because there are few good tools to build data apps. That will not be true in a few years.

Latency: How low do we need to go?

George: One of the things that you run into when you try to build data applications and take action automatically is latency becomes incredibly important. Everybody in the ecosystem is battling this right now. I think there’s a lot of different visions of how we’re going to crush the latency problem and how low we need it to get. How low does the latency need to be? At what point do we have most of the interesting use cases

Bob: People have dozens to hundreds or even thousands of operational systems. More and more, they’re SaaS operations. They’re outside of your organization. They’re always a source of truth now. They are the present, and a data warehouse or a data lake is about historical or the past.

What does that latency need to be? Does it need to be zero seconds? I don’t think so. There are applications where zero seconds or instant is required, mostly having to do with eventing and alerting of some sort. Most of the time, if you can get it in a minute or two, you can leverage that data inside your historical system with predictive analytics to begin to perform actions on it.

Martin: This is a very complicated topic that I think is very use case specific. But there tends to be serious trade-offs that systems designers make between latency and throughput. If you want higher throughput, you batch. And the reason that you batch is that you don’t have as many domain crossings.

However, if you look at most systems, you can make the tradeoff. Meaning you could do low latency in a data lake, and you could do high throughput in a data warehouse, or vice versa. These are not architectural limitations. They just tend to be the tradeoffs that were made as a result of serving whatever the primary use case is. I’ve heard a number of these latency-throughput tradeoff discussions, and you actually get down to a machine level, they are just a result of the tradeoffs that were made on the system going into it.

George: One of the interesting things that we see is that the point at which you start to have to spend a lot more to get the latency lower is actually lower than people think. I suspect you can get down into the 10 second range with the throughput optimized architecture. Basically, the throughput optimized architecture I suspect will go lower than we expect.

Michelle: What do you imagine will happen with the serving layer? Your website still needs to operate over that data. Are you imagining that there’s just going to continue to be a caching layer? Or is that going to be a separate system?

Bob: It depends on what the characteristics of the system need to be. If something needs to be really low latency, today’s data warehouses are not always the right solution for it. It just depends on the application. Latencies will go down in these products, but to Martin’s point, some of the architectural choices make the latency characteristics of a Snowflake somewhat different than, for example, the latency characteristics of a MemSQL.

Tristan: One of the things that I would like to see more of in the future is Lambda architectures, but with off-the-shelf tools. So my data is flowing into a more streaming-like system and a more batch-like system so that I can get the best of both worlds. You’re making tradeoffs when you build these systems. As a user, I want to be able to choose and have both of them.

George: Well, we have one minute left. I’d like to ask a yes or no question for everyone: will there emerge another major data platform alongside Snowflake, Databricks, Google, AWS, and Azure? We’ll start with you, Michelle. Yes or no?

Michelle: Yes.

George: Bob?

Bob: What’s your timescale?

George: In the next five years.

Bob: Yes.

Michelle: Yes.

Bob: But the new one may be relatively small relative to those guys.

George: Well I said major. That sounds like an in-between…

Bob: Snowflake was small five years ago.

George: Tristan?

Tristan: I think no.

George: Martin?

Martin: Yes.

George: All right. Thank you very much, everyone, for joining. This has been a really fun conversation. I really appreciate all of you being here. I know our audience does as well.

The views expressed here are those of the individual AH Capital Management, L.L.C. (“a16z”) personnel quoted and are not the views of a16z or its affiliates. Certain information contained in here has been obtained from third-party sources, including from portfolio companies of funds managed by a16z. While taken from sources believed to be reliable, a16z has not independently verified such information and makes no representations about the enduring accuracy of the information or its appropriateness for a given situation.

This content is provided for informational purposes only, and should not be relied upon as legal, business, investment, or tax advice. You should consult your own advisers as to those matters. References to any securities or digital assets are for illustrative purposes only, and do not constitute an investment recommendation or offer to provide investment advisory services. Furthermore, this content is not directed at nor intended for use by any investors or prospective investors, and may not under any circumstances be relied upon when making a decision to invest in any fund managed by a16z. (An offering to invest in an a16z fund will be made only by the private placement memorandum, subscription agreement, and other relevant documentation of any such fund and should be read in their entirety.) Any investments or portfolio companies mentioned, referred to, or described are not representative of all investments in vehicles managed by a16z, and there can be no assurance that the investments will be profitable or that other investments made in the future will have similar characteristics or results. A list of investments made by funds managed by Andreessen Horowitz (excluding investments for which the issuer has not provided permission for a16z to disclose publicly as well as unannounced investments in publicly traded digital assets) is available at https://a16z.com/investments/.

Charts and graphs provided within are for informational purposes solely and should not be relied upon when making any investment decision. Past performance is not indicative of future results. The content speaks only as of the date indicated. Any projections, estimates, forecasts, targets, prospects, and/or opinions expressed in these materials are subject to change without notice and may differ or be contrary to opinions expressed by others. Please see https://a16z.com/disclosures for additional important information.

Posted November 12, 2020