Try as they may, critics of SQL (structured query language) have never really been able to dent its popularity. Decades after its creation, the majority of the world’s databases still run on SQL, and the majority of data analysis still happens via SQL queries. It’s not too big a stretch to say that the digital world runs on SQL.
Despite its popularity, however, SQL does have shortcomings that limit its utility – even for power users. In this interview with Fivetran co-founder and CEO George Fraser, we discuss one of them: The fact that SQL doesn’t have an open source ecosystem of software libraries to tackle certain common use cases, and that work across popular SQL systems. As a result, skills learned on one SQL database might not transfer to another, and there are far too many complex queries being written.
It’s a tricky problem to solve because of how the SQL ecosystem operates, but doing so could catalyze a whole new era of innovation in data analysis. And Fraser thinks one possible solution is right under our nose.
FUTURE: So everyone can keep up, can you briefly explain what SQL is?
GEORGE FRASER: SQL is a programming language that is used solely for interacting with databases. It’s ubiquitous, present under the covers in almost every software application. If you load your Facebook feed, all of that data about who commented what, what your uncle just posted, it’s all stored in a bunch of SQL databases. And when you load the page, a whole bunch of SQL queries fire off and go fetch all that data.
And that holds true if, say, you’re getting your car repaired – the data about what has been done to your car is probably stored in a SQL database somewhere. This Zoom call we’re on right now, I’m sure there’s a bunch of entries in a SQL database somewhere representing this call. Really, the world runs on SQL.
Structured information – in the case of a social media post, that might be ‘name,’ ‘post content,’ ‘time of post,’ whether it includes an image, stuff like that – is typically stored in SQL databases. There are other kinds of databases, but their use is just tiny compared to SQL databases.
And yet, people are always complaining about things SQL can’t do, or that SQL is not. Why is that?
I forget what the original saying is, but it’s something like, ‘There are two kinds of technology: technology people complain about, and technology that doesn’t matter.’ So database management systems and SQL, the language that is used to interact with them 99 percent of the time, were one of the very first applications of computers. When people invented computers, one of the very first things they did was invent databases, because one of the most useful things that you can do with computers is store a bunch of data, update it, retrieve it, and summarize it. So, it goes way back.
SQL itself was created in the ’70s, and it has a lot of great characteristics. And it was incredibly successful and became widely adopted. It’s sort of like the air we breathe. At this point, to technologists, it’s like asking a fish about water.
People complain about it because it isn’t perfect. Nothing is. But it’s hard to change for a bunch of reasons. And some of its imperfections have really stuck around for a long time.
Code that performs specific and well-defined operations, often in addition to or on top of the native capabilities of an application or language. Examples of popular libraries include pandas for data analysis in Python, MLlib for machine learning with Apache Spark, and PostGIS for handling geographic data in SQL.
One of those imperfections, which you’ve written about, is that SQL isn’t a library language – you can’t easily use software libraries with it. Why is that something worth addressing?
If I rewind by one step, there are lots of problems with SQL. And some of these are small problems that are probably not worth fixing. People like to point out – and this is kind of programmer inside-baseball – that the order of the clauses is arguably wrong and it would have been better if it had a different order. But at this point, it’s not a huge problem and it’s just way too late to change. I see that as a small problem. There’s a reason why that hasn’t proven to be the fatal flaw of SQL. And there are other things like this that are small problems.
But I think there’s one really big problem with SQL, which is that it isn’t a library language. The open source software revolution, which has changed how we build every other kind of software, has not come to SQL.
This particularly matters when you’re trying to use SQL for analytical workloads. There’s no need for libraries when you’re using SQL in the way that it gets used in my examples of your data on Facebook, or the data about your car repair, or the data about this call that we’re on right now. All you’re doing is pulling records and updating them one at a time. It’s not a library language, but who cares?
However, because 99 percent of the world’s important data is stored in SQL databases, people use SQL as not just a way of retrieving data, but of actually summarizing it and analyzing it. And when analysts write SQL, they write big SQL queries that have lots of complexity and that do fancy things like rolling averages and just about everything you can imagine. And there, the fact that there is no open source SQL ecosystem, that it isn’t a good library language, is a big problem. Because every analyst who uses SQL to analyze data has to start from zero.
That’s the world we’re still living in with SQL, but I think there’s a way out.
Analytical databases are used to inform business decision-making via dashboards, reports, and other methods of data analysis. This is in contrast to transactional databases, which typically read, write, and fetch application data in response to events (such as someone making a car-repair appointment online.)
Why doesn’t SQL have an ecosystem of software libraries?
I think there are two reasons. One is that SQL isn’t actually one language. Every database management system that implements SQL – and there are a lot of them – implements a slightly different SQL. If we’re focused on analytics, we’re talking about maybe 10 databases. That just makes it harder because whatever you do, you’re going to have to do it 10 times.
And the other reason is there’s just no way to distribute SQL. Even if you write an awesome open source SQL library for a particular database, how the heck do you get it to other people? There is no package manager for SQL. Golang has a built-in package manager. Rust has Cargo. Java has Maven. Every programming language has some package manager that either was created with the programming language in the first place, or achieved community adoption or escape velocity, and it became the de facto package manager. That’s how you share code, and until recently SQL had no package manager.
There are actually a couple of open source libraries for SQL. My favorite example is PostGIS, which is the exception to what I’m talking about. It’s a library for Postgres for dealing with geographic data, which is something a lot of people wanted to do. Despite all these obstacles, it was so useful to a small set of people – and that’s what you really need when you’re doing new things, you need to appeal a lot to a few people and not a little bit to everyone – that they would install binaries on their database by following instructions from websites. And then the big cloud vendors, because PostGIS was popular, just pre-packaged it with their databases. So by heroic efforts, people were able to adopt this one example of a library for SQL, for a particular flavor of SQL Postgres.
But if you look at that exception, you can see the problem: It’s so hard to get distribution for an open source SQL library.
How do you solve this problem, given the way that the SQL ecosystem functions?
I think the solution is dbt, a build tool and a package manager for SQL. Package managers are a way for programmers to share code. I can write some code, I can publish it to a package repository, and then you can use that code.
Build tools are a little bit harder to explain. If you’re a programmer and you write a bunch of code, your job is not done. Something always has to be done to turn that code into something that actually does something. You need to deploy that code into a database, in the example of dbt. Or you need to compile that code into a program or God knows what. There’s always some kind of build step where you take this code, which is basically a bunch of text written by a human being, and you turn it into something that’s actually useful. In the case of dbt, what it does is deploy that code into the database so that it actually starts doing things in the real world, as opposed to just sitting on your screen looking at you.
Now, if dbt were to be the platform that enables this, it would need to be nearly ubiquitous, which I think could happen. SQL analysts had never really had good developer tools. They only had these proprietary things that were made by companies trying to sell them their database, or whatever it was. I often like to joke that dbt was analysts’ first good relationship, and so they’re all intensely loyal to dbt.
And we’ve started to see this happen to some extent already, although it’s not a perfect example of what we’re talking about. But Fivetran, for example, created a library that gets reused across our dozens of dbt models, so users don’t need to reinstall and relearn the same things for each model they want to use. The fragmentation problem (that each database management system implements a slightly different SQL) should be manageable at the database level because if you’re targeting analysts who are using SQL to do data analysis, you basically are just targeting Snowflake, Databricks, BigQuery, Redshift, and SQL Server. Maybe somebody else will break through and then there will be one more, but it’s a reasonable number, it’s not a thousand.
“The open source software revolution, which has changed how we build every other kind of software, has not come to SQL.”
SQL databases have been around forever, so what happened over the past several years that we need more libraries and a better overall developer experience now?
I think it’s the performance of SQL databases for analytical workloads. Years ago, there were no SQL databases – except really, really expensive ones – that were fast enough for complex analytical workloads. So when you wanted to do a complex analytical workload on a large set of data, you would take the data out of the database and put it into a special-purpose tool, often an OLAP cube that was optimized to do very specific kinds of queries very quickly. These OLAP cubes all have their own languages, tools, GUIs, and whatever. It was very much a commercial ecosystem that was not centered around a single language like SQL is.
In the last 10 years, though, SQL databases got so fast and cheap at analytical workloads that a lot of this has just moved down into the database. A lot of these special-purpose data-analysis tools that you would connect to just disappeared. The database was fast enough by itself, and better in certain ways because it was more flexible, so people started doing a lot of their analysis right on the database. That led to more SQL code and more complex code, which created the need for a better way to organize it and build it.
Before, everyone was just using ad hoc SQL build processes. It could be as simple as copying and pasting some code from one place to another and pushing a button to run it. That would work fine if your code was not that complicated and there were only, like, two people in the whole company who worked on this part of it. But then as people started doing their analysis from soup to nuts inside the database, they needed something like dbt.
Assuming your theory bears fruit, what do you think would make for useful or popular SQL libraries?
Time-series analysis might be it. Doing rolling averages and things like that is very awkward in SQL using the built-in capabilities, like window functions.
Another really useful open-source library that I would love to see is approximate aggregation. It’s a thing that exists in all these different databases, but it’s typically not very user-friendly and so they hardly use it. Or it’s just different for different systems, so nobody ever bothers to learn; they just learn the regular average. And, boy, it would be nice if there was just a uniform way of doing approximate aggregate cases. It would be great if someone would write a friendly wrapper around the built-in approximate aggregation capabilities of popular systems, and then as a user you could just use that.
“Open source code was an absolute revolution in software development, so the same thing could happen for SQL developers – it could be a catalyst.”
What’s the net effect on the SQL ecosystem if this idea catches on and becomes super popular?
Well, open source code was an absolute revolution in software development, so the same thing could happen for SQL developers – it could be a catalyst. You could see the emergence of these widely used libraries that everyone learns and lists on their LinkedIn profiles and uses every day in their work. And it allows analysts to be more productive: One analyst, twice as much done because they’re leveraging this open source code that they’ve been using for years.
It could also lead to fewer mistakes, because every line of code you write is an opportunity to make a mistake. The more you can leverage widely tested things, the fewer mistakes you make. These are all things that have happened in Java and C++ and other languages, and it’s just kind of waiting to happen in SQL.
You mentioned LinkedIn. A standard set of tools and skills would seem valuable for employees changing jobs and companies trying to hire, too.
It’s huge. Everyone wants to think of these things in terms of the tech dimensions, but one of the most important elements of open source code is that you can take it with you. You learn once how to use that library and, if it’s popular enough, there’s a good chance that your next job will use it, too. So it creates more of an incentive for individuals to learn these things because they don’t have to worry that this knowledge is going to become useless in a couple of years if they change jobs.