Interviews with Experts Bonus 25 exercises

Interviews with Experts

Transcript

00:00:00 Kent: Hello everybody, this is a yeah a great day. I'm gonna be talking with my friend Tyler Benfield say hi Tyler

00:00:08 Tyler: everyone

00:00:09 Kent: so Yeah, Tyler and I I think I kind of feel like we've met before this last remix conf But I can't remember it was was this remix conf the first time we met?

00:00:21 Tyler: No, so funny fact, this is the last ReactConf back in Hendersonville. We briefly met at the like, little like welcome party there.

00:00:33 Kent: Oh, okay, yeah.

00:00:34 Tyler: That was when like, you were super into all the testing stuff and I was very much aligned with that. We were pushing that pretty hard where I was at the time. And yeah, I got the pleasure of meeting you there.

00:00:43 Kent: Yeah, that's awesome. Yeah, That feels like ages ago. That was, that's what we call BC before COVID. So cool. Yeah. So we go way back, but most recently you've gotten, And actually I'd love to get an intro to you, but like you seem to

00:01:03 have gotten way deep into databases and now you're at Prisma. And we had a really awesome conversation about database in full stack stuff at RemixConf this year. So yeah, I'd love to chat with you about that. But before we do, yeah, wanna get to know you a little bit. So why don't you give us an intro?

00:01:22 Tyler: Yeah, hey everyone, I'm Tyler Binfield. I am a software engineer. I've been doing this professionally for over 10 years now. I love it. I'm really passionate about software engineering. I've worked at startups big and small. I've worked at consulting companies, brief sense of very larger companies, even a race team at 1 point doing software there. So lots

00:01:42 of variety of stuff. And I'm currently a staff software engineer at Prisma. I live in North Carolina. I love it here. I think it's the best place to live until I visit Utah, and then I'm jealous. The weather there is fantastic, and you guys get to snowboard all the time.

00:01:55 Kent: Yeah, it's awesome.

00:01:56 Tyler: Yeah, it's great. Yeah, and I really enjoy what I do. I enjoy environments where you have to sort of get in and learn a lot of things, and that's sort of what drove me into being really heavy on databases. I consider myself a full-stack engineer, but I kind of gravitated towards databases because I've seen sort of a lack of, maybe, knowledge and experience there and education around

00:02:16 it. So I really like gravitating towards that space.

00:02:21 Kent: Yeah, I'll give you a little context on my background too with databases. So when I was going through school, I decided pretty early on that programming sounded like the most boring thing ever. And so I somehow ended up in information systems as a degree, and that includes all

00:02:41 sorts of things. Like you get product management and databases and statistics, just all over the place. Business, computers, wherever those 2 things meet, that's what information systems was. And I decided that I really enjoyed working

00:03:01 in Excel and I just thought it was really interesting. And so the field that sounded most interesting to me was actually business intelligence. And so my first internship was doing business intelligence for the Church of Jesus Christ of Latter-day Saints, actually, here in Utah.

00:03:22 And I actually, I really enjoyed it. It was very interesting. We learned about star schema and denormalized databases to optimize for query speed and answering the questions of the business. Very interesting, ETLs, all this stuff. But I ended up finding myself spending more time automating processes and stuff and

00:03:43 like wiring systems together. And So I just was a software engineer at the end of the day. So that's where I went from, okay, I wanna do database over to software engineering. And it's only in the last couple of years now that I've been using Remix that I started getting interested in databases again.

00:04:03 And so with all of that, you know, context behind us now, why do you think that full stack engineers should care about databases?

00:04:14 Tyler: Yeah, I think 1 thing that I've learned in my career is that databases are 1 of the most critical parts of any application. Yet it's interesting that whether it's formal education or other means of getting into software engineering, there's like not as much attention put on how to really optimize the database. And what I've seen so many times is

00:04:34 companies or teams just start throwing money at the problem and they buy bigger databases, buy more resources, and it doesn't really fix the underlying root problem. And it's also so important because when you're building a really fast and responsive app, you wanna have a really good user experience, you'll never be faster than your slowest database query.

00:04:53 Kent: Oh, that's such a good, yeah, quote. Snip that, yeah, you'll never be faster than your slowest database query. That is so good.

00:05:01 Tyler: Yeah, so if you start there, you can only get better from there, right? Like it doesn't matter how many times a React app is re-rendering. It doesn't matter like network latency as much. Like what really matters is make your database fast and then look at the alternative things. And it's interesting that too, that oftentimes the database, I wouldn't say always, but oftentimes it's 1 of the easier things, in my

00:05:21 opinion, to actually save a lot of that on. Like sometimes it's just throwing the right index in place, right? So yeah, I guess long-winded answer. That's why I think it's important. It's fundamental. Maybe there's some apps out there that don't rely entirely on a database, but most everything out built does in some fashion.

00:05:38 Kent: Absolutely. I think there are a lot of services that will do your database as a service stuff, and you better believe that they know about databases, that's like what they do. But at some point, I think those things work really well for like

00:05:58 speccing something out, or your hobby project or something. And there certainly are some use cases that they have for like a big project using their service. Like they definitely shout those out on the rooftops. Like we're for real stuff too. But for most stuff that you're building

00:06:18 at work at scale, you are definitely going to want to understand how to use databases and optimize those. So.

00:06:26 Tyler: Yeah. And even most of the hosted vendors that I've looked at, it is actually, I have a lot of interesting thoughts about some of those modern databases coming out there, right? If we take a little tangent for a minute, you see a lot of modern ones coming out, a lot of them bringing a better developer experience, approaching the application developer, which I think is an overlooked, surprisingly, audience to a lot

00:06:47 of cloud vendors providing databases. But a lot of them that have been successful are fundamentally building on top of 1 of the big open source databases anyway. Like they're either Postgres compatible or MySQL compatible or modern stuff moving to SQLite, which is really interesting. Taking the

00:07:07 properties of SQLite and making them work in a distributed environment is super cool.

00:07:11 Kent: Yeah.

00:07:11 Tyler: So, but what I found interesting is a lot of the ones that I've seen that don't take 1 of those approaches that try to do something entirely unique, they don't seem to get the same traction that the others do. And I think there's maybe something to read into there. But also interesting observation that kind of stems from that is all those same things that you learn about how to do proper indexing or how to tune

00:07:32 SQL queries, that still applies to these modern database providers with the exception of some of those more fancy ones that don't gravitate towards SQL.

00:07:41 Kent: Yeah, you know, that's 100% correct because if these hosting providers could figure out like which things to index automatically for you, then the database would do that too. And so like, so the fact that the database doesn't do it kind of gives us some

00:08:01 idea that like, yeah, maybe that's just a thing you need to be able to do yourself.

00:08:06 Tyler: Yeah. Yeah.

00:08:08 Kent: Well, cool. So you were mentioning earlier that enterprises will just throw money at the problem. They'll say, okay, let's set up a caching instance so that we can cache stuff, which, like, enormous collection of problems in addition to the cost when you introduce caching, or let's just horizontally scale this out,

00:08:28 you know, outrageously or whatever, or just, like, give this a monstrous machine so that we don't have these performance problems. And you told me about a story that you had about that I'd love to hear.

00:08:41 Tyler: Yeah, yeah, I've got a number of stories, but 1 of my favorite ones is a company I worked for at 1 point, I won't name any names because everybody learns at some point. And yeah, this company, we had, I think it was around 250,000 users in 6 to 8 months had signed up. This was like an e-commerce type of platform. And

00:09:01 this might sound kind of small in terms of data size, but again, in my experience, a lot of databases that were built for like software to service stuff or e-commerce stuff, they don't get super big. So this was about 4 gigs of data stored in that 6 to 8 months of being live. It was running on 4 vCPU instance, 15 gigs of memory with

00:09:21 a read replica. Now keep in mind too, fun fact there, if you have more memory than you have actual data in your database, your entire database and most database engines is going to be sitting in memory. You're never going to touch disk for a read operation. You will for writes, but not for reads.

00:09:37 Kent: That's just like a built-in thing for the database? Like it'll check how much memory do I got? Wow. That's awesome.

00:09:43 Tyler: Yeah, so if you can afford to just give your database enough memory to hold your actual full database, do it. And that applies to Postgres, MySQL, SQLite is a little bit special because it's file system based, but it still does some in-memory stuff too, right? Yeah, so 15 gigs to 4 gigs, That's a ratio of extreme overpaying in my opinion. Yeah.

00:10:05 So yeah, the database, like you can see it in the stats too, that it was always reading from memory. But anyway, the worst query that we had was every time a API call ran or a page loaded, it had to look up some stats about who the user was for the session. Very common thing to have in any application, right? This query ran 14 to 20,000

00:10:26 times every hour. And 14,000 to 20,000 times. And its average was 140 to 200 milliseconds, which is quite a range for an average. That's another

00:10:38 Kent: sort

00:10:38 Tyler: of symptom we could get into. But yeah, 140 to 200 milliseconds, almost all of that was CPU because we were reading from memory. I did some analysis and I quickly saw this database had no indexes at all. Added 1 index to that table and it took it down to 0.04 milliseconds average. That reduced the

00:10:58 CPU by about 20% overall over the whole system and the active connections reduced by 30%. Wow. That's like a 5,000 times improvement on performance.

00:11:10 Kent: Yeah, that's amazing. Yeah. And then

00:11:13 Tyler: after that we scaled the database down too. Like immediately, like a week or 2 later, I was like, yeah, we're way overpaying on this database. We can half it and drop the read replica. And we did. And everything continued to operate smoothly.

00:11:24 Kent: That's amazing. Like a single index, was it just like an easy primary key or unique field? Or what was the index?

00:11:33 Tyler: Yeah, I think it was on like a session ID. So we had like a table of like some type of user session stuff. I think it was like, if I remember right, something to get probably joining from the session ID to the users table, which another principle that we've talked about before is if you don't know like any other indexes to apply, start with your foreign keys. Cause that's going to have the most bang

00:11:53 for your buck. A DBA won't tell you that. They'll probably tell you to analyze everything about your database and see what happens. I'm going to go on record saying, if you don't do anything else, start with your foreign keys. That'll take you from 0 to your first user and more on just indexing foreign keys alone. So yeah, I think that 1 was a foreign

00:12:13 key somewhere in there. And I did a little bit of analysis just to see, like, if we made 2 or 3 columns in that index, like we've seen before from analyzing, sometimes you gotta compound them a little bit. And, but overall, there was just 1 table that needed that index, and it was done.

00:12:28 Kent: Amazing, yeah, indexes. So, like, this is the thing where we talk a lot about like making our pending states look really good. And I think that's important to make our pending states look really good. There was this tweet or post on X the other day that somebody was sharing a cool new,

00:12:48 I don't know if it was like a new skeleton UI for GitHub or if it was just a proof of concept or something, but it looks cool, you know. And somebody else posted like, I don't know, I never really liked the way these things looked, and I replied with, better than pending UI is just a faster

00:13:08 app. GitHub should try it sometime. And honestly, I don't know GitHub's constraints. I'm sure that what they're doing is very complicated, especially on some of those large diffs and stuff. Like even just rendering the DOM for some of that stuff is gonna be expensive, even if you have no JavaScript at all.

00:13:28 But I don't understand why the profile page takes at least a second every time. And there's got to be an index hiding in there somewhere that could have just sped the whole thing up.

00:13:37 Tyler: Yeah, absolutely. And that's something I also like tying into Remix. I feel like with Remix, I write so many fewer loading spinners. And it kind of nudges me towards this, like, just make your backend faster, like do the things to get your backend faster. And then you don't have to deal with this complexity that comes from loading spinners. And granted, like, Deferred and Remix makes that so much better

00:13:57 than it used to be, right? But even now I find myself, like, I only grabbed that as a like last resort. And it's so oftentimes I can just look at what's happening on the back end and say, well maybe I can parallelize these 2 things. Maybe I can throw an index on this query. And now the page loads fast enough that the loading spinner just flickered anyway. It wasn't

00:14:17 all that useful.

00:14:19 Kent: Yeah, yeah. Like you've got to have those pending states because you don't control the network. So no matter how fast you make everything, network is still going to be an issue. And you can even throw it on the edge, put it on all of Fly's regions or put it on CloudFlare everywhere, even still users devices can have a

00:14:39 poor connection. So those are important, but like I see people over complicating a combo box experience where like they have like a bunch of debounce on it and it's a whole bunch of caching and the client and really complicated stuff just because their database queries are slower.

00:14:59 And so by optimizing the backend, you end up not having to complicate the front end. Okay, so 1 of the things that we're doing in the database modeling workshop is analyzing SQL statements. And this is definitely not something I'd originally

00:15:20 planned on including, but thanks to our conversations, thank you so much, Tyler, for your feedback on stuff. I realized that there was a query that I was doing in the Epic stack, which is what we're building as part of Epic Web, that was really slow once you added a bunch of data into it. So I

00:15:40 was doing some profiling and trying to figure out, okay, what is the problem here? Trying to figure out a good index. And in the process of figuring out what is a good index, Tyler taught me how to do the query explain stuff. So can you talk a little, actually, I'm gonna take another step back.

00:16:01 When you were talking about having 15 gigabytes of memory with 4, 4 gigabyte database. 1 thing that I found that's kind of interesting about that is even if your database is only 4 gigabytes, you need that 15 gigabytes for a really bad query because of all the temporary data structures that the database

00:16:21 is creating, right? So I'd like to talk, and that's why you needed 15 gigs instead of just 4. So I'd like to talk about some of the crazy things that databases have to do with suboptimal queries and how you can identify good ways to optimize. And also going beyond just indexes too.

00:16:41 So love to talk about all of that in general.

00:16:45 Tyler: Yeah, yeah. This is something I also love analogies for, right? We talked about, I think it was a really big time, we talked about sort of my contact list analogy, which is my favorite. I think you have a variant of that too that you talk about. But I think about indexes this way. So If you're keeping a contact list on pen and paper, right? And you meet somebody, you write their name down and their phone number, you know, some

00:17:05 details about them. You're probably doing that, you know, just line by line, page by page in order. Well, then if I say, well, I want to call Kent up because I want to chat with him, what page was Kent on? I have to go back and flip through every page to see like, well, where was he? Like, where's his info? So that's what we would call a scan

00:17:25 in the database world. I'm having to scan through every record. You know, what's the optimal way to do it? You could start at the back, you could start at the front, but either way, you have no idea where this record lives. So that ends up eating a lot of CPU time, and it ends up using a lot of disk if you don't have the whole database in memory. And essentially, it's just doing a for loop over every record

00:17:45 that matches, right? Even if you filter it by a specific condition, it still has to do that because it has no idea where that record or what records might match. Yeah. So what we would do is, you know, humans, if we were to do that, databases behave a lot more like humans than you might expect. And what we're doing as humans is we might say, well, I'm gonna keep this organized by last

00:18:05 name. So maybe I've got all these like allocated tabs, you know, by last name. And I, you know, capture it that way. Maybe you organize it by like, I don't know, location where you met someone or something like that. Whatever way you want to organize it, you might change how you're recording the data to match that. That would be what an index essentially is. It's recording the data sort of

00:18:25 a second time. It might be if there's a partial recording, because you might only need to record, well, let's say that that initial contact list was numbered from, you know, for every record that you recorded. Well, you might only need to record the number where Kent was in my original contact list, or you might wanna repeat more of the information in your secondary index, right?

00:18:46 To avoid that extra like page flip. But either way, like you keep that second copy and then you can keep as many of those copies as you need that refer back to the 1 true record of origin. And that's how we could go really fast at that point because we can just flip to the specific page. Maybe we have to scan through a couple records remaining, but for the most part, everything's already narrowed down. Exactly how databases work.

00:19:06 So your question about memory consumptions and things, there's a lot of like, it depends on the situations, but when you have a query that performs really poorly, especially when it gets into lots of joins and complex conditions, it has to load all of the stuff into memory. Sometimes it can sort of stream it where it's like not all loaded at the same time. Sometimes it has to like build up a lot more to like operate

00:19:26 on at 1 time, especially if you do things like, I want to pull a bunch of stuff into memory, do like sort of an aggregation in memory that's particularly memory problematic, right? Because it can't, there's like streaming aggregates and there's memory aggregation, diving into too much details, they're like, there's all these things that can happen, right? So having optimal queries,

00:19:47 usually like with indexing and other solutions also kind of nudges the database in the right direction. So if you can get to a situation like for aggregates where it can stream the aggregate rather than have to buffer it all into memory at 1 time, All these things are ways that you can keep your memory footprint on the database, like sort of equivalent to what your database footprint

00:20:07 is, and make your CPU go so much farther, right? Like your database CPU should ideally be like almost idle. There's exceptions, right? But if you're doing CPU computation database, it means you're probably doing a lot of scans and memory operations.

00:20:23 Kent: Yeah, yeah. And that's a good place to identify that you have some suboptimal query. Like, are you seeing CPU and memory spikes? This is something you told me. If you're seeing those spikes, then something is not optimal.

00:20:37 Tyler: Exactly. Like memory, personally, like doesn't always concern me too much because I know the database is going to take so much anyway. Right. Like it wants to take memory. SQL Server is a really fun 1 because it'll just consume memory and not release it back. If you just look at like task managers, it was like, how much is this thing running? It'll just consume as much as it can unless you tell it

00:20:57 not to, because that's what would make it most optimal, right?

00:21:00 Kent: Like if

00:21:01 Tyler: it can pre-allocate all of this, then it doesn't have to go back to the OS to get the memory every time. So looking at memory.

00:21:08 Kent: Node.js and V8 do the same thing. It's just more optimal. Even though it looks like your app is using way more memory than it actually needs at the time. It'll eat what you give it. So maybe occasionally it'd be a good idea to just test it out and see if your app can run on half as much memory as you're giving

00:21:28 it.

00:21:29 Tyler: Exactly. And that's all by design and intentional. Really cool stuff, like really cool technical solutions to those problems. But that's why usually I look at CPU as a clear symptom a lot of times on databases. Like you said, like if you got CPU, high CPU usage, you'll probably have some suboptimal queries. And also fluctuations

00:21:49 too I've found is often a symptom of something like I said, that 1 query was averaging 140 to 200. That's a big swing for an average. That usually means like there's some kind of pressure happening that it's not performing consistently. That could be like external pressure. That could be other queries running that are putting, causing this 1 to get delayed,

00:22:10 or it could be something happening within that query, causing it to run slower. You can see that a lot of times too, with like if you put a limit on something, right? If I say like, just give me the first record. Well, if the first 1 was on the first page, that's gonna be like really fast, whether it had to scan or not. But if it was on the last page, well, you just had to go through everything

00:22:30 to get there. So you'll see a lot of like fluctuation in the numbers on queries like that.

00:22:36 Kent: Yeah, that makes a lot of sense. So with your contact list analogy, something that I think is important to, or useful to call out is that the index that you have is going to have, you mentioned that you might add more than just the number

00:22:56 of the contact, you might add like their name. And So that way you don't have to go back and do that extra page flip. So would that be like having 2 indexes or like a dual column index?

00:23:10 Tyler: Yeah, yeah. It would be called, most databases I think would call this like a compound index or a composite index. So an index can be on 1 column or it can be on really any number of columns you want on the same table. And most database engines in most cases are only gonna use 1 index per table that they had. Say most, because there are like certain

00:23:30 types of query exceptions, right? But for the most part, if 1 index is found that matches what's needed or there's a close match what's needed for the query being run, the database isn't going to try to join 2 indexes together.

00:23:43 Kent: Yeah.

00:23:44 Tyler: Going back to our human analogy, what would you do if you had 1 like contact list by last name and another by like location? Like how efficient is it for you to like flip to like where's Dodds in my last name? Okay, now where's the like location that I met him and now I have to match these things up like that's just extra cognitive load. Yeah. Faster just to go back

00:24:04 to the origin and do the rest of the filtering sort of like from there, like in the table level. That's exactly how databases work. So you want to do like compound indexes instead where you stack those columns together. And that will make it be like a single index hit has everything it needs to get there. What you might want somebody to go back to the table for is any additional

00:24:24 data. You need to look up and this is where the like being explicit about your selection and your query to is really helpful because if your index included like first name and last name, and that was all you needed, then it doesn't have to go back to the table to get anything else. But if you just did the equivalent of like a select star, well, maybe you need a date of birth too. The

00:24:44 database doesn't know that. So now it has to do an extra lookup back to the original table to get that extra field that wasn't in the index, just for you to throw it away on the application side.

00:24:53 Kent: Yeah, I actually hadn't thought about that. But so in lots of Epic web, we're querying the user table and all we need is the user ID. And so I always select the ID in those cases. I tell people like, always put a select. And I hadn't considered the fact that when you do that, if you're only selecting

00:25:14 indexed columns, or columns that are in the same index at least, then like it doesn't even need to touch the actual data.

00:25:22 Tyler: Which is pretty cool. Yeah, there's another side effect of that, that if you happen to, what I like to talk about is like, if you happen to go back into that table and you add a very expensive column, like let's say that you add an avatar image and you're just gonna say, I'm just gonna stick the binary right on this user. How many queries in your app are now gonna start returning that big binary data

00:25:43 accidentally because it was like equivalent of a select star. And now you just see the performance suffer all the way through the app. It's so much easier, in my opinion, to sort of like, as you go, just say this is exactly what I need from this query. And now if I have to do that in the future, I don't have any worries about, well, what's this gonna do to everything else because nothing else

00:26:03 needed it.

00:26:04 Kent: Yeah, yeah. So listen to us here, kids. You gotta put your selects, select what you want. Yeah, that's very interesting. Now, another thing that people talk about, and I'd like to get beyond indexes, but there's still so much to talk about.

00:26:22 Tyler: There's so much depth.

00:26:23 Kent: But another thing about indexes that people will talk about is the write speed and the space, the disk space, because these aren't free. It doesn't like come out of nowhere. Like in your contacts analogy, you've got a separate book. Then you know, you've got your main contact book over here but you've got this separate book that has a, you know, pointers over to the regular table

00:26:44 just to make it easier to find things. If I'm like a person in this context analogy, like I don't want to have 30 other books that I'm carrying around. So yeah, there is like a balance here maybe, or trade-offs. What are your thoughts around that?

00:27:00 Tyler: Yeah, I would say for us humans, it's a lot of work to carry a lot of those gun type lists around. For databases, it's, I would say, nearly negligible. And the reason for that is if applications I've worked on, typical applications, are going to be way more read-heavy than write-heavy. I would

00:27:20 say that's pretty common to see. So, and also we've talked about this before too, that users are way more forgiving of a slow save button than they are a slow page load. It's just something inherent in our brains. We're used to like, this is an expensive thing versus I'm just trying to see some stuff. So I

00:27:40 would much rather put that load on the, I'll take the right overhead trade off and I'll take the extra disk space too, because the disk space, disk is cheap nowadays, right? Compute is expensive, time is expensive, disk is cheap. So I'll take that extra disk space and I'll take the very small write overhead in most cases,

00:28:01 with the trade off of my users get to see data really fast and that's what they're doing most of the time. There's exceptions to that, right? Like if you're building a really heavy analytics database, most of this advice probably doesn't apply the same way in general, right? But for most like software as a service apps, e-commerce apps, the things that most developers I think are probably

00:28:21 building, the most common applications that would hold true, I think.

00:28:26 Kent: Absolutely, yeah. And another thing that you kind of tuned me in on was even though, like, so in the context analogy, if I add a new contact, I not only have to add it to the contacts list, but I also have to go to every index that is affected and update them in there,

00:28:46 which takes longer to write it down. But, and that's the same thing in databases. And so a lot of people are concerned, well, okay, so the writes are gonna take a long time, but as you mentioned, users are okay with that typically. And then at a certain scale, like it's not that much longer than it takes.

00:29:07 And then it also, the index could also potentially speed up the write. Can you explain why?

00:29:12 Tyler: Yeah. Yeah. So there's a certain times where the index can help, especially if you're doing like bulk updates. So going back to that like system of record, like you have 1 table that's really the source of truth for all the data. The indexes are really kind of like a replica of that. Well, if you wanted to update like a first name in the table based on a, let's

00:29:32 say, date of birth as a thing. Well, if you want to run that update just based on date of birth, you have to scan through, just like you would on the query, all the records for the date of birth that matches that and then make the update. So that has to scan the entire table to update like a handful of records. Where if you had an index on data birth in that case, it would first

00:29:52 use the index to find these are the 5 records or so that I actually care about. And then go back to the table, perform the update and then push that out to like all the indexes, you know, whatever order the database does all those operations in, it's all like a transaction. But essentially, if your updates can do scans just like a read

00:30:12 can, and if you put the right index in place, you can make the update so much faster that it's actually faster to have the index than it costs to do the additional update.

00:30:23 Kent: Yeah, there it is right there, folks. So add your indexes, unless you're building a time series data, crazy amount of data input sort of thing, which you are an outlier, you should know that. I would like to talk about a concern that a lot of people bring up with Prisma, and that is joins.

00:30:45 People talk about this a lot. In the workshop, we have 1 exercise where we're doing a bra SQL because we need to do a nested order by query so that we can sort users by a certain way. Yeah, there's no chance. I can't even imagine what an API for that would look like

00:31:05 with the Prisma API currently. I actually do mention in the video that like, I can't see a reason why Prisma couldn't add a drizzle like API in the future. That seems like a reasonable thing it could do in the future which would be sick. But anyway, Prisma does get a fair bit of criticism

00:31:25 for what a lot of people call suboptimal queries. So I'd love for you to talk about that.

00:31:32 Tyler: Yeah, happy to share my opinion of it. I think, you know, you talk to others on the team, you'll probably get like different, you know, sort of take some like some of the history behind it and some of the, you know, the outcomes. And we're definitely embracing the feedback too. And we're looking for opportunities. Like there's the last release had a couple of improvements

00:31:52 already to leverage some of this better. So there's improvements coming, but we have to be really careful about them because ORMs in general, and sort of like, when you're doing like a drizzle like API or sort of like a query builder type API, you get so much control over all the specific things that you wanna do, that it almost translates

00:32:12 1 for 1 with the SQL. Not always, but like it gets really close. For an API like Prisma, it has to run through almost like a processor to see like, okay, how do I turn this JavaScript representation into the actual like SQL queries that need to do the exchange? And databases historically, relational databases are not very good

00:32:32 at handling like nested data with complex sort of filtering nested deeply. Right. Like you can return, like do joins and stuff. But like, if you. So I come from a C-sharp background too, early in my career and I used ORMs there that would do like these complex join scenarios. And what happened

00:32:52 is every time I got a reasonably complex query, the whole thing would just fall apart. And I would end up splitting it down into multiple individual queries anyway, because it would always try to build a single table results that coming back from the query, like whatever I gave it as like an expression, like you would in Prisma, it would try to build 1 monolithic SQL query for that.

00:33:13 And what you end up having is like, if I have all these joins, I have data that's repeated, you know, multiple times across rows. Cause it's like, this thing has 10 sub records. So how do I model that in a single table? If you apply filters on those nested things, it's like, depending on how it tries to generate the SQL, like where do I put this condition? Do I put it in an optimal

00:33:33 place? And especially whenever you have a forked sort of relationship in a single query. So you've got like a parent that now needs like 2 different child entities and those might go multiple layers deep. Well now it's like the query is taking 2 different join paths. And they're like, if you were to write that by hand, it's not really clear what exactly

00:33:53 you would write, in my opinion, like, to do that, I would probably write 2 separate queries to get. So Prisma takes a pretty, like, simple stance to that. And It says, well, every layer is more or less like a SQL query. And what I've found personally in my experience is that is like super optimal for most workloads because it's very clear

00:34:13 to index. I can see like if I log my Prisma queries which I highly suggest you do, any ORM, I would suggest you log the SQL, just especially on development, just to see what it's doing. I can take that, drop it into like explain, analyze, and it says, here's where you're missing an index, more or less. And it's very easy for me to say like, okay, I put an index here, that query is now faster.

00:34:35 This overall Prisma query now that might've been multiple other queries in it is now faster. Where would you get these like really complex ones? Sometimes it's not clear which specific direction I need to take the index. And sometimes the optimizer doesn't even, like in my experience, those other C sharp engines that I've run,

00:34:55 the optimizer in SQL server would pick like, to use an index I totally didn't expect it to, but it was because the query was so complex that the optimizer just chose like what it thought was the best path, but maybe I knew wasn't the best path. So yeah, I'm not saying that it's not going to be beneficial when we get there. I think it will be tremendously beneficial,

00:35:15 but I think it's right now it's covering a breadth of workloads pretty well. And the penalty that you could pay on a really like massive, nasty query is so much worse than the penalty that's paid on like little, small queries running, in my experience.

00:35:34 Kent: That is so good. That makes so much sense. And like from my perspective, because I ship SQLite to production, I don't care how many queries are going because like it's on disk. So like you send tons of queries, it doesn't matter. It's fine because

00:35:54 we're just, there's no latency. It's amazing. So the N plus 1 problem that they always talk about with ORMs, It's not really a problem with SQL either. Yeah, that is very interesting that exploding those things out into a giant query, it feels more, it feels

00:36:14 better. But I actually hadn't considered the fact that when you're building out this query, the result is flat, right? Like that's how SQL works. And so you will absolutely have repeat data in there.

00:36:28 Tyler: Yeah. And that's why I encourage, Like if you ever get a curiosity of like, oh, wouldn't this be really like so much better with the joins, try to write it by hand 1 time. If you have like a very linear sort of relationship, it's not so bad. You do have repeat stuff sometimes, but like it's not terrible. If you get into those like branching scenarios, that's where like for me it's like, okay, I don't even know

00:36:48 which way I want to take this. Like I could go this way and then it's like a suboptimal query, but I'm probably going to lean towards multiple queries like every time anyway. There are some like newer tools in the databases and some engines to like help with some of this. That's I think where the ORM team might be exploring some of those things. I don't know exactly like which directions it's

00:37:08 going. Yeah, I just know it's like a big topic for the team. But yeah, There's ways you can sort of like do JSON, you know, sort of data to like get subgroups and stuff back in certain formats. Back when I was doing this in like .NET world, that didn't exist, right? Yeah. So we were just like exploding this single table result set. I ended up building

00:37:28 something that behaved very similar to Prisma, like, I don't know, 7, 8 years ago, because of that, I was hitting that wall of like, I'm trying to do, this was just before GraphQL came out and I was trying to build something similar to GraphQL. I wanted to have this like thing I could talk to my front, or from my front and ask for specific fields to my API

00:37:49 and it could do all the filtering and stuff for me. And I ended up landing on, well I just need to do this layer by layer because otherwise I'm hitting this wall of, this is a really suboptimal query and I can't optimize it with indexes because the optimizer just discards them.

00:38:05 Kent: Yeah, yeah, because it can't make sense of the complexity. Yeah, yeah, that's very interesting. So I'm sold. It's not so simple as the feature exists, use it. There's a time and a place for that. And I am satisfied with knowing that

00:38:25 the team is aware, of course, of occasionally in some instances, join being more efficient than what is being done. I really appreciate tools that say, the default is great in like 90% of cases and in 10% of cases, it's not

00:38:45 great. And we can optimize those cases rather than optimizing every case.

00:38:51 Tyler: Yeah, absolutely. And that's what I like about like the query raw example you mentioned earlier. Like I love that escape hatch because I always felt like with using Prisma prior to joining the team, I always had this option of like, if I can't model this the way that I want to at Prisma, I can get back into SQL world and I can write something that does exactly what I want it to, right? And

00:39:11 I do that so rarely that the benefit of the ORM is still very much there. And when I've looked at other options too in the past that are more like query builder type tools, I personally don't lean towards that type of API because I find that I'm just repeating what I already know in SQL. Like I'm having to go

00:39:32 reference like this API spec to see like, how do I write what columns I want to select? How do I write how I do this filter? Where like at that point, I could have just wrote the SQL by hand. I don't think that's true for everyone, right? Like, and there's like, I lose type safety, you know, when I do that. But like I do it so infrequently that I'm happy with it.

00:39:52 Yeah. And I love your option too, the type safety using Zod afterwards. Like if you don't have 10,000 records or whatever that you have to like churn through, just throw it through a Zod parser and see what it is or just do it in development mode to see like, make sure it matches up right. That's like a nice sweet spot for me.

00:40:09 Kent: Yeah, yeah. You know, I actually hadn't thought about that. The big idea behind these query builders is that like we're totally type safe and of course they're not running a runtime type checker. So they're development type safe, which is, that's what we're looking for most of the time when we're talking type safety. You can do the same thing with Zod. But

00:40:30 like I hadn't considered the fact that it is an API, it's not exactly SQL. And so there is a little bit of difference there. And I'm not a super fan of writing SQL. In fact, in the videos where we're doing SQL, I have 1 video where

00:40:50 I have GitHub Copilot, do like, I give it the instructions for the exercise and I say, okay, do this, and it did, and it was perfect, it was awesome. And that's 100%, well, it wasn't perfect, I fixed like 1 or 2 things, but like that is how I recommend people write languages they're unfamiliar with. And you know, it's not like you just ship that and

00:41:10 pretend and not even test it. So you test it and you feel good about it and then you move on. So I like SQL and I think it's a really, really powerful language. I don't mind having to dip in and write it manually occasionally because of all the other things that you get from Prisma outside of that.

00:41:27 Tyler: Yeah, absolutely. And I think like that goes to like sort of a fundamental principle I have to have, like just the knowledge transfer, the transferability of knowledge across things that I work on, right? That's like something I love about Remix is I feel like all the things I knew from working in the browser and like on web platforms prior to that, like just work the way I want them to in Remix. And I feel like what

00:41:47 I've learned in Remix also translates to whatever I might use in the future because it's just platform APIs. And that's sort of where SQL fits in some ways to me of like, I learned this once and it more or less translates across Postgres and MySQL and SQLite and all these like databases that use SQL as a language, all these like concepts

00:42:08 and everything like mostly work across them, where like if I learn a specific API for something, like what happens when the next thing comes out and this thing is no longer relevant. I have to relearn all this stuff again. Not to mention like reviewing the query after it's generated, like it's always review the query afterwards, regardless

00:42:28 just to see like what it did. It's good to have some knowledge of like SQL to be able to review that and see that it did more or less the optimal thing for your use case.

00:42:38 Kent: Yeah, yeah. I think it's probably fair to mention also though, that Prisma is not exactly transferable knowledge either. It's very much

00:42:47 Tyler: like, yeah.

00:42:48 Kent: But when it comes down to, okay, I have to step outside of Prisma. I do prefer to step into standards, which I think is pretty cool.

00:42:58 Tyler: Yeah, it's absolutely true.

00:43:00 Kent: So, you know, we're over our time that I was anticipating already, but I really would like to touch on this because I think a lot of people are gonna be thinking about this. So if you've got the time, then I'd like to ask you about your opinion on the database that we're using in Epic Web and in the Epic Stack. So I think a lot of people, when they first

00:43:20 see that we're using SQLite, they're thinking, oh, well that's just to make the workshop easier, right? Like I can literally commit the database to the workshop so that your data looks exactly the same as my data and it's really nice that way, despite seeding and stuff. Like even the IDs are exactly the same, which is cool. So that is a side

00:43:40 effect. That's not the reason we're using SQLite. The reason I use SQLite is because I think it's a great production database and I'm using it in production for years now and it's awesome. But there are other databases I've used others myself so I'd love to hear your thoughts around using SQLite and cases where you think it might not be suited.

00:44:01 Tyler: Yeah I think if you had asked me a few years ago, I'd have been like shocked, like SQLite in production, like what are you talking about? But that to me was not so much the actual SQLite, I guess sort of like technology of it. It was the fact that the underlying engine didn't have these tools available like replication

00:44:23 or the like distribution problem where like I'm essentially saying like this database is running on a disk somewhere on a single node and I've got a point of failure now. Yeah But that's all changed and it's actively changing right now, which I think is really fascinating. I've been following a lot of these companies that are doing this. And I'm seeing it more

00:44:43 and more as being like, I'm glad that that's where EpicSec is taking it. Cause I think that's going to be like a next wave of things. Like these other databases are really, really good. Most of my experiences with Postgres and SQL Server in my career, and I've gotten really far on them. Like that example I shared earlier, like that was with Postgres. And I still like those databases,

00:45:03 but I think that SQLite's sort of been untapped because it didn't have these things that you needed to run a really like high scale system. And now it's getting that, but it's bringing with it the benefits that it still had in that sort of like embedded systems environment, which is where it's really like cool to me. Like these different vendors are taking

00:45:23 it. Like I think flies is a virtual file system approach to the replication. Super fascinating to see like how that works. There's also I think it's torso that has done a fork of SQLite into libsql, and adding things like a write-ahead log, which is essentially how replication happens. Like it's a change

00:45:43 stream of all the events happening in the database that can then broadcast to every other node you have throughout the world. That brings like all those replication concerns forward. I think it's gonna be a really good choice. And like I said, too, if you ever think like, I wanna use Postgres or MySQL in my production environment. Well, the things that you learn

00:46:04 working on SQLite, a lot of it can transfer over. Prisma is not always transferable knowledge, but 1 nice thing is that you can mostly switch from SQLite to Postgres or MySQL, and a lot of things do work that way. Yeah. So yeah, I think it's a great choice. I think we'll still see improvements on some of the tooling, the

00:46:24 advanced tooling. Like when I do performance insights, like we've looked at the explain plan before, My complaint has always been I get so much more out of SQL Server or Postgres than I do out of SQLite. But I think that will change too. We'll continue to see, I think that's been a symptom of SQLite not being as used

00:46:44 in these types of environments. And now that it is, we're gonna see more tooling come for that too.

00:46:51 Kent: And I think that the technology speaks for itself as something that is worth investing that type of effort into.

00:47:00 Tyler: Yeah,

00:47:00 Kent: I completely agree. And I migrated from Postgres to SQLite, which is kind of the backwards for a lot of people, but not for long, I don't know, we'll see. But yeah, Prisma made that like way easier than it should have been. So yeah,

00:47:20 that's, you know, a win for an ORM. And actually, so you mentioned Fly and Terso. There's also Cloudflare Durable Objects is built on top of SQLite too, which is like, I mean, that's pretty hardcore. So.

00:47:36 Tyler: Yeah, I've been looking at D1, I think is there, I think it's in beta right now. There's a SQLite offering and it is very cool. I believe there's gonna be a lot of exciting stuff coming for that when it gets out of the beta phase. Yeah,

00:47:54 Kent: a lot of benefits to SQLite and solving, like you said, solving the distribution and single point of failure problems. Kind of now that those are being, you know, I wouldn't say that they're solved. There's still like some challenges with those, but like there is a path forward for all of that.

00:48:13 Tyler: And it's so cool how, Like I said earlier, it brings those benefits with it too. Like when you like, is it running the queries against SQLite on fly is so fast that you don't even notice some of the penalties you get. Like index is still a help, but some of the other things that we've been concerned with in the past, I think, like latency between

00:48:33 the database and the application. Those things just disappear. Even like, you know, I was talking about the CPU and memory of that database instance I had, like that wasn't the application, that was the database provision. Like in SQLite, you don't even really have that concern. It's just part of like your application in many ways.

00:48:52 Kent: Yes, and that can't be understated. That was 1 of the main reasons that I switched or started thinking about switching from Postgres to SQLite was to eliminate the extra service that I had running. Because I had my application, it's still distributed all over the world, but that's not very useful if your data isn't also distributed all over

00:49:12 the world. And so I had Postgres and ReadReplicas in all those places, And so I had twice the number of services. I also had Redis actually in all of those regions. So I had 3 times the number of services. And then I switched over to SQLite and used SQLite for my cache as well. And

00:49:33 I cut my services to a third. That was nice.

00:49:38 Tyler: It's cheaper. Fewer things to fail.

00:49:40 Kent: Yeah, fewer things to fail. Yeah, actually, that's funny, because SQLite was like single point of failure, and now I'm like, I don't have that problem because it's distributed, but also I have newer points of failure, which is pretty sweet.

00:49:53 Tyler: And additive points of failure too, I haven't really thought about this before, but if your application is down, how useful is your database? Yeah, yeah, sure. So, there's arguments to be made either way, I guess, that like your data, like I said earlier, database is your most critical thing, but if you move it to be part of your application, you only have 1 thing that's either up or down now. Yeah.

00:50:14 Something to think about.

00:50:16 Kent: Yeah, it's very interesting. Yeah, SQLite is phenomenal. I feel like there was something else I wanted to mention about it. But yeah, I think that's all that I got. This conversation has just been really fun for me, Tyler. Thank you for giving us some of your time to chat about these things. Is there anything we didn't talk about that you wanted to make sure to

00:50:36 bring up?

00:50:36 Tyler: Oh, no, I think we're good. I always appreciate talking with you, Ken. It's a pleasure to be here.

00:50:41 Kent: Yeah, thank you, likewise. And what is the best place for people to get in contact with you or the Prisma team or if they have any follow-up questions?

00:50:51 Tyler: Yeah, you can find me on Twitter. RTBinfield's my name on pretty much everything, Twitter X, whatever you prefer to call it. Discord as well, RTBinfield, I'm pretty responsible on there. For Prisma, we have our Prisma handle on Twitter. We're very active on there. And we also have our Discord and Slack. I actively look at the Discord quite a bit

00:51:11 and enjoy working there. So yeah, check out the Prisma Discord, hit me up, ask me any database questions. I love tuning databases. Throw me your best tuning scenarios and I'll see what I can crack.

00:51:23 Kent: Awesome, that sounds great. Okay, very good. Actually, you know what? There was something that I wanted to mention. Because earlier you, I remembered the thing, earlier you mentioned that in SQLite, it's some of the indexes are less important, still important, but maybe a little less so, just because of the nature.

00:51:43 And I just wanted to double click on that, which is a phrase that I don't use very often. And just say, like, in my database on my website, I have over half a million rows. I should look again, but I have, you know, it's not a small amount of data. For lots of the apps that people watching are building, unless you're working

00:52:03 at Amazon, you're probably not at a half a million rows. There are just tons of apps that aren't all that huge. So anyway, I feel like I have a pretty sizable set of data. And on every page load, I execute this query, or this set of queries, to determine the

00:52:23 team rankings. So on my site, you can log in and you choose a team, and based on which blog post you've read, you score a point for your team. So on every page load I wanna calculate those team rankings and just by nature of the way that that calculation has to work, because I do, how recently active, or how

00:52:43 many recently active members there are. Well to know how many recently active members there are, you need to know how many post reads they've had in the last 6 months. And all, like, there's a bunch of stuff. And they have to have read something in the last year to be counted as recently active. So there's, like, it's more complicated than you might think. So at the end of it all, it's 9 queries that

00:53:03 I have to execute to determine the rankings. And before I had your help on indexing the tables, these queries just took way too long, especially since they're every single page load. So I stuck a cache in front of it and it's fine. Like it would just update every 60 seconds, no big deal. So

00:53:24 yeah, what's interesting is once you helped me with the index, that went from like a couple seconds for all of those. I think it was maybe just like 2 seconds, but certainly way too long for every page load. So it went from like a couple seconds down to a couple milliseconds. Just even in that setting, adding indexes. I mean, it's a lot of data, so

00:53:44 it kind of makes sense, but indexes, they're real. And I still have the cache there because it takes more effort to remove it than to just leave it. So, yeah.

00:53:54 Tyler: Yeah, indexes are huge. I think most index performance still helps SQLite. It's a lot of the other things that we were optimizing, like the network stuff that mostly goes away, right? Like geographic location was a big 1, right? Like how close can you put your database to your application and then how do you, like you're talking earlier about how you replicate it to

00:54:14 be everywhere your application is. Like those are things that with SQLite, we don't really have to think about as much, if at all. But for most part, your indexes are still gonna help. The ones that are gonna be maybe, if anything less beneficial, like if you have a table with very low numbers of records, scanning it with SQLite is gonna be super fast. Yeah. If it's a hundred records,

00:54:34 you're probably not going to notice an index benefit, but at half a million, it's going to be very obvious. Yeah. Yeah. All

00:54:44 Kent: right. Hey, thanks so much, Tyler, again. And yeah, we'll chat with you all later

00:54:50 Tyler: Thanks. Bye everyone

interview

The Crucial Role of Database Optimization with Tyler Benfield

Tyler Benfield, a staff software engineer at Prisma, joins Kent to explore the world of databases.

They begin with a discussion on the effeciency of databases. It's important to make databases efficient, particularly through optimizing queries, according to Tyler.

The discussion shifts to concrete techniques for improving database performance, from self-indexing to enterprise-level strategies. Tyler shares specific examples from his work, such as how adding a single index resulted in a 20% reduction in CPU usage and a 30% decrease in active connections.

Both speakers underscore the importance of backend optimization as a means to improve frontend performance. This approach minimizes the need for elaborate frontend loading indicators, resulting in a smoother user experience.

Tyler elaborates on the pros and cons of using Prisma and other ORM tools. While these tools are beneficial for handling simple queries, they often fall short when tasked with more complex queries and joins.

The viability of SQLite as a production database is also discussed, especially in light of advancements in replication and distribution technologies.

The conversation wraps up with an emphasis on the importance of having skills that can be applied across different projects and platforms. Both Tyler and Kent agree that knowledge of SQL is one such skill that is universally applicable.

Resources

Tyler Benfield on Twitter