The Problem the DataLoader Solves
By James Moore
Interested in learning more about GraphQL and the DataLoader?
Source Code for this tutorial: https://github.com/knowthen/dataloader_tutorial
"When you first start learning GraphQL, there’s a pretty good chance, that you’re first attempt at creating a GraphQL API, will have one big problem, that you really shouldn’t live with. Now, I bet many of the seasoned developers, who start learning GraphQL, will recognize the problem I’m referring to, but there’s a whole lot of developers, who probably won’t even notice the problem, because… well, the problem isn’t obvious. This problem, that I’m about to talk about, won’t give you an error, at least not during development, and honestly you’ll probably end up with this problem, because its a result of an intuitive solution, but unfortunately the intuitive solution is a naive solution.
So what is this problem, I’m referring to?
Well, instead of just telling you, how about I show you.
I want you to imagine that you and I are working on a book review app, that looks something like this, and we’ve already created the beginnings of a GraphQL API for this app.
Here’s a pretty simple GraphQL query, that you can use with our server, to fetch a set of books, that includes each books unique id and title, and it also fetches an a set of reviews, for each book, which includes the: star rating, and the title left by the person who left the review.
Now, the GraphQL Server that we’re about to look at, has the problem, I’ve been hinting at, but the problem isn’t obvious, I mean watch, I’ll run this query, and guess what, it worked, the server didn’t blow up. We got an array of books, and each individual book includes some book reviews.
Let’s go take a look at how this GraphQL server is setup, but before I show you, I’ve got a challenge for you, I want you to pay close attention over the next few minutes, and try to figure out, what the problem is with the code, I’m about to show you.
Ok, to start, I’m going to simplify this graphql query, by commenting out the reviews field and it’s subselection, right here.
Now, when this query get’s sent to the server, a resolver function named books get’s called, which you see right here. Resolver functions should generally be thin, in other words, as much as possible, you should avoid putting logic in them. So is this resolver function “thin”?
Well, yeah it’s pretty thin, it’s simply returning the value you get, when you call the allBooks function.
Let’s go take a look at the “allBooks” function, which is defined in the book dot js file.
Ok, there’s not a lot going on in the allBook function. Right here, you see a sql statement, that’s querying all the columns, from the hb schema’s, book table. Then on this line, the query function is called, which will send the sql statement to the database, and return the response, from the database.
Well, actually, the query function does one other thing, just for this tutorial. It’s logging the sql query, to the console, which you’ll see in a bit.
Now, in a moment, we’ll run this GraphQL query, but before we do that, I’m going to clear the terminal session, which is running the graphql server. We’ll come back to this in a moment, to see what sql queries get’s logged, when we execute our GraphQL query.
Now, back in the browser, I’ll run this query… and as you can see it worked, we see an array of around 10 books.
Let’s head back to the console, and see what SQL commands, were sent to the server. Ok, well, it looks like one sql query, which is the query we just looked at, was sent to the server, which is probably what you expected.
Alright, now let’s expand this query to include the reviews.
Now, the book id and title fields, you see here, are stored in the book table, but the book reviews, are stored in a different related table, named review, so to fetch these fields, another sql query, needs to be sent to the database server.
Before we run this query, let’s take a closer look at what will happen on the server, when we include these new fields.
Now, when the server receives this modified query, the book resolver get’s called again, just like it did last time, but then after the list of 10 books, are returned from the database, a new resolver function is called for the reviews field, and of course the resolver has the same name as the field, which is reviews.
So, how many times do you think this reviews resolver function, will get called?
Now keep in mind, the reviews field is a child of the book field, so it’s going to get called once, for each book, so in our example, it’ll get called 10 times, and as you can see here, the parent book object, get’s passed into this resolver.
Now this reviews resolver isn’t doing much, it’s just calling this reviewsByBookId function, passing it the books id. If we peak at the reviewsByBookId function, you see that it’s querying the hb schema’s review table, where the book id is some value, represented by the dollar sign one, you see here. Then on this line, you see a params constant, which is set to be an array, that contains the passed bookId. When the database is sent this sql command, and this array of parameters, it safely replaces the dollarsign one with the first element of the array, which is the book id parameter.
Ok, this probably seems pretty straight forward to you, but do you see a problem with what’s happening here?
To demonstrate the problem, let me clear the terminal again, then I’ll execute the modified GraphQL query, and as you can see, it seemed to work. We see the 10 books, and each book has around 3 or 4 reviews.
So let me ask you this, how many sql queries do you think were sent to the database server?
Let’s take a look in the terminal.
Well, at the top of the logs, we see the query of the book table, just like we saw last time, then we see a query for the review table, for the book with an id of 1, then we see a nearly identical query, for the book with and id of 2 and then another identical query with an id of 3. If we continued to look at every log, you’d see that 10, nearly identical queries, were sent to the database server.
So, in total we’re sending 11 queries.
So is sending 11 queries bad?
Well, it can be a real performance killer, for a few different reasons.
First of all, the 10 nearly identical queries, consumed 10 separate database connections.
So should you care that 10 database connections were used, just for this one part of our graphql query?
Yep, it really should be a concern. Database connections are a pretty expensive resource, and generally speaking, most smaller and medium sized database servers, should only use a small number of connections usually around 10 to 20… Technically, you could allow a lot more connections, but it’s expensive, and it typically it’s leads to worse database performance, than allowing a smaller number of connections, and just queuing requests, if and when there are no available connections to use.
Now, additionally, every time a query get’s sent to the server, there’s a bit of overhead, for example, the database query planner has to execute once for every query, and guess what, often the amount of time it takes to run the query planner takes longer than actually executing the query.
By the way, this problem we’re seeing here, has a name, that you might have heard of, it’s called the N + 1 problem.
In other words, we’re have to execute 1 query to get the list of books, then we have to execute N additional queries, where N is the number of books returned from the first query, which in our case is 10.
Now, imagine if we tweaked this query, to include the user field, which, by the way, would need to query the user table.
How many additional queries, would our database need to execute, in this scenario, assuming we’re following the same naive approach?
We’ll for discussions sake, let’s say we run the book query, which returns 10 books, then we have to run 10 more review queries, because there were 10 books, now let’s say there are around 4 review per book, that means we’ll need to run 4 additional user queries, for each review query, which would give us 40 user queries. So we’d end up running a total of 51 sql queries, for this simple graphQL query.
Now it’s important to keep in mind, most real graphql queries, will be more complicated than this one, so, as you can probably imagine, the number of database queries your server might end up attempting could quickly get out of hand.
So, how do we avoid the N + 1 problem, that we’re experiencing?
Well, what if instead of sending these 10, nearly identical queries, we instead sent 1 query, that looks like this?
This single database query, will fetch the same review data, as these 10 queries, but it’s only going to use one database connection, which as you can imagine, is much more efficient.
Now the question is, how do we get from our current solution of 10 queries, to this greatly improved solution with one query?
This is where the Dataloader library, from facebook, comes in handy. The dataloader is a small but brilliant library, that allows you to perform per request, batching and caching of queries. Well, that definition probably seems a little bit vague, so let me explain with a visual.
Right now, the reviews resolver, is directly and immediately calling the reviewsByBookId function, passing along each book id, but what if we added something in between these two functions, that could be sent the book ids, but it didn’t immediately act on the book id, and instead, it waited some small period of time, collecting all the book ids, then after all the book ids have been batched together, an array of the ids could get sent to another function, which could create the more efficient database query we’re looking to use?
So what is this thing in them middle? Well, you probably guessed… it’s the dataloader.
Alright, let’s try and use the dataloader, in other words, let’s put the dataloader in between the resolver, and the data access layer function, that will actually look up the book reviews.
Now I moment ago, I said that the Dataloader does per request batching and caching, and I want to draw your attention to the first part of that description, the per request part. What this means is, the dataloader only lives and does it’s job, for a single server request. In other words, when the server receives a new graphql query, we’ll need to create an instance of the dataloader, and use the dataloader for the duration of the request, but once the response is sent from the server, the dataloader instance will fall out of scope, and be garbage collected.
I bet some of you are thinking, wait a minute, how can the dataloader do caching, if it’s so short lived. This is a common misunderstanding. It’s important to not confuse the caching the dataloader does, with the caching done with something like redis or memcache. The dataloader library isn’t a replacement for these types of caches, the dataloader would be used in addition to these types of caches.
Ok, so how do we use the dataloader? Well, first of all, I’ll install it by keying npm install dataloader.
Next, we’ll open up the review dot js file, and I’ll import the dataloader from the dataloader package.
So, how do you use the dataloader?
Well, you’ll need to create instances of the dataloader for each new server request, by keying new DataLoader, and passing it a callback function.
Now, to add an id to the dataloader instance, you’ll call it’s load method, passing in the id.
Ok, let’s do this, in the review dot js file, I’ll create a new, exported function named reviewsDataLoader, which will simply return a new instance of a dataloader, and we’ll pass it a new callback function named reviewsByBookIds, which doesn’t exist, but we’ll create it right now.
Now, I’ll create the function named reviewsByBookIds, which get’s passed in an array of book ids. Next, I’ll just copy the body of the existing reviewByBookId function, and I’ll paste it into our new reviewsByBookIds function.
Now, I’ll modify the sql command, adding a call to any, around the dollar sign one, then I’ll tweak the params array, to include the bookIds, not just an individual book id.
So, do you think we need to do anything else with this function? I mean, can we just return the array of reviews that we get from the database, from this sql query?
Well, not exactly. Let’s say for discussion sake, the bookIds parameter contained an array of three book ids: 1, 2 and 3.
Now, let’s say the server returned 4 review rows, that look like this. If we just returned these 4 review objects to the dataloader, how would the dataloader know which books these 4 reviews are associated with?
I mean, all 4 reviews could be for one book, or there could be some other distribution of the reviews between the 3 books, but currently, there’s know way for the dataloader to know which review goes with which book, so obviously we’ll need to do something, so that the dataloader can know which reviews, belong to which books.
Here’s what we’ll need to do, to make this work, we need to map the reviews returned from the database to the input array. Here’s what I mean, if the first element of the bookIds array is 1, then we should return an array that contains the reviews for the book with an id of 1, in the first element, and we should do the same thing for the book ids 2 and 3 as well. Now, this new array, is sequenced so that the dataloader can figure out which reviews, go with which books.
Ok, so obviously we’ll need to do a little bit of data manipulation, in this function, to properly sequence the array we return.
So what does this groupBy function do? Well, let’s say you’ve got an array of 4 book reviews, that look something like this. I’d like to transform this array, which includes the bookId and the review data, into an object, where the properties, are the book id’s, and the values are an array of reviews. Why do I want to do this? You’ll see in a moment.
Ramda’s groupBy function, will do this exact data transformation for you.
To get from this array to this object, all you need to do is call groupBy, passing it a function that returns the value you’d like to have the resulting object grouped by, which in our case is each reviews bookId, and then you’ll just pass in the array to act on, which in our case is the array of reviews.
Let’s go use the groupBy function. Instead of returning the promise that you get by calling the query function, I’ll create a new constant named reviews, and I’ll set it equal to the value you get by awaiting on the promise returned by calling query, and of course I’ll need to make this function an asynchronous function.
Next, I’ll create a constant named groupedById and I’ll set it to the value you get by calling ramda’s groupBy function. I’ll skip the first parameter for a moment, then I’ll pass in the reviews array as the second parameter.
The first parameter will be a function that takes an individual review from the reviews array, and I’ll have it return the reviews bookId.
Next, we’ll do the mapping of the input bookIds, to their corresponding reviews, as we discussed a few moments ago.
todo this, I’ll return the value you get by calling ramda’s map function , I’ll skip the first parameter for a moment, and I’ll pass in the array of bookIds, to map over, as the second parameter. Now the first parameter, should be a function that receives an individual book Id, from the bookIds array, then what’s returned should be the reviews for the passed bookId, which we can get by keying the groupedById constant, we created a moment ago, then using bracket notation, we’ll access the reviews for the passed bookId. This expression right here, is why we used the groupBy function.
Ok, that’s all we need to do in the new reviewsByBookIds function.
Alright, we’re ready to create and use the dataloader, but the next question is, where do we use it? In other words, where in our codebase should we call the reviewsDataLoader function.
Let’s head over to the server dot js file, and I’ll show you.
In this simple nodejs server, I’m using the Apollo graphql server. We’re going to add a new property to this config object, named context, which I’ll set to be a function. This function will get called every time the server receives a request, and whatever we return from this function, will get passed into every resolver function, on this server. In other words, this context function will allow us to create a new dataloader for every request, and it’ll send the dataloader into every resolver function.
So, I’ll return a new object, with a loaders property, which I’ll set to be an object with a reviewsLoader property, which I’ll set to be a new dataloader instance, but first, I’ll import the reviewsDataLoader function, from the review file, then I’ll call the reviewsDataLoader function, to create a new dataloader.
Ok, cool, now all we need to do is head over to our resolver function, and use this dataloader, that’s stored in this context object.
So, how do we access the context object in our resolver function?
Well, it’s one of the provided parameters. The second parameter, contains the fields arguments, which we don’t actually need in this case, but the 3rd parameter is the context object we need.
Next, I’ll just unpack the loaders from the context, then I’ll unpack the reviewsLoader from the loader object.
Now, I’ll get rid of the call to the reviewsByBookId function, and instead we’ll return a call to reviewsLoader dot load, passing in the book’s id.
Ok, that’s all we need to do, to use the dataloader.
Let’s see how many queries get sent to the database server, now. So, first I’ll clear the terminal, then I’ll execute our graphql query, and we’ll take a look at what was logged in the terminal.
Alright, last time we ran this graphql query, 11 sql commands were sent to the server, but this time we’re only seeing 2 queries, nice.
What if the books query here, returned 100 books instead of 10 books, how many sql queries would be sent to the server in this scenario?
Well, we’d still only see 2 queries get sent to the database. The only thing that would change, is this list of ids would be much bigger, as it would contain 100 book ids.
I want to show you one more thing, by tweaking this graphql query. First copying the entire books field, and it’s subselection, then just below the books field, I create an alias named books 2, and I paste what I just copied, the books field and it’s subselection
How many queries do you think will get sent to the database server, in this case?
Let’s try and see. First I’ll clear the terminal, then I’ll run the query, and we’ll checkout what was logged.
Ok, we see 3 queries, the book query was sent twice, but the reviews query, which is using the dataloader, was only sent once. The reason the reviews query was only sent once, is because the dataloader performs per request caching, so there was no need to query the reviews table more than once, nice.