Django QuerySets

QuerySets


A QuerySet, in essence, is a list of objects of a given model. I say ‘list’ and not ‘group’ or the more formal ‘set’ because it is ordered. In fact, you’re probably already familiar with how to get QuerySets because that’s what you get when you call various Book.objects.XXX() methods. For example, consider the following statement:
1
Book.objects.all()
What all() returns is a QuerySet of Book instances which happens to include all Book instances that exist. There are other calls which you probably already know:
1
2
3
4
5
6
7
8
9
# Return all books published since 1990
Book.objects.filter(year_published__gt=1990)

# Return all books *not* written by Richard Dawkins
Book.objects.exclude(author=''Richard Dawkins'')

# Return all books, ordered by author name, then
# chronologically, with the newer ones first.
Book.objects.order_by(''author'', ''-year_published'')
The cool thing about QuerySets is that, since every one of these function both operates on and returns a QuerySet, you can chain them up:
1
2
3
4
5
6
7
# Return all book published after 1990, except for
# ones written by Richard Dawkins. Order them by
# author name, then chronologically, with the newer 
# ones first.
Book.objects.filter(year_published__gt=1990) \
            .exclude(author=''Richard Dawkins'') \
            .order_by(''author'', ''-year_published'')
And that’s not all! It’s also fast:
Internally, a QuerySet can be constructed, filtered, sliced, and generally passed around without actually hitting the database. No database activity actually occurs until you do something to evaluate the queryset.
So we’ve established that QuerySets are cool. Now what?

Return QuerySets Wherever Possible

I’ve recently worked on a django app where I had a Model that represented a tree (the data structure, not the christmas decoration). It meant that every instance had a link to its parent in the tree. It looked something like this:
1
2
3
4
5
6
7
8
9
10
11
class Node(models.Model):
    parent = models.ForeignKey(to=''self'', null=True, blank=True)
    value = models.IntegerField()

    def __unicode__(self):
        return ''Node #{}''.format(self.id)

    def get_ancestors(self):
        if self.parent is None:
            return []
        return [self.parent] + self.parent.get_ancestors()
This worked pretty well. Trouble was, I had to add another method, get_larger_ancestors, which should return all the ancestors whose value was larger then the value of the current node. This is how I could have implemented this:
1
2
3
    def get_larger_ancestors(self):
        ancestors = self.get_ancestors()
        return [node for node in ancestors if node.value > self.value]
The problem with this is that I’m essentially going over the list twice – one time by django and another time by me. It got me thinking – what if get_ancestors returned a QuerySet instead of a list? I could have done this:
1
2
    def get_larger_ancestors(self):
        return self.get_ancestors().filter(value__gt=self.value)
Pretty straight forward, The important thing here is that I’m not looping over the objects. I could perform however many filters I want on what get_larger_ancestors returned and feel safe that I’m not rerunning on a list of object of an unknown size. The key advantage here is that I keep using the same interface for querying. When the user gets a bunch of objects, we don’t know how he’ll want to slice and dice them. When we return QuerySet objects we guarantee that the user will know how to handle it.
But how do I implement get_ancestors to return a QuerySet? That’s a little bit trickier. It’s not possible to collect the data we want with a single query, nor is it possible with any pre-determined number of queries. The nature of what we’re looking for is dynamic and the alternative implementation will look pretty similar to what it is now. Here’s the alternative, better implementation:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
class Node(models.Model):
    parent = models.ForeignKey(to=''self'', null=True, blank=True)
    value = models.IntegerField()

    def __unicode__(self):
        return ''Node #{}''.format(self.id)

    def get_ancestors(self):
        if self.parent is None:
            return Node.objects.none()
        return Node.objects.filter(pk=self.parent.pk) | self.parent.get_ancestors()

    def get_larger_ancestors(self):
        return self.get_ancestors().filter(value__gt=self.value)
Take a while, soak it in. I’ll go over the specifics in just a minute.
The point I’m trying to make here is that whenever you return a bunch of objects – you should always try to return a QuerySet instead. Doing so will allow the user to freely filter, splice and order the result in a way that’s easy, familiar and provides better performance.
(On a side note – I am hitting the database in get_ancestors, since I’m using self.parent recursively. There is an extra hit on the database here – once when executing the function and another in the future, when actually inspecting the results. We do get the performance upside when we perform further fliters on the results which would have meant more hits on the database or heavy in-memory operations. The example here is to show how to turn non-trivial operations into QuerySets).

Common QuerySet Manipulations

So, returning a QuerySet where we perform a simple query is easy. When we want to implement something with a little more zazz, we need to perform relational operations (and some helpers, too). Here’s a handy cheat sheet (as an exercise, try to understand my implementation of get_larger_ancestors).
  • Union – The union operator for QuerySets is |, the pipe symbol. qs1 | qs2 returns a QuerySet with all the items from qs1 and all the items in qs2 while handling duplicates (items that are in both QuerySets will only appear once in the result).
  • Intersection – there is no special operator for intersection, because you already know how to do it! Chaining functions like filter and exclude are in fact performing an intersection between the original QuerySet and the new filter.
  • Difference – a difference (mathematically written as qs1 \ qs2) is all the items in qs1 that do not exist in qs2. Note that this operation is asymmetrical (as opposed to the previous operations). I’m afraid there is no built-in way to do this in python, but you can do this: qs1.exclude(pk__in=qs2)
  • Nothing – seems useless, but it actually isn’t, as the above example shows. A lot of time, when you’re dynamically building a QuerySet with unions, you need to start off with what would have been an empty list. This is how to get it: MyModel.objects.none().

No comments:

Post a Comment