Note: At the time of writing, Django is at v1.4.1
So what actually happens when you have a Django query like the following?
cities = City.objects.all()
venues = Venue.objects.filter(city__in=cities)QuerySet as the input for the __in clause, and in turn, generates a subquery. The raw query generated is something like this:SELECT ...
FROM   "venue" 
WHERE  "venue"."city_id"
IN (
    SELECT U0."id" 
    FROM "city" U0
)What's wrong?
In this scenario, nothing at all. We've saved a round trip to the database, all is good. Let's throw caching into the mix.cities = cache.get('cities')
if cities is None:
  cities = City.objects.all()
  cache.set('cities', cities)
venues = Venue.objects.filter(city__in=cities)City QuerySet gets cached for us like we expect, but when Django gets to the filter(), it sees that you're passing it a QuerySet and generates a subquery again. We've effective cached nothing because the exact same query is being run.This also applies when you want to use the same
__in across multiple filters.cities = City.objects.all()
venues = Venue.objects.filter(city__in=cities)
users = UserAccount.objects.filter(city__in=cities)Well, how can we fix this?
The reason why Django is doing this is solely because the argument passed intofilter() is a QuerySet. If we just pass a list, Django will use an array of ids inside the IN SQL.Let's look at an improved version with caching.
cities = cache.get('cities')
if cities is None:
  cities = list(City.objects.all())  # Force query evaluation
  cache.set('cities', cities)
venues = Venue.objects.filter(city__in=cities)QuerySet to be evaluated by converting it to a list. Now, cities is a list, so Django won't try to perform a subquery anymore. This could be simplified to just caching the primary keys as a list as well, depending on what you're trying to achieve.The resulting SQL will be something like:
SELECT ...
FROM   "venue" 
WHERE  "venue"."city_id"
IN (
    1, 2, 3
)One more gotcha!
We all know that Django'sQuerySets are lazy, right? No query is actually performed until the QuerySet is iterated over. Once a QuerySet has been iterated over, it won't query again for the results. The results get cached into an internal _results_cache object.At the moment, Django is not smart enough to detect that a
QuerySet
 has already been evaluated before generating a subquery, effectively 
causing the query to be run again when we already know the results.Picture this scenario:
cities = City.objects.all()
names = [city.name for city in cities]  # QuerySet cache is filled
# ... do something awesome ...
venues = Venue.objects.filter(city__in=cities)QuerySet before passing it to the __in
 clause, it'd be smart to just use the ids that we've already calculated
 from before, but it doesn't. So pay attention and be careful.I've submitted a patch to Django core to try and get that behavior fixed:
 
No comments:
Post a Comment