Streaming Output
Since the DSO-API offers huge datasets (e.g. all buildings in Amsterdam), the output is written using a streaming response. This greatly improves the performance and reduces memory consumption.
Standard REST Framework
A standard REST framework project follows these steps:
The serializer reads all data in memory, and generates the total JSON dictionary.
Then the paginator can wrap it, and the rendering cals json.dumps()
to
convert this complete structure to a JSON message.
Finally, the total output is written to the client.
For huge datasets, this is problematic. It uses a lot of memory.
The whole complete table data is read in memory, the QuerySet
caches all results,
and the whole JSON string is stored in memory before writing any output.
For a single request, this can create a peak of 1GB in memory usage to handle that amount of data.
Streaming Design
Ideally, individual records are written to the client while the table is being read. Then, the memory usage would stay low because a records are processed one by one in the pipeline.
Python offers a generator pattern to read data while it’s being retrieved.
Django also has QuerySet.iterator()
to process data one record at a time
and a StreamingHttpResponse
class that allows the WSGI server to read
the response body from a generator/iterator.
Unfortunately, each step inside the Django REST Framework pipeline breaks the streaming behavior. Whenever code reads all all incoming data upfront, the whole chain of streaming records breaks. Hence, all steps were altered to work with generators:
The following steps are taken:
Serializers read using
QuerySet.iterator()
whenever possible.Serializers return a
ReturnGenerator
instead of aReturnList
.The paginator delegates most rendering to the output format; it only adds the basic structure.
The next/previous links are determined after rendering all main objects.
Our custom
HALJSONRenderer
andGeoJSONRenderer
classes support generators.The rendering classes perform
json.dumps()
calls on single records.The
Response
class is replaced by aStreamingResponse
class.
Chunked JSON Rendering
The JSON response is written by selectively applying json.dumps()
.
The idea is to write some object layout manually, and leverage json.dumps()
where possible.
It basically looks like:
yield '{["_embedded": ['
# write main listing:
for record in generator:
if not first:
yield ",\n"
yield json.dumps(record)
yield "],\n"
# write "_links" and "page" parts without the opening/closing braces:
yield json.dumps(footer_links)[1:-1]
yield "}\n"
Additional Optimizations
Output Buffering
To avoid too many back/forth calls from the response-generator
and the WSGI server, the produced output is submitted in chunks of 4096kB.
Otherwrite an OS write()
call might happen for a simple yield "}"
statement.
Next Link Optimization
A standard paginator would do an expensive COUNT(*)
on the table
and use that to tell whether there are additional pages.
We’ve optimized this by requesting one extra record from the database at the end of a page.
This sentinel record is not rendered. Its existence indicates that there is another page available.
This does mean the next/previous links have to be written at end of the response, after all main objects have been seen. This is handled by writing the main JSON object in chunks.
Fields like a “total page count” and “total results” are no longer available, but rarely needed either.
Typically clients only need a link to the next page.
The ?_count=true
query parameter can be provided when a client does need a result count.
Error Handling
When an error occurs during streaming, it’s not possible to send out a HTTP 500 “Internal Server” error. After all, the header (with HTTP 200 OK) is already submitted, and possibly even more content of the first few records. All the WSGI server can do is stop writing and close the connection. This gives a confusing situation, with either an unparsable JSON document, or a incomplete CSV export that might look finished. This is solved in two ways.
First the peek_iterable()
function takes a look at the first
record in the generator. This triggers the database query execution,
and any on-demand parsing (needed for the Content-Crs
header).
At this point, any raised exceptions still trigger a HTTP 500 error.
Only then the streaming response starts.
This is mediated by wrapping the response generator inside a try..except
block.
When an error happens during the streaming phase, a proper message
like /* aborted by exception ... */
can be written to the client.
Embedding Solution
While records are streamed one at a time, the embedding still needs to track all records
to find out what related records should be fetched. Fortunately, the results only have
to be written to the client after the first section of _embedded
is written.
To solve this, an ObservableIterator
wraps the QuerySet.iterator()
and monitors which
objects are written to the client. Meanwhile, it tracks all related object ID’s in a list.
Once the main objects are written to the client, all related identifiers are known
and can be queried at once.
For nested embedding, this isn’t possible. Those objects have to be included in a nested
_embedded
section within the current section that is written to the client.
Hence those relations are queried directly, with some prefetching optimizations
on the embedded section to avoid many repeated queries.
Prefetching Optimization
One problem with QuerySet.iterator()
is that it’s incompatible with QuerySet.prefetch_related()
.
This happens because prefetch_related()
reads over the internal results to collect all
identifiers that need to be “prefetched” with a single query.
To have the best of both words, the ChunkedQuerySetIterator
avoids this problem by reading
the table in chunks of 1000 records. For every batch, records are prefetched and given to
the next generator. It also tracks the most recently retrieved prefetches so the next batch
likely doesn’t need an extra prefetch. But even when it does,
this is still better then no having prefetching at all.
Also note that internally, Django’s QuerySet.iterator()
may still request 1000 records from the
database cursor at once. Hence, the ChunkedQuerySetIterator
also follows this pattern
to request the exact same amount of records.