Batch Processing: Job Queue Design and Dispatch

Hey All,

I’m trying to workout a good pattern for a Job Queue for processing and dispatch style operations.

Scenario:

  • Every x interval(Probably once a day)
  • Trigger a job which queues up a data set for processing
    • Data too large to process in one function so needs to be split up.
  • process each slice of data
  • store some metric or result

From this scenario the questions i’m trying to solve. I’m trying to figure out how to ensure scaleability:

  • How to trigger the queue spinup? This would queue up the processing of “slices”
    • Cron / Scheduled FNs work well here is my guess but:
  • How to represent the queued data?
  • Two options occur to me here
    • Dump the data into Kinesis from a query in the trigger function?
      • Querying the data might be slow? Dumping the whole set might also be slow.
    • Dump a reference/pointer to the data set into SNS (The query params / an s3 location)
      • This could be generated quickly, instead of actually querying just send “process page: 0…1…2” etc.
      • A listening function would pickup the “page request” and process it
  • Should the result of the processing be written in the “page processor”?
    • Or do i dump the processed data back into Kinesis to be picked up by a “Writer” function?

Essentially looking to model something like this:

You’re on the right track.

Scheduled trigger function is a pattern that’s worked well for me. Just make sure you set up your DLQ.

Where are you querying from? Why are you worried it will be slow? Dumping stuff in to Kinesis is not a bad idea, but I’d probably only do it after I’ve proved the alternative (i.e. no steam) is actually too slow.

How long is your processing expected to take? Keep in mind that you can increase the memory allocated to your functions to improve their performance even if they are not memory bound. The amount of memory you allocate also determines your function’s access to CPU, IO, and network capacity.

Where are you writing to? If it was just S3 then I think a separate writer would be overkill - you’d have to worry about Kinesis message size, etc.

Thanks for the reply, its great to hear i’m on the right track!

I’ll need to look into DLQs.

For the source its external to my network and therefore out of my control and can sometimes be slow. I’ll have to keep an eye on it and adjust for any problems I see.

Processing needs some more evaluation but you’re right that a separate writer might be overkill.

OK. If there’s a speed mismatch (e.g. reading vs processing) then putting a stream in between is a good idea, but just watch out for message sizes. An alternative way would be to use S3 to stage data e.g. Reader saves to S3 and then sends SNS message with location to the Process function.

I would definitely go with using a scheduled task to determine the number of pages of data to process and queue up the events. I would avoid Kinesis unless there is a need, simply because it adds complexity that is best avoided.

However I don’t think SNS alone will give you want you want. If your entry point function (the one called by a scheduled task) simply publishes an SNS notification for every slice, this will trigger all of the slices to be processed (practically) simultaneously – which isn’t a problem for Lambda, but it might be for your external source…

Instead, I would do something like:

// after you've determined the number of slices by querying the external source.
async.each(
    _.range(0, numSlices), 
    (slice) => sqs.sendMessage({...slice...}), 
    () => {
        sns.publish({...});
    }
);

This way, you are only publishing one event on the SNS topic, so the processor function is only invoked once. That processor function would receiveMessage() from the SQS queue, that gives it the slice to process. Then when it’s done its processing, it adds another message to the SNS topic, which triggers the next invocation. If the SQS queue is empty, the function quits without publishing to SNS (this stops it from continuing to process indefinitely).

The other problem you might have, which you haven’t mentioned, is how to then collate the results of all of the slices into a single result set at the end. Although it might be that the way you are storing those results, it actually doesn’t matter. But if this is not the case, then you can store the partial results from each slice in a known location (Dynamo or S3) and then when the SQS queue is empty, that’s where you can do the collation (probably in a separate function that is triggered by publishing to a different SNS topic).