Scheduled trigger function is a pattern that’s worked well for me. Just make sure you set up your DLQ.
Where are you querying from? Why are you worried it will be slow? Dumping stuff in to Kinesis is not a bad idea, but I’d probably only do it after I’ve proved the alternative (i.e. no steam) is actually too slow.
How long is your processing expected to take? Keep in mind that you can increase the memory allocated to your functions to improve their performance even if they are not memory bound. The amount of memory you allocate also determines your function’s access to CPU, IO, and network capacity.
Where are you writing to? If it was just S3 then I think a separate writer would be overkill - you’d have to worry about Kinesis message size, etc.
Thanks for the reply, its great to hear i’m on the right track!
I’ll need to look into DLQs.
For the source its external to my network and therefore out of my control and can sometimes be slow. I’ll have to keep an eye on it and adjust for any problems I see.
Processing needs some more evaluation but you’re right that a separate writer might be overkill.
OK. If there’s a speed mismatch (e.g. reading vs processing) then putting a stream in between is a good idea, but just watch out for message sizes. An alternative way would be to use S3 to stage data e.g. Reader saves to S3 and then sends SNS message with location to the Process function.
I would definitely go with using a scheduled task to determine the number of pages of data to process and queue up the events. I would avoid Kinesis unless there is a need, simply because it adds complexity that is best avoided.
However I don’t think SNS alone will give you want you want. If your entry point function (the one called by a scheduled task) simply publishes an SNS notification for every slice, this will trigger all of the slices to be processed (practically) simultaneously – which isn’t a problem for Lambda, but it might be for your external source…
Instead, I would do something like:
// after you've determined the number of slices by querying the external source.
async.each(
_.range(0, numSlices),
(slice) => sqs.sendMessage({...slice...}),
() => {
sns.publish({...});
}
);
This way, you are only publishing one event on the SNS topic, so the processor function is only invoked once. That processor function would receiveMessage() from the SQS queue, that gives it the slice to process. Then when it’s done its processing, it adds another message to the SNS topic, which triggers the next invocation. If the SQS queue is empty, the function quits without publishing to SNS (this stops it from continuing to process indefinitely).
The other problem you might have, which you haven’t mentioned, is how to then collate the results of all of the slices into a single result set at the end. Although it might be that the way you are storing those results, it actually doesn’t matter. But if this is not the case, then you can store the partial results from each slice in a known location (Dynamo or S3) and then when the SQS queue is empty, that’s where you can do the collation (probably in a separate function that is triggered by publishing to a different SNS topic).