rowanu, matt, thanks for your responses and insights! In particular, I didn't yet know about AWS Batch.
We can definitely split up execution into small units, our current engine already works like that (but on a single machine). We'll have to find the right batch/fragment sizes of course. One of the challenges is that the size of processed data for any process can vary widely, from a few hundred Kilobytes to multiple gigabytes. Our users expect the system to return first results very fast (it's an interactive transformation environment), and then later to receive results of the entire data set having been transformed.
Based on what we've tried out so far, we're looking at a combination of divide and conquer and iterative processing, along the lines of the following, using a merge (combine many objects into one) use case:
- Client triggers type transformation via request to AWS API Gateway (Trigger Event)
- API Gateway triggers Lambda Function to partition request; function uses indexes (e.g. ElasticSearch) to determine affected features and partitions them; partition function triggers request to API Gateway's transform endpoint with partitionMap parameter (could also be based on an update trigger as suggested by Matt)
- 1..n transform functions execute, each one retrieving fragments from S3 and merging their partition of features. If a function doesn't manage to complete in time, it writes a partial result and invokes the API Gateway's transform endpoint with a new, smaller partitionMap parameter that only includes those features not processed yet. In the case of a merge, it might make sense to also pass a list of unique values identified before (specific to Merge cases).
- When all partitions have been processed (monitoring could haben via AWS SQS I assume), the individual results are merged in a final call to transform (using an iterative approach in case execution time is not sufficient).
- When the transform function finishes and there are no further features left, the function triggers a callback (several options for that) and delivers the location of the final result.