Processing large data sets stored on S3 - Architecture?

Hi all,

we’re currently evaluating whether to implement our cloud data processing architecture on containers or functions. Our design objectives are:

  1. Fast processing of small data sets (i.e. minimum response time must be <<5s, which means normal VMs don’t work well)
  2. Processing of large data sets (currently up to 50 GB) which are stored in S3 as fragments and indexed via ElasticSearch needs to be possible
  3. Parallel Processing to bring down total processing time is highly desirable
  4. Support custom transformation functions (users of our system can write their own Groovy scripts or Java functions)

Our use cases include simple conversion and restructuring of data, but also include join and merge operations, potentially executed over very large numbers of objects. In other words, execution of the entire process within a single function and its limits (time, memory) doesn’t make sense. We currently have processes that run for 30 minutes or longer on normal 8 vCPU, 32GB RAM servers.

I saw the data processing example for the serverless framework. Would that be a reasonable starting point given the criteria above? Do you have any additional hints or suggestions on the architecture for our scenario?

Thanks in advance,
Thorsten

Yep, that definitely sounds like a good use-case for a Lambda-based application, as long as you can divvy-up your work in to less-than-5-minute chunks of processing. Assuming you can, it would also help you get the parallel processing benefits you are after.

Not sure about your point 4, as the scope of that could be quite large (or it could not), depending on your hard requirements.

I haven’t used the data processing example myself, but technically the things you’re after are achievable.
Another option might be the recently released AWS Batch.

Rowanu is right, you have to be able to divvy up your work for it to be worthwhile. THis is kind of the important part as well, because the kind of solution you will build for S3 bulk processing is inherently different than a standard batch process. The tools and scale available to you are different and the kind of performance improvements you might do in one, would be an anti pattern in the other.

I recommend you consider making heavy use of update/delete triggers, specifically using prefix rules to determine when specific lambda events should fire. This could enable you to send a file through a lifecycle by processing it, then saving it back to S3 with a different name causing another event to fire and begin processing the next step. With the incredibly low latency access between lambda and S3, breaking your file up into chunks (which lambda can do very quickly as well) and having each chunk be worked will be limited only by your Lambda restrictions.

Last note, all of your lambda restrictions can be updated by requesting an increase to AWS, this includes the 5 minute execution time. http://docs.aws.amazon.com/lambda/latest/dg/limits.html

rowanu, matt, thanks for your responses and insights! In particular, I didn’t yet know about AWS Batch.

We can definitely split up execution into small units, our current engine already works like that (but on a single machine). We’ll have to find the right batch/fragment sizes of course. One of the challenges is that the size of processed data for any process can vary widely, from a few hundred Kilobytes to multiple gigabytes. Our users expect the system to return first results very fast (it’s an interactive transformation environment), and then later to receive results of the entire data set having been transformed.

Based on what we’ve tried out so far, we’re looking at a combination of divide and conquer and iterative processing, along the lines of the following, using a merge (combine many objects into one) use case:

  1. Client triggers type transformation via request to AWS API Gateway (Trigger Event)
  2. API Gateway triggers Lambda Function to partition request; function uses indexes (e.g. ElasticSearch) to determine affected features and partitions them; partition function triggers request to API Gateway’s transform endpoint with partitionMap parameter (could also be based on an update trigger as suggested by Matt)
  3. 1…n transform functions execute, each one retrieving fragments from S3 and merging their partition of features. If a function doesn’t manage to complete in time, it writes a partial result and invokes the API Gateway’s transform endpoint with a new, smaller partitionMap parameter that only includes those features not processed yet. In the case of a merge, it might make sense to also pass a list of unique values identified before (specific to Merge cases).
  4. When all partitions have been processed (monitoring could haben via AWS SQS I assume), the individual results are merged in a final call to transform (using an iterative approach in case execution time is not sufficient).
  5. When the transform function finishes and there are no further features left, the function triggers a callback (several options for that) and delivers the location of the final result.

I admit I completely forgot about AWS Batch. I’ve been a bit to deep in the serverless/lambda world and I may have fallen victim to the old saying, “to a hammer, every problem looks like a nail.”

  1. I imagine this is a two step process, first they send you a file (direct to S3), then trigger the process via an API call? Or do you already have the data and are just starting the processing?
  2. I would definitely drop the additional API call if possible. Extra overhead both maintenance and execution time wise.
  3. Without knowing your business problem, it sounds like there may be an opportunity to break up your problem even further. Possibly by having the first process be to break up the file into smaller chunks as a single operation. Files can be written to S3 fairly quickly and a scheme could be created to break up files and drop them into the same bucket so if a file was enormous, it would get broken out into manageable sizes before it proceeded to the next step. If this really took a long time, i would recommend requesting a time extension from AWS, so you dont end up having to get to clever around this kind of overhead. Each transformation step, should ideally, also be a isolated to a single function so you really take advantage of what AWS lambda brings to the table. Just my 2c though…
  4. Alternatively, you could start the merge process as files are dropped into the ‘final’ bucket. IE, each time a file is dropped, you rename the files so no other process grabs them, then you read all the files in the folder and then merge them together and write the larger chunk back into the folder. As long as the process only does work when there is more than 1 file this should work.
  5. This is harder to detect, because its kind of just a pipe with files flowing through it, you dont know that the water was ‘turned off’ on the other side. Unless you had some sort of trace record flow through the process that kept track of the total number of chunks to process, and it was ignored by everything, except the last step, which would then record it in the final file. Then your last process that merged the files together could send the file to the next step.

This is all armchair quarterbacking though, you know the problem better, so take my suggestions with a grain of salt. A couple things I do want to share with you having done some batch processing with Lambda previously. First, make sure that you keep your steps and functions to a SINGLE purpose. This isn’t just for maintainability, but also for performance. Second, when working with files Node.js streams is the way to go (or Language equivalent), I’ve read and processed enormous amounts of data with them. :slight_smile:

Matt