Processing Large Data set stored in S3/Redshift - Any preferred Architecture?

Hi All,

We are having a system where daily we get new data and we do process in Amazaon EMR system and put the processed data in to S3 (daily one file) and in Redshift as well. Like this we are having TB’s of data till now.

Now the we want build a system which takes the data (historical date to till date data lets say 5 years of data) from either S3 or Redshift and do further processing and push the enriched/processed data to S3 (per file 50M entries only this is the limitation) and Redshift as updates to old data (records contains 30 columns and one columns say “UPDATED”). Before sending as update make sue there is a difference in values for fields between old entry to new entry and if there is a difference then send as update other wise ignore that record, so the old entry will be there as it is in the sate “NEW”.

Processing whole historical data will be very expensive and so we want to build some layer between source to processing layer so that we can reduce similar entries, lets say processing component take pkid, and description, then we do build a component to produce pkid, simplified description and some unique id(md5(of simplified description)). And now I pass Md5-Hash and Simplified Description the processing component.

Example:
(see We have two entries, but results to same simplified description post running intermediate module)

  1. PKID1^$^Krish purchased KFC on 2017-08-06 - 0000451
  2. PKID2^$^Krish purchased KFC on 2018-08-13 - 0000678

My intermediate module produces

  1. PKID1^&^Krish purchased KFC on DATE - NUMBER^$^MD5_Hash1
  2. PKID1^&^Krish purchased KFC on DATE - NUMBER^$^MD5_Hash1

Here We have to different entries in the input, but my output say both are similar description entries. So we can give only one entry to processing component (Md5_Hash^$^Simplified Description) instead of giving two entries this way I can reduce the computation time, but this requires reverse mapping later from Md5 to PKID.

So this way we can reduce the data that can go to processing module by giving unique id and simplified description and once processing completes (which produces uniqueid, some set of new fields) we will do reverse mapping and produce pkid, unique id mapping and take the base data from s3 and new fields from processed set and create a record which will be treated as update to old entries in s3 or redshift. We will try to maintain a lookup here with unique id and processed data.

We want to process whole historical data in Batches (out thinking is: this is the way we can process huge data set can complete whole processing in 10-20 days)

Every time a new batch comes first we will execute that intermediate component which produces pkid, simplified description and md5hash. Before giving the data to processing module. So here first we will check in lookup if the similar md5hash entry is alredy present here (mean it was already processed) so no need to give that entry to processing module. This way I can further reduce the computation time. And just take the processed values from lookup for this md5 hash and take the base data from Batch and wait for completion of new entries which are not present in lookup, once the processing completes refresh lookup with new entries as well. Repeat the same step for every batch. Not sure what will happen if the lookup data grows, will this approach scales?

Once the whole processing completes which mean we completed all historical data processing and is in S3 and Redshift as well.

Next time if we want to do one more historical processing with new models, we should consume all historical data again but should be the latest record for further processing. (consume records which has latest state, latest either can be in ‘NEW’ or ‘UPDATED’)

Looking for some Architecture help here.

Thanks in Advance,
Krish