Mass curling URLs and saving content to S3


#1

I have to build architecture to curl all the html pages on my prelive environment and save the static content to S3.

I need to limit the number of parallel curl requests running against my prelive servers to avoid DDOSing my prelive environment.

As it might take a while (1,000’s of URLs) I want to provide the user feedback on the progress of the curl requests.

I had an idea to post the URLs to a Lambda master function calling Lambda slaves using the fanout method dividing out curl requests. Unfortunately this does not give the user feedback as the lambda task does not return with a response until it is finished.

Any idea how I might do this in the serverless environment or how that might fit within a serverless / hybrid environment?


#2

I would start with a DynamoDB table and add one record for each URL you want downloaded. Next add a Lambda to the table stream which writes to SQS and a Lambda that reads from SQS but has a limit on the number of parallel invocations. The Lambda that reads from SQS will handle fetching the URL. If it finds a new URL it can add it to the DynamoDB table.

I would also add a job number to the URL in the DynamoDB table. You know have a few options:

  1. You could query the table for all of the URL’s in a particular job and look at the status to figure out how many have been completed. The Lambda reading the pages would need to update the status appropriately.

  2. You could create a job record in DynamoDB and have the two Lambda increment counters for the total number of URL’s in the job and total number of complete URL’s in the jobs.