I have to build architecture to curl all the html pages on my prelive environment and save the static content to S3.
I need to limit the number of parallel curl requests running against my prelive servers to avoid DDOSing my prelive environment.
As it might take a while (1,000’s of URLs) I want to provide the user feedback on the progress of the curl requests.
I had an idea to post the URLs to a Lambda master function calling Lambda slaves using the fanout method dividing out curl requests. Unfortunately this does not give the user feedback as the lambda task does not return with a response until it is finished.
Any idea how I might do this in the serverless environment or how that might fit within a serverless / hybrid environment?
I would start with a DynamoDB table and add one record for each URL you want downloaded. Next add a Lambda to the table stream which writes to SQS and a Lambda that reads from SQS but has a limit on the number of parallel invocations. The Lambda that reads from SQS will handle fetching the URL. If it finds a new URL it can add it to the DynamoDB table.
I would also add a job number to the URL in the DynamoDB table. You know have a few options:
You could query the table for all of the URL’s in a particular job and look at the status to figure out how many have been completed. The Lambda reading the pages would need to update the status appropriately.
You could create a job record in DynamoDB and have the two Lambda increment counters for the total number of URL’s in the job and total number of complete URL’s in the jobs.