Mass curling URLs and saving content to S3

peteb · January 10, 2019, 3:58pm

I have to build architecture to curl all the html pages on my prelive environment and save the static content to S3.

I need to limit the number of parallel curl requests running against my prelive servers to avoid DDOSing my prelive environment.

As it might take a while (1,000’s of URLs) I want to provide the user feedback on the progress of the curl requests.

I had an idea to post the URLs to a Lambda master function calling Lambda slaves using the fanout method dividing out curl requests. Unfortunately this does not give the user feedback as the lambda task does not return with a response until it is finished.

Any idea how I might do this in the serverless environment or how that might fit within a serverless / hybrid environment?

buggy · January 12, 2019, 10:10am

I would start with a DynamoDB table and add one record for each URL you want downloaded. Next add a Lambda to the table stream which writes to SQS and a Lambda that reads from SQS but has a limit on the number of parallel invocations. The Lambda that reads from SQS will handle fetching the URL. If it finds a new URL it can add it to the DynamoDB table.

I would also add a job number to the URL in the DynamoDB table. You know have a few options:

You could query the table for all of the URL’s in a particular job and look at the status to figure out how many have been completed. The Lambda reading the pages would need to update the status appropriately.
You could create a job record in DynamoDB and have the two Lambda increment counters for the total number of URL’s in the job and total number of complete URL’s in the jobs.

Topic		Replies	Views
Configuring Serverless.yml with multiple iamRoleStatements Serverless Framework aws	1	7444	February 21, 2019
How to query dynamoDB, transform results, and upload transformed results to s3? Serverless Framework aws , lambda , dynamodb	4	1674	May 28, 2019
Methods of throttling DynamoDB writes Serverless Architectures aws	2	1557	May 30, 2017
Lambda / microservice architecture Serverless Architectures aws , lambda	5	3614	April 25, 2018
Local development with Serverless - whats the latest/greatest? Serverless Framework aws	0	522	November 16, 2018

Mass curling URLs and saving content to S3

Related topics