I have a lot (millions) of json records stored as ndjson in a file, which I need to get into DynamoDB. I'm doing this by splitting the original file in chunks automatically and then uploading it to S3, which then triggers lambdas that parse the data and insert it into my DynamoDB table.

The problem I have is that every record contains a URL to another json file which contains some additional data about that record, and I would like to fetch that data and save it together with the record, so that I can perform quick searching with OpenSearch later.

My dilemma is should I fetch this data in the same lambda that does the parsing, or should I make a separate lambda that triggers on every DynamoDB insert? In my understanding, insert-triggered lambdas are invoked after the record is inserted (couldn't find anything in the documentation about this), so I would do one insert and one update for each record.

Notes:

  • The https server I'm fetching the additional data from limits number of requests to ~100 requests per minute, and in the end I would like these lambdas to have different IPs so that I can go beyond that limitation
  • I'm an AWS newbie so feel free to correct anything you thing I'm doing wrong

EDIT - Editing in light of additional info provided.

Between the 2 options, you rightly pointed out some of the considerations. For example, the insert/update option is more costly, because you are performing 2 operations instead of 1.

Another concern is the limit of 5 minutes per ~100 API calls. Since you have millions of calls to make, the time it takes is 1,000,000 / 100 * 5 minutes = ~35 days. However, I don't think it is possible to control the Lambda IP address without some advanced setup (along the line of running Lambda in VPC and using a custom NAT instance that changes IP every time). Even then, the rate is not going to be very fantastic.

So if 35 days is an issue, I think it is better to run a fleet of EC2, each with different IP address, to parallelly process the records. EC2 gives greater control over the networking compared to Lambda, as Lambda is meant to be serverless.

If it is not an issue, I don't see anything wrong with the 2 options you proposed. Just choose the one you are more comfortable with, as these 2 entail different solutions.