step function to iterate s3

In this post I’m going to show you how to use a step function to iterate S3 buckets

AWS Step functions is a product that helps you coordinate multiple AWS services into serverless workflows

You can start an execution of a step function from AWS Cloudwatch Rules, AWS Lambda or AWS API Gateway for example

If you want to know more about Step functions check the AWS Documentation because this post only shows this use case

Step function to iterate S3 use case

Imagine that you have a S3 bucket and you want to do a job with Step Functions iterating over all the objects in the bucket. There is no iterate function in a step function workflow so you have to do it on your own

Populating the S3 bucket

We are going to create 20 empty files to upload them to S3. If you’re using a Unix like operating system you can use touch obj{1..20} to create the files

Now create a S3 bucket and upload all the files

Creating the State Machine

It’s time to create the State Machine in AWS Step Functions. Give it a name like IterateS3 for example. For now, we are going to keep the default definition

It’s time to create a new role for this State Machine. Give it a name, IterateS3Permissions for example and it will be created with the default NoPermissionsAccessPolicy

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Deny",
            "Action": "*",
            "Resource": "*"
        }
    ]
}

Your State machine is created. You can start an execution if you want but it won’t do nothing

Creating the Lambda Function

We have to create the Lambda function that do the real job, so let’s get started

Create a Lambda function from scratch. Give it the name IterateS3Function for example and choose Node.js 10.x runtime. Create a new execution role with basic Lambda permissions for the function. This is the function that the step function is going to use to iterate s3

We need to change the Lambda function role to give it permissions to list the bucket so we will add the next policy

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "1",
            "Effect": "Allow",
            "Action": "s3:ListBucket",
            "Resource": "arn:aws:s3:::your-bucket-name"
        }
    ]
}

Now we can create the code that iterates the S3 bucket in the Lambda function

const AWS = require('aws-sdk')
const s3 = new AWS.S3()

exports.handler = async (event) => {
    const response = await s3.listObjectsV2({
        Bucket: process.env.S3_BUCKET,
        MaxKeys: process.env.MAX_KEYS,
        ContinuationToken: event.NextContinuationToken || null
    }).promise()
    return {
        Files: response.Contents,
        NextContinuationToken: response.NextContinuationToken || ""
    }
};

In lines from 5 to 9, we invoke the GET Bucket (List Objects) Version 2 method from the AWS API. This method returns all the objects in a bucket

You must set an environment variable called S3_BUCKET with your actual bucket name and another environment variable called MAX_KEYS with the maximum number of keys returned by the API Method

As you can see, we are returning the Files and the NextContinuationToken. The NextContinuationToken is an id used by AWS to know how to continue the iteration process of an API call

Updating the State Machine

We have the Lambda function but we need the Step Function. Let’s change the State machine definition as follows

{
    "StartAt": "Configure",
    "States": {
        "Configure": {
            "Type": "Pass",
            "Result": {
                "NextContinuationToken": ""
            },
            "ResultPath": "$.iterator",
            "Next": "Iterator"
        },
        "Iterator": {
            "Type": "Task",
            "Resource": "your-iterator-lambda-function-arn",
            "Parameters": {
                "NextContinuationToken.$": "$.iterator.NextContinuationToken"
            },
            "ResultPath": "$.iterator",
            "Next": "DoYourThing"
        },
        "DoYourThing": {
            "Type": "Task",
            "Resource": "your-lambda-function-to-get-your-thing-with-files-arn",
            "Next": "HasNextElements"
        },
        "HasNextElements": {
            "Type": "Choice",
            "Choices": [
                {
                    "Not": {
                        "Variable": "$.iterator.NextContinuationToken",
                        "StringEquals": ""
                    },
                    "Next": "Iterator"
                }
            ],
            "Default": "Done"
        },
        "Done": {
            "Type": "Pass",
            "End": true
        }
    }
}

In the State machine definition, we have the Configure task where we set the initial variables. Then we have the Iterator task which is responsible for invoking the lambda function to iterate the S3 bucket

The DoYourThing task is where your actual job is done with the files. Don’t forget to return the iterator object after you do your job, at least, the NextContinuationToken property or the next taks won’t work

The HasNextElements task evaluates the NextContinuationToken property and if it has a value it will return to the Iterator task. Otherwise, it will finish

We need to change the IAM Role of the State machine to give it permissions to execute the Lambda function. So, let’s do it removing the NoPermissionsAccessPolicy and adding a new one

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "1",
            "Effect": "Allow",
            "Action": "lambda:InvokeFunction",
            "Resource": "your-lambda-function-arn"
        }
    ]
}

Now you can execute your State machine and iterate through your S3 bucket

Conclusion

In conclusion, we can use this code to create loops inside a state machine. Anyway, AWS is always updating its services so let’s hope that they create a Loop Task or something like that

You can see another example of a Loop iteration in the official documentation