AWS Step Functions Best practices : All in one place.

Perumal Babu
6 min readDec 28, 2022

A curated list of best practices while building applications with AWS Step functions.

Let me start with a quick introduction to AWS Step Functions — It’s a service that enables you to build and run distributed applications using visual workflows. You can use Step Functions to coordinate the execution of various AWS services, such as Amazon EC2, AWS Lambda, and Amazon SNS, in a reliable and scalable manner.

Some common use cases for AWS Step Functions include:

  1. Microservices: Step Functions can be used to build and orchestrate microservices architectures, allowing you to break down complex applications into smaller, independent services that can be developed, tested, and deployed independently.
  2. Batch processing: Step Functions can be used to process large volumes of data or perform tasks that need to be run periodically. For example, you can use Step Functions to process a large number of records in a database or to perform nightly data backups.
  3. Data pipelines: Step Functions can be used to build and manage data pipelines, allowing you to move data between different sources and destinations in a reliable and scalable manner.
  4. Event-driven architectures: Step Functions can be used to build event-driven architectures, allowing you to respond to specific events (such as the completion of a task or the arrival of new data) and trigger appropriate actions.

If you have been following me, I have been blogging on sharing best practices on Cloud Services across AWS WAF pillars.

Performance :

  • Use the right execution mode: AWS Step Functions offers two execution modes: standard and express. Standard mode is suitable for workflows with many states, complex logic, or long-running tasks. Express mode is suitable for workflows with fewer states, simple logic, or short-running tasks. To improve performance, you should use the appropriate execution mode for your workflow.
  • Use the right service integration: AWS Step Functions integrates with various AWS services, such as Amazon EC2, AWS Lambda, and Amazon SNS. To improve performance, you should use the service integration that is most suitable for your workload. For example, if you have a large number of tasks that can be processed concurrently, you might want to use AWS Lambda or Amazon EC2 to execute the tasks.
  • Use batch processing: If you have a large number of tasks that need to be processed, you can use batch processing to group the tasks and improve performance. For example, you can use the BatchSize parameter in the AWS Step Functions API to specify the number of tasks to process in each batch.
  • Use retries and error handling: If you are executing tasks that may fail due to temporary issues (such as network errors), you can use retries and error handling to improve overall performance. For example, you can use the Retry feature in AWS Step Functions to retry tasks that fail due to transient errors.
  • Monitor and optimize your usage: Use the AWS CloudWatch service to monitor the performance of your Step Functions workflows. This will help you to identify any bottlenecks or issues that may be impacting performance and allow you to take corrective action as needed.
  • Avoid Latency when polling task activities by implementing pollers as seperate threads in activity worker implementation and also have atleast 100 open polls per activiyt in production.

Security :

  • Use IAM roles for tasks: Instead of using hardcoded AWS access keys in your Step Functions tasks, use IAM roles to grant the necessary permissions to your tasks. This ensures that the tasks have the minimum necessary permissions to perform their actions, and also allows you to easily rotate or revoke the permissions if needed.
  • Encrypt sensitive data: If you need to pass sensitive data (such as passwords or API keys) as inputs to your Step Functions tasks, make sure to encrypt the data using AWS Key Management Service (KMS). This will help to protect the data from unauthorized access.
  • Use CloudTrail to monitor Step Functions: Enable CloudTrail logging for your Step Functions execution to keep track of any changes or updates made to your Step Functions resources. This will help you to detect and respond to any unauthorized or unexpected activity.
  • Use resource-level permissions: Use resource-level permissions in your IAM policies to grant access to specific Step Functions resources, rather than granting global permissions to all Step Functions resources. This helps to limit the scope of permissions and reduce the risk of accidental or unauthorized access to your resources.
  • Enable CloudWatch logging: Enable CloudWatch logging for your Step Functions execution to keep track of the execution history and troubleshoot any issues that may arise. You can also set up alarms in CloudWatch to notify you of any unusual activity or errors.

Operational Excellence :

  • If you pass large payloads between states there is potential chance tha the task would be terminated instead Use S3 ARN to pass these large payloads.

Use Cloud Watch metrics to monitor the health of the Step Functions. Here are some of the metrics that would be handy.

  • State transitions: This metric represents the number of times your workflow moves from one state to another. You can use this metric to monitor the overall activity of your workflows and identify any bottlenecks or issues that may be impacting performance.
  • Throttled state transitions: This metric represents the number of times a state transition was throttled (i.e., rejected due to excessive load). If this metric is consistently high, it may indicate that your workflows are experiencing high levels of activity or that you have reached the maximum capacity of your workflow.
  • Execution duration: This metric represents the time it takes for a workflow execution to complete. You can use this metric to monitor the performance of your workflows and identify any issues that may be causing delays.
  • Throttled execution starts: This metric represents the number of times an execution start was throttled (i.e., rejected due to excessive load). If this metric is consistently high, it may indicate that you are reaching the maximum capacity of your workflow or that you need to scale your resources to handle the workload.
  • Task failures: This metric represents the number of tasks that have failed during workflow execution. You can use this metric to monitor the reliability of your workflows and identify any issues that may be causing tasks to fail.

Reliability :

  • By default there ae no timeouts for state machines in AWS. Hence its recommended that we use Timeouts to avoid states that are getting stuck.
  • For long running executions its possible that we hit the 25000 hard quota on event history. Instead start a new state machine from the Task state of the running state machine.
  • Try to handle transient errors with 500(s) from your Lambdas so that the state machine is more reliable
  • Use retries and error handling: If you are executing tasks that may fail due to temporary issues (such as network errors), you can use retries and error handling to improve reliability. For example, you can use the Retry feature in AWS Step Functions to retry tasks that fail due to transient errors.
  • Use idempotent tasks: If your tasks are idempotent (i.e., they can be safely retried without causing unintended side effects), you can use retries to improve reliability. For example, if you are using a task to write data to a database, you can design the task to be idempotent by using an “upsert” operation that updates the data if it already exists, or inserts it if it does not.
  • Use CloudWatch alarms: Use CloudWatch alarms to monitor the status of your Step Functions workflows and tasks. If an alarm is triggered, you can use it to automatically take corrective action (such as scaling up a service or retrying a task) to improve reliability.
  • Use CloudTrail logging: Enable CloudTrail logging for your Step Functions execution to keep track of any changes or updates made to your Step Functions resources. This will help you to detect and respond to any unauthorized or unexpected activity, and improve the reliability of your workflows.
  • Test your workflows: Regularly test your Step Functions workflows to ensure that they are functioning as expected. You can use the AWS Step Functions Testing Framework to automate the testing of your workflows and improve reliability.

Cost Optimization :

  • Choosing Standard or express workload
  • Monitor and optimize your usage: Use the AWS Cost Explorer and other tools to monitor your Step Function's usage and costs. This will help you to identify areas where you can optimize your usage to reduce costs. You can also set up budget alerts in the AWS Billing and Cost Management console to notify you when your usage or costs exceed a certain threshold.

Here is again a post that shows talks about the over all approach and process towards successfuly implementing Cloud FinOps .

Fome best practices are taken from AWS

--

--