Lifecycle of a Message in Amazon SQS – A Detailed Coverage

Please share if you find this useful!!!
  •  
  •  
  •  
  •  
  •  
  •  
  •  
  •  

In case you haven’t been following this blog series on Amazon SQS, we have already looked at the mechanics of consuming messages from Amazon SQS through one of my earlier blog post, and have looked at the concepts and implementation details to implement Delivery Delays with Amazon SQS through another earlier blog post.

As you may agree as well, the most important part of Messaging is messages themselves and their reliable delivery and processing- it is important that when a message is sent, it should reach the appropriate consumer in time, the consumer should be able to read and process it in the allotted time as necessary and the queuing system should always be in a healthy state while acting as broker, continuously facilitating sending and receiving of messages. Obviously, the messaging system (broker/provider) has a lot of responsibility in providing a reliable infrastructure to move the message from its source to destination, and while this happens, the messaging broker has to manage and maintain a lot of state about the message, additionally providing appropriate mechanisms to triage messages that cannot be processed. A number of factors come into play and a lot of this is automatically taken care of by the messaging providers, in this case Amazon SQS, however it needs to be configured appropriately to suit the needs of the business. Obviously we do not want our messages to get lost or go unprocessed.

Thus gaining a thorough understanding of what stages/states a message goes through (or can go through) from the time it is sent to the time it is consumed and/or discarded and what factors influence the state of the messages is crucial. Through this blog, I hope to provide a comprehensive coverage of the complete lifecycle of a message when using Amazon SQS as a queuing system. The following sections describe various stages in the lifecycle of a message, followed by a graphical representation of the complete flow in the form of a flowchart.

A New Message

Obviously, it all starts when a message is sent. The sender sends the message to an Amazon SQS queue and it gets queued up in the queue for potential consumers. Amazon SQS queues are modeled on point-to-point messaging model, so a message can be consumed by only a single consumer at any given point in time (the ending clause ‘at any given point in time is important‘ – more on this later when we talk about Message Visibility Timeout).

Delivery of Messages can be delayed…!

Alright, so a message is sent but we do not want the message to be consumed by any of its consumers right away. This may be due to the fact that the message may make more sense when it is consumed with a delay, for example, to avoid a race condition or to cater to another dependency.

One way to achieve this is to ask SQS queue consumers to not pick up the message(s) for sometime, but that may be not practical and not in our control, hence error prone. So, how about controlling this behavior at message broker (Amazon SQS) level itself.

Yes, that is possible and there are a couple of options to do that in Amazon SQS. In fact, the behavior of Delivery Delay varies by queue types – standard queues and FIFO queues, with differential handling for existing messages, multiple mechanisms with subtle differences and customization options available to introduce Delivery Delay for messages, so this topic warranted a complete blog and may not be possible to repeat it here. So, please feel free to read through the Delivery Delay blog post to understand the details about Delivery Delays.

Messages have an expiry too…!

As everything in life, messages in Amazon SQS have a lifetime too and it is enforced (cant get away from it). This is obviously considering Amazon SQS being a cloud based high throughput messaging system, it cannot let messages just pile up in queues, without getting consumed, or without getting drained from the queue to be more precise. This is also important from the perspective that with Amazon SQS standard queues, strict message ordering is not guaranteed.

Key points to note:

  • A queue can be configured so that all messages sent to that queue would have a certain expiry period. This can be done while creating the queue, or at a later point by changing the queuing configuration through the varied mechanisms supported by Amazon SQS, including Amazon console, CLI, CloudFormation and SDK’s.
  • The queue attribute for this setting is called MessageRetentionPeriod, its value is specified in seconds, and can range from 60 seconds (1 minute) to 1209600 seconds (14 days).
  • The default value for a newly created queue is 345,600 seconds (4 days), unless modified.
  • There is no way to set the expiry period at a message level.

Let us take a look at an example of setting this value to 14 days for a queue using the Java SDK.

 

I cant process a message, what do I do…!

There could be situations where a queue contains a message, but none of the consumers of the queue can process it, may be because they do not recognize it at all, or may be because the message format has evolved overtime and there is some mismatch between what is in the queue versus what can be processed. There could be other reasons as well, like application logic error, race condition, infrastructure issues etc.

Obviously, you cannot stop processing the queue and bring the whole system down for one such poison message. The queue may have thousands of other messages that can be processed and are waiting to be processed.

In such a case, the standard messaging pattern is to move the message to a Dead Letter Queue after a certain number of retries. The Dead Letter Queue is a separate queue that is monitored separately for such messages and can be used for alerting and notifications about such events.

Amazon SQS supports configuring a Dead Letter Queue for every SQS queue you create, and lets you configure a threshold in terms of the number of times a message is retried for processing, failing which, the message is moved to the Dead Letter Queue. This configuration itself is known as Redrive policy and the threshold setting is known as maxReceiveCount within Amazon SQS. Thus, for example, if you have configured the maxReceiveCount setting as 100, Amazon SQS would move a message to the configured Dead Letter Queue only after the message has been read 100 times but still not deleted from the original queue, indicating the consumer(s) are unable to process this message.

I need isolation while processing my messages…!

Finally, the message has passed the initial hurdles and is now getting ready to be available to consumers for processing. But what if there are multiple consumers looking to consume the message. In most cases, multiple consumers would be in the form of multiple load balanced instances of the same consumer application so that it can process a large number of messages. Obviously, we would not want the same message to be processed multiple times by more than one instance of the same application, each not knowing that the other is also processing the same message. To overcome this problem, JMS based messaging brokers let the client/consuming application use CLIENT_ACKNOWLEDGE as the acknowledgement mode. There is no concept of acknowledgement modes within Amazon SQS, however there is a similar concept of message visibility timeout provided by SQS.

Message Visibility Timeout

So, imagine the case where multiple consumers are trying to read the queue for consuming messages, whenever a message is returned in the response to the read request made by a particular consumer, that message is hidden from other consumers for a specific period of time. The time window is called Visibility Timeout. The basic idea is that once that particular consumer is done processing the message, it would make a delete call to delete the message. Yes, delete is not automatic, Amazon SQS requires you to delete the message, but the mechanism itself is very similar to CLIENT_ACKNOWLEDGE within JMS, just that in JMS world, the consumer acknowledges that it is done processing and the JMS broker takes care of removing the message from the queue, while in Amazon SQS, the consumer directly makes a delete call to SQS indicating that it is done processing.

One key point to take care is that the delete should happen within that message visibility timeout period because it is during this period that the message is not available to other consumers. Once the message is visible again, it may be read by other consumers and hence reprocessed. The message is said to be Inflight while it is being read by a consumer and is within the message visibility timeout period.

The following diagram illustrates the concept of Message Visibility Timeout.

In terms of implementation, the visibility timeout can be set at two levels – at the queue level as well as while submitting a read request.

Key points to note:

  • All queues have a default visibility timeout of 30 seconds, can be configured at queue level. This can be done at the time of creation of queue or at a later point by changing the queue attributes using various mechanisms supported by Amazon SQS.
  • The maximum period supported is 12 hours.
  • Visibility timeout can be specified by consumers in the receive message request. The visibility timeout specified as part of the receive message request does not impact the overall queue level visibility timeout settings, they are just valid for messages returned in response to that particular receive message request.
  • In fact, the visibility timeout of a specific message (or a set of messages) can be extended using the ChangeMessageVisibility action/API to get more time to process a message in case the existing visibility timeout of the read message is not sufficient.

An example to set the Message Visibility Timeout at a queue level for an existing queue is given below. The Visibility Timeout at queue level applies to all messages.

An example to set the Message Visibility at a message level, for messages read in a particular read request is given below.

A picture is worth a thousand words

Alright, so we have looked at so many aspects around the states/stages that a message goes through from the time it is sent to the time it is consumed or discarded and in the process, learnt concepts, implementation details, workarounds and much more. The following diagram tries to depict all of the above information through a flowchart for easier understanding and reference.

In this diagram, the green boxes denote external actions taken by producers or consumers of the message. The structures in blue denote Message states and the amber ones denote internal SQS processes.

We have covered a lot through this blog, and hopefully it has helped you visualize the complete picture in terms of what states a message goes through (or can go through) when you are using Amazon SQS for messaging.

Feedback, suggestions are most welcome as always.

Happy learning, Happy sharing!!!

– Amit


Please share if you find this useful!!!
  •  
  •  
  •  
  •  
  •  
  •  
  •  
  •  

23 Comments

  1. Hey Amit,

    Thanks for the detailed explanation. I would like to understand if we can retrieve messages based on a certain key/attribute at the consumers.

    For example: consumer 1 receives only the messages with key = ‘a’ and consumer 2 receives only the messages with key = ‘b’ etc

    Any solution would be highly appreciated.

    Kr,
    Ravi

    1. Hi Ravi,

      Thanks for your feedback.
      In terms of fetching messages selectively based on filters, the SQS service itself does not support this feature inherently. One can build such a filtering logic within consumers. Or one can use frameworks based on Enterprise Application Integration like Spring Integration, Apache Camel to do this, although they would be doing the same (implementing filtering over and above the SQS API), but by using such a framework, the consumers are somewhat free of implementing this filtering on their own.

      Based on the use case and the nature of filters (fixed/limited or completely dynamic/uncontrolled), one could also use SNS-SQS combination to achieve out of the box message filtering. Essentially, setup a SNS topic and multiple SQS queues subscribe to that SNS topic with subscription filters. The producer of the message sends the message to SNS topic, and the messages automatically get routed to the appropriate SQS queue(s) based on subscription filter definition. The consumers continue to consume from respective SQS queues, without having to worry about applying filters while consuming.

      Hope this helps!
      Amit

  2. Hi Amit,
    Thank you for such a piece of fruitfull information. I am using AWS FIFO SQS with JMS in Spring boot application, I need to add some delay if there is any exception in @JMSListner. Means some delay in the retry. if the retry completes, then it should move the message for Dead letter Queue.

    Thanks

      1. Hello Amit,

        Currently we are using SQS FIFO queue. I have a question how will we recover the sqs queue from a disaster? For example if there are some messages in the queue which are not processed yet and a disaster happened. what will happen at that time, do we loose the messages?

        Thanks.

        1. Hello Meghna,

          Apologies for a delayed response.
          SQS queues (regular as well as FIFO queues) are reliable persistent queues. Once the message is in SQS, it will remain there until it is deleted by the processing application or it runs out of its retention period, whichever is earlier. This is true even in case there is a disaster and your application is not able to process the messages in the queue for some time.

          Hope this helps!

          Regards
          Amit

  3. Hi Amit,

    What would be the answers to this question?

    When designing an Amazon SQS message-processing solution, messages in the queue must be processed before the maximum retention time has elapsed.
    Which actions will meet this requirement? (Choose two.)

    A. Use AWS STS to process the messages
    B. Use Amazon EBS-optimized Amazon EC2 instances to process the messages
    C. Use Amazon EC2 instances in an Auto Scaling group with scaling triggered based on the queue length
    D. Increase the SQS queue attribute for the message retention period
    E. Convert the SQS queue to a first-in first-out (FIFO) queue

    1. I think it depends on how you are processing your messages, however based on the options you have provided, it appears like option C (EC2 instances in an Auto scaling group with scaling triggered based on queue length) is the best option.

      Hope this helps!!

  4. HI we are working on the apigateway integration with Amazon sqs. the api gateway endpoint is using get method to read messages off the queue at which point we are seeing the messages in inflight condition but are returning to Available state again.kindly suggest

    1. Hi Tejdeep,

      Not sure if I fully understand your question, but looks like your visibility timeout may be too short. Can you elaborate on the problem and some of your configuration.

      Regards
      Amit

  5. Hi Amit,

    I want send over 300 failed bulk meesages to sqs from script receiver will process them. But I’m afraid will there will be any issue if we send such huge chunk of messages at a time.

    Can you give any suggestion best way to send bulk messages to sqs.

    1. Hi Sai,

      You should use SQS batch API to send messages in bulk. Also, sending 300 messages should not be a problem.

      Regards
      Amit

  6. Hi Amit,

    Great series of articles, really got me up to speed on sqs!

    Question:

    I have multiple consumers that need to process every message, so all consumers need to receive all messages.
    I would like to use a single queue, because the consumers will come on line as needed via k8s so it would be difficult to create / delete queues as needed.

    Would it work to use a low message ttl (like 10 seconds) and never acknowledge any messages?

    The producers will use SNS and publish messages with a topic that consumers will subscribe to.

    Thanks,
    Roy

    1. Thanks Roy, glad you found them useful.

      On the scenario you mentioned, if you are looking for every consumer to process every message, then I am afraid using a single queue wont work because SQS is used for point-to-point messaging. You may use SNS for broadcasting and fanning out your messages, but that would mean you need multiple queues as subscribers.

      Having said that, I am a bit curious why would you want every consumer to process every message. Or did you mean every consumer should be capable of receiving (and processing) every message, whoever gets to it first will process and delete from the queue. Are your consumers homogeneous (multiple instances of same process) or heterogeneous (different types of consumers, each can process different types of messages).

      1. Hi Amit,

        Actually, the consumers in this case are servers that client devices connect to via winsocket. Messages are proxied to the clients when received by the servers. Client connections are load balanced and can connect to any server at any time. Also need to support the same client login connection on multiple servers. Messages are addressed by client login id and when received by the server, are forwarded to the client if a connection to the client exists.

        Thus, the requirement for all servers to receive all messages.

        Thanks,
        Roy

        1. Hi Amit,

          Wanted to add, the consumers are homogeneous. They need to receive all messages and will forward only messages to clients connected to the consumer.

          -Roy

          1. Hi Roy,

            If I understanding your question right, then in my opinion, SQS is probably not the right solution, at least as described in your question.
            SNS fanning to multiple SQS queues is a pattern that can be used for broadcasting of messages, but you mentioned you would like to use a single SQS queue being consumed by all your homogeneous consumers.

            Since all your consumers are homogeneous, I am assuming all are running the same code – so why would you want a message to be processed by all the running consumers, why not just let one of them process (and delete it). In this particular case, multiple homogeneous consumers would let you process the queue faster. You need to have a way to delete the message once it is processed, and typically, you would do it once the a particular consumer has successfully processed it. If you wait for all consumers to receive and process the message, you do not have a way to reliably delete the message and ensure its processed only once by each consumer.

            Hope this helps!

            Regards
            Amit

  7. Hi Amit, I’ve faced some problem about maxReceiveCount for dead-letter queue as i’ve use configuration below

    -MaxReceiveCount is 5
    -Visibility Timeout is 2 time of my lambda timeout (330 seconds) so it’ll be 335 seconds. It’s required to set visibility timeout greater than lambda timeout.

    but the problem is i tried to simulate failure case every time from lambda function but I’ve not seen any messages in dead-letter queue. After i reduce maxReceiveCount to 1-3, i suddenly see the message in DLT queue. I changed it back to 5 again and wait for 8-9 hours, it doesn’t work and it still remain the same issues.
    I’m really concern whether i’ll lost messages or not with 5 (maxReceiveCount)

    1. I found something, I’m consuming msg the queue and it contain 200K msg. It’s quite a lot. My assumption is when maxReceiveCount is increased, it seem like sqs queue treat those messages as low priority and event though messages is not order as same as FIFO because lambda always get lower maxReceiveCount and newer messages instead of higher maxReceiveCount, I guess AWS might want to reduce number of messages in DLT queue and give a chance for those messages to get retry from lambda as much as possible. Anyway, i’ve tested it with new SQS queue and DLT queue then it works fine and no concern about this issues on Production environment. I just want anyone who can confirm my assumption is correct.

      1. Hi Phonlawat,

        Apologies for the delayed response, its been a crazy time.

        With SQS and Lambda, the chances of processing failures is a bit more due to various reasons, especially with the kind of volume in your scenario.
        Hence, with a lower DLQ threshold, you have a higher chance of messages landing up in DLQ.
        With your DLQ threshold set to 5, you have a better chance of messages getting processed because of more number of retries.

        Hope this helps!

        Amit

        Regards,
        Amit

  8. Hi Amit,

    Detailed explanation. Can you update what are all the situation that messages fall into DLQ.
    1. As you mentioned when message received limit exceed?
    2. When retention period expires i.e max 14 days?

    Any other possibility. my understanding is that if message doesn’t get consume / read then it will always present int the main queue and will not move to DLQ.

    Thanks,
    – Ahmed

    1. Hi Ahmed,

      Messages can get deleted from the queue (and not move to DLQ) in following conditions:

    2. The consumer process deletes it
    3. The message expires – exceeds MessageRetentionPeriod – Note that in this case, it does not move to DLQ, it gets deleted
    4. Messages get moved to DLQ only when a DLQ is configured, and for configuring a DLQ, we need to specify a maxReceiveCount. So the message, if not deleted (and not expired), is delivered to consumers upto ‘maxReceiveCount’ number of times, post which it is moved to DLQ.

      Hope this helps!
      Amit

Leave a Reply

Your email address will not be published. Required fields are marked *