Working with Asynchronous Celery Tasks – lessons learned

Jakub Trzaskoma
Daftcode Blog
Published in
7 min readAug 8, 2018

--

Illustration by Magdalena Tomczyk

The main goal of this post is to show the most important issues I came across when working with Celery tasks. I will use a simple case to present the most common problems you can encounter while working with Celery.

Prerequisites:

  • Python web framework basics (e.g., Django/Flask/Pyramid)
  • Celery library basics (worker, broker, delays, retries, task acknowledgment)
  • Database knowledge (ORM, transactions, locking reads)
  • Familiarity with using Redis as a Celery broker

The case

Imagine that we are implementing a web store application. Let’s focus on a component responsible for registering new users and sending a welcome emails after successful registration.

I will use Django-like code examples because they are short and easy to show general idea which can be applied to other web frameworks. Simplified code for create_user request handler could look like this:

First, we use ORM to create new User object containing all important information (name, email). Then, we call send_email method which will send welcome message to the user via SMTP server. Finally, we render HTML page and send it to the user. We also use transaction.commit_on_success decorator to automatically commits User object to the database if create_user function returns without raising exceptions.

Carefully choose task function arguments

As you probably know, sending an email via SMTP can cause several problems associated with long server response time or temporary unavailability of the server. Let’s extract send_email logic to Celery task, delegating logic of sending the email to a separate process. create_user will return faster regardless of potential problems with sending an email. After this extraction our code should look as follows:

Everything seems to look neat and simple but there is a little problem. We are passing User object to send_welcome_email_task. This is an ORM object which is bound to the current database connection. The task will be run by the worker in a different process, using a different database connection, so it’s not a good idea to pass the whole User object between processes.

Furthermore, User object can be modified in the meantime and Celery worker will be operating on an outdated version. Your object can also contain some more complex field types like datetime which cannot be serialized to JSON (the most commonly used serializer in Celery). But don’t worry, we can fix this issue by passing user.id instead of user as a task argument:

There is no problem with serializing user.id because it’s now a simple int. user is fetched from the database directly by id at the beginning of the function, so we are sure that Celery task operates on the latest version of user object.

Lesson learned: Use only simple types as task function arguments.

Remember, that state of your object may change after the task was sent to the broker. This is especially important when two tasks are operating on the same object. The second one should know about possible object changes made by the first one.

Lesson learned: Always get the newest version of the object at task runtime.

Sending task to the broker and database transactions

Let’s move on. Think about the case when render_html function is very slow (e.g., badly implemented). If so, it’s possible that race condition may occur. Celery worker could start running the task before create_user will finish and commit user object to the database. In that case, send_welcome_email_task will raise an exception like “User object not found in the database for a given id”. This scenario may also come true when some long running operation is run after sending a task to the Celery broker.

How to solve this problem? The first idea could be to delay the moment of sending the task to the broker for e.g., 2 seconds, to be sure that function returns before Celery worker will get the task from the broker:

send_welcome_email_task.apply_async(args=(user.id,), countdown=2)

This fairly easy fix will obviously fail when render_html runs for more than 2 seconds. It will cause an unnecessary delay in sending an email.

We must find a better solution. We have to be sure that Celery task is sent to the broker not earlier than database transaction is committed. In order to accomplish this, we will use the transaction.commit_manually decorator to have full control over the database transaction.

With these new adjustments, the task is sent to Celery after the transaction is committed so there is no place for a race condition. This implementation has also one more advantage: the task is sent to the broker only if the transaction is committed successfully and no exception is raised in create_user function. Previously, if an exception happened (e.g., during render_html function or other business logic) email would be sent to the user which is not what we expected.

Tip: For large projects, the better way to solve this problem is to extend base Celery Task class and use transaction events to send task just after commit. This would automate this process for all kinds of tasks which operate on the database and make the code much cleaner (see this blog post for Django example).

Lesson learned: Send task to the broker only when the function does not raise
any exception and after the transaction is committed.

Retrying tasks can be tricky

Let’s complicate things a little bit. If there is a temporary problem with SMTP server, it would be nice to retry sending email after a while. We can use Celery retrying feature to implement this. It requires adding two parameters to Celery task decorator and adding try-except block:

Now if any problem occurs during sending the email, the task will be retried every 60 seconds until it has finally succeeded or failed (maximum 120 times). Theoretically, this should work without any problems… but it will not if we are using Redis as a broker. Why?

Redis uses visibility_timeout parameter which defines time limit when the worker should acknowledge that the task succeeds or fails. If not, the task is considered as lost and is resent to a worker (not necessarily the same, if you have a few running in parallel) once again. The default value for visibility_timeout is set to 1 hour. In our case, it means that if an email is not sent for 1 hour, the task will be resent to the worker. However, the task is not lost, but waiting inside the worker for another retry. As a result, we will end up with two identical tasks. If both finally succeed, the user will get 2 emails which is not the desired behavior.

Tip: If you want to avoid problems associated with visibility_timeout you can also consider using RabbitMQ as a broker.

For more information about visibility_timeout read Celery documentation and GitHub issue.

The first solution that comes to mind is to increase the visibility timeout to e.g., 2 hours. But what if we have other tasks that require retries for more than 2 hours? What if we do not want to wait more than 1 hour to resend task to the worker if the task is really lost (e.g., worker network unavailability)?

The better solution is to write task code in such a way that doesn’t matter how many times we run the task, essential logic (sending an email in our case) is executed no more than once. This is called task idempotence. To make our task idempotent we have to follow two steps.

First, at the beginning of the task, we must check if the email is already sent. To achieve this we will add is_welcome_email_sent bool field to the User ORM model class. If it’s True, task function should return without doing anything. Otherwise, it should send the email and set is_welcome_email_sent to True.

Second, we have to protect the code against race condition which could happen when two tasks are trying to check is_welcome_email_sent value at the same time. If that happens, an email will be sent twice. We will use locking reads feature select_for_update() to be sure that one user object is processed by maximally one worker at the same time. The other task will have to wait on select_for_update() until the first one releases the lock (commit transaction). The solution should look like that:

Even if a task will be multiplied between a few workers only the first one will do the job. The rest of tasks will only check that the job is already done and finish without causing any unwanted effects.

Lesson learned: Make task idempotent to be sure that if the job is done only once.

Late acknowledgment

When all our tasks are idempotent we gain one more bonus. We can enable “late acknowledgment”:

@app.task(bind=True, default_retry_delay=60, max_retries=120, acks_late=True)

By default, worker acknowledges task just after it’s retrieved from the broker (before execution of the task). It’s called “early acknowledgment”. It means that if the worker crashes in the middle of the execution, the task would not be run again because it’s considered as done. As a result, the task will never be finished.

After enabling “late acknowledgment” the task is confirmed by worker just after execution. If the worker crashes during this process, the task will be run once again. It’s a good solution because we can be sure that the entire task code was run completely at least once. There is also one disadvantage: part of the task code could be executed twice. Fortunately, our task is idempotent so we are protected against multiple task “business” logic execution.

Lesson learned: Use “late acknowledgment” for idempotent tasks to protect them against incomplete execution.

Conclusion

As you can see, even fairly simple Celery task is not so easy to write. Especially when you want to be 100% sure that the job is done under any circumstances. I hope this article will help you make your tasks more reliable.

If you enjoyed this post, please hit the clap button below 👏👏👏

You can also follow us on Facebook, Twitter and LinkedIn.

--

--