Tasks marked as "UP_FOR_RESCHEDULE" get stuck in Executor.running and never reschedule
See original GitHub issueApache Airflow version
2.3.3
What happened
Upon upgrading from Airflow 2.1.3 to Airflow 2.3.3 we have an issue with our sensors that have mode=‘reschedule’. Using TimeSensor as example:
- It executes as normal on the first run
- It detects it is not the correct time yet and marks itself “UP_FOR_RESCEDULE” (usually to rescheduled for 5 minutes in the future)
- When the time comes to be rescheduled it just gets marked as “QUEUED” and is never actually run again, the error in the log:
[2022-08-15 00:01:11,027] {base_executor.py:215} ERROR - could not queue task TaskInstanceKey(dag_id='TestDAG', task_id='testTASK', run_id='scheduled__2022-08-12T04:00:00+00:00', try_number=1, map_index=-1) (still running after 4 attempts)
Looking at the relevant code (https://github.com/apache/airflow/blob/2.3.3/airflow/executors/base_executor.py#L215) it seems that the Task Key was never removed from self.running after it initially rescheduled itself.
What you think should happen instead
Rescheduled tasks should reschedule
How to reproduce
- Airflow 2.3.3 from Docker
- Celery 5.2.7 with Redis backend
- MySQL 8
- Airflow Timezone set to America/New_York
- Have a normal (non-async) sensor that has mode reschedule and needs to reschedule itself
Operating System
Fedora 29
Versions of Apache Airflow Providers
No response
Deployment
Docker-Compose
Deployment details
No response
Anything else
The symptoms of this discussion sounds the same, but no one has replied on it yet: https://github.com/apache/airflow/discussions/25651
Are you willing to submit PR?
- Yes I am willing to submit a PR!
Code of Conduct
- I agree to follow this project’s Code of Conduct
Issue Analytics
- State:
- Created a year ago
- Reactions:1
- Comments:6 (5 by maintainers)
Top Results From Across the Web
Airflow Sensors failing after getting UP_FOR_RESCHEDULE
Ideally, the task status should be UP_FOR_RESCHEDULE, but it becomes failed and even after configuring retries, it doesn't retry again. The ...
Read more >100% of our tasks are stuck in "scheduled" state this morning ...
Marking all the running DagRuns as failed does not change the state of the tasks from scheduled to failed. Current next step is...
Read more >[GitHub] [airflow] podhornyi opened a new issue #15077
... to executor with priority 3 and queue default"} ``` Scheduler logs Task which stuck in queue: `test_reschedule.up-for-reschedule.task_1` ...
Read more >Executors hang with supposedly running task that are really ...
Running on a six-node cluster, each of the executors end up with 5-7 tasks that are never marked as completed. Here's an excerpt...
Read more >Part 5: How to Resolve Common Errors When Switching to ...
When you have failed tasks, you need to find the Stage that the tasks belong to. To do this, click on Stages in...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
It appears this was never an issue before 2.3.0 because the CeleryExecutor implemented it’s own trigger_tasks logic, until this PR landed: https://github.com/apache/airflow/pull/23016
Looks like it was our fault!
It seems the issue was that our scheduler celery results backend was pointing to a different database than our worker celery results backend 🤦♂️.
Thanks for responding earlier, sorry it was on our side.