[core] ray.kill pending actor doesn't cancel the actor creation task
See original GitHub issueWhat is the problem?
Currently, ray.kill will silently fail if the actor has not already been started. This appears to be because we try to kill actors directly (via direct actor transport), but now GCS is responsible for scheduling/creating actors, so the actor’s owner can’t easily cancel the pending lease request.
Here’s a simple reproduction which shows the lease request is still infeasible in a raylet.
import ray
from ray._raylet import GlobalStateAccessor
import time
cluster = ray.init()
global_state_accessor = GlobalStateAccessor(
cluster["redis_address"], ray.ray_constants.REDIS_DEFAULT_PASSWORD)
global_state_accessor.connect()
@ray.remote(resources={"WORKER": 1.0})
class ActorA:
pass
a = ActorA.remote()
ray.kill(a) # do not wait until it starts
while True:
message = global_state_accessor.get_all_resource_usage()
if message is not None:
resource_usage = ray.gcs_utils.ResourceUsageBatchData.FromString(
message)
print(resource_usage)
else:
print(message)
time.sleep(1)
cc @ericl
Ray version and other system information (Python version, TensorFlow version, OS):
Reproduction (REQUIRED)
Please provide a short code snippet (less than 50 lines if possible) that can be copy-pasted to reproduce the issue. The snippet should have no external library dependencies (i.e., use fake or mock data / environments):
If the code snippet cannot be run by itself, the issue will be closed with “needs-repro-script”.
- I have verified my script runs in a clean environment and reproduces the issue.
- I have verified the issue also occurs with the latest wheels.
Issue Analytics
- State:
- Created 3 years ago
- Reactions:1
- Comments:18 (18 by maintainers)
Top Related StackOverflow Question
Yes, I think the version with
lease_client->CancelWorkerLeaselooks right.I call
lease_client->CancelWorkerLeasein gcs actor scheduler and it will cancel actor creation request.running Alex's script with no_restart True: batch { node_id: “\206\363\344\377<3^\350y's\250\264x\214+\035\010<Us\177\255\323\362"\035H” resources_available { key: “CPU” value: 8.0 } resources_available { key: “memory” value: 99.0 } resources_available { key: “node:10.15.246.254” value: 1.0 } resources_available { key: “object_store_memory” value: 34.0 } resources_available_changed: true resources_total { key: “CPU” value: 8.0 } resources_total { key: “memory” value: 99.0 } resources_total { key: “node:10.15.246.254” value: 1.0 } resources_total { key: “object_store_memory” value: 34.0 } resource_load_changed: true resource_load_by_shape { } } placement_group_load { }running Alex's script with no_restart False: batch { node_id: “\273\315\313K\234\272X\264\014xE0\020\006Y\333\367\313^\013\264\324\303#\343\030z\342” resources_available { key: “CPU” value: 8.0 } resources_available { key: “memory” value: 95.0 } resources_available { key: “node:10.15.246.254” value: 1.0 } resources_available { key: “object_store_memory” value: 33.0 } resources_available_changed: true resources_total { key: “CPU” value: 8.0 } resources_total { key: “memory” value: 95.0 } resources_total { key: “node:10.15.246.254” value: 1.0 } resources_total { key: “object_store_memory” value: 33.0 } resource_load { key: “CPU” value: 1.0 } resource_load { key: “WORKER” value: 1.0 } resource_load_changed: true resource_load_by_shape { resource_demands { shape { key: “CPU” value: 1.0 } shape { key: “WORKER” value: 1.0 } num_infeasible_requests_queued: 1 } } } resource_load_by_shape { resource_demands { shape { key: “CPU” value: 1.0 } shape { key: “WORKER” value: 1.0 } num_infeasible_requests_queued: 1 } } placement_group_load { }