[core] ray.kill pending actor doesn't cancel the actor creation task

What is the problem?

Currently, ray.kill will silently fail if the actor has not already been started. This appears to be because we try to kill actors directly (via direct actor transport), but now GCS is responsible for scheduling/creating actors, so the actor’s owner can’t easily cancel the pending lease request.

Here’s a simple reproduction which shows the lease request is still infeasible in a raylet.

import ray
from ray._raylet import GlobalStateAccessor
import time

cluster = ray.init()

global_state_accessor = GlobalStateAccessor(
    cluster["redis_address"], ray.ray_constants.REDIS_DEFAULT_PASSWORD)
global_state_accessor.connect()


@ray.remote(resources={"WORKER": 1.0})
class ActorA:
    pass

a = ActorA.remote()
ray.kill(a) # do not wait until it starts

while True:
    message = global_state_accessor.get_all_resource_usage()
    if message is not None:
        resource_usage = ray.gcs_utils.ResourceUsageBatchData.FromString(
            message)
        print(resource_usage)
    else:
        print(message)
    time.sleep(1)

cc @ericl

Ray version and other system information (Python version, TensorFlow version, OS):

Reproduction (REQUIRED)

Please provide a short code snippet (less than 50 lines if possible) that can be copy-pasted to reproduce the issue. The snippet should have no external library dependencies (i.e., use fake or mock data / environments):

If the code snippet cannot be run by itself, the issue will be closed with “needs-repro-script”.

I have verified my script runs in a clean environment and reproduces the issue.
I have verified the issue also occurs with the latest wheels.

Issue Analytics

State:
Created 3 years ago
Reactions:1
Comments:18 (18 by maintainers)

Top GitHub Comments

1reaction

wuisawesomecommented, Jan 12, 2021

Yes, I think the version with lease_client->CancelWorkerLease looks right.

0reactions

ffbincommented, Jan 12, 2021

I call lease_client->CancelWorkerLease in gcs actor scheduler and it will cancel actor creation request. running Alex's script with no_restart True: batch { node_id: “\206\363\344\377<3^\350y's\250\264x\214+\035\010<Us\177\255\323\362"\035H” resources_available { key: “CPU” value: 8.0 } resources_available { key: “memory” value: 99.0 } resources_available { key: “node:10.15.246.254” value: 1.0 } resources_available { key: “object_store_memory” value: 34.0 } resources_available_changed: true resources_total { key: “CPU” value: 8.0 } resources_total { key: “memory” value: 99.0 } resources_total { key: “node:10.15.246.254” value: 1.0 } resources_total { key: “object_store_memory” value: 34.0 } resource_load_changed: true resource_load_by_shape { } } placement_group_load { } running Alex's script with no_restart False: batch { node_id: “\273\315\313K\234\272X\264\014xE0\020\006Y\333\367\313^\013\264\324\303#\343\030z\342” resources_available { key: “CPU” value: 8.0 } resources_available { key: “memory” value: 95.0 } resources_available { key: “node:10.15.246.254” value: 1.0 } resources_available { key: “object_store_memory” value: 33.0 } resources_available_changed: true resources_total { key: “CPU” value: 8.0 } resources_total { key: “memory” value: 95.0 } resources_total { key: “node:10.15.246.254” value: 1.0 } resources_total { key: “object_store_memory” value: 33.0 } resource_load { key: “CPU” value: 1.0 } resource_load { key: “WORKER” value: 1.0 } resource_load_changed: true resource_load_by_shape { resource_demands { shape { key: “CPU” value: 1.0 } shape { key: “WORKER” value: 1.0 } num_infeasible_requests_queued: 1 } } } resource_load_by_shape { resource_demands { shape { key: “CPU” value: 1.0 } shape { key: “WORKER” value: 1.0 } num_infeasible_requests_queued: 1 } } placement_group_load { }