Inference tensors cannot be saved for backward.

I don’t know if this is a bug. I’m trying to follow the official tutorial using pytorch engine.

Here is my code and exception.

String modelUrl = "https://resources.djl.ai/test-models/traced_distilbert_wikipedia_uncased.zip";
Criteria<NDList, NDList> criteria = Criteria.builder()
    .optApplication(Application.NLP.WORD_EMBEDDING)
    .setTypes(NDList.class, NDList.class)
    .optModelUrls(modelUrl)
    .optProgress(new ProgressBar())
    .build();

ZooModel<NDList, NDList> embedding = criteria.loadModel();
Predictor<NDList, NDList> embedder = embedding.newPredictor();
SequentialBlock classifier = new SequentialBlock()
    .add(
            ndList -> {
                NDArray data = ndList.singletonOrThrow();
                long batchSize = data.getShape().get(0);
                long maxLen = data.getShape().get(1);
                NDList inputs = new NDList();
                inputs.add(data.toType(DataType.INT64, false));
                inputs.add(data.getManager().full(data.getShape(), 1, DataType.INT64));
                inputs.add(data.getManager().arange(maxLen).toType(DataType.INT64, false).broadcast(data.getShape()));
                try {
                    return embedder.predict(inputs);
                } catch (TranslateException e) {
                    throw new RuntimeException(e);
                }
            }
    )
    .add(Linear.builder().setUnits(768).build())
    .add(Activation::relu)
    .add(Dropout.builder().optRate(0.2f).build())
    .add(Linear.builder().setUnits(5).build())
    .addSingleton(nd -> nd.get(":,0"));

Model model = Model.newInstance("review_classification");
model.setBlock(classifier);

DefaultVocabulary vocabulary = DefaultVocabulary.builder()
    .addFromTextFile(embedding.getArtifact("vocab.txt"))
    .optUnknownToken("[UNK]")
    .build();

int maxTokenLen = 64;
int batchSize = 8;
int limit = Integer.MAX_VALUE;

BertFullTokenizer tokenizer = new BertFullTokenizer(vocabulary, true);
CsvDataset awsDataset = getDataset(batchSize, tokenizer, maxTokenLen, limit);
RandomAccessDataset[] datasets = awsDataset.randomSplit(7, 3);
RandomAccessDataset trainDataset = datasets[0];
RandomAccessDataset evalDataset = datasets[1];

SaveModelTrainingListener listener = new SaveModelTrainingListener("build/model");
listener.setSaveModelCallback(
    trainer -> {
        TrainingResult result = trainer.getTrainingResult();
        Model trainerModel = trainer.getModel();
        float acc = result.getValidateEvaluation("Accuracy");
        trainerModel.setProperty("Accuracy", String.format("%.5f", acc));
        trainerModel.setProperty("Loss", String.format("%.5f", result.getValidateLoss()));
    }
);

DefaultTrainingConfig trainingConfig = new DefaultTrainingConfig(Loss.softmaxCrossEntropyLoss())
    .addEvaluator(new Accuracy())
    .addTrainingListeners(TrainingListener.Defaults.logging("build/model"))
    .addTrainingListeners(listener);

int epoch = 2;
Trainer trainer = model.newTrainer(trainingConfig);
trainer.setMetrics(new Metrics());
Shape shape = new Shape(batchSize, maxTokenLen);
trainer.initialize(shape);
EasyTrain.fit(trainer, epoch, trainDataset, evalDataset);
System.out.println(trainer.getTrainingResult());
model.save(Paths.get("build/model"), "aws-review-rank");

[main] INFO ai.djl.pytorch.engine.PtEngine - Number of inter-op threads is 6
[main] INFO ai.djl.pytorch.engine.PtEngine - Number of intra-op threads is 6
[main] INFO ai.djl.training.listener.LoggingTrainingListener - Training on: cpu().
[main] INFO ai.djl.training.listener.LoggingTrainingListener - Load PyTorch Engine Version 1.12.1 in 0.079 ms.
Exception in thread "main" ai.djl.engine.EngineException: Inference tensors cannot be saved for backward. To work around you can make a clone to get a normal tensor and use it in autograd.
	at ai.djl.pytorch.jni.PyTorchLibrary.torchNNLinear(Native Method)
	at ai.djl.pytorch.jni.JniUtils.linear(JniUtils.java:1189)
	at ai.djl.pytorch.engine.PtNDArrayEx.linear(PtNDArrayEx.java:390)
	at ai.djl.nn.core.Linear.linear(Linear.java:183)
	at ai.djl.nn.core.Linear.forwardInternal(Linear.java:88)
	at ai.djl.nn.AbstractBaseBlock.forwardInternal(AbstractBaseBlock.java:126)
	at ai.djl.nn.AbstractBaseBlock.forward(AbstractBaseBlock.java:91)
	at ai.djl.nn.SequentialBlock.forwardInternal(SequentialBlock.java:209)
	at ai.djl.nn.AbstractBaseBlock.forward(AbstractBaseBlock.java:91)
	at ai.djl.training.Trainer.forward(Trainer.java:175)
	at ai.djl.training.EasyTrain.trainSplit(EasyTrain.java:122)
	at ai.djl.training.EasyTrain.trainBatch(EasyTrain.java:110)
	at ai.djl.training.EasyTrain.fit(EasyTrain.java:58)
	at cn.amberdata.misc.djl.rankcls.Main.main(Main.java:114)

And here is my dependencies.

        <!-- https://mvnrepository.com/artifact/ai.djl/api -->
        <dependency>
            <groupId>ai.djl</groupId>
            <artifactId>api</artifactId>
            <version>0.19.0</version>
        </dependency>
        <!-- https://mvnrepository.com/artifact/org.slf4j/slf4j-simple -->
        <dependency>
            <groupId>org.slf4j</groupId>
            <artifactId>slf4j-simple</artifactId>
            <version>1.7.36</version>
        </dependency>
        <!-- https://mvnrepository.com/artifact/ai.djl.pytorch/pytorch-engine -->
        <dependency>
            <groupId>ai.djl.pytorch</groupId>
            <artifactId>pytorch-engine</artifactId>
            <version>0.19.0</version>
        </dependency>
        <!-- https://mvnrepository.com/artifact/ai.djl/basicdataset -->
        <dependency>
            <groupId>ai.djl</groupId>
            <artifactId>basicdataset</artifactId>
            <version>0.19.0</version>
        </dependency>
        <!-- https://mvnrepository.com/artifact/ai.djl/model-zoo -->
        <dependency>
            <groupId>ai.djl</groupId>
            <artifactId>model-zoo</artifactId>
            <version>0.19.0</version>
        </dependency>

Issue Analytics

State:
Created 10 months ago
Comments:17 (8 by maintainers)

Top GitHub Comments

1reaction

siddvenkcommented, Nov 9, 2022

Thanks for pointing out this bug to us - this is an issue in the NoopTranslator when using the PyTorch engine. We will take a look and determine the best fix.

For now, you can adjust your code as follows to get around the exception:

Define your own custom translator that extends the NoopTranslator and overrides the processOutput method

public static final class MyTranslator extends NoopTranslator {

        @Override
        public NDList processOutput(TranslatorContext ctx, NDList input) {
            return new NDList(
                input.stream().map(ndArray -> ndArray.duplicate()).collect(Collectors.toList()));
        }
    }

When you are building the Criteria object, add the optTranslator builder method and pass in the customer translator

Criteria<NDList, NDList> criteria = Criteria.builder()
    .optApplication(Application.NLP.WORD_EMBEDDING)
    .setTypes(NDList.class, NDList.class)
    .optModelUrls(modelUrl)
    .optProgress(new ProgressBar())
    .optTranslator(new MyTranslator()) // Add this line
    .build();

0reactions

KexinFengcommented, Dec 2, 2022

I’m glad to know that the answers helped!

By the way, the newest snapshot version is not in the 0.20.0 released version. It is accessed by

dependencies {
    implementation platform("ai.djl:bom:<UPCOMING VERSION>-SNAPSHOT")
}

Currently, the <UPCOMING VERSION> = 0.21.0. See https://github.com/deepjavalibrary/djl/blob/master/docs/get.md#using-built-from-source-version-in-another-project

Top Results From Across the Web

PyTorch on Twitter: "4. ⚠️ Inference tensors can't be used ...

⚠️ Inference tensors can't be modified in-place outside InferenceMode. ✓ Simply clone the inference tensor and you're good to go.

pytorch/saved_variable.cpp at master - autograd - GitHub

You can't save an inference tensor for backwards. // If an inference tensor was saved for backward in an autograd session and. //...

Inference Mode — PyTorch master documentation

A non-view tensor is an inference tensor if and only if it was allocated inside ... e.g. PyTorch throws an error when tensors...

python - Getting pytorch backward's RuntimeError: Trying to ...

run_backward ( RuntimeError: Trying to backward through the graph a second time, but the saved intermediate results have already been freed.

NVIDIA Deep Learning TensorRT Documentation

Where possible, the parser is backward compatible up to opset 7; the ONNX Model Opset ... outputDims = dims; } // Saved dimensions...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

Inference tensors cannot be saved for backward.

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post