_batch_encode_plus() got an unexpected keyword argument 'is_pretokenized' using BertTokenizerFast

See original GitHub issue

System Info

tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')
for token, label in zip(tokenizer.convert_ids_to_tokens(training_set[0]["input_ids"]), training_set[0]["labels"]):
  print('{0:10}  {1}'.format(token, label))

The error I am getting is:
Traceback (most recent call last):
  File "C:\Users\1632613\Documents\Anit\NER_Trans\test.py", line 108, in <module>
    for token, label in zip(tokenizer.convert_ids_to_tokens(training_set[0]["input_ids"]), training_set[0]["labels"]):
  File "C:\Users\1632613\Documents\Anit\NER_Trans\test.py", line 66, in __getitem__
    encoding = self.tokenizer(sentence,
  File "C:\Users\1632613\AppData\Local\conda\conda\envs\ner\lib\site-packages\transformers\tokenization_utils_base.py", line 2477, in __call__
    return self.batch_encode_plus(
  File "C:\Users\1632613\AppData\Local\conda\conda\envs\ner\lib\site-packages\transformers\tokenization_utils_base.py", line 2668, in batch_encode_plus
    return self._batch_encode_plus(
TypeError: _batch_encode_plus() got an unexpected keyword argument 'is_pretokenized'

Who can help?

@SaulLu

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, …)
  • My own task or dataset (give details below)

Reproduction

  1. Download the NER Dataset from the Kaggle link (https://www.kaggle.com/datasets/namanj27/ner-dataset)
  2. Use the Script with the mentioned versions of transformers and tokenizers: tokenizer = BertTokenizerFast.from_pretrained(‘bert-base-uncased’) for token, label in zip(tokenizer.convert_ids_to_tokens(training_set[0][“input_ids”]), training_set[0][“labels”]): print(‘{0:10} {1}’.format(token, label))

Expected behavior

I expect to get the token, label from the script above.

Python Version: 3.9
tokenizers-0.12.1 
transformers-4.19.2

Anyone can shed some light please?

Issue Analytics

  • State:open
  • Created a year ago
  • Comments:8 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
naarkhoocommented, Dec 17, 2022

I am having the same problem

here is the output of transformers-cli env

- `transformers` version: 4.25.1
- Platform: Linux-5.10.133+-x86_64-with-glibc2.27
- Python version: 3.8.16
- Huggingface_hub version: 0.11.1
- PyTorch version (GPU?): 1.13.0+cu116 (True)
- Tensorflow version (GPU?): 2.9.2 (True)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: <fill in>
- Using distributed or parallel set-up in script?: <fill in>

you can also find the colab notebook here

0reactions
berkekavakcommented, Dec 20, 2022

Experiencing the same issue. I think it depends on the version compatibility of PyTorch or Transformers. This notebook is different from the others since the predictions are made sentence-wise.

It works very well with Python 3.7, Transformers 3.0.2. @SaulLu would appreciate your help.

Read more comments on GitHub >

github_iconTop Results From Across the Web

3.0.1: "unexpected keyword argument 'is_pretokenized'" when ...
3.0.1: "unexpected keyword argument 'is_pretokenized'" when using batch_encode_plus() w/ Fast Tokenizers #5528.
Read more >
batch_encode_plus() got an unexpected keyword argument ...
I am studying RoBERTA model to detect emotions in ...
Read more >
Tokenizer - Hugging Face
Tokenizer. A tokenizer is in charge of preparing the inputs for a model. The library contains tokenizers for all the models. Most of...
Read more >
PyTorch Transformers: TypeError: forward() got an ...
At the beginning of training I get the following error: Traceback (most recent ... TypeError: forward() got an unexpected keyword argument ...
Read more >
How to use BERT from the Hugging Face transformer library
where(). Because in this particular example I am retrieving the top 10 candidate replacement words for the mask token(you can get more than...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found