AttributeError: 'BertTokenizerFast' object has no attribute 'max_len'

See original GitHub issue

Environment info

transformers version: 4.0.0-rc-1
Platform: Linux-4.9.0-14-amd64-x86_64-with-debian-9.13
Python version: 3.6.10
PyTorch version (GPU?): 1.8.0a0+4ed7f36 (False)
Tensorflow version (GPU?): not installed (NA)
Using GPU in script?: No
Using distributed or parallel set-up in script?: 8-core TPU training
Using TPU

Who can help

albert, bert, GPT2, XLM: @LysandreJik

Information

Model I am using (Bert, XLNet …): bert and roberta

The problem arises when using:

the official example scripts: (give details below)
my own modified scripts: (give details below)

The tasks I am working on is:

an official GLUE/SQUaD task: mlm
my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior:

2 examples of failing commands:

E 2020-11-18T17:38:08.657584093Z python examples/xla_spawn.py \
E 2020-11-18T17:38:08.657588780Z   --num_cores 8 \
E 2020-11-18T17:38:08.657593609Z   examples/contrib/legacy/run_language_modeling.py \
E 2020-11-18T17:38:08.657598646Z   --logging_dir ./tensorboard-metrics \
E 2020-11-18T17:38:08.657604088Z   --cache_dir ./cache_dir \
E 2020-11-18T17:38:08.657609492Z   --train_data_file /datasets/wikitext-103-raw/wiki.train.raw \
E 2020-11-18T17:38:08.657614614Z   --do_train \
E 2020-11-18T17:38:08.657619772Z   --do_eval \
E 2020-11-18T17:38:08.657624531Z   --eval_data_file /datasets/wikitext-103-raw/wiki.valid.raw \
E 2020-11-18T17:38:08.657629731Z   --overwrite_output_dir \
E 2020-11-18T17:38:08.657641827Z   --output_dir language-modeling \
E 2020-11-18T17:38:08.657647203Z   --logging_steps 100 \
E 2020-11-18T17:38:08.657651823Z   --save_steps 3000 \
E 2020-11-18T17:38:08.657656739Z   --overwrite_cache \
E 2020-11-18T17:38:08.657661282Z   --tpu_metrics_debug \
E 2020-11-18T17:38:08.657667598Z  --mlm --model_type=bert \
E 2020-11-18T17:38:08.657672545Z --model_name_or_path bert-base-cased \
E 2020-11-18T17:38:08.657677441Z --num_train_epochs 3 \
E 2020-11-18T17:38:08.657682320Z --per_device_train_batch_size 16 \
E 2020-11-18T17:38:08.657687053Z --per_device_eval_batch_size 16

2020-11-18T17:51:49.357234955Z Traceback (most recent call last):
 E 
2020-11-18T17:51:49.357239554Z   File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 329, in _mp_start_fn
 E 
2020-11-18T17:51:49.357245350Z     _start_fn(index, pf_cfg, fn, args)
 E 
2020-11-18T17:51:49.357249851Z   File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 323, in _start_fn
 E 
2020-11-18T17:51:49.357254654Z     fn(gindex, *args)
 E 
2020-11-18T17:51:49.357272443Z   File "/transformers/examples/contrib/legacy/run_language_modeling.py", line 359, in _mp_fn
 E 
2020-11-18T17:51:49.357277658Z     main()
 E 
2020-11-18T17:51:49.357281928Z   File "/transformers/examples/contrib/legacy/run_language_modeling.py", line 279, in main
 E 
2020-11-18T17:51:49.357287863Z     data_args.block_size = tokenizer.max_len
 E 
2020-11-18T17:51:49.357292355Z AttributeError: 'BertTokenizerFast' object has no attribute 'max_len'
 E

E 2020-11-18T06:47:53.910306819Z python examples/xla_spawn.py \
E 2020-11-18T06:47:53.910310176Z   --num_cores 8 \
E 2020-11-18T06:47:53.910314263Z   examples/contrib/legacy/run_language_modeling.py \
E 2020-11-18T06:47:53.910319173Z   --logging_dir ./tensorboard-metrics \
E 2020-11-18T06:47:53.910322683Z   --cache_dir ./cache_dir \
E 2020-11-18T06:47:53.910325895Z   --train_data_file /datasets/wikitext-103-raw/wiki.train.raw \
E 2020-11-18T06:47:53.910329170Z   --do_train \
E 2020-11-18T06:47:53.910332491Z   --do_eval \
E 2020-11-18T06:47:53.910335626Z   --eval_data_file /datasets/wikitext-103-raw/wiki.valid.raw \
E 2020-11-18T06:47:53.910340314Z   --overwrite_output_dir \
E 2020-11-18T06:47:53.910343710Z   --output_dir language-modeling \
E 2020-11-18T06:47:53.910347004Z   --logging_steps 100 \
E 2020-11-18T06:47:53.910350089Z   --save_steps 3000 \
E 2020-11-18T06:47:53.910353259Z   --overwrite_cache \
E 2020-11-18T06:47:53.910356297Z   --tpu_metrics_debug \
E 2020-11-18T06:47:53.910359351Z  --mlm --model_type=roberta \
E 2020-11-18T06:47:53.910362484Z --tokenizer=roberta-base \
E 2020-11-18T06:47:53.910365650Z --num_train_epochs 5 \
E 2020-11-18T06:47:53.910368797Z --per_device_train_batch_size 8 \
E 2020-11-18T06:47:53.910371843Z --per_device_eval_batch_size 8

2020-11-18T06:48:27.357394365Z Traceback (most recent call last):
 E 
2020-11-18T06:48:27.357399685Z   File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 329, in _mp_start_fn
 E 
2020-11-18T06:48:27.357405353Z     _start_fn(index, pf_cfg, fn, args)
 E 
2020-11-18T06:48:27.357426600Z   File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 323, in _start_fn
 E 
2020-11-18T06:48:27.357448514Z     fn(gindex, *args)
 E 
2020-11-18T06:48:27.357454250Z   File "/transformers/examples/contrib/legacy/run_language_modeling.py", line 359, in _mp_fn
 E 
2020-11-18T06:48:27.357460262Z     main()
 E 
2020-11-18T06:48:27.357465843Z   File "/transformers/examples/contrib/legacy/run_language_modeling.py", line 279, in main
 E 
2020-11-18T06:48:27.357471227Z     data_args.block_size = tokenizer.max_len
 E 
2020-11-18T06:48:27.357477576Z AttributeError: 'RobertaTokenizerFast' object has no attribute 'max_len'
 E

The timing of this issue lines up with https://github.com/huggingface/transformers/pull/8586 Tests started failing on the evening of Nov 17, a few hours after that PR was submitted

Issue Analytics

State:
Created 3 years ago
Comments:12 (7 by maintainers)

Top GitHub Comments

12reactions

LysandreJikcommented, Nov 23, 2020

It is actually due to https://github.com/huggingface/transformers/pull/8604, where we removed several deprecated arguments. The run_language_modeling.py script is deprecated in favor of language-modeling/run_{clm, plm, mlm}.py.

Is it possible for you to switch to one of these newer scripts? If not, the fix is to change max_len to model_max_length. We welcome PRs to fix it, but we won’t be maintaining that script ourselves as there exists better alternatives now (which run on TPU too 🙂)

1reaction

mzhadigerovcommented, Sep 15, 2022

It is actually due to #8604, where we removed several deprecated arguments. The run_language_modeling.py script is deprecated in favor of language-modeling/run_{clm, plm, mlm}.py.

Is it possible for you to switch to one of these newer scripts? If not, the fix is to change max_len to model_max_length. We welcome PRs to fix it, but we won’t be maintaining that script ourselves as there exists better alternatives now (which run on TPU too 🙂)

Change max_len to model_max_length where?

Top Results From Across the Web

AttributeError: 'GPT2TokenizerFast' object has no attribute ...

The "AttributeError: 'BertTokenizerFast' object has no attribute 'max_len'" Github issue contains the fix:.

RoBERTa_Bert_tokenizer_train_...

Id)): location = f'{path}{rec_id}.json' with open(location, ... *inputs, **kwargs): AttributeError: 'RobertaTokenizerFast' object has no attribute 'to'.

How to save my tokenizer using save_pretrained? - Beginners

However, from executing the code above, I get this error: AttributeError: 'tokenizers.Tokenizer' object has no attribute 'save_pretrained'.

记录自己基于pytorch增量训练（继续预训练）BERT的过程- 知乎

4以上会报错AttributeError: 'BertTokenizerFast' object has no attribute 'max_len' #8739 。直接pip重装一下transformers非常快。

How to use BERT from the Hugging Face transformer library

The BERT Tokenizer is a tokenizer that works with BERT. ... In the code below, you will see me not adding all the...