AttributeError: 'BertTokenizerFast' object has no attribute 'max_len'
See original GitHub issueEnvironment info
transformersversion: 4.0.0-rc-1- Platform: Linux-4.9.0-14-amd64-x86_64-with-debian-9.13
- Python version: 3.6.10
- PyTorch version (GPU?): 1.8.0a0+4ed7f36 (False)
- Tensorflow version (GPU?): not installed (NA)
- Using GPU in script?: No
- Using distributed or parallel set-up in script?: 8-core TPU training
- Using TPU
Who can help
albert, bert, GPT2, XLM: @LysandreJik
Information
Model I am using (Bert, XLNet …): bert and roberta
The problem arises when using:
- the official example scripts: (give details below)
- my own modified scripts: (give details below)
The tasks I am working on is:
- an official GLUE/SQUaD task: mlm
- my own task or dataset: (give details below)
To reproduce
Steps to reproduce the behavior:
2 examples of failing commands:
E 2020-11-18T17:38:08.657584093Z python examples/xla_spawn.py \
E 2020-11-18T17:38:08.657588780Z --num_cores 8 \
E 2020-11-18T17:38:08.657593609Z examples/contrib/legacy/run_language_modeling.py \
E 2020-11-18T17:38:08.657598646Z --logging_dir ./tensorboard-metrics \
E 2020-11-18T17:38:08.657604088Z --cache_dir ./cache_dir \
E 2020-11-18T17:38:08.657609492Z --train_data_file /datasets/wikitext-103-raw/wiki.train.raw \
E 2020-11-18T17:38:08.657614614Z --do_train \
E 2020-11-18T17:38:08.657619772Z --do_eval \
E 2020-11-18T17:38:08.657624531Z --eval_data_file /datasets/wikitext-103-raw/wiki.valid.raw \
E 2020-11-18T17:38:08.657629731Z --overwrite_output_dir \
E 2020-11-18T17:38:08.657641827Z --output_dir language-modeling \
E 2020-11-18T17:38:08.657647203Z --logging_steps 100 \
E 2020-11-18T17:38:08.657651823Z --save_steps 3000 \
E 2020-11-18T17:38:08.657656739Z --overwrite_cache \
E 2020-11-18T17:38:08.657661282Z --tpu_metrics_debug \
E 2020-11-18T17:38:08.657667598Z --mlm --model_type=bert \
E 2020-11-18T17:38:08.657672545Z --model_name_or_path bert-base-cased \
E 2020-11-18T17:38:08.657677441Z --num_train_epochs 3 \
E 2020-11-18T17:38:08.657682320Z --per_device_train_batch_size 16 \
E 2020-11-18T17:38:08.657687053Z --per_device_eval_batch_size 16
2020-11-18T17:51:49.357234955Z Traceback (most recent call last):
E
2020-11-18T17:51:49.357239554Z File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 329, in _mp_start_fn
E
2020-11-18T17:51:49.357245350Z _start_fn(index, pf_cfg, fn, args)
E
2020-11-18T17:51:49.357249851Z File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 323, in _start_fn
E
2020-11-18T17:51:49.357254654Z fn(gindex, *args)
E
2020-11-18T17:51:49.357272443Z File "/transformers/examples/contrib/legacy/run_language_modeling.py", line 359, in _mp_fn
E
2020-11-18T17:51:49.357277658Z main()
E
2020-11-18T17:51:49.357281928Z File "/transformers/examples/contrib/legacy/run_language_modeling.py", line 279, in main
E
2020-11-18T17:51:49.357287863Z data_args.block_size = tokenizer.max_len
E
2020-11-18T17:51:49.357292355Z AttributeError: 'BertTokenizerFast' object has no attribute 'max_len'
E
E 2020-11-18T06:47:53.910306819Z python examples/xla_spawn.py \
E 2020-11-18T06:47:53.910310176Z --num_cores 8 \
E 2020-11-18T06:47:53.910314263Z examples/contrib/legacy/run_language_modeling.py \
E 2020-11-18T06:47:53.910319173Z --logging_dir ./tensorboard-metrics \
E 2020-11-18T06:47:53.910322683Z --cache_dir ./cache_dir \
E 2020-11-18T06:47:53.910325895Z --train_data_file /datasets/wikitext-103-raw/wiki.train.raw \
E 2020-11-18T06:47:53.910329170Z --do_train \
E 2020-11-18T06:47:53.910332491Z --do_eval \
E 2020-11-18T06:47:53.910335626Z --eval_data_file /datasets/wikitext-103-raw/wiki.valid.raw \
E 2020-11-18T06:47:53.910340314Z --overwrite_output_dir \
E 2020-11-18T06:47:53.910343710Z --output_dir language-modeling \
E 2020-11-18T06:47:53.910347004Z --logging_steps 100 \
E 2020-11-18T06:47:53.910350089Z --save_steps 3000 \
E 2020-11-18T06:47:53.910353259Z --overwrite_cache \
E 2020-11-18T06:47:53.910356297Z --tpu_metrics_debug \
E 2020-11-18T06:47:53.910359351Z --mlm --model_type=roberta \
E 2020-11-18T06:47:53.910362484Z --tokenizer=roberta-base \
E 2020-11-18T06:47:53.910365650Z --num_train_epochs 5 \
E 2020-11-18T06:47:53.910368797Z --per_device_train_batch_size 8 \
E 2020-11-18T06:47:53.910371843Z --per_device_eval_batch_size 8
2020-11-18T06:48:27.357394365Z Traceback (most recent call last):
E
2020-11-18T06:48:27.357399685Z File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 329, in _mp_start_fn
E
2020-11-18T06:48:27.357405353Z _start_fn(index, pf_cfg, fn, args)
E
2020-11-18T06:48:27.357426600Z File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 323, in _start_fn
E
2020-11-18T06:48:27.357448514Z fn(gindex, *args)
E
2020-11-18T06:48:27.357454250Z File "/transformers/examples/contrib/legacy/run_language_modeling.py", line 359, in _mp_fn
E
2020-11-18T06:48:27.357460262Z main()
E
2020-11-18T06:48:27.357465843Z File "/transformers/examples/contrib/legacy/run_language_modeling.py", line 279, in main
E
2020-11-18T06:48:27.357471227Z data_args.block_size = tokenizer.max_len
E
2020-11-18T06:48:27.357477576Z AttributeError: 'RobertaTokenizerFast' object has no attribute 'max_len'
E
The timing of this issue lines up with https://github.com/huggingface/transformers/pull/8586 Tests started failing on the evening of Nov 17, a few hours after that PR was submitted
Issue Analytics
- State:
- Created 3 years ago
- Comments:12 (7 by maintainers)
Top Results From Across the Web
AttributeError: 'GPT2TokenizerFast' object has no attribute ...
The "AttributeError: 'BertTokenizerFast' object has no attribute 'max_len'" Github issue contains the fix:.
Read more >RoBERTa_Bert_tokenizer_train_...
Id)): location = f'{path}{rec_id}.json' with open(location, ... *inputs, **kwargs): AttributeError: 'RobertaTokenizerFast' object has no attribute 'to'.
Read more >How to save my tokenizer using save_pretrained? - Beginners
However, from executing the code above, I get this error: AttributeError: 'tokenizers.Tokenizer' object has no attribute 'save_pretrained'.
Read more >记录自己基于pytorch增量训练(继续预训练)BERT的过程- 知乎
4以上会报错AttributeError: 'BertTokenizerFast' object has no attribute 'max_len' #8739 。直接pip重装一下transformers非常快。
Read more >How to use BERT from the Hugging Face transformer library
The BERT Tokenizer is a tokenizer that works with BERT. ... In the code below, you will see me not adding all the...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
It is actually due to https://github.com/huggingface/transformers/pull/8604, where we removed several deprecated arguments. The
run_language_modeling.pyscript is deprecated in favor oflanguage-modeling/run_{clm, plm, mlm}.py.Is it possible for you to switch to one of these newer scripts? If not, the fix is to change
max_lentomodel_max_length. We welcome PRs to fix it, but we won’t be maintaining that script ourselves as there exists better alternatives now (which run on TPU too 🙂)Change
max_lentomodel_max_lengthwhere?