Tokenization in C++

See original GitHub issue

Is there any general strategy for tokenizing text in C++ in a way that’s compatible with the existing pretrained BertTokenizer implementation? I’m looking to use a finetuned BERT model in C++ for inference, and currently the only way seems to be to reproduce the BertTokenizer code manually (or modify it to be compatible with torchscript). Has anyone come up with a better solution than this?

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:8 (3 by maintainers)

github_iconTop GitHub Comments

8reactions
thomwolfcommented, Dec 11, 2019

You should wait a few days if you can because @n1t0 is working on something that will very likely solve your problem and it should be ready for a first release before the end of the year.

6reactions
MarkJGxcommented, Mar 29, 2021

Why was this closed? https://github.com/huggingface/tokenizers offers no C++ solution other than developing a Rust -> C++ interop wrapper yourself, which wouldn’t work in my case.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Tokenizing strings in C
strtok has an internal state variable tracking the string being tokenized. When you pass NULL to it, strtok will continue to use this...
Read more >
Tokenization (The C Preprocessor)
Preprocessing tokens fall into five broad classes: identifiers, preprocessing numbers, string literals, punctuators, and other. An identifier is the same as an ...
Read more >
String tokenisation function in C
In this section, we will see how to tokenize strings in C. The C has library function for this. The C library function...
Read more >
STR06-C. Do not assume that strtok() leaves the parse ...
The C function strtok() is a string tokenization function that takes two arguments: an initial string to be parsed and a const -qualified...
Read more >
Tokenizing a string in C++
Just like strtok() function in C, strtok_r() does the same task of parsing a string into a sequence of tokens. strtok_r() is a...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found