TfidfVectorizer handles multiple text columns
See original GitHub issueIt’s really nice that transformers such as sklearn.preprocessing.OneHotEncoder and sklearn.preprocessing.StandardScaler can operate on multiple data columns simultaneously.
sklearn.feature_extraction.text.TfidfVectorizer on the other hand, can only process one column at a time, so you need to make a new transformer for each text column in your dataset. This can get a little tedious and in particular makes pipelines more verbose.
It’d be nice if TfidfVectorizer could also operate on multiple text columns, using the same settings for each column, perhaps with an option to make one vocabulary per column, or use a shared vocabulary across all the columns.
It might be easiest to implement this as a new class that wraps TfidfVectorizer sagemaker-scikit-learn-extension takes this approach.
If this seems like a good idea, I’d be happy to make a PR.
Issue Analytics
- State:
- Created 4 years ago
- Reactions:5
- Comments:20 (20 by maintainers)
Top Related StackOverflow Question
@jnothman @amueller Here’s an example pipeline I was debugging today. It follows a common failure pattern I’ve seen a few times now:
handles multiple columnsoopsI fixed this bug by replacing
TfidfVectorizerwithsagemaker_sklearn_extension.feature_extraction.text import MultiColumnTfidfVectorizer.The fact that AWS added this to their sagemaker_sklearn_extension extensions indicate their users frequently run into this problem too.
I’ve used SageMaker’s TFIDFVectorizer and I like it. It’s nice and simple. Doesn’t support all the parameters of TFIDFVectorizer in sklearn though which is annoying