TfidfVectorizer handles multiple text columns

See original GitHub issue

It’s really nice that transformers such as sklearn.preprocessing.OneHotEncoder and sklearn.preprocessing.StandardScaler can operate on multiple data columns simultaneously.

sklearn.feature_extraction.text.TfidfVectorizer on the other hand, can only process one column at a time, so you need to make a new transformer for each text column in your dataset. This can get a little tedious and in particular makes pipelines more verbose.

It’d be nice if TfidfVectorizer could also operate on multiple text columns, using the same settings for each column, perhaps with an option to make one vocabulary per column, or use a shared vocabulary across all the columns.

It might be easiest to implement this as a new class that wraps TfidfVectorizer sagemaker-scikit-learn-extension takes this approach.

If this seems like a good idea, I’d be happy to make a PR.

Issue Analytics

State:
Created 4 years ago
Reactions:5
Comments:20 (20 by maintainers)

Top GitHub Comments

4reactions

zachmayercommented, Jun 15, 2020

@jnothman @amueller Here’s an example pipeline I was debugging today. It follows a common failure pattern I’ve seen a few times now:

Categorical preprocessing handles multiple columns
Numeric preprocessing handles multiple columns
Text preprocessing ~~handles multiple columns~~ oops

# Categorical pipeline
categorical_preprocessing = Pipeline(
[
    ('Imputation', SimpleImputer(strategy='constant', fill_value='?')),
    ('One Hot Encoding', OneHotEncoder(handle_unknown='ignore')),
]
)
# Numeric pipeline
numeric_preprocessing = Pipeline(
[
     ('Imputation', SimpleImputer(strategy='mean')),
     ('Scaling', StandardScaler())
]
)
text_preprocessing = Pipeline(
[
     ('Text',TfidfVectorizer())       
]
)
# Creating preprocessing pipeline
preprocessing = make_column_transformer(
     (numeric_preprocessing, numeric_features),
     (categorical_preprocessing, categorical_features),
     (text_preprocessing, text_features),
)
# Final pipeline
pipeline = Pipeline(
[('Preprocessing', preprocessing)]
)
test = pipeline.fit_transform(x_train)

I fixed this bug by replacing TfidfVectorizer with sagemaker_sklearn_extension.feature_extraction.text import MultiColumnTfidfVectorizer.

The fact that AWS added this to their sagemaker_sklearn_extension extensions indicate their users frequently run into this problem too.

0reactions

zachmayercommented, Feb 19, 2022

I’ve used SageMaker’s TFIDFVectorizer and I like it. It’s nice and simple. Doesn’t support all the parameters of TFIDFVectorizer in sklearn though which is annoying

Top Results From Across the Web

Vectorize two text columns in a ColumnTransformer - YouTube

Want to vectorize two text columns in a ColumnTransformer?You can't pass them in a list, but you can pass the vectorizer twice!

Computing separate tfidf scores for two different columns ...

Here's how it would work for your data: import pandas as pd from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.preprocessing import ...

sklearn.feature_extraction.text.TfidfVectorizer

The callable handles preprocessing, tokenization, and n-grams generation. Returns: analyzer: callable. A function to handle preprocessing, tokenization and n- ...

Convert Text Documents to a TF-IDF Matrix with tfidfvectorizer

Convert text documents to vectors using TF-IDF vectorizer for topic ... and contains two “the”, the TF ratio of this word would be...

TF-IDF Vectorizer scikit-learn - Medium

CountVectorizer from sklearn.feature_extraction.text import TfidfVectorizer,CountVectorizer import pandas as pd# set of documentstrain ...