UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 962: character maps to <undefined>
See original GitHub issueI am trying to run model.py but i am getting following error:
D:\imad_web\kaggle-quora-dup_24_position>python model.py
C:\ProgramData\Anaconda3\lib\site-packages\h5py_init_.py:36: FutureWarning: Conversion of the second argument of issubdtype from float to np.floating is deprecated. In future, it will be treated as np.float64 == np.dtype(float).type.
from ._conv import register_converters as _register_converters
Using TensorFlow backend.
Creating the vocabulary of words occurred more than 100
Traceback (most recent call last):
File “model.py”, line 122, in <module>
embeddings_index = get_embedding()
File “model.py”, line 55, in get_embedding
for line in f:
File “C:\ProgramData\Anaconda3\lib\encodings\cp1252.py”, line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: ‘charmap’ codec can’t decode byte 0x90 in position 962: character maps to <undefined>
`
def get_embedding():
embeddings_index = {}
f = open(EMBEDDING_FILE)
for line in f: ##########################line 55
values = line.split()
word = values[0]
if len(values) == EMBEDDING_DIM + 1 and word in top_words:
coefs = np.asarray(values[1:], dtype=“float32”)
embeddings_index[word] = coefs
f.close()
return embeddings_index
`
` vectorizer = CountVectorizer(lowercase=False, token_pattern=“\S+”, min_df=MIN_WORD_OCCURRENCE) vectorizer.fit(all_questions) top_words = set(vectorizer.vocabulary_.keys()) top_words.add(REPLACE_WORD)
embeddings_index = get_embedding() ##############line 122 print(“Words are not found in the embedding:”, top_words - embeddings_index.keys()) top_words = embeddings_index.keys()
`
Issue Analytics
- State:
- Created 5 years ago
- Reactions:2
- Comments:5 (1 by maintainers)
Top Related StackOverflow Question
No, Bug any ways i’ve found solution. I had used f = open(EMBEDDING_FILE,encoding=“utf-8”) now its working 😃
f = open(EMBEDDING_FILE, “rb”, buffering=0)