docx created with word online

See original GitHub issue

https://github.com/ankushshah89/python-docx2txt/blob/f71c423c3562b7c3f5bfcec822e384273d0034f2/docx2txt/docx2txt.py#L87

If I create a docx in SharePoint it takes me to Word Online. I add some text and it saves automatically. Then I download the file.

Now I do the following:

import zipfile
zip = zipfile.ZipFile('path/to/file.docx')
xml = zip.read('word/document.xml')

This fails with KeyError: "There is no item named 'word/document.xml' in the archive"

There is, however, a ‘word/document2.xml’ which contains (at least for my one trial case) the same as ‘word/document.xml’. I discovered this by opening ‘path/to/file.docx’ in actual Microsoft Word on my local machine and then saving the file. NOW when I do zip.read('word/document.xml') the xml file is there as expected.

I really don’t know much about this stuff or why creating a file with Word Online appears to create something different then local Word. Thus I don’t know what the best solution is. It seems hack-ish to just put a line in the code that says if you can’t find ‘word/document.xml’ look for ‘word/document2.xml’ but maybe that’s all we need. Let me know.

Issue Analytics

  • State:open
  • Created 6 years ago
  • Comments:5

github_iconTop GitHub Comments

1reaction
SamMorrowDrumscommented, Mar 19, 2018

Yep, @fjouault I’ve updated my PR to reflect this. As far as I can tell this solution is robust. I’d appreciate any help with manually testing it, but hopefully this is ready to merge (or at least very close).

0reactions
wendywangwwtcommented, Sep 4, 2020

I’m getting this issue and a temporary fix I created is as follows:

path = '/opt/conda/envs/Python-3.6-WMLCE/lib/python3.6/site-packages/docx2txt/docx2txt.py'
with open(path,'r') as f:
    script = f.readlines()

with open(path,'w') as f:
    for i,line in enumerate(script):
        if i == 86:
            line = "    doc_xml = [re.findall('(word\/document.*)',fn)[0] for fn in filelist if len(re.findall('(word\/document.*)',fn)) > 0][0]\n"
        f.write(line)

Basically I replace the hard coded xml (line 87) with a regular expression search as highlighted above. We are doing it in this way because the environment is containerized so every time we need to reinstall the package and change this line. For those who run this in their own long-lasting environment, simply replace line 87 with the following:

doc_xml = [re.findall('(word\/document.*)',fn)[0] for fn in filelist if len(re.findall('(word\/document.*)',fn)) > 0][0]\n

It’s definitely not perfect and can be improved… but anyway it solves my problem 😃

Read more comments on GitHub >

github_iconTop Results From Across the Web

Create a document in Word for the web - Microsoft Support
With Word for the web running in your web browser, you can: Create documents to add and format text, images, and page layouts....
Read more >
Create online WORD document for microsoft word - OffiDocs
It will allow you to create word doc and docx file formats. Any free word documents can be open or edited. Your documents...
Read more >
Word online create
Use our Word online create tools & fast online create your Word DOCX files, No need to install.
Read more >
Google Docs: Online Document Editor | Google Workspace
Use Google Docs to create, and collaborate on online documents. ... Easily edit Microsoft Word files online without converting them, and layer on...
Read more >
3 Ways to Open a .DOCX File - wikiHow
1. Go to https://www.office.com in a web browser. If you don't have a recent version of Microsoft Office installed on your computer, Microsoft...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found