Uncategorized

Error when trying to extract text from word using python?


I’m currently trying to write a function in Python that will allow me to extract text from .docx files. For this I use the python-docx library. My program also does what it’s supposed to do, at least when I create a docx file in Python and then use my function on this file it returns the text to me.

However, for .docx files (word documents) that I have modified or created, it cannot find the path and returns PackageNotFoundError. I came across the Internet to check whether my file is a zip file. I did this with zipfile and in fact my saved word documents are not zipfiles. What’s going on? My python code again for verification:

from zipfile import is_zipfile
import docx

doc = docx.Document()

doc.add_paragraph(“Hello”)

doc.save(test_path)

print(is_zipfile(test_path))

//output = true

If I then go into this test_path, type a number and save ->

print(is_zipfile(test_path))
//output = false

Are modern .docx documents no longer zip files? Or what wrong here?

When googling everywhere is written that word documents/.docx files are zip files. I think that is the problem why the libary gives me the error code and cannot open the file.
I appreciate everyone trying to help. Thanks



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *