You will see that I have posted about this before asking for suggestions on which software I can use to convert PDF to docx/odt.
I am a teacher. During my time as a researcher I wrote a lot of documents and regularly draw upon them to teach my students. I often have to take the text, modify them, or build upon them. A lot of my material is bound up in PDFs. Sometimes, I have grant applications to write where a previous draft I wrote was stored as a PDF. Converting them to text has become the bane of my life.
I am forced to use online tools because none of the software I have seem to do the trick. Lot of people keep saying pandoc. Pandoc does not convert PDF to any other format. It can only be the output format.
Is there a magic open source solution that I have missed out?
https://pdf2docx.readthedocs.io/ seems to fit the bill. I can’t vouch for it.
PDF is such a curse. I say this as a person currently tasked with deploying new mysteriously complex enterprise PDF conversion software for technical documents. The rabbit hole is so deep.
It’s a curse because it’s used for things other than what it’s intended to. It’s doing a good job representing printed material, but unfortunately people very commonly expect it to be something more akin to a word processor file.
This is probably my first time ever using it for an appropriate purpose as this team’s technical docs are destined for the press (and digital distribution). They just have no idea how to software, so I was brought in to build bridges between and ultimately simplify all their tools.
It is not a curse. It does exactly what it is intended to do: Create an archive of a document that is universally reproduceable.
It is a very well designed cul-de-sac for exactly this purpose. Using it for anything else is calling for trouble.
As a dev the reason pdf is so strange is because it’s a compound format. It can be just images strung together. It can also be pure text with fonts, ect…etc …
If you open the file as a text file, you can see this. It’s many different formats in a trenchcoat.
Yeah, also a dev here. I’d be so happy if they’d parted ways with the 90s legacy bits at some point. Just glad there are enough parsing libraries that I’ll never need to care (right? Please tell me I’m right!).
I hope your right too lol.