Tuesday, April 24, 2007

How to convert PDF to Word?

I got this email from my friend a few days back:
"Hi, do you have any idea of a software that I could use to convert pdf to Word. I'm working on a movie but I've got the script in PDF... I want to convert it to Word. I tried but i am not able to copy it... any idea???"

Most probably the PDF has the text as image (likely because its a scanned page stored in PDF). It is usually difficult to copy text out of such PDF documents unless the parser is able to use some character recognition to figure out the text.

But the question is can we carry out a conversion of a document in PDF to MS Word?
I tried to search and did find many softwares to do so but all turned out to be closed source, costly and at times not good enough. I was looking for a open source option to do the same.

So finally I advised him to use PDFtoHTML to convert the pdf document to html and then use MS Word to open the same copying the text out from there.

I know this is a cross country trip but if someone has a better way to carry out the same using only open source software please let us know.

7 comments:

KB Jørgensen said...

I've had the exact same problem. Scribus claims to be able to open PDF files for editing, but each time i try, it says "File C:/path/to/file.pdf is not an acceptable format". This happens with different PDF files from different sources. I tried scribus version 1.3.3.8 and 1.3.3.9cvs, both on windows.

Techknight said...

Hi KB,
I installed Scribus today and did a initial round of testing. I had tried to use Scribus sometime back but was not able to install it then for some reason.

Well I am not sure if Scribus does import PDF. Check this link

But what I was able to do was first print the pdf to a .ps file (postscript printer required). And then import the PS file to Scribus to edit the same. This seems to be working.

I was impressed with Scribus but havent played much around with it so will post about it sometime later.

Hope this helps you KB.

jay said...

Use SimpleOCR

Techknight said...

But SimpleOCR is NOT Open Source! We definitely do have no access to all features and neither is the source code of the app available to us. Any other open source suggestions?

Anonymous said...

Can I suggest

pdf2doc

PDF2DOC software converts PDF document into MS Word document format (RTF or Word), so you can edit and reuse your PDF content. PDF2Doc preserves the original PDF text, layout and bitmap images in the generated Word document.





pdf word


pdf2word


convert pdf to word


pdf2doc


convert pdf to word



Mike said...

I've been doing exactly this during the past week. The pdf was generated from Final Draft. I have something which lets me edit pdfs, but I chose 'continuous' in 'page layout' selected all, and copied and pasted into a wp.

It loses format, but the great thing is I can track changes and send the wp file back and he can choose whether to incorporate the changes. It's working very well - a few pages a day, and he keeps control of his document.

Hikari said...

Nice tips!

I'm looking for word plugin to generate PDF, keeping track of references. In exemple, I have a summary listing all chapters and on the PDF I want to lick on a list item and be sent to that chapter.

I found some apps, but all are paid, so I'd rather stick with Adobe Pro.


And what's the best way to extract images from PDF?
For text, if I can copy it, I prefere to manually copy and paste to a new doc. But what's the best way to extract images? printscreen?