« Mail Call | Main | Perly Gates »

Nota Bene

I'm here at work but I don't know if I'll make it through the day.

I had an interesting experience yesterday. It started with problems editing a PDF file and ended with hours of formatting a word processing document.

Many of you know that PDF files are not intended to be edited. That is, you create the original in another program such as a word processor and when done, and only when done, you convert the document into a PDF. Once in PDF form, you really don't want to be making substantial changes to the PDF. This is because, in essence, a PDF is a picture of a document just as a jpg file is a picture of something you photographed or created in a paint program. Hence, trying to edit a picture as if it were a word processing document is not going to get you very far.

Indeed, if you do need to make substantial changes, you go back to your original word processing document and make the changes there - then convert again to PDF.

A problem occurs when you have to make substantial changes but don't have the original word processing file but you do have the PDF file. If the PDF came from a word processing document and you saved the font into the PDF, you may be able to make substantial changes to the PDF. But if the PDF came from a scan of the hard copy, you're pretty much toast because all you can do is rescan the document and run it through your Optical Character Recognition (OCR) software.

This is where things get hairy. Said software is far from perfect even though it is getting better. You would think with systems available to recognize handwriting that software would be able to read printed documents. But you would be wrong because much of how we recognize written ideas is through the context.

For example, a numbered list gives order and is intended to be seen as a whole. To OCR software, the numbers are just characters and have no attachment to the words that follow. Hence, even if the OCR correctly reads the characters, your word processing software will not recognize the output as a numbered list. Hence, you spend much time formatting the document to create the context.

I don't know if anyone has done a study as to what point it becomes more efficient to type in a document versus trying to make corrections and format an OCR read document. But with the 33 page document (a memorandum of agreement) in question, all I can do is cut and paste parts of the OCR into a clean word processing document rather than waste time making corrections.

About

This page contains a single entry from the blog posted on June 25, 2004 9:49 AM.

The previous post in this blog was Mail Call.

The next post in this blog is Perly Gates.

Many more can be found on the main index page or by looking through the archives.

Creative Commons License
This weblog is licensed under a Creative Commons License.
Powered by
Movable Type 3.34