• This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn more.

Reading PDF file

C

Codeblack

#1
Does any one know how to read a pdf file and search for text within the pdf.
Any inputs will be greatly appreciated.
 

My Computer

A

Al Dunbar

#3
"Codeblack" <Codeblack@xxxxxx> wrote in message
news:64301878-37E2-4857-9AE8-F5812DE672DA@xxxxxx

> Anyone in this forum can help me.
Judging from the responses to date, apparently, many of us cannot help you.

VBScript's file system object has a difficult time with anything other than
text files. You will either need to determine the details of the format and
write your own interface, or find a document object model for pdf's.
Unfortunately, googling ["document object model" "portable document format"]
seems to find information about document object models for htnl, dhtml,
word, and etc, all presented in pdf format. I checked the adobe site, and
could not find anything helpful there, other than adobe acrobat itself. It
could be that the full acrobat package provides what you need, but possibly
not.

/Al
 

My Computer

D

David Kerber

#4
In article <195FFDF9-7A3E-4FCD-8D56-BC2F454975D4@xxxxxx>,
Codeblack@xxxxxx says...

> Does any one know how to read a pdf file and search for text within the pdf.
> Any inputs will be greatly appreciated.
A .pdf is just a text file with some mark-up elements, so you can search
for contained text just like you would a .html or .txt file.

--
/~\ The ASCII
\ / Ribbon Campaign
X Against HTML
/ \ Email!

Remove the ns_ from if replying by e-mail (but keep posts in the
newsgroups if possible).
 

My Computer

G

gimme_this_gimme_that

#5
Long shot ....

In Excel:

Go into the VBA IDE (Alt-F11)
Go into Tools->References
Check all the Adobe Libraries

I have:
Adobe Acrobat 7.0 Browser Control Type Library 1.0
Adobe Acrobat 7.0 Type Library


Go into Object Browser

See if you can get a VBA Sub going that looks like this:

Sub SearchPDF()
Set a = New AcroAVDoc
a.Open("C:\mypdf.pdf")
Set ln = New Long(1)
b = a.FindText("SearchTextString",ln,ln) 'b is a boolean
MsgBox CStr(b)
End Sub

*IF* you ever get that to work - the arguments to FindText are
undocumented - the next step is to translate this into VBScript -

Someone might be able to help you here with another post.
You'd need to convert this VBA:

Set a = New AcroAVDoc

'into VBScript that might look like this:

Set a = CreateObject("AcroAVDoc")
Set a = CreateObject("Adobe Acrobat 7.0")

YMMV
 

My Computer

A

Al Dunbar

#6
"David Kerber" <ns_dkerber@xxxxxx_WarrenRogersAssociates.com> wrote in message
news:MPG.23f665d27256871989ce2@xxxxxx

> In article <195FFDF9-7A3E-4FCD-8D56-BC2F454975D4@xxxxxx>,
> Codeblack@xxxxxx says...

>> Does any one know how to read a pdf file and search for text within the
>> pdf.
>> Any inputs will be greatly appreciated.
>
> A .pdf is just a text file with some mark-up elements
Not the one I just renamed as .txt and opened in notepad...

/Al

> so you can search
> for contained text just like you would a .html or .txt file.
>
> --
> /~\ The ASCII
> \ / Ribbon Campaign
> X Against HTML
> / \ Email!
>
> Remove the ns_ from if replying by e-mail (but keep posts in the
> newsgroups if possible).
>
 

My Computer

G

gimme_this_gimme_that

#7
Opps. I forget to tell you what to do once you get to object browser.

You probably figured that out...

In the VBA IDE select View->Object Browser

In the drop down in the middle of the page where it says <All
Libraries> select Acrobat

Peruse the objects.

For example, click AcroAVDoc - and you see the method FindText.
 

My Computer

T

Tom Lavedas

#8
On Feb 9, 1:25 pm, "Al Dunbar" <aland...@xxxxxx> wrote:

> "David Kerber" <ns_dkerber@xxxxxx_WarrenRogersAssociates.com> wrote in message
>
> news:MPG.23f665d27256871989ce2@xxxxxx
>

> > In article <195FFDF9-7A3E-4FCD-8D56-BC2F45497...@xxxxxx>,
> > Codebl...@xxxxxx says...

> >> Does any one know how to read a pdf file and search for text within the
> >> pdf.
> >> Any inputs will be greatly appreciated.
>

> > A .pdf is just a text file with some mark-up elements
>
> Not the one I just renamed as .txt and opened in notepad...
>
> /Al
>

> >    so you can search
> > for contained text just like you would a .html or .txt file.
>

> > --
> > /~\ The ASCII
> > \ / Ribbon Campaign
> > X  Against HTML
> > / \ Email!
>

> > Remove the ns_ from if replying by e-mail (but keep posts in the
> > newsgroups if possible).
Later versions of pdf seem to be encoded to keep that from happening,
but I think that's still at the discretion of the creator. That is,
some are and some aren't searchable. Clearly, the scanner documents
in pdf format are unsearchable, since they are image based.

Tom Lavedas
***********
http://there.is.no.more/tglbatch/
 

My Computer

T

Tom Lavedas

#9
On Feb 9, 1:19 pm, "gimme_this_gimme_t...@xxxxxx"
<gimme_this_gimme_t...@xxxxxx> wrote:

> Long shot ....
>
> In Excel:
>
> Go into the VBA IDE (Alt-F11)
> Go into Tools->References
> Check all the Adobe Libraries
>
> I have:
> Adobe Acrobat 7.0 Browser Control Type Library 1.0
> Adobe Acrobat 7.0 Type Library
>
> Go into Object Browser
>
> See if you can get a VBA Sub going that looks like this:
>
> Sub SearchPDF()
> Set a = New AcroAVDoc
> a.Open("C:\mypdf.pdf")
> Set ln = New Long(1)
> b =  a.FindText("SearchTextString",ln,ln) 'b is a boolean
> MsgBox CStr(b)
> End Sub
>
> *IF* you ever get that to work - the arguments to FindText are
> undocumented - the next step is to translate this into VBScript -
>
> Someone might be able to help you here with another post.
> You'd need to convert this VBA:
>
> Set a = New AcroAVDoc
>
> 'into VBScript that might look like this:
>
> Set a = CreateObject("AcroAVDoc")
> Set a = CreateObject("Adobe Acrobat 7.0")
>
> YMMV
The Acrobat controls do not provide a shell of their own, but must be
hosted by an application, like IE. Gunter Born wrote about this years
ago. His web site, WSH Bazaar, is no longer maintained, but is still
out there. See: http://freenet-homepage.de/gborn/WSHBazaar/WSHBazaar.htm.
In the Newsletter #5, he presents the basics of hosting the Acrobat
Reader ActiveX in IE and does a lot of manipulations. Unfortunately,
he does not cover the method you discuss and some of the supporting
files are missing. Further, if the input arguments must be typed as
Long, they cannot by implemented in script, since all variables in
script are of type Variant.

I looked at the methods that are exposed in all of the Acrobat ActiveX
libraries on my machine and I cannot find a reference to a FindText
method. I did this with show hidden objects selected. Where did you
find a reference to this method?

Tom Lavedas
***********
http://there.is.no.more/tglbatch/
 

My Computer

P

Paul Randall

#10
"Tom Lavedas" <tglbatch@xxxxxx> wrote in message
news:429bac15-3b35-46c4-b82b-15d2b0cc23ef@xxxxxx
On Feb 9, 1:25 pm, "Al Dunbar" <aland...@xxxxxx> wrote:

> "David Kerber" <ns_dkerber@xxxxxx_WarrenRogersAssociates.com> wrote in message
>
> news:MPG.23f665d27256871989ce2@xxxxxx
>

> > In article <195FFDF9-7A3E-4FCD-8D56-BC2F45497...@xxxxxx>,
> > Codebl...@xxxxxx says...

> >> Does any one know how to read a pdf file and search for text within the
> >> pdf.
> >> Any inputs will be greatly appreciated.
>

> > A .pdf is just a text file with some mark-up elements
>
> Not the one I just renamed as .txt and opened in notepad...
>
> /Al
>

> > so you can search
> > for contained text just like you would a .html or .txt file.
>

> > --
> > /~\ The ASCII
> > \ / Ribbon Campaign
> > X Against HTML
> > / \ Email!
>

> > Remove the ns_ from if replying by e-mail (but keep posts in the
> > newsgroups if possible).
Later versions of pdf seem to be encoded to keep that from happening,
but I think that's still at the discretion of the creator. That is,
some are and some aren't searchable. Clearly, the scanner documents
in pdf format are unsearchable, since they are image based.

Tom Lavedas
***********
http://there.is.no.more/tglbatch/

---------------------------------------------
I think it is way more complex than that.
Try downloading http://www.sfmta.com/cms/mmaps/documents/47.pdf.

Looking at the file with NotePad, you will find almost no text that looks
like street names or Muni route numbers.
Look at it with Acrobat Reader. The text zooms beautifully (like text font
size changes, not zooming a bit map).
Use Acrobat's binocular icon and search for some text, like 9x. It finds 3
occurrences that are readable on the map. NotePad finds two occurrences,
but I think these have nothing to do with text '9x'.

-Paul Randall
 

My Computer

G

gimme_this_gimme_that

#11
Hi Tom,

I checked the "Adobe Acrobat 7.0 Type Library" to get the Acrobat
library.
I clicked on the AcroAVDoc class and the FindText Function appears
there.

Yes. Like I said it's a long shot. And you make a good point that a
Variant should be used in the VBA - not a Long.

Incidentally I also have a "AcroIEHelper 1.0 Type Library" - but I'm
not using that.

If you're really interested I could spend some time actually trying to
get the code to work.

What I posted was better-than-nothing thing that may or may not work.
I figured Codeblack would post again once he looked at it more.


On Feb 9, 2:04 pm, Tom Lavedas <tglba...@xxxxxx> wrote:

> On Feb 9, 1:19 pm, "gimme_this_gimme_t...@xxxxxx"
>
>
>
>
>
> <gimme_this_gimme_t...@xxxxxx> wrote:

> > Long shot ....
>

> > In Excel:
>

> > Go into the VBA IDE (Alt-F11)
> > Go into Tools->References
> > Check all the Adobe Libraries
>

> > I have:
> > Adobe Acrobat 7.0 Browser Control Type Library 1.0
> > Adobe Acrobat 7.0 Type Library
>

> > Go into Object Browser
>

> > See if you can get a VBA Sub going that looks like this:
>

> > Sub SearchPDF()
> > Set a = New AcroAVDoc
> > a.Open("C:\mypdf.pdf")
> > Set ln = New Long(1)
> > b =  a.FindText("SearchTextString",ln,ln) 'b is a boolean
> > MsgBox CStr(b)
> > End Sub
>

> > *IF* you ever get that to work - the arguments to FindText are
> > undocumented - the next step is to translate this into VBScript -
>

> > Someone might be able to help you here with another post.
> > You'd need to convert this VBA:
>

> > Set a = New AcroAVDoc
>

> > 'into VBScript that might look like this:
>

> > Set a = CreateObject("AcroAVDoc")
> > Set a = CreateObject("Adobe Acrobat 7.0")
> >
> The Acrobat controls do not provide a shell of their own, but must be
> hosted by an application, like IE.  Gunter Born wrote about this years
> ago.  His web site, WSH Bazaar, is no longer maintained, but is still
> out there.  See:http://freenet-homepage.de/gborn/WSHBazaar/WSHBazaar.htm.
> In the Newsletter #5, he presents the basics of hosting the Acrobat
> Reader ActiveX in IE and does a lot of manipulations.  Unfortunately,
> he does not cover the method you discuss and some of the supporting
> files are missing.  Further, if the input arguments must be typed as
> Long, they cannot by implemented in script, since all variables in
> script are of type Variant.
>
> I looked at the methods that are exposed in all of the Acrobat ActiveX
> libraries on my machine and I cannot find a reference to a FindText
> method.  I did this with show hidden objects selected.  Where did you
> find a reference to this method?
>
> Tom Lavedas
> ***********http://there.is.no.more/tglbatch/- Hide quoted text -
>
> - Show quoted text -
 

My Computer

A

Al Dunbar

#12
"Paul Randall" <paulr90@xxxxxx> wrote in message
news:uVl4ZJxiJHA.1172@xxxxxx

>
> "Tom Lavedas" <tglbatch@xxxxxx> wrote in message
> news:429bac15-3b35-46c4-b82b-15d2b0cc23ef@xxxxxx
> On Feb 9, 1:25 pm, "Al Dunbar" <aland...@xxxxxx> wrote:

>> "David Kerber" <ns_dkerber@xxxxxx_WarrenRogersAssociates.com> wrote in
>> message
>>
>> news:MPG.23f665d27256871989ce2@xxxxxx
>>

>> > In article <195FFDF9-7A3E-4FCD-8D56-BC2F45497...@xxxxxx>,
>> > Codebl...@xxxxxx says...
>> >> Does any one know how to read a pdf file and search for text within
>> >> the
>> >> pdf.
>> >> Any inputs will be greatly appreciated.
>>

>> > A .pdf is just a text file with some mark-up elements
>>
>> Not the one I just renamed as .txt and opened in notepad...
>>
>> /Al
>>

>> > so you can search
>> > for contained text just like you would a .html or .txt file.
>>

>> > --
>> > /~\ The ASCII
>> > \ / Ribbon Campaign
>> > X Against HTML
>> > / \ Email!
>>

>> > Remove the ns_ from if replying by e-mail (but keep posts in the
>> > newsgroups if possible).
>
> Later versions of pdf seem to be encoded to keep that from happening,
> but I think that's still at the discretion of the creator. That is,
> some are and some aren't searchable. Clearly, the scanner documents
> in pdf format are unsearchable, since they are image based.
>
> Tom Lavedas
> ***********
> http://there.is.no.more/tglbatch/
>
> ---------------------------------------------
> I think it is way more complex than that.
> Try downloading http://www.sfmta.com/cms/mmaps/documents/47.pdf.
>
> Looking at the file with NotePad, you will find almost no text that looks
> like street names or Muni route numbers.
> Look at it with Acrobat Reader. The text zooms beautifully (like text
> font size changes, not zooming a bit map).
> Use Acrobat's binocular icon and search for some text, like 9x. It finds
> 3 occurrences that are readable on the map. NotePad finds two
> occurrences, but I think these have nothing to do with text '9x'.
In fact, isn't one of the reasons for pdf documents the fact that, relative
to word documents, at least, they are not easily modified? If the contained
text appeared in more-or-less plain text, well...

/Al
 

My Computer

T

Tom Lavedas

#13
On Feb 9, 8:47 pm, "gimme_this_gimme_t...@xxxxxx"
<gimme_this_gimme_t...@xxxxxx> wrote:

> Hi Tom,
>
> I checked the "Adobe Acrobat 7.0 Type Library" to get the Acrobat
> library.
> I clicked on the AcroAVDoc class and the FindText Function appears
> there.
>
> Yes. Like I said it's a long shot. And you make a good point that a
> Variant should be used in the VBA - not a Long.
>
> Incidentally I also have a "AcroIEHelper 1.0 Type Library" - but I'm
> not using that.
>
> If you're really interested I could spend some time actually trying to
> get the code to work.
>
> What I posted was better-than-nothing thing that may or may not work.
> I figured Codeblack would post again once he looked at it more.
>
> On Feb 9, 2:04 pm, Tom Lavedas <tglba...@xxxxxx> wrote:
>

> > On Feb 9, 1:19 pm, "gimme_this_gimme_t...@xxxxxx"
>

> > <gimme_this_gimme_t...@xxxxxx> wrote:

> > > Long shot ....
> >

> > > Go into the VBA IDE (Alt-F11)
> > > Go into Tools->References
> > > Check all the Adobe Libraries
>

> > > I have:
> > > Adobe Acrobat 7.0 Browser Control Type Library 1.0
> > > Adobe Acrobat 7.0 Type Library
>

> > > Go into Object Browser
>

> > > See if you can get a VBA Sub going that looks like this:
>

> > > Sub SearchPDF()
> > > Set a = New AcroAVDoc
> > > a.Open("C:\mypdf.pdf")
> > > Set ln = New Long(1)
> > > b =  a.FindText("SearchTextString",ln,ln) 'b is a boolean
> > > MsgBox CStr(b)
> > > End Sub
>

> > > *IF* you ever get that to work - the arguments to FindText are
> > > undocumented - the next step is to translate this into VBScript -
>

> > > Someone might be able to help you here with another post.
> > > You'd need to convert this VBA:
>

> > > Set a = New AcroAVDoc
>

> > > 'into VBScript that might look like this:
>

> > > Set a = CreateObject("AcroAVDoc")
> > > Set a = CreateObject("Adobe Acrobat 7.0")
> >

> > The Acrobat controls do not provide a shell of their own, but must be
> > hosted by an application, like IE.  Gunter Born wrote about this years
> > ago.  His web site, WSH Bazaar, is no longer maintained, but is still
> > out there.  See:http://freenet-homepage.de/gborn/WSHBazaar/WSHBazaar.htm.
> > In the Newsletter #5, he presents the basics of hosting the Acrobat
> > Reader ActiveX in IE and does a lot of manipulations.  Unfortunately,
> > he does not cover the method you discuss and some of the supporting
> > files are missing.  Further, if the input arguments must be typed as
> > Long, they cannot by implemented in script, since all variables in
> > script are of type Variant.
>

> > I looked at the methods that are exposed in all of the Acrobat ActiveX
> > libraries on my machine and I cannot find a reference to a FindText
> > method.  I did this with show hidden objects selected.  Where did you
> > find a reference to this method?
>

> > Tom Lavedas
> > ***********
> > http://there.is.no.more/tglbatch/
Yes, I later found the AcroAVDoc class in the dll you cite. However,
I could not instantiate that class in a VBS script. It is not
registered as a class so that CreateObject cannot instatiate it. My
attempts to use GetObject to do it from a reference to the type
library dll at "C:\Program Files\Adobe\Reader 9.0\Reader\AcroRd32.dll"
also failed, though that is the library that is linked in an Office
object browser, like in Excel, to expose the class.

The 'if all else fails' approach to linking to such a class is to use
an <object> tag in a WSH, HTA or HTML document, but that requires a
ClassID, which I failed to track down in the registry after 30 minutes
(and gave up).

I suppose an Excel.Application could be used to instantiate it from
script, but it is clear that Adobe does not really want people to
access their code except through their Reader. As Al Dunbar suggested
in another post in this thread, "In fact, isn't one of the reasons for
pdf documents the fact that, relative to word documents, at least,
they are not easily modified?" I would add that they are also not too
easy to co-opt - at least not with automation.

[As an aside, isn't interesting that we seem to be taking more
interest in this subject than the original poster? I've noticed that
in many threads - the poster gets what they need or decides it's more
trouble than it's worth and bails while the diehards beat the subject
to death. ;-)]

Tom Lavedas
***********
http://there.is.no.more/tglbatch/
 

My Computer

A

Alex K. Angelopoulos

#14
Sorry for the late post, but I thought it was worth mentioning that although
it isn't possible to use an object model to search PDF data from script,
there actually _are_ tools that can extract the text in a PDF automatically.
If you use pdftotext from the xpdf tools (see the Win32 download link at
http://www.foolabs.com/xpdf/download.html ), you can get the raw text from a
PDF file extracted and placed in a text document.

Besides the caveats people have mentioned about trying to read PDF files
(primarily issues with protected files) there are other problems you may
encounter. Many PDF files have text that is stored as images instead of real
text, and actual page layout affects how data is interpreted. However,
within those limits I've found pdftotext to work quite well for getting data
out of a PDF file as raw text.

"Codeblack" <Codeblack@xxxxxx> wrote in message
news:195FFDF9-7A3E-4FCD-8D56-BC2F454975D4@xxxxxx

> Does any one know how to read a pdf file and search for text within the
> pdf.
> Any inputs will be greatly appreciated.
>
 

My Computer

Users Who Are Viewing This Thread (Users: 1, Guests: 0)