Does any one know how to read a pdf file and search for text within the pdf.
Any inputs will be greatly appreciated.
Does any one know how to read a pdf file and search for text within the pdf.
Any inputs will be greatly appreciated.
"Codeblack" <Codeblack@xxxxxx> wrote in message
news:64301878-37E2-4857-9AE8-F5812DE672DA@xxxxxxJudging from the responses to date, apparently, many of us cannot help you.
> Anyone in this forum can help me.
VBScript's file system object has a difficult time with anything other than
text files. You will either need to determine the details of the format and
write your own interface, or find a document object model for pdf's.
Unfortunately, googling ["document object model" "portable document format"]
seems to find information about document object models for htnl, dhtml,
word, and etc, all presented in pdf format. I checked the adobe site, and
could not find anything helpful there, other than adobe acrobat itself. It
could be that the full acrobat package provides what you need, but possibly
not.
/Al
In article <195FFDF9-7A3E-4FCD-8D56-BC2F454975D4@xxxxxx>,
Codeblack@xxxxxx says...A .pdf is just a text file with some mark-up elements, so you can search
> Does any one know how to read a pdf file and search for text within the pdf.
> Any inputs will be greatly appreciated.
for contained text just like you would a .html or .txt file.
--
/~\ The ASCII
\ / Ribbon Campaign
X Against HTML
/ \ Email!
Remove the ns_ from if replying by e-mail (but keep posts in the
newsgroups if possible).
Long shot ....
In Excel:
Go into the VBA IDE (Alt-F11)
Go into Tools->References
Check all the Adobe Libraries
I have:
Adobe Acrobat 7.0 Browser Control Type Library 1.0
Adobe Acrobat 7.0 Type Library
Go into Object Browser
See if you can get a VBA Sub going that looks like this:
Sub SearchPDF()
Set a = New AcroAVDoc
a.Open("C:\mypdf.pdf")
Set ln = New Long(1)
b = a.FindText("SearchTextString",ln,ln) 'b is a boolean
MsgBox CStr(b)
End Sub
*IF* you ever get that to work - the arguments to FindText are
undocumented - the next step is to translate this into VBScript -
Someone might be able to help you here with another post.
You'd need to convert this VBA:
Set a = New AcroAVDoc
'into VBScript that might look like this:
Set a = CreateObject("AcroAVDoc")
Set a = CreateObject("Adobe Acrobat 7.0")
YMMV
"David Kerber" <ns_dkerber@xxxxxx_WarrenRogersAssociates.com> wrote in message
news:MPG.23f665d27256871989ce2@xxxxxxNot the one I just renamed as .txt and opened in notepad...
> In article <195FFDF9-7A3E-4FCD-8D56-BC2F454975D4@xxxxxx>,
> Codeblack@xxxxxx says...>
>> Does any one know how to read a pdf file and search for text within the
>> pdf.
>> Any inputs will be greatly appreciated.
> A .pdf is just a text file with some mark-up elements
/Al
> so you can search
> for contained text just like you would a .html or .txt file.
>
> --
> /~\ The ASCII
> \ / Ribbon Campaign
> X Against HTML
> / \ Email!
>
> Remove the ns_ from if replying by e-mail (but keep posts in the
> newsgroups if possible).
>
Opps. I forget to tell you what to do once you get to object browser.
You probably figured that out...
In the VBA IDE select View->Object Browser
In the drop down in the middle of the page where it says <All
Libraries> select Acrobat
Peruse the objects.
For example, click AcroAVDoc - and you see the method FindText.
On Feb 9, 1:25*pm, "Al Dunbar" <aland...@xxxxxx> wrote:Later versions of pdf seem to be encoded to keep that from happening,
> "David Kerber" <ns_dkerber@xxxxxx_WarrenRogersAssociates.com> wrote in message
>
> news:MPG.23f665d27256871989ce2@xxxxxx
>>
> > In article <195FFDF9-7A3E-4FCD-8D56-BC2F45497...@xxxxxx>,
> > Codebl...@xxxxxx says...
> >> Does any one know how to read a pdf file and search for text within the
> >> pdf.
> >> Any inputs will be greatly appreciated.>
> > A .pdf is just a text file with some mark-up elements
> Not the one I just renamed as .txt and opened in notepad...
>
> /Al
>>
> > * *so you can search
> > for contained text just like you would a .html or .txt file.>
> > --
> > /~\ The ASCII
> > \ / Ribbon Campaign
> > X *Against HTML
> > / \ Email!
> > Remove the ns_ from if replying by e-mail (but keep posts in the
> > newsgroups if possible).
but I think that's still at the discretion of the creator. That is,
some are and some aren't searchable. Clearly, the scanner documents
in pdf format are unsearchable, since they are image based.
Tom Lavedas
***********
http://there.is.no.more/tglbatch/
On Feb 9, 1:19*pm, "gimme_this_gimme_t...@xxxxxx"
<gimme_this_gimme_t...@xxxxxx> wrote:The Acrobat controls do not provide a shell of their own, but must be
> Long shot ....
>
> In Excel:
>
> Go into the VBA IDE (Alt-F11)
> Go into Tools->References
> Check all the Adobe Libraries
>
> I have:
> Adobe Acrobat 7.0 Browser Control Type Library 1.0
> Adobe Acrobat 7.0 Type Library
>
> Go into Object Browser
>
> See if you can get a VBA Sub going that looks like this:
>
> Sub SearchPDF()
> Set a = New AcroAVDoc
> a.Open("C:\mypdf.pdf")
> Set ln = New Long(1)
> b = *a.FindText("SearchTextString",ln,ln) 'b is a boolean
> MsgBox CStr(b)
> End Sub
>
> *IF* you ever get that to work - the arguments to FindText are
> undocumented - the next step is to translate this into VBScript -
>
> Someone might be able to help you here with another post.
> You'd need to convert this VBA:
>
> Set a = New AcroAVDoc
>
> 'into VBScript that might look like this:
>
> Set a = CreateObject("AcroAVDoc")
> Set a = CreateObject("Adobe Acrobat 7.0")
>
> YMMV
hosted by an application, like IE. Gunter Born wrote about this years
ago. His web site, WSH Bazaar, is no longer maintained, but is still
out there. See: http://freenet-homepage.de/gborn/WSH.../WSHBazaar.htm.
In the Newsletter #5, he presents the basics of hosting the Acrobat
Reader ActiveX in IE and does a lot of manipulations. Unfortunately,
he does not cover the method you discuss and some of the supporting
files are missing. Further, if the input arguments must be typed as
Long, they cannot by implemented in script, since all variables in
script are of type Variant.
I looked at the methods that are exposed in all of the Acrobat ActiveX
libraries on my machine and I cannot find a reference to a FindText
method. I did this with show hidden objects selected. Where did you
find a reference to this method?
Tom Lavedas
***********
http://there.is.no.more/tglbatch/
"Tom Lavedas" <tglbatch@xxxxxx> wrote in message
news:429bac15-3b35-46c4-b82b-15d2b0cc23ef@xxxxxx
On Feb 9, 1:25 pm, "Al Dunbar" <aland...@xxxxxx> wrote:Later versions of pdf seem to be encoded to keep that from happening,
> "David Kerber" <ns_dkerber@xxxxxx_WarrenRogersAssociates.com> wrote in message
>
> news:MPG.23f665d27256871989ce2@xxxxxx
>>
> > In article <195FFDF9-7A3E-4FCD-8D56-BC2F45497...@xxxxxx>,
> > Codebl...@xxxxxx says...
> >> Does any one know how to read a pdf file and search for text within the
> >> pdf.
> >> Any inputs will be greatly appreciated.>
> > A .pdf is just a text file with some mark-up elements
> Not the one I just renamed as .txt and opened in notepad...
>
> /Al
>>
> > so you can search
> > for contained text just like you would a .html or .txt file.>
> > --
> > /~\ The ASCII
> > \ / Ribbon Campaign
> > X Against HTML
> > / \ Email!
> > Remove the ns_ from if replying by e-mail (but keep posts in the
> > newsgroups if possible).
but I think that's still at the discretion of the creator. That is,
some are and some aren't searchable. Clearly, the scanner documents
in pdf format are unsearchable, since they are image based.
Tom Lavedas
***********
http://there.is.no.more/tglbatch/
---------------------------------------------
I think it is way more complex than that.
Try downloading http://www.sfmta.com/cms/mmaps/documents/47.pdf.
Looking at the file with NotePad, you will find almost no text that looks
like street names or Muni route numbers.
Look at it with Acrobat Reader. The text zooms beautifully (like text font
size changes, not zooming a bit map).
Use Acrobat's binocular icon and search for some text, like 9x. It finds 3
occurrences that are readable on the map. NotePad finds two occurrences,
but I think these have nothing to do with text '9x'.
-Paul Randall
| Similar Threads | ||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Reading A Text File | vqthomf | VB Script | 2 | 19 Aug 2009 |
| Find hidden file, Reading from txt file | niva | VB Script | 3 | 07 Nov 2008 |
| reading the first column from a file | Tim | PowerShell | 1 | 06 Nov 2008 |
| reading last line of file | zerbie45 | VB Script | 1 | 21 Oct 2008 |
| error reading file | Richard Bell | Vista music pictures video | 0 | 15 Apr 2008 |