Windows Vista Forums
Vista Forums Home Join Vista Forums Donate Vista Tutorials Tags

Welcome to Vista Forums we are your forum to discuss Windows Vista x64 and x86 systems. Whether you need help or just want to post an idea you have on Vista, this is the forum for you.
Register at Vista forums...the world biggest Windows Vista resource Join Vista Forums Now

Go Back   Vista Forums > Microsoft Technical Newsgroups > PowerShell

Extract data from web page

Closed Thread
 
Thread Tools Display Modes
Old 05-25-2007   #1 (permalink)
cmyers
Guest


 

Extract data from web page

We have internal web pages that query databases & produce tables based
upon certain selections, and then gives an option to view a comma
delimited text file with the data that's just been created. I'd like
to be able to extract that text file (as it's displayed in the web
browser), but the net.webclient can't generate it (because you can't
make it go through the gyrations of choosing the fields etc. that you
would manually go through to produce it)...and I can get IE to display
the appropriate text with the following script, but for some reason I
can't seem to extract the data that's in it...any suggestions?

Here's a simplified example of what I want...

$ie = New-Object -ComObject InternetExplorer.Application
$ie.Navigate("http://webserver/inventory/displaydb.php")
while ($ie.busy) {
sleep -milliseconds 50
}
$ie.Navigate("http://webserver/displaydb.php?
table=software&site_filter=xx%&os_filter=%Windows%")
while ($ie.busy) {
sleep -milliseconds 50
}
$ie.Navigate("http://webserver/inventory/text.php?
table=software&site_filter=xx%&os_filter=%indows%")
while ($ie.busy) {
sleep -milliseconds 50
}
$ie.visible = $true

Up to this point, everything works fine and the browser will be
displaying the contents of a text file...so how to I get that text
into a variable in PowerShell so that I can manipulate it?

Thanks,
Charlie

Old 05-25-2007   #2 (permalink)
Michael Harris \(MVP\)
Guest


 

Re: Extract data from web page

---snip---

> $ie.Navigate("http://webserver/inventory/text.php?
> table=software&site_filter=xx%&os_filter=%indows%")
> while ($ie.busy) {
> sleep -milliseconds 50
> }
> $ie.visible = $true
>
> Up to this point, everything works fine and the browser will be
> displaying the contents of a text file...so how to I get that text
> into a variable in PowerShell so that I can manipulate it?
>


$text = $ie.document.body.innertext

--
Michael Harris
Microsoft.MVP.Scripting


Old 05-25-2007   #3 (permalink)
cmyers
Guest


 

Re: Extract data from web page


>
> $text = $ie.document.body.innertext


Michael,

Thanks for the response, but although I've found that documented
*everywhere*, I've tried it and it doesn't work either. Even if I do
this, it displays the contents of the text file in an IE browser, but
$ie.document.body.innertext doesn't seem to contain anything:

$ie = New-Object -ComObject InternetExplorer.Application
$ie.Navigate("file://C:\temp\textfile.txt")
$ie.visible = $true
$text=$ie.document.body.innertext
$text

$text will contain nothing. Piping it into get-member produces this
error:

PS C:\Scripting\PowerShell\> $text | get-member
Get-Member : No object has been specified to get-member.
At line:1 char:18
+ $text | get-member <<<<

I'm using Powershell 1.0 on Windows XP Pro SP-2 & IE7 if that makes a
difference.

-Charlie

Old 05-25-2007   #4 (permalink)
Keith Hill [MVP]
Guest


 

Re: Extract data from web page

"cmyers" <cmyers@nrao.edu> wrote in message
news:1180118690.913064.142410@u30g2000hsc.googlegroups.com...
> We have internal web pages that query databases & produce tables based
> upon certain selections, and then gives an option to view a comma
> delimited text file with the data that's just been created. I'd like
> to be able to extract that text file (as it's displayed in the web
> browser), but the net.webclient can't generate it (because you can't
> make it go through the gyrations of choosing the fields etc. that you
> would manually go through to produce it)...and I can get IE to display
> the appropriate text with the following script, but for some reason I
> can't seem to extract the data that's in it...any suggestions?
>
> Here's a simplified example of what I want...
>
> $ie = New-Object -ComObject InternetExplorer.Application
> $ie.Navigate("http://webserver/inventory/displaydb.php")
> while ($ie.busy) {
> sleep -milliseconds 50
> }
> $ie.Navigate("http://webserver/displaydb.php?
> table=software&site_filter=xx%&os_filter=%Windows%")
> while ($ie.busy) {
> sleep -milliseconds 50
> }
> $ie.Navigate("http://webserver/inventory/text.php?
> table=software&site_filter=xx%&os_filter=%indows%")
> while ($ie.busy) {
> sleep -milliseconds 50
> }
> $ie.visible = $true


I don't think you need to fire up IE to do this. .NET has a class that
works real well for screen scraping e.g.:

$url =
'http://webserver/displaydb.php?table=software&site_filter=xx%&os_filter=%Windows%'
$content = (new-object System.Net.WebClient).DownloadString($url)

--
Keith

Old 05-25-2007   #5 (permalink)
cmyers
Guest


 

Re: Extract data from web page


> I don't think you need to fire up IE to do this. .NET has a class that
> works real well for screen scraping e.g.:
>
> $url =
> 'http://webserver/displaydb.php?table=software&site_filter=xx%&os_filt...
> $content = (new-object System.Net.WebClient).DownloadString($url)



Keith,

That's the problem...in this instance it isn't really "screen
scraping". I'm not a web admin, but our web admin basically says that
as you browse and choose options, information is gathered in a cookie
which is then used to display the results of the final text file that
I'm trying to gather. So, if you enter the url to try to go directly
to the text file (which is dynamicaly generated), you get errors
(associated to the database calls I think) if you haven't already
built that information up in the cookie.

I've tried the net.webclient and it produces the aforementioned
database call errors. Going through the gyrations that I've mentioned
previously, I can at least produce an IE window that contains the text
that I want, but I can't seem to harvest that information into
Powershell.

Charlie

PS...The errors that I get when using the webclient and the direct url
are (thousands of them because I'm expecting to get about 6,500 lines
returned):

<b>Warning</b>: Invalid argument supplied for foreach() in <b>/opt/
services/httpd/htdocs/inventory/displaydb.php</b> on line <b>379</
b><br />



Old 05-25-2007   #6 (permalink)
Keith Hill [MVP]
Guest


 

Re: Extract data from web page


"cmyers" <cmyers@nrao.edu> wrote in message
news:1180123508.433223.243350@w5g2000hsg.googlegroups.com...
> Keith,
>
> That's the problem...in this instance it isn't really "screen
> scraping". I'm not a web admin, but our web admin basically says that
> as you browse and choose options, information is gathered in a cookie
> which is then used to display the results of the final text file that
> I'm trying to gather. So, if you enter the url to try to go directly
> to the text file (which is dynamicaly generated), you get errors
> (associated to the database calls I think) if you haven't already
> built that information up in the cookie.
>
> I've tried the net.webclient and it produces the aforementioned
> database call errors. Going through the gyrations that I've mentioned
> previously, I can at least produce an IE window that contains the text
> that I want, but I can't seem to harvest that information into
> Powershell.
>
> Charlie
>
> PS...The errors that I get when using the webclient and the direct url
> are (thousands of them because I'm expecting to get about 6,500 lines
> returned):
>
> <b>Warning</b>: Invalid argument supplied for foreach() in <b>/opt/
> services/httpd/htdocs/inventory/displaydb.php</b> on line <b>379</
> b><br />


What happens if you Download the three URLs in the proper order? Does that
not create the appropriate cookies? If not and you knew what the cookie
content is, you could use HttpWebRequest (created by the static method
WebRequest.Create(string url)). This type allows you to set the cookie
content before calling GetResponse().

--
Keith

Old 05-25-2007   #7 (permalink)
Michael Harris \(MVP\)
Guest


 

Re: Extract data from web page

> Thanks for the response, but although I've found that documented
> *everywhere*, I've tried it and it doesn't work either. Even if I do
> this, it displays the contents of the text file in an IE browser, but
> $ie.document.body.innertext doesn't seem to contain anything:
>
> $ie = New-Object -ComObject InternetExplorer.Application
> $ie.Navigate("file://C:\temp\textfile.txt")
> $ie.visible = $true
> $text=$ie.document.body.innertext
> $text
>
> $text will contain nothing. Piping it into get-member produces this
> error:
>


It does work for me...

I'm also using Powershell 1.0 on Windows XP Pro SP-2 & IE7

--
Michael Harris
Microsoft.MVP.Scripting


Old 05-29-2007   #8 (permalink)
cmyers
Guest


 

Re: Extract data from web page


> What happens if you Download the three URLs in the proper order? Does that
> not create the appropriate cookies? If not and you knew what the cookie
> content is, you could use HttpWebRequest (created by the static method
> WebRequest.Create(string url)). This type allows you to set the cookie
> content before calling GetResponse().
>
> --
> Keith


Keith,

If I navigate to the three URLs in the proper order, the cookie gets
built and the text file that I'm after displays correctly in IE...this
works using the script that I posted originally. The problem is that
once I have that text displayed in the browser, for some reason using
the ".Document.Body.InnerText" doesn't grab the content of the IE
window. In fact, I guess that my main problem is
that .Document.Body.InnerText doesn't seem to grab the content of the
IE window that's been opened either by .PS1 script or from the PS
command line.

It appears to me that I had 2 ways of trying to skin this cat, but I
couldn't use 1 of them (net.webclient) because I can't construct an
URL to navigate directly to the text. Using the other method (new-
object -comobject InternetExplorer.Application & browsing the URLs in
sequence) gives me a view of the data that I want to grab, but for
some reason I just can't seem to grab that data. I'm still open to
suggestions as to why the .Document.Body.InnerText appears to be null
when I try to access it.

Thanks for all the help so far.

-Charlie

Old 05-29-2007   #9 (permalink)
cmyers
Guest


 

Re: Extract data from web page

Actually, now that I look at it, if I look at the properties of my
$ie.Document, there is no "Body" property...sounds like this is what's
causing my problems, but why is this property missing? Where's my
body? )

PS C:\Scripting\PowerShell> $ie = New-Object -ComObject
InternetExplorer.Application
PS C:\Scripting\PowerShell> $ie.Navigate("file://C:\temp
\textfile.txt")
PS C:\Scripting\PowerShell> $ie.Document | get-member


TypeName: System.__ComObject#{3050f55f-98b5-11cf-bb82-00aa00bdce0b}

Name MemberType Definition
---- ---------- ----------
appendChild Method IHTMLDOMNode appendChild
(IHTMLDOMNode)
attachEvent Method bool attachEvent (string, IDispatch)
clear Method void clear ()
cloneNode Method IHTMLDOMNode cloneNode (bool)
close Method void close ()
createAttribute Method IHTMLDOMAttribute createAttribute
(string)
createComment Method IHTMLDOMNode createComment (string)
createDocumentFragment Method IHTMLDocument2
createDocumentFragment ()
createDocumentFromUrl Method IHTMLDocument2 createDocumentFromUrl
(stri...
createElement Method IHTMLElement createElement (string)
CreateEventObject Method IHTMLEventObj CreateEventObject
(Variant)
createRenderStyle Method IHTMLRenderStyle createRenderStyle
(string)
createStyleSheet Method IHTMLStyleSheet createStyleSheet
(string, ...
createTextNode Method IHTMLDOMNode createTextNode (string)
detachEvent Method void detachEvent (string, IDispatch)
elementFromPoint Method IHTMLElement elementFromPoint (int,
int)
execCommand Method bool execCommand (string, bool,
Variant)
execCommandShowHelp Method bool execCommandShowHelp (string)
FireEvent Method bool FireEvent (string, Variant)
focus Method void focus ()
getElementById Method IHTMLElement getElementById (string)
getElementsByName Method IHTMLElementCollection
getElementsByName (...
getElementsByTagName Method IHTMLElementCollection
getElementsByTagNam...
hasChildNodes Method bool hasChildNodes ()
hasFocus Method bool hasFocus ()
insertBefore Method IHTMLDOMNode insertBefore
(IHTMLDOMNode, V...
open Method IDispatch open (string, Variant,
Variant, ...
queryCommandEnabled Method bool queryCommandEnabled (string)
queryCommandIndeterm Method bool queryCommandIndeterm (string)
queryCommandState Method bool queryCommandState (string)
queryCommandSupported Method bool queryCommandSupported (string)
queryCommandText Method string queryCommandText (string)
queryCommandValue Method Variant queryCommandValue (string)
recalc Method void recalc (bool)
releaseCapture Method void releaseCapture ()
removeChild Method IHTMLDOMNode removeChild
(IHTMLDOMNode)
removeNode Method IHTMLDOMNode removeNode (bool)
replaceChild Method IHTMLDOMNode replaceChild
(IHTMLDOMNode, I...
replaceNode Method IHTMLDOMNode replaceNode
(IHTMLDOMNode)
swapNode Method IHTMLDOMNode swapNode (IHTMLDOMNode)
toString Method string toString ()
write Method void write (SAFEARRAY(Variant))
writeln Method void writeln (SAFEARRAY(Variant))
alinkColor Property Variant alinkColor () {get} {set}
anchors Property IHTMLElementCollection anchors ()
{get}
attributes Property IDispatch attributes () {get}
baseUrl Property string baseUrl () {get} {set}
bgColor Property Variant bgColor () {get} {set}
charset Property string charset () {get} {set}
childNodes Property IDispatch childNodes () {get}
compatMode Property string compatMode () {get}
cookie Property string cookie () {get} {set}
defaultCharset Property string defaultCharset () {get} {set}
designMode Property string designMode () {get} {set}
dir Property string dir () {get} {set}
doctype Property IHTMLDOMNode doctype () {get}
documentElement Property IHTMLElement documentElement ()
{get}
domain Property string domain () {get} {set}
embeds Property IHTMLElementCollection embeds ()
{get}
enableDownload Property bool enableDownload () {get} {set}
expando Property bool expando () {get} {set}
fgColor Property Variant fgColor () {get} {set}
fileCreatedDate Property string fileCreatedDate () {get}
fileModifiedDate Property string fileModifiedDate () {get}
fileSize Property string fileSize () {get}
fileUpdatedDate Property string fileUpdatedDate () {get}
firstChild Property IHTMLDOMNode firstChild () {get}
forms Property IHTMLElementCollection forms ()
{get}
frames Property IHTMLFramesCollection2 frames ()
{get}
implementation Property IHTMLDOMImplementation
implementation () {...
inheritStyleSheets Property bool inheritStyleSheets () {get}
{set}
lastChild Property IHTMLDOMNode lastChild () {get}
lastModified Property string lastModified () {get}
linkColor Property Variant linkColor () {get} {set}
location Property IHTMLLocation location () {get}
media Property string media () {get} {set}
mimeType Property string mimeType () {get}
nameProp Property string nameProp () {get}
namespaces Property IDispatch namespaces () {get}
nextSibling Property IHTMLDOMNode nextSibling () {get}
nodeName Property string nodeName () {get}
nodeType Property int nodeType () {get}
nodeValue Property Variant nodeValue () {get} {set}
onactivate Property Variant onactivate () {get} {set}
onafterupdate Property Variant onafterupdate () {get} {set}
onbeforeactivate Property Variant onbeforeactivate () {get}
{set}
onbeforedeactivate Property Variant onbeforedeactivate () {get}
{set}
onbeforeeditfocus Property Variant onbeforeeditfocus () {get}
{set}
onbeforeupdate Property Variant onbeforeupdate () {get}
{set}
oncellchange Property Variant oncellchange () {get} {set}
onclick Property Variant onclick () {get} {set}
oncontextmenu Property Variant oncontextmenu () {get} {set}
oncontrolselect Property Variant oncontrolselect () {get}
{set}
ondataavailable Property Variant ondataavailable () {get}
{set}
ondatasetchanged Property Variant ondatasetchanged () {get}
{set}
ondatasetcomplete Property Variant ondatasetcomplete () {get}
{set}
ondblclick Property Variant ondblclick () {get} {set}
ondeactivate Property Variant ondeactivate () {get} {set}
ondragstart Property Variant ondragstart () {get} {set}
onerrorupdate Property Variant onerrorupdate () {get} {set}
onfocusin Property Variant onfocusin () {get} {set}
onfocusout Property Variant onfocusout () {get} {set}
onhelp Property Variant onhelp () {get} {set}
onkeydown Property Variant onkeydown () {get} {set}
onkeypress Property Variant onkeypress () {get} {set}
onkeyup Property Variant onkeyup () {get} {set}
onmousedown Property Variant onmousedown () {get} {set}
onmousemove Property Variant onmousemove () {get} {set}
onmouseout Property Variant onmouseout () {get} {set}
onmouseover Property Variant onmouseover () {get} {set}
onmouseup Property Variant onmouseup () {get} {set}
onmousewheel Property Variant onmousewheel () {get} {set}
onpropertychange Property Variant onpropertychange () {get}
{set}
onreadystatechange Property Variant onreadystatechange () {get}
{set}
onrowenter Property Variant onrowenter () {get} {set}
onrowexit Property Variant onrowexit () {get} {set}
onrowsdelete Property Variant onrowsdelete () {get} {set}
onrowsinserted Property Variant onrowsinserted () {get}
{set}
onselectionchange Property Variant onselectionchange () {get}
{set}
onselectstart Property Variant onselectstart () {get} {set}
onstop Property Variant onstop () {get} {set}
ownerDocument Property IDispatch ownerDocument () {get}
parentDocument Property IHTMLDocument2 parentDocument ()
{get}
parentNode Property IHTMLDOMNode parentNode () {get}
parentWindow Property IHTMLWindow2 parentWindow () {get}
plugins Property IHTMLElementCollection plugins ()
{get}
previousSibling Property IHTMLDOMNode previousSibling ()
{get}
protocol Property string protocol () {get}
readyState Property string readyState () {get}
referrer Property string referrer () {get}
scripts Property IHTMLElementCollection scripts ()
{get}
security Property string security () {get}
selection Property IHTMLSelectionObject selection ()
{get}
styleSheets Property IHTMLStyleSheetsCollection
styleSheets () ...
title Property string title () {get} {set}
uniqueID Property string uniqueID () {get}
url Property string url () {get} {set}
URLUnencoded Property string URLUnencoded () {get}
vlinkColor Property Variant vlinkColor () {get} {set}

Old 02-12-2008   #10 (permalink)
Newbie


  Darth Mainer is offline

Re: Extract data from web page

cmyers problem was almost certainly due to the fact that WebBrowser.Navigate() is asynchronous: When you call it, it returns immediately in the thread where it was called. The actual loading then proceeds in a different thread. So when cmyers looked at Document.Body, there was nothing there yet, because the control hadn't yet gotten that far in parsing and rendering the HTML.

The answer is to call Navigate() and do nothing further right there. Instead, wait, and handle the WebBrowser instance's DocumentCompleted event, which fires when the document is parsed and ready to go. At that point, you can do whatever you like in your event handler. If the HTML was valid, WebBrowser.Document.Body will be valid.

I've no clue if you can do that in PowerShell. But that's what the problem is, and that's how to solve it if your language supports it.
Closed Thread

Thread Tools
Display Modes


Similar Threads
Thread Thread Starter Forum Replies Last Post
So much for Bitlocker/EFS. MS supplies law enforcement with usb key to extract data rive0108 System Security 1 3 Weeks Ago 10:52 AM
Extract information from web-page Nikhil R. Bhandari PowerShell 1 10-10-2007 05:56 PM
Extract string from web page Brian Hoort PowerShell 3 01-11-2007 05:30 PM
How to Compose a Page From Grids Defined Outide the Page? Chris Moore Avalon 10 11-11-2006 12:24 PM
extract vista I can extract vista from the iso file but can not ex fjjm303 Vista General 7 06-19-2006 07:46 PM








Vistax64.com is an independent web site and has not been authorized,
sponsored, or otherwise approved by Microsoft Corporation.
"Windows Vista", the Start Orb, and related materials are trademarks of Microsoft Corp.
© Designer Media 2005-2008

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50