![]() |
![]() | ![]() | ![]() | ![]() | ![]() | ![]() | ![]() |
|
Welcome to Vista Forums we are your forum to discuss Windows Vista x64 and x86 systems. Whether you need help or just want to post an idea you have on Vista, this is the forum for you.
br> br> |
| |||||||
![]() |
| | Thread Tools | Display Modes |
| | #1 (permalink) |
| Guest | Extract data from web page We have internal web pages that query databases & produce tables based upon certain selections, and then gives an option to view a comma delimited text file with the data that's just been created. I'd like to be able to extract that text file (as it's displayed in the web browser), but the net.webclient can't generate it (because you can't make it go through the gyrations of choosing the fields etc. that you would manually go through to produce it)...and I can get IE to display the appropriate text with the following script, but for some reason I can't seem to extract the data that's in it...any suggestions? Here's a simplified example of what I want... $ie = New-Object -ComObject InternetExplorer.Application $ie.Navigate("http://webserver/inventory/displaydb.php") while ($ie.busy) { sleep -milliseconds 50 } $ie.Navigate("http://webserver/displaydb.php? table=software&site_filter=xx%&os_filter=%Windows%") while ($ie.busy) { sleep -milliseconds 50 } $ie.Navigate("http://webserver/inventory/text.php? table=software&site_filter=xx%&os_filter=%indows%") while ($ie.busy) { sleep -milliseconds 50 } $ie.visible = $true Up to this point, everything works fine and the browser will be displaying the contents of a text file...so how to I get that text into a variable in PowerShell so that I can manipulate it? Thanks, Charlie |
| | #2 (permalink) |
| Guest | Re: Extract data from web page ---snip--- > $ie.Navigate("http://webserver/inventory/text.php? > table=software&site_filter=xx%&os_filter=%indows%") > while ($ie.busy) { > sleep -milliseconds 50 > } > $ie.visible = $true > > Up to this point, everything works fine and the browser will be > displaying the contents of a text file...so how to I get that text > into a variable in PowerShell so that I can manipulate it? > $text = $ie.document.body.innertext -- Michael Harris Microsoft.MVP.Scripting |
| | #3 (permalink) |
| Guest | Re: Extract data from web page > > $text = $ie.document.body.innertext Michael, Thanks for the response, but although I've found that documented *everywhere*, I've tried it and it doesn't work either. Even if I do this, it displays the contents of the text file in an IE browser, but $ie.document.body.innertext doesn't seem to contain anything: $ie = New-Object -ComObject InternetExplorer.Application $ie.Navigate("file://C:\temp\textfile.txt") $ie.visible = $true $text=$ie.document.body.innertext $text $text will contain nothing. Piping it into get-member produces this error: PS C:\Scripting\PowerShell\> $text | get-member Get-Member : No object has been specified to get-member. At line:1 char:18 + $text | get-member <<<< I'm using Powershell 1.0 on Windows XP Pro SP-2 & IE7 if that makes a difference. -Charlie |
| | #4 (permalink) |
| Guest | Re: Extract data from web page "cmyers" <cmyers@nrao.edu> wrote in message news:1180118690.913064.142410@u30g2000hsc.googlegroups.com... > We have internal web pages that query databases & produce tables based > upon certain selections, and then gives an option to view a comma > delimited text file with the data that's just been created. I'd like > to be able to extract that text file (as it's displayed in the web > browser), but the net.webclient can't generate it (because you can't > make it go through the gyrations of choosing the fields etc. that you > would manually go through to produce it)...and I can get IE to display > the appropriate text with the following script, but for some reason I > can't seem to extract the data that's in it...any suggestions? > > Here's a simplified example of what I want... > > $ie = New-Object -ComObject InternetExplorer.Application > $ie.Navigate("http://webserver/inventory/displaydb.php") > while ($ie.busy) { > sleep -milliseconds 50 > } > $ie.Navigate("http://webserver/displaydb.php? > table=software&site_filter=xx%&os_filter=%Windows%") > while ($ie.busy) { > sleep -milliseconds 50 > } > $ie.Navigate("http://webserver/inventory/text.php? > table=software&site_filter=xx%&os_filter=%indows%") > while ($ie.busy) { > sleep -milliseconds 50 > } > $ie.visible = $true I don't think you need to fire up IE to do this. .NET has a class that works real well for screen scraping e.g.: $url = 'http://webserver/displaydb.php?table=software&site_filter=xx%&os_filter=%Windows%' $content = (new-object System.Net.WebClient).DownloadString($url) -- Keith |
| | #5 (permalink) |
| Guest | Re: Extract data from web page > I don't think you need to fire up IE to do this. .NET has a class that > works real well for screen scraping e.g.: > > $url = > 'http://webserver/displaydb.php?table=software&site_filter=xx%&os_filt... > $content = (new-object System.Net.WebClient).DownloadString($url) Keith, That's the problem...in this instance it isn't really "screen scraping". I'm not a web admin, but our web admin basically says that as you browse and choose options, information is gathered in a cookie which is then used to display the results of the final text file that I'm trying to gather. So, if you enter the url to try to go directly to the text file (which is dynamicaly generated), you get errors (associated to the database calls I think) if you haven't already built that information up in the cookie. I've tried the net.webclient and it produces the aforementioned database call errors. Going through the gyrations that I've mentioned previously, I can at least produce an IE window that contains the text that I want, but I can't seem to harvest that information into Powershell. Charlie PS...The errors that I get when using the webclient and the direct url are (thousands of them because I'm expecting to get about 6,500 lines returned): <b>Warning</b>: Invalid argument supplied for foreach() in <b>/opt/ services/httpd/htdocs/inventory/displaydb.php</b> on line <b>379</ b><br /> |
| | #6 (permalink) |
| Guest | Re: Extract data from web page "cmyers" <cmyers@nrao.edu> wrote in message news:1180123508.433223.243350@w5g2000hsg.googlegroups.com... > Keith, > > That's the problem...in this instance it isn't really "screen > scraping". I'm not a web admin, but our web admin basically says that > as you browse and choose options, information is gathered in a cookie > which is then used to display the results of the final text file that > I'm trying to gather. So, if you enter the url to try to go directly > to the text file (which is dynamicaly generated), you get errors > (associated to the database calls I think) if you haven't already > built that information up in the cookie. > > I've tried the net.webclient and it produces the aforementioned > database call errors. Going through the gyrations that I've mentioned > previously, I can at least produce an IE window that contains the text > that I want, but I can't seem to harvest that information into > Powershell. > > Charlie > > PS...The errors that I get when using the webclient and the direct url > are (thousands of them because I'm expecting to get about 6,500 lines > returned): > > <b>Warning</b>: Invalid argument supplied for foreach() in <b>/opt/ > services/httpd/htdocs/inventory/displaydb.php</b> on line <b>379</ > b><br /> What happens if you Download the three URLs in the proper order? Does that not create the appropriate cookies? If not and you knew what the cookie content is, you could use HttpWebRequest (created by the static method WebRequest.Create(string url)). This type allows you to set the cookie content before calling GetResponse(). -- Keith |
| | #7 (permalink) |
| Guest | Re: Extract data from web page > Thanks for the response, but although I've found that documented > *everywhere*, I've tried it and it doesn't work either. Even if I do > this, it displays the contents of the text file in an IE browser, but > $ie.document.body.innertext doesn't seem to contain anything: > > $ie = New-Object -ComObject InternetExplorer.Application > $ie.Navigate("file://C:\temp\textfile.txt") > $ie.visible = $true > $text=$ie.document.body.innertext > $text > > $text will contain nothing. Piping it into get-member produces this > error: > It does work for me... I'm also using Powershell 1.0 on Windows XP Pro SP-2 & IE7 -- Michael Harris Microsoft.MVP.Scripting |
| | #8 (permalink) |
| Guest | Re: Extract data from web page > What happens if you Download the three URLs in the proper order? Does that > not create the appropriate cookies? If not and you knew what the cookie > content is, you could use HttpWebRequest (created by the static method > WebRequest.Create(string url)). This type allows you to set the cookie > content before calling GetResponse(). > > -- > Keith Keith, If I navigate to the three URLs in the proper order, the cookie gets built and the text file that I'm after displays correctly in IE...this works using the script that I posted originally. The problem is that once I have that text displayed in the browser, for some reason using the ".Document.Body.InnerText" doesn't grab the content of the IE window. In fact, I guess that my main problem is that .Document.Body.InnerText doesn't seem to grab the content of the IE window that's been opened either by .PS1 script or from the PS command line. It appears to me that I had 2 ways of trying to skin this cat, but I couldn't use 1 of them (net.webclient) because I can't construct an URL to navigate directly to the text. Using the other method (new- object -comobject InternetExplorer.Application & browsing the URLs in sequence) gives me a view of the data that I want to grab, but for some reason I just can't seem to grab that data. I'm still open to suggestions as to why the .Document.Body.InnerText appears to be null when I try to access it. Thanks for all the help so far. -Charlie |
| | #9 (permalink) |
| Guest | Re: Extract data from web page Actually, now that I look at it, if I look at the properties of my $ie.Document, there is no "Body" property...sounds like this is what's causing my problems, but why is this property missing? Where's my body? )PS C:\Scripting\PowerShell> $ie = New-Object -ComObject InternetExplorer.Application PS C:\Scripting\PowerShell> $ie.Navigate("file://C:\temp \textfile.txt") PS C:\Scripting\PowerShell> $ie.Document | get-member TypeName: System.__ComObject#{3050f55f-98b5-11cf-bb82-00aa00bdce0b} Name MemberType Definition ---- ---------- ---------- appendChild Method IHTMLDOMNode appendChild (IHTMLDOMNode) attachEvent Method bool attachEvent (string, IDispatch) clear Method void clear () cloneNode Method IHTMLDOMNode cloneNode (bool) close Method void close () createAttribute Method IHTMLDOMAttribute createAttribute (string) createComment Method IHTMLDOMNode createComment (string) createDocumentFragment Method IHTMLDocument2 createDocumentFragment () createDocumentFromUrl Method IHTMLDocument2 createDocumentFromUrl (stri... createElement Method IHTMLElement createElement (string) CreateEventObject Method IHTMLEventObj CreateEventObject (Variant) createRenderStyle Method IHTMLRenderStyle createRenderStyle (string) createStyleSheet Method IHTMLStyleSheet createStyleSheet (string, ... createTextNode Method IHTMLDOMNode createTextNode (string) detachEvent Method void detachEvent (string, IDispatch) elementFromPoint Method IHTMLElement elementFromPoint (int, int) execCommand Method bool execCommand (string, bool, Variant) execCommandShowHelp Method bool execCommandShowHelp (string) FireEvent Method bool FireEvent (string, Variant) focus Method void focus () getElementById Method IHTMLElement getElementById (string) getElementsByName Method IHTMLElementCollection getElementsByName (... getElementsByTagName Method IHTMLElementCollection getElementsByTagNam... hasChildNodes Method bool hasChildNodes () hasFocus Method bool hasFocus () insertBefore Method IHTMLDOMNode insertBefore (IHTMLDOMNode, V... open Method IDispatch open (string, Variant, Variant, ... queryCommandEnabled Method bool queryCommandEnabled (string) queryCommandIndeterm Method bool queryCommandIndeterm (string) queryCommandState Method bool queryCommandState (string) queryCommandSupported Method bool queryCommandSupported (string) queryCommandText Method string queryCommandText (string) queryCommandValue Method Variant queryCommandValue (string) recalc Method void recalc (bool) releaseCapture Method void releaseCapture () removeChild Method IHTMLDOMNode removeChild (IHTMLDOMNode) removeNode Method IHTMLDOMNode removeNode (bool) replaceChild Method IHTMLDOMNode replaceChild (IHTMLDOMNode, I... replaceNode Method IHTMLDOMNode replaceNode (IHTMLDOMNode) swapNode Method IHTMLDOMNode swapNode (IHTMLDOMNode) toString Method string toString () write Method void write (SAFEARRAY(Variant)) writeln Method void writeln (SAFEARRAY(Variant)) alinkColor Property Variant alinkColor () {get} {set} anchors Property IHTMLElementCollection anchors () {get} attributes Property IDispatch attributes () {get} baseUrl Property string baseUrl () {get} {set} bgColor Property Variant bgColor () {get} {set} charset Property string charset () {get} {set} childNodes Property IDispatch childNodes () {get} compatMode Property string compatMode () {get} cookie Property string cookie () {get} {set} defaultCharset Property string defaultCharset () {get} {set} designMode Property string designMode () {get} {set} dir Property string dir () {get} {set} doctype Property IHTMLDOMNode doctype () {get} documentElement Property IHTMLElement documentElement () {get} domain Property string domain () {get} {set} embeds Property IHTMLElementCollection embeds () {get} enableDownload Property bool enableDownload () {get} {set} expando Property bool expando () {get} {set} fgColor Property Variant fgColor () {get} {set} fileCreatedDate Property string fileCreatedDate () {get} fileModifiedDate Property string fileModifiedDate () {get} fileSize Property string fileSize () {get} fileUpdatedDate Property string fileUpdatedDate () {get} firstChild Property IHTMLDOMNode firstChild () {get} forms Property IHTMLElementCollection forms () {get} frames Property IHTMLFramesCollection2 frames () {get} implementation Property IHTMLDOMImplementation implementation () {... inheritStyleSheets Property bool inheritStyleSheets () {get} {set} lastChild Property IHTMLDOMNode lastChild () {get} lastModified Property string lastModified () {get} linkColor Property Variant linkColor () {get} {set} location Property IHTMLLocation location () {get} media Property string media () {get} {set} mimeType Property string mimeType () {get} nameProp Property string nameProp () {get} namespaces Property IDispatch namespaces () {get} nextSibling Property IHTMLDOMNode nextSibling () {get} nodeName Property string nodeName () {get} nodeType Property int nodeType () {get} nodeValue Property Variant nodeValue () {get} {set} onactivate Property Variant onactivate () {get} {set} onafterupdate Property Variant onafterupdate () {get} {set} onbeforeactivate Property Variant onbeforeactivate () {get} {set} onbeforedeactivate Property Variant onbeforedeactivate () {get} {set} onbeforeeditfocus Property Variant onbeforeeditfocus () {get} {set} onbeforeupdate Property Variant onbeforeupdate () {get} {set} oncellchange Property Variant oncellchange () {get} {set} onclick Property Variant onclick () {get} {set} oncontextmenu Property Variant oncontextmenu () {get} {set} oncontrolselect Property Variant oncontrolselect () {get} {set} ondataavailable Property Variant ondataavailable () {get} {set} ondatasetchanged Property Variant ondatasetchanged () {get} {set} ondatasetcomplete Property Variant ondatasetcomplete () {get} {set} ondblclick Property Variant ondblclick () {get} {set} ondeactivate Property Variant ondeactivate () {get} {set} ondragstart Property Variant ondragstart () {get} {set} onerrorupdate Property Variant onerrorupdate () {get} {set} onfocusin Property Variant onfocusin () {get} {set} onfocusout Property Variant onfocusout () {get} {set} onhelp Property Variant onhelp () {get} {set} onkeydown Property Variant onkeydown () {get} {set} onkeypress Property Variant onkeypress () {get} {set} onkeyup Property Variant onkeyup () {get} {set} onmousedown Property Variant onmousedown () {get} {set} onmousemove Property Variant onmousemove () {get} {set} onmouseout Property Variant onmouseout () {get} {set} onmouseover Property Variant onmouseover () {get} {set} onmouseup Property Variant onmouseup () {get} {set} onmousewheel Property Variant onmousewheel () {get} {set} onpropertychange Property Variant onpropertychange () {get} {set} onreadystatechange Property Variant onreadystatechange () {get} {set} onrowenter Property Variant onrowenter () {get} {set} onrowexit Property Variant onrowexit () {get} {set} onrowsdelete Property Variant onrowsdelete () {get} {set} onrowsinserted Property Variant onrowsinserted () {get} {set} onselectionchange Property Variant onselectionchange () {get} {set} onselectstart Property Variant onselectstart () {get} {set} onstop Property Variant onstop () {get} {set} ownerDocument Property IDispatch ownerDocument () {get} parentDocument Property IHTMLDocument2 parentDocument () {get} parentNode Property IHTMLDOMNode parentNode () {get} parentWindow Property IHTMLWindow2 parentWindow () {get} plugins Property IHTMLElementCollection plugins () {get} previousSibling Property IHTMLDOMNode previousSibling () {get} protocol Property string protocol () {get} readyState Property string readyState () {get} referrer Property string referrer () {get} scripts Property IHTMLElementCollection scripts () {get} security Property string security () {get} selection Property IHTMLSelectionObject selection () {get} styleSheets Property IHTMLStyleSheetsCollection styleSheets () ... title Property string title () {get} {set} uniqueID Property string uniqueID () {get} url Property string url () {get} {set} URLUnencoded Property string URLUnencoded () {get} vlinkColor Property Variant vlinkColor () {get} {set} |
| | #10 (permalink) |
| Newbie | Re: Extract data from web page cmyers problem was almost certainly due to the fact that WebBrowser.Navigate() is asynchronous: When you call it, it returns immediately in the thread where it was called. The actual loading then proceeds in a different thread. So when cmyers looked at Document.Body, there was nothing there yet, because the control hadn't yet gotten that far in parsing and rendering the HTML. The answer is to call Navigate() and do nothing further right there. Instead, wait, and handle the WebBrowser instance's DocumentCompleted event, which fires when the document is parsed and ready to go. At that point, you can do whatever you like in your event handler. If the HTML was valid, WebBrowser.Document.Body will be valid. I've no clue if you can do that in PowerShell. But that's what the problem is, and that's how to solve it if your language supports it. |
| |
| |
![]() |
| Thread Tools | |
| Display Modes | |
| |
Similar Threads | ||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| So much for Bitlocker/EFS. MS supplies law enforcement with usb key to extract data | rive0108 | System Security | 1 | 3 Weeks Ago 10:52 AM |
| Extract information from web-page | Nikhil R. Bhandari | PowerShell | 1 | 10-10-2007 05:56 PM |
| Extract string from web page | Brian Hoort | PowerShell | 3 | 01-11-2007 05:30 PM |
| How to Compose a Page From Grids Defined Outide the Page? | Chris Moore | Avalon | 10 | 11-11-2006 12:24 PM |
| extract vista I can extract vista from the iso file but can not ex | fjjm303 | Vista General | 7 | 06-19-2006 07:46 PM |