![]() |
![]() | ![]() | ![]() | ![]() | ![]() | ![]() | ![]() |
| Welcome to Windows Vista Forums. Our forum is dedicated to helping you find solutions with any problems, errors or issues you are experiencing with Windows Vista. The Vista forum also covers news and updates and has an extensive Windows Vista tutorial section that covers a wide range of tips and tricks. |
| |||||||
![]() |
| |
| | #1 (permalink) |
| | Adobe parsing project hello list i am in waaay over my head, and hope you can straighten me out while i am trying to learn powershell. i get a big PDF file from a state office that needs to be parsed and graphed, but its format is very gnarly because the PDF doesnt leave any commas or quotes, and has no fixed positions but i think i should be able to do what i need by testing each character in the pipeline to see if its a number, a letter or a space. If its a letter followed by a space and another letter, iow, a string, then enclose it or delimit it, and then collect the rest of the doubles and ints on the line. what i have been doing is opening the PDF and saving it as text, but that leaves me with multi column titles bunched up on the first two lines, and indeterminate columns or locations on the 'data' lines, like this; New Pre Projected New Budget CasesComp Ratio Ratio Example One 11.9 12.3 14 6 9 Example Line Two 3 4.4 6 7 8 Example Line Number Three 2.4 4.4 1.6 7.2 1 Example Four 2 4 6 8 1.5 i have been fumfering with things along the lines of get-childitem report.txt | get-content | $_.Split(" "), but that complained that Split didnt belong there, but a foray into filter showpart{$_.Split(" ") -match "\D"} also bombed. Basically i need to wind up with something like "Example Line Number Three", 2.4, 4.4, 1.6, 7.2, 1 "Example Four",2, 4, 6, 8, 1.5 Please accept my thanks and forgive my inarticulate and incorrect attempts in my struggle to get to dry ground. |
My System Specs![]() |
| | #2 (permalink) |
| | Re: Adobe parsing project Hi Drew, I'm sure you will get much better answers than this, but anyway... On the basis that you are grabbing all the text out of the pdf ( even though there is stuff in there alongside your "Example" lines ). Try this:- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ # grab the "example" lines into array cls ; $examples = @() ; $product = @() foreach ( $example in ( gc pdftext.txt ) ) { switch -regex ( $example ) { "^Example" { $examples += $example } } } foreach ( $example in $examples ) { $example = [regex]::Replace("$example","^","`"") $example = [regex]::Replace("$example","(?<=\D)\b\s(?=\d)","`"`,") $example = [regex]::Replace("$example","(?<=\d)\s",",") $product += "$example" } $product ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Obviously pdftext.txt is the file containing the pdf textual content. Please note that this is not industrial strength code by any means and is actually quite fragile, but it might be a possible starting point for something stronger. Hope it helps a bit. Stuart |
My System Specs![]() |
| | #3 (permalink) |
| | Re: Adobe parsing project thanks a ton Stuart; In stepping through the code, it seems that this bit is not getting executed switch -regex ( $example ) { "^Example" { $examples += $example } } because i can see the value in $example, but $examples never gets a value, so execution never gets to the backticked double quotes in the subsequent foreach...did i miss something? thanks again for your help, it is really instructiive. drew "Kryten" wrote: Quote: > Hi Drew, > I'm sure you will get much better answers than this, but anyway... > > On the basis that you are grabbing all the text out of the pdf ( even > though there > is stuff in there alongside your "Example" lines ). Try this:- > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > # grab the "example" lines into array > cls ; $examples = @() ; $product = @() > foreach ( $example in ( gc pdftext.txt ) ) { > switch -regex ( $example ) { > "^Example" { $examples += $example } > } > } > foreach ( $example in $examples ) { > $example = [regex]::Replace("$example","^","`"") > $example = [regex]::Replace("$example","(?<=\D)\b\s(?=\d)","`"`,") > $example = [regex]::Replace("$example","(?<=\d)\s",",") > $product += "$example" > } > $product > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > Obviously pdftext.txt is the file containing the pdf textual > content. Please note that this is not industrial strength code > by any means and is actually quite fragile, but it might be a > possible starting point for something stronger. > > Hope it helps a bit. > > Stuart > > > |
My System Specs![]() |
| | #4 (permalink) |
| | Re: Adobe parsing project Hi Drew, It's hard to know why it's not working for you without seeing it. First, I created a .txt file called pdftext.txt which contains:- New Pre Projected New Budget CasesComp Ratio Ratio Example One 11.9 12.3 14 6 9 Example Line Two 3 4.4 6 7 8 Example Line Number Three 2.4 4.4 1.6 7.2 1 Example Four 2 4 6 8 1.5 Which is exactly what your original post said your data looked like after being scraped out of the .pdf On that data, the script should (and does for me) return :- "Example One",11.9,12.3,14,6,9 "Example Line Two",3,4.4,6,7,8, "Example Line Number Three",2.4,4.4,1.6,7.2,1 "Example Four",2,4,6,8,1.5 Secondly, when I modify the script to include an additional action embedded in the switch -regex, I can see the value of $examples being updated with each new value. Here's that modified script, you might like to try? :- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ cls ; $examples = @() ; $product = @() foreach ( $example in ( gc pdftext.txt ) ) { switch -regex ( $example ) { "^Example" { $examples += $example "from SWITCH: $examples" } } } foreach ( $example in $examples ) { $example = [regex]::Replace("$example","^","`"") $example = [regex]::Replace("$example","(?<=\D)\b\s(?=\d)","`"`,") $example = [regex]::Replace("$example","(?<=\d)\s",",") $product += "$example" } $product ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Thirdly, I am using CTP3 throughout all of this, though I'm not sure without checking whether that would be a big problem. Also, When I run this with Powershell+ and PowerGUII can "see" the value of $examples being incremented as the script runs. Uuumm, if none of that's done the trick the only thing I can think of left is that there is something wrong with the source .txt file, that I referred to as the pdftext.txt. Could you try $a = get-content -path "path to this file" $a -match "^Example" and see if what is returned. When I do this I get back :- Example One 11.9 12.3 14 6 9 Example Line Two 3 4.4 6 7 8 Example Line Number Three 2.4 4.4 1.6 7.2 1 Example Four 2 4 6 8 1.5 So should you. Are there any specific error messages being returned? After that I'm kinda stuck I'm afraid. Sorry. Good luck, Stuart |
My System Specs![]() |
| | #5 (permalink) |
| | Re: Adobe parsing project Well, maybe not totally stuck but you would probably need to email me the .txt file you are working with, or at least a chunk of it. Cheers, Stuart |
My System Specs![]() |
| | #6 (permalink) |
| | Re: Adobe parsing project gc adobe.txt | # filter lines that don't start with a whitespace ? {$_ -notmatch '^\s'} | % { # split the line using a RegEx $items = [regex]::split($_,'\s(?=\d)') # enclose the first item in double-quotes $items[0] = "`"$($items[0])`"" # change the Object Field Separator ($ofs) to a comma # and expand the $items array in a child scope &{$ofs = ','; "$items"} } -- Kiron |
My System Specs![]() |
| | #7 (permalink) |
| | Re: Adobe parsing project as George Costanza made immortal, it isn't you, its me. it didnt work for me because I AM AN IDIOT! my 'real' text file has different Plan names in it, not "Example", i meant to illustrate that the plan name could have up to five embedded spaces in it. So when the regex says 'start the line with "Example"', and no line starts with that, naturally, the examples array doesnt get filled and neither does product! DOH! thank you so much for your excellent illustration, i really appreciate it, and yes, i have managed to get it to go against the real files. for my next trick i want to get ps to send Adobe the keystrokes to open the pdfs and save them as text...that would automate the conversion, then send the files with commas and quotes to Excel to graph them. thanks again for your help and examples "Kryten" wrote: Quote: > Well, maybe not totally stuck but you would > probably need to email me the .txt file you are > working with, or at least a chunk of it. > > Cheers, > Stuart > > |
My System Specs![]() |
![]() |
| Thread Tools | |
| |