Windows Vista Forums
Vista Forums Home Join Vista Forums Windows 7 Forum Vista Tutorials Tags
Welcome to Windows Vista Forums. Our forum is dedicated to helping you find solutions with any problems, errors or issues you are experiencing with Windows Vista. The Vista forum also covers news and updates and has an extensive Windows Vista tutorial section that covers a wide range of tips and tricks.

Go Back   Vista Forums > Misc Newsgroups > PowerShell

Vista - Adobe parsing project

Reply
 
Old 12-30-2008   #1 (permalink)
drew


 
 

Adobe parsing project

hello list
i am in waaay over my head, and hope you can straighten me out while i am
trying to learn powershell.
i get a big PDF file from a state office that needs to be parsed and
graphed, but its format is very gnarly because the PDF doesnt leave any
commas or quotes, and has no fixed positions but i think i should be able to
do what i need by testing each character in the pipeline to see if its a
number, a letter or a space. If its a letter followed by a space and another
letter, iow, a string, then enclose it or delimit it, and then collect the
rest of the doubles and ints on the line.
what i have been doing is opening the PDF and saving it as text, but that
leaves me with multi column titles bunched up on the first two lines, and
indeterminate columns or locations on the 'data' lines, like this;

New Pre Projected New Budget
CasesComp Ratio Ratio
Example One 11.9 12.3 14 6 9
Example Line Two 3 4.4 6 7 8
Example Line Number Three 2.4 4.4 1.6 7.2 1
Example Four 2 4 6 8 1.5

i have been fumfering with things along the lines of get-childitem
report.txt | get-content | $_.Split(" "), but that complained that Split
didnt belong there, but a foray into
filter showpart{$_.Split(" ") -match "\D"} also bombed.

Basically i need to wind up with something like
"Example Line Number Three", 2.4, 4.4, 1.6, 7.2, 1
"Example Four",2, 4, 6, 8, 1.5

Please accept my thanks and forgive my inarticulate and incorrect attempts
in my struggle to get to dry ground.




My System SpecsSystem Spec
Old 12-30-2008   #2 (permalink)
Kryten


 
 

Re: Adobe parsing project

Hi Drew,
I'm sure you will get much better answers than this, but anyway...

On the basis that you are grabbing all the text out of the pdf ( even
though there
is stuff in there alongside your "Example" lines ). Try this:-

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# grab the "example" lines into array
cls ; $examples = @() ; $product = @()
foreach ( $example in ( gc pdftext.txt ) ) {
switch -regex ( $example ) {
"^Example" { $examples += $example }
}
}
foreach ( $example in $examples ) {
$example = [regex]::Replace("$example","^","`"")
$example = [regex]::Replace("$example","(?<=\D)\b\s(?=\d)","`"`,")
$example = [regex]::Replace("$example","(?<=\d)\s",",")
$product += "$example"
}
$product
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Obviously pdftext.txt is the file containing the pdf textual
content. Please note that this is not industrial strength code
by any means and is actually quite fragile, but it might be a
possible starting point for something stronger.

Hope it helps a bit.

Stuart


My System SpecsSystem Spec
Old 12-30-2008   #3 (permalink)
drew


 
 

Re: Adobe parsing project

thanks a ton Stuart;
In stepping through the code, it seems that this bit is not getting executed

switch -regex ( $example )
{
"^Example" { $examples += $example }
}

because i can see the value in $example, but $examples never gets a value,
so execution never gets to the backticked double quotes in the subsequent
foreach...did i miss something?
thanks again for your help, it is really instructiive.
drew


"Kryten" wrote:
Quote:

> Hi Drew,
> I'm sure you will get much better answers than this, but anyway...
>
> On the basis that you are grabbing all the text out of the pdf ( even
> though there
> is stuff in there alongside your "Example" lines ). Try this:-
>
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> # grab the "example" lines into array
> cls ; $examples = @() ; $product = @()
> foreach ( $example in ( gc pdftext.txt ) ) {
> switch -regex ( $example ) {
> "^Example" { $examples += $example }
> }
> }
> foreach ( $example in $examples ) {
> $example = [regex]::Replace("$example","^","`"")
> $example = [regex]::Replace("$example","(?<=\D)\b\s(?=\d)","`"`,")
> $example = [regex]::Replace("$example","(?<=\d)\s",",")
> $product += "$example"
> }
> $product
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> Obviously pdftext.txt is the file containing the pdf textual
> content. Please note that this is not industrial strength code
> by any means and is actually quite fragile, but it might be a
> possible starting point for something stronger.
>
> Hope it helps a bit.
>
> Stuart
>
>
>
My System SpecsSystem Spec
Old 12-30-2008   #4 (permalink)
Kryten


 
 

Re: Adobe parsing project

Hi Drew,

It's hard to know why it's not working for you without seeing it.

First, I created a .txt file called pdftext.txt which contains:-


New Pre Projected New Budget
CasesComp Ratio Ratio
Example One 11.9 12.3 14 6 9
Example Line Two 3 4.4 6 7 8
Example Line Number Three 2.4 4.4 1.6 7.2 1
Example Four 2 4 6 8 1.5

Which is exactly what your original post said your data looked like
after being scraped out of the .pdf

On that data, the script should (and does for me) return :-

"Example One",11.9,12.3,14,6,9
"Example Line Two",3,4.4,6,7,8,
"Example Line Number Three",2.4,4.4,1.6,7.2,1
"Example Four",2,4,6,8,1.5

Secondly, when I modify the script to include an additional action
embedded in the switch -regex, I can see the value of $examples being
updated with each new value. Here's that modified script, you might
like
to try? :-

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

cls ; $examples = @() ; $product = @()
foreach ( $example in ( gc pdftext.txt ) ) {
switch -regex ( $example ) {
"^Example" { $examples += $example
"from SWITCH: $examples" }
}
}

foreach ( $example in $examples ) {
$example = [regex]::Replace("$example","^","`"")
$example = [regex]::Replace("$example","(?<=\D)\b\s(?=\d)","`"`,")
$example = [regex]::Replace("$example","(?<=\d)\s",",")
$product += "$example"
}

$product

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Thirdly, I am using CTP3 throughout all of this, though I'm not sure
without
checking whether that would be a big problem.

Also, When I run this with Powershell+ and PowerGUII can "see" the
value of $examples being
incremented as the script runs.

Uuumm, if none of that's done the trick the only thing I can think of
left is that there is
something wrong with the source .txt file, that I referred to as the
pdftext.txt. Could you try
$a = get-content -path "path to this file"
$a -match "^Example"
and see if what is returned. When I do this I get back :-

Example One 11.9 12.3 14 6 9
Example Line Two 3 4.4 6 7 8
Example Line Number Three 2.4 4.4 1.6 7.2 1
Example Four 2 4 6 8 1.5

So should you.

Are there any specific error messages being returned?

After that I'm kinda stuck I'm afraid. Sorry.

Good luck,

Stuart









My System SpecsSystem Spec
Old 12-30-2008   #5 (permalink)
Kryten


 
 

Re: Adobe parsing project

Well, maybe not totally stuck but you would
probably need to email me the .txt file you are
working with, or at least a chunk of it.

Cheers,
Stuart

My System SpecsSystem Spec
Old 12-30-2008   #6 (permalink)
Kiron


 
 

Re: Adobe parsing project

gc adobe.txt |
# filter lines that don't start with a whitespace
? {$_ -notmatch '^\s'} |
% {
# split the line using a RegEx
$items = [regex]::split($_,'\s(?=\d)')
# enclose the first item in double-quotes
$items[0] = "`"$($items[0])`""
# change the Object Field Separator ($ofs) to a comma
# and expand the $items array in a child scope
&{$ofs = ','; "$items"}
}

--
Kiron
My System SpecsSystem Spec
Old 12-30-2008   #7 (permalink)
drew


 
 

Re: Adobe parsing project

as George Costanza made immortal, it isn't you, its me.
it didnt work for me because I AM AN IDIOT!
my 'real' text file has different Plan names in it, not "Example", i meant
to illustrate that the plan name could have up to five embedded spaces in it.
So when the regex says 'start the line with "Example"', and no line starts
with that, naturally, the examples array doesnt get filled and neither does
product!
DOH!
thank you so much for your excellent illustration, i really appreciate it,
and yes, i have managed to get it to go against the real files.

for my next trick i want to get ps to send Adobe the keystrokes to open the
pdfs and save them as text...that would automate the conversion, then send
the files with commas and quotes to Excel to graph them.

thanks again for your help and examples




"Kryten" wrote:
Quote:

> Well, maybe not totally stuck but you would
> probably need to email me the .txt file you are
> working with, or at least a chunk of it.
>
> Cheers,
> Stuart
>
>
My System SpecsSystem Spec
Reply

Thread Tools


Similar Threads
Thread Forum
In any project, Can't we refer classes directly, (without addingphysically in the project)? .NET General
Adobe Reader telling me all program files are corrupt .. desktop and sidebar icons are all adobe Vista General
Adobe.CS3.Master.Collection 4DVD, Adobe.Premiere.Pro.CS3 1DVD, QuarkXPress.7.3.Passport 1CD, software for depth of market DOM, other 2007/August/17 new programs Vista General
Adobe Flash Player IE7 Crash neither Adobe or MSFT can fix Vista General
Microsoft Advances Its Project Management Technology and the Project Management Profession Vista News


Vista Forums is an independent web site and has not been authorized,
sponsored, or otherwise approved by Microsoft Corporation.
"Windows Vista", the Start Orb, and related materials are trademarks of Microsoft Corp.
© Designer Media Ltd

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46