Windows Vista Forums
Vista Forums Home Join Vista Forums Windows 7 Forum Vista Tutorials Tags
Welcome to Windows Vista Forums. Our forum is dedicated to helping you find solutions with any problems, errors or issues you are experiencing with Windows Vista. The Vista forum also covers news and updates and has an extensive Windows Vista tutorial section that covers a wide range of tips and tricks.

Go Back   Vista Forums > Misc Newsgroups > PowerShell

Vista Tutorial - Searching large text files

Reply
 
Old 06-14-2007   #1 (permalink)
Chris Harris
Guest


 
 

Searching large text files

I have very large text files (100MB-500MB+) that I need to process in order
to extract useful pieces of information. Unfortunately, I can't find an
efficient way of doing this with powershell as get-content tries to pull the
entire contents of the file into memory and doesn't seem to store it there
very efficiently.

While I have 2GB of memory on my machine, I keep getting
System.OutOfMemoryException errors from get-content. When I look at the
powershell.exe process in task manager I see it using over 1.5GB of memory.

Is there a more efficient way to do this with Powershell?

Thanks,
Chris

My System SpecsSystem Spec
Old 06-14-2007   #2 (permalink)
Marco Shaw
Guest


 
 

Re: Searching large text files

Chris Harris wrote:
> I have very large text files (100MB-500MB+) that I need to process in order
> to extract useful pieces of information. Unfortunately, I can't find an
> efficient way of doing this with powershell as get-content tries to pull the
> entire contents of the file into memory and doesn't seem to store it there
> very efficiently.
>
> While I have 2GB of memory on my machine, I keep getting
> System.OutOfMemoryException errors from get-content. When I look at the
> powershell.exe process in task manager I see it using over 1.5GB of memory.
>
> Is there a more efficient way to do this with Powershell?
>
> Thanks,
> Chris


Can you be more specific about what you are trying to accomplish?
PowerShell may not be the best in this case (until v.Next).

Marco
My System SpecsSystem Spec
Old 06-14-2007   #3 (permalink)
Kiron
Guest


 
 

Re: Searching large text files

Use Get-Content's -ReadCount parameter, set it to 1 to send a line at a time
through the pipeline but don't assign this to a variable before, instead
redirect the output to a file, e.g.:

gc c:\largeFile.txt -read 1 | ? {<filters>} > c:\filteredFile.txt

# don't assign the out to variable like this
$filterContent = gc c:\largeFile.txt -read 1 | ? {<filters>}

--
Kiron

My System SpecsSystem Spec
Old 06-14-2007   #4 (permalink)
Jacques Barathon [MS]
Guest


 
 

Re: Searching large text files

"Kiron" <Kiron@discussions.microsoft.com> wrote in message
news:67AC4B33-13F4-4A00-BD9F-080549674FDE@microsoft.com...
> Use Get-Content's -ReadCount parameter, set it to 1 to send a line at a
> time
> through the pipeline but don't assign this to a variable before, instead
> redirect the output to a file, e.g.:
>
> gc c:\largeFile.txt -read 1 | ? {<filters>} > c:\filteredFile.txt
>
> # don't assign the out to variable like this
> $filterContent = gc c:\largeFile.txt -read 1 | ? {<filters>}


Alternatively you can use the System.IO.File class:

[io.file]::ReadAllLines("c:\largeFile.txt")

It definitely is faster than get-content, it may also make better usage of
memory.

Jacques

My System SpecsSystem Spec
Old 06-14-2007   #5 (permalink)
Kiron
Guest


 
 

Re: Searching large text files

Thanks for the tip. The [IO.File] Method does get the contents faster but I
suppose the memory overflow issue remains because it would go through the
pipeline as a big chunk.
Get-Content's -ReadCount could be set to a higher value than 1 to get larger
chunks of data -therefore faster- without overflowing the memory,
unfortunately, the comparison operators (-like, -notlike, -match, -notmatch)
don't work efficently then, many lines are skipped, missed or ignored.

--
Kiron

My System SpecsSystem Spec
Old 06-14-2007   #6 (permalink)
Keith Hill [MVP]
Guest


 
 

Re: Searching large text files

"Kiron" <Kiron@discussions.microsoft.com> wrote in message
news1FF3ADA-7974-4805-BFFC-6E668C42DAA4@microsoft.com...
> Thanks for the tip. The [IO.File] Method does get the contents faster but
> I
> suppose the memory overflow issue remains because it would go through the
> pipeline as a big chunk.
> Get-Content's -ReadCount could be set to a higher value than 1 to get
> larger
> chunks of data -therefore faster- without overflowing the memory,
> unfortunately, the comparison operators
> (-like, -notlike, -match, -notmatch)
> don't work efficently then, many lines are skipped, missed or ignored.
>


Even though "get-content -readcount 1000" reads a 1000 lines at a time and
sends them down the pipeline, the next stage of the pipeline still sees each
individual line. So that should not impact operators like -like
and -notlike. This would matter for -match *if* you needed to use
singleline/multiline regex mode in which case you need all the contents as a
single string.

--
Keith

My System SpecsSystem Spec
Old 06-14-2007   #7 (permalink)
Kiron
Guest


 
 

Re: Searching large text files

Thanks Keith. That's what I thought Where-Object would do --filter one
object at a time-- but when the objects are sent through the pipeline from
Get-Content with the -ReadCount parameter set to other than 0 or 1, lines
are skipped.

Try this, it's pretty simple, ten lines, but the Count varies instead of
constantly being 10:

@'
a
ab
abc
abcd
abcde
abcdef
abcdefg
abcdefgh
abcdefghi
abcdefghij
'@ > test.txt

gc test.txt

(gc test.txt -read 1 | ? {$_ -like '*a*'} | mo).count
(gc test.txt -read 2 | ? {$_ -like '*a*'} | mo).count
(gc test.txt -read 3 | ? {$_ -like '*a*'} | mo).count
(gc test.txt -read 4 | ? {$_ -like '*a*'} | mo).count
(gc test.txt -read 5 | ? {$_ -like '*a*'} | mo).count
(gc test.txt -read 6 | ? {$_ -like '*a*'} | mo).count
(gc test.txt -read 7 | ? {$_ -like '*a*'} | mo).count
(gc test.txt -read 8 | ? {$_ -like '*a*'} | mo).count
(gc test.txt -read 9 | ? {$_ -like '*a*'} | mo).count
(gc test.txt -read 10 | ? {$_ -like '*a*'} | mo).count

(gc test.txt -read 1 | ? {$_ -match 'a'} | mo).count
(gc test.txt -read 2 | ? {$_ -match 'a'} | mo).count
(gc test.txt -read 3 | ? {$_ -match 'a'} | mo).count
(gc test.txt -read 4 | ? {$_ -match 'a'} | mo).count
(gc test.txt -read 5 | ? {$_ -match 'a'} | mo).count
(gc test.txt -read 6 | ? {$_ -match 'a'} | mo).count
(gc test.txt -read 7 | ? {$_ -match 'a'} | mo).count
(gc test.txt -read 8 | ? {$_ -match 'a'} | mo).count
(gc test.txt -read 9 | ? {$_ -match 'a'} | mo).count
(gc test.txt -read 10 | ? {$_ -match 'a'} | mo).count

# delete when done
ri test.txt--
Kiron

My System SpecsSystem Spec
Old 06-15-2007   #8 (permalink)
Kiron
Guest


 
 

Re: Searching large text files

mo is an alias for Measure-Object, oops!

Try this, it's pretty simple, ten lines, but the Count varies instead of
constantly being 10:

@'
a
ab
abc
abcd
abcde
abcdef
abcdefg
abcdefgh
abcdefghi
abcdefghij
'@ > test.txt

gc test.txt

(gc test.txt -read 1 | ? {$_ -like '*a*'} | Measure-Object).count
(gc test.txt -read 2 | ? {$_ -like '*a*'} | Measure-Object).count
(gc test.txt -read 3 | ? {$_ -like '*a*'} | Measure-Object).count
(gc test.txt -read 4 | ? {$_ -like '*a*'} | Measure-Object).count
(gc test.txt -read 5 | ? {$_ -like '*a*'} | Measure-Object).count
(gc test.txt -read 6 | ? {$_ -like '*a*'} | Measure-Object).count
(gc test.txt -read 7 | ? {$_ -like '*a*'} | Measure-Object).count
(gc test.txt -read 8 | ? {$_ -like '*a*'} | Measure-Object).count
(gc test.txt -read 9 | ? {$_ -like '*a*'} | Measure-Object).count
(gc test.txt -read 10 | ? {$_ -like '*a*'} | Measure-Object).count

(gc test.txt -read 1 | ? {$_ -match 'a'} | Measure-Object).count
(gc test.txt -read 2 | ? {$_ -match 'a'} | Measure-Object).count
(gc test.txt -read 3 | ? {$_ -match 'a'} | Measure-Object).count
(gc test.txt -read 4 | ? {$_ -match 'a'} | Measure-Object).count
(gc test.txt -read 5 | ? {$_ -match 'a'} | Measure-Object).count
(gc test.txt -read 6 | ? {$_ -match 'a'} | Measure-Object).count
(gc test.txt -read 7 | ? {$_ -match 'a'} | Measure-Object).count
(gc test.txt -read 8 | ? {$_ -match 'a'} | Measure-Object).count
(gc test.txt -read 9 | ? {$_ -match 'a'} | Measure-Object).count
(gc test.txt -read 10 | ? {$_ -match 'a'} | Measure-Object).count

# delete when done
ri test.txt

--
Kiron
My System SpecsSystem Spec
Old 06-15-2007   #9 (permalink)
Kiron
Guest


 
 

Re: Searching large text files

Now try filtering each object with an If statement inside a Foreach-Object
scriptblock. Count is constantly 10 as expected.
Where-Object and Get-Content's -ReadCount <-gt 1> don't get along:

@'
a
ab
abc
abcd
abcde
abcdef
abcdefg
abcdefgh
abcdefghi
abcdefghij
'@ > test.txt

gc test.txt

(gc test.txt -read 1 | % {if ($_ -like '*a*') {$_}} | Measure-Object).count
(gc test.txt -read 2 | % {if ($_ -like '*a*') {$_}} | Measure-Object).count
(gc test.txt -read 3 | % {if ($_ -like '*a*') {$_}} | Measure-Object).count
(gc test.txt -read 4 | % {if ($_ -like '*a*') {$_}} | Measure-Object).count
(gc test.txt -read 5 | % {if ($_ -like '*a*') {$_}} | Measure-Object).count
(gc test.txt -read 6 | % {if ($_ -like '*a*') {$_}} | Measure-Object).count
(gc test.txt -read 7 | % {if ($_ -like '*a*') {$_}} | Measure-Object).count
(gc test.txt -read 8 | % {if ($_ -like '*a*') {$_}} | Measure-Object).count
(gc test.txt -read 9 | % {if ($_ -like '*a*') {$_}} | Measure-Object).count
(gc test.txt -read 10 | % {if ($_ -like '*a*') {$_}} | Measure-Object).count

(gc test.txt -read 1 | % {if ($_ -match 'a') {$_}} | Measure-Object).count
(gc test.txt -read 2 | % {if ($_ -match 'a') {$_}} | Measure-Object).count
(gc test.txt -read 3 | % {if ($_ -match 'a') {$_}} | Measure-Object).count
(gc test.txt -read 4 | % {if ($_ -match 'a') {$_}} | Measure-Object).count
(gc test.txt -read 5 | % {if ($_ -match 'a') {$_}} | Measure-Object).count
(gc test.txt -read 6 | % {if ($_ -match 'a') {$_}} | Measure-Object).count
(gc test.txt -read 7 | % {if ($_ -match 'a') {$_}} | Measure-Object).count
(gc test.txt -read 8 | % {if ($_ -match 'a') {$_}} | Measure-Object).count
(gc test.txt -read 9 | % {if ($_ -match 'a') {$_}} | Measure-Object).count
(gc test.txt -read 10 | % {if ($_ -match 'a') {$_}} | Measure-Object).count

# delete when done
ri test.txt

--
Kiron

My System SpecsSystem Spec
Old 06-15-2007   #10 (permalink)
Keith Hill [MVP]
Guest


 
 

Re: Searching large text files

"Kiron" <Kiron@discussions.microsoft.com> wrote in message
news:O2H3piwrHHA.4364@TK2MSFTNGP04.phx.gbl...
> Now try filtering each object with an If statement inside a Foreach-Object
> scriptblock. Count is constantly 10 as expected.
> Where-Object and Get-Content's -ReadCount <-gt 1> don't get along:
>


Yeah what I said wasn't quite right. Setting -readcount to something like 5
will read five lines and send that down the pipeline as two array objects
each with 5 strings in it:

64> gc test.txt -read 5 | get-typename # get-typename from PSCX
Object[]
Object[]

65> gc test.txt -read 5 | %{$_} | get-typename
String
String
String
String
String
String
String
String
String
String

"Typically" these arrays are dealt with in the same way as if you had sent
the strings one at a time but not in all cases. In the for each loop above,
it sends the array down the pipeline which shreds the array and sends the
individual elements. In the case of -like, it will work on an array as well
as a scalar:

66> gc test.txt -read 5 | where {$_ -like 'a*'}
a
ab
abc
abcd
abcde
abcdef
abcdefg
abcdefgh
abcdefghi
abcdefghij

or

68> (ql a ab abc abcd) -like "a*" # ql or quote-list from PSCX
a
ab
abc
abcd

Many cmdlets will accept an array of input and then operate on each element
individually. However in your case, what you are measuring with
measure-object is the fact the Where-Object cmdlets just sends the
"original" object (which is an array) on down the pipeline if the expression
evaluates to true. Fortunately both -like and -match operate on arrays and
return just the elements that match:

2> (ql ab ba cd af) -match '^a'
ab
af
3> (ql ab ba cd af) -like 'a*'
ab
af

What I'm not seeing is get-content ballooning the memory requirements of
PowerShell. I run the following command on a 77 MB text file:

84> measure-command { gc large.txt | ?{$_ -match 'dg\s*$'} }

Days : 0
Hours : 0
Minutes : 2
Seconds : 18
Milliseconds : 162
Ticks : 1381622340
TotalDays : 0.00159909993055556
TotalHours : 0.0383783983333333
TotalMinutes : 2.3027039
TotalSeconds : 138.162234
TotalMilliseconds : 138162.234

and PowerShell never gets above ~53 MB of private memory.

--
Keith

My System SpecsSystem Spec
Reply

Thread Tools


Similar Threads
Thread Forum
Searching message text Live Mail
searching text within word documents Vista General
Searching for content in text files with powershell PowerShell
Help searching text within XLS files Vista file management
Searching for specific target text Vista General


Vista Forums is an independent web site and has not been authorized,
sponsored, or otherwise approved by Microsoft Corporation.
"Windows Vista", the Start Orb, and related materials are trademarks of Microsoft Corp.
© Designer Media Ltd

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46