Windows Vista Forums

Searching large text files
  1. #1


    Chris Harris Guest

    Searching large text files

    I have very large text files (100MB-500MB+) that I need to process in order
    to extract useful pieces of information. Unfortunately, I can't find an
    efficient way of doing this with powershell as get-content tries to pull the
    entire contents of the file into memory and doesn't seem to store it there
    very efficiently.

    While I have 2GB of memory on my machine, I keep getting
    System.OutOfMemoryException errors from get-content. When I look at the
    powershell.exe process in task manager I see it using over 1.5GB of memory.

    Is there a more efficient way to do this with Powershell?

    Thanks,
    Chris



      My System SpecsSystem Spec

  2. #2


    Marco Shaw Guest

    Re: Searching large text files

    Chris Harris wrote:
    > I have very large text files (100MB-500MB+) that I need to process in order
    > to extract useful pieces of information. Unfortunately, I can't find an
    > efficient way of doing this with powershell as get-content tries to pull the
    > entire contents of the file into memory and doesn't seem to store it there
    > very efficiently.
    >
    > While I have 2GB of memory on my machine, I keep getting
    > System.OutOfMemoryException errors from get-content. When I look at the
    > powershell.exe process in task manager I see it using over 1.5GB of memory.
    >
    > Is there a more efficient way to do this with Powershell?
    >
    > Thanks,
    > Chris


    Can you be more specific about what you are trying to accomplish?
    PowerShell may not be the best in this case (until v.Next).

    Marco

      My System SpecsSystem Spec

  3. #3


    Kiron Guest

    Re: Searching large text files

    Use Get-Content's -ReadCount parameter, set it to 1 to send a line at a time
    through the pipeline but don't assign this to a variable before, instead
    redirect the output to a file, e.g.:

    gc c:\largeFile.txt -read 1 | ? {<filters>} > c:\filteredFile.txt

    # don't assign the out to variable like this
    $filterContent = gc c:\largeFile.txt -read 1 | ? {<filters>}

    --
    Kiron


      My System SpecsSystem Spec

  4. #4


    Jacques Barathon [MS] Guest

    Re: Searching large text files

    "Kiron" <Kiron@discussions.microsoft.com> wrote in message
    news:67AC4B33-13F4-4A00-BD9F-080549674FDE@microsoft.com...
    > Use Get-Content's -ReadCount parameter, set it to 1 to send a line at a
    > time
    > through the pipeline but don't assign this to a variable before, instead
    > redirect the output to a file, e.g.:
    >
    > gc c:\largeFile.txt -read 1 | ? {<filters>} > c:\filteredFile.txt
    >
    > # don't assign the out to variable like this
    > $filterContent = gc c:\largeFile.txt -read 1 | ? {<filters>}


    Alternatively you can use the System.IO.File class:

    [io.file]::ReadAllLines("c:\largeFile.txt")

    It definitely is faster than get-content, it may also make better usage of
    memory.

    Jacques


      My System SpecsSystem Spec

  5. #5


    Kiron Guest

    Re: Searching large text files

    Thanks for the tip. The [IO.File] Method does get the contents faster but I
    suppose the memory overflow issue remains because it would go through the
    pipeline as a big chunk.
    Get-Content's -ReadCount could be set to a higher value than 1 to get larger
    chunks of data -therefore faster- without overflowing the memory,
    unfortunately, the comparison operators (-like, -notlike, -match, -notmatch)
    don't work efficently then, many lines are skipped, missed or ignored.

    --
    Kiron


      My System SpecsSystem Spec

  6. #6


    Keith Hill [MVP] Guest

    Re: Searching large text files

    "Kiron" <Kiron@discussions.microsoft.com> wrote in message
    news1FF3ADA-7974-4805-BFFC-6E668C42DAA4@microsoft.com...
    > Thanks for the tip. The [IO.File] Method does get the contents faster but
    > I
    > suppose the memory overflow issue remains because it would go through the
    > pipeline as a big chunk.
    > Get-Content's -ReadCount could be set to a higher value than 1 to get
    > larger
    > chunks of data -therefore faster- without overflowing the memory,
    > unfortunately, the comparison operators
    > (-like, -notlike, -match, -notmatch)
    > don't work efficently then, many lines are skipped, missed or ignored.
    >


    Even though "get-content -readcount 1000" reads a 1000 lines at a time and
    sends them down the pipeline, the next stage of the pipeline still sees each
    individual line. So that should not impact operators like -like
    and -notlike. This would matter for -match *if* you needed to use
    singleline/multiline regex mode in which case you need all the contents as a
    single string.

    --
    Keith


      My System SpecsSystem Spec

  7. #7


    Kiron Guest

    Re: Searching large text files

    Thanks Keith. That's what I thought Where-Object would do --filter one
    object at a time-- but when the objects are sent through the pipeline from
    Get-Content with the -ReadCount parameter set to other than 0 or 1, lines
    are skipped.

    Try this, it's pretty simple, ten lines, but the Count varies instead of
    constantly being 10:

    @'
    a
    ab
    abc
    abcd
    abcde
    abcdef
    abcdefg
    abcdefgh
    abcdefghi
    abcdefghij
    '@ > test.txt

    gc test.txt

    (gc test.txt -read 1 | ? {$_ -like '*a*'} | mo).count
    (gc test.txt -read 2 | ? {$_ -like '*a*'} | mo).count
    (gc test.txt -read 3 | ? {$_ -like '*a*'} | mo).count
    (gc test.txt -read 4 | ? {$_ -like '*a*'} | mo).count
    (gc test.txt -read 5 | ? {$_ -like '*a*'} | mo).count
    (gc test.txt -read 6 | ? {$_ -like '*a*'} | mo).count
    (gc test.txt -read 7 | ? {$_ -like '*a*'} | mo).count
    (gc test.txt -read 8 | ? {$_ -like '*a*'} | mo).count
    (gc test.txt -read 9 | ? {$_ -like '*a*'} | mo).count
    (gc test.txt -read 10 | ? {$_ -like '*a*'} | mo).count

    (gc test.txt -read 1 | ? {$_ -match 'a'} | mo).count
    (gc test.txt -read 2 | ? {$_ -match 'a'} | mo).count
    (gc test.txt -read 3 | ? {$_ -match 'a'} | mo).count
    (gc test.txt -read 4 | ? {$_ -match 'a'} | mo).count
    (gc test.txt -read 5 | ? {$_ -match 'a'} | mo).count
    (gc test.txt -read 6 | ? {$_ -match 'a'} | mo).count
    (gc test.txt -read 7 | ? {$_ -match 'a'} | mo).count
    (gc test.txt -read 8 | ? {$_ -match 'a'} | mo).count
    (gc test.txt -read 9 | ? {$_ -match 'a'} | mo).count
    (gc test.txt -read 10 | ? {$_ -match 'a'} | mo).count

    # delete when done
    ri test.txt--
    Kiron


      My System SpecsSystem Spec

  8. #8


    Kiron Guest

    Re: Searching large text files

    mo is an alias for Measure-Object, oops!

    Try this, it's pretty simple, ten lines, but the Count varies instead of
    constantly being 10:

    @'
    a
    ab
    abc
    abcd
    abcde
    abcdef
    abcdefg
    abcdefgh
    abcdefghi
    abcdefghij
    '@ > test.txt

    gc test.txt

    (gc test.txt -read 1 | ? {$_ -like '*a*'} | Measure-Object).count
    (gc test.txt -read 2 | ? {$_ -like '*a*'} | Measure-Object).count
    (gc test.txt -read 3 | ? {$_ -like '*a*'} | Measure-Object).count
    (gc test.txt -read 4 | ? {$_ -like '*a*'} | Measure-Object).count
    (gc test.txt -read 5 | ? {$_ -like '*a*'} | Measure-Object).count
    (gc test.txt -read 6 | ? {$_ -like '*a*'} | Measure-Object).count
    (gc test.txt -read 7 | ? {$_ -like '*a*'} | Measure-Object).count
    (gc test.txt -read 8 | ? {$_ -like '*a*'} | Measure-Object).count
    (gc test.txt -read 9 | ? {$_ -like '*a*'} | Measure-Object).count
    (gc test.txt -read 10 | ? {$_ -like '*a*'} | Measure-Object).count

    (gc test.txt -read 1 | ? {$_ -match 'a'} | Measure-Object).count
    (gc test.txt -read 2 | ? {$_ -match 'a'} | Measure-Object).count
    (gc test.txt -read 3 | ? {$_ -match 'a'} | Measure-Object).count
    (gc test.txt -read 4 | ? {$_ -match 'a'} | Measure-Object).count
    (gc test.txt -read 5 | ? {$_ -match 'a'} | Measure-Object).count
    (gc test.txt -read 6 | ? {$_ -match 'a'} | Measure-Object).count
    (gc test.txt -read 7 | ? {$_ -match 'a'} | Measure-Object).count
    (gc test.txt -read 8 | ? {$_ -match 'a'} | Measure-Object).count
    (gc test.txt -read 9 | ? {$_ -match 'a'} | Measure-Object).count
    (gc test.txt -read 10 | ? {$_ -match 'a'} | Measure-Object).count

    # delete when done
    ri test.txt

    --
    Kiron

      My System SpecsSystem Spec

  9. #9


    Kiron Guest

    Re: Searching large text files

    Now try filtering each object with an If statement inside a Foreach-Object
    scriptblock. Count is constantly 10 as expected.
    Where-Object and Get-Content's -ReadCount <-gt 1> don't get along:

    @'
    a
    ab
    abc
    abcd
    abcde
    abcdef
    abcdefg
    abcdefgh
    abcdefghi
    abcdefghij
    '@ > test.txt

    gc test.txt

    (gc test.txt -read 1 | % {if ($_ -like '*a*') {$_}} | Measure-Object).count
    (gc test.txt -read 2 | % {if ($_ -like '*a*') {$_}} | Measure-Object).count
    (gc test.txt -read 3 | % {if ($_ -like '*a*') {$_}} | Measure-Object).count
    (gc test.txt -read 4 | % {if ($_ -like '*a*') {$_}} | Measure-Object).count
    (gc test.txt -read 5 | % {if ($_ -like '*a*') {$_}} | Measure-Object).count
    (gc test.txt -read 6 | % {if ($_ -like '*a*') {$_}} | Measure-Object).count
    (gc test.txt -read 7 | % {if ($_ -like '*a*') {$_}} | Measure-Object).count
    (gc test.txt -read 8 | % {if ($_ -like '*a*') {$_}} | Measure-Object).count
    (gc test.txt -read 9 | % {if ($_ -like '*a*') {$_}} | Measure-Object).count
    (gc test.txt -read 10 | % {if ($_ -like '*a*') {$_}} | Measure-Object).count

    (gc test.txt -read 1 | % {if ($_ -match 'a') {$_}} | Measure-Object).count
    (gc test.txt -read 2 | % {if ($_ -match 'a') {$_}} | Measure-Object).count
    (gc test.txt -read 3 | % {if ($_ -match 'a') {$_}} | Measure-Object).count
    (gc test.txt -read 4 | % {if ($_ -match 'a') {$_}} | Measure-Object).count
    (gc test.txt -read 5 | % {if ($_ -match 'a') {$_}} | Measure-Object).count
    (gc test.txt -read 6 | % {if ($_ -match 'a') {$_}} | Measure-Object).count
    (gc test.txt -read 7 | % {if ($_ -match 'a') {$_}} | Measure-Object).count
    (gc test.txt -read 8 | % {if ($_ -match 'a') {$_}} | Measure-Object).count
    (gc test.txt -read 9 | % {if ($_ -match 'a') {$_}} | Measure-Object).count
    (gc test.txt -read 10 | % {if ($_ -match 'a') {$_}} | Measure-Object).count

    # delete when done
    ri test.txt

    --
    Kiron


      My System SpecsSystem Spec

  10. #10


    Keith Hill [MVP] Guest

    Re: Searching large text files

    "Kiron" <Kiron@discussions.microsoft.com> wrote in message
    news:O2H3piwrHHA.4364@TK2MSFTNGP04.phx.gbl...
    > Now try filtering each object with an If statement inside a Foreach-Object
    > scriptblock. Count is constantly 10 as expected.
    > Where-Object and Get-Content's -ReadCount <-gt 1> don't get along:
    >


    Yeah what I said wasn't quite right. Setting -readcount to something like 5
    will read five lines and send that down the pipeline as two array objects
    each with 5 strings in it:

    64> gc test.txt -read 5 | get-typename # get-typename from PSCX
    Object[]
    Object[]

    65> gc test.txt -read 5 | %{$_} | get-typename
    String
    String
    String
    String
    String
    String
    String
    String
    String
    String

    "Typically" these arrays are dealt with in the same way as if you had sent
    the strings one at a time but not in all cases. In the for each loop above,
    it sends the array down the pipeline which shreds the array and sends the
    individual elements. In the case of -like, it will work on an array as well
    as a scalar:

    66> gc test.txt -read 5 | where {$_ -like 'a*'}
    a
    ab
    abc
    abcd
    abcde
    abcdef
    abcdefg
    abcdefgh
    abcdefghi
    abcdefghij

    or

    68> (ql a ab abc abcd) -like "a*" # ql or quote-list from PSCX
    a
    ab
    abc
    abcd

    Many cmdlets will accept an array of input and then operate on each element
    individually. However in your case, what you are measuring with
    measure-object is the fact the Where-Object cmdlets just sends the
    "original" object (which is an array) on down the pipeline if the expression
    evaluates to true. Fortunately both -like and -match operate on arrays and
    return just the elements that match:

    2> (ql ab ba cd af) -match '^a'
    ab
    af
    3> (ql ab ba cd af) -like 'a*'
    ab
    af

    What I'm not seeing is get-content ballooning the memory requirements of
    PowerShell. I run the following command on a 77 MB text file:

    84> measure-command { gc large.txt | ?{$_ -match 'dg\s*$'} }

    Days : 0
    Hours : 0
    Minutes : 2
    Seconds : 18
    Milliseconds : 162
    Ticks : 1381622340
    TotalDays : 0.00159909993055556
    TotalHours : 0.0383783983333333
    TotalMinutes : 2.3027039
    TotalSeconds : 138.162234
    TotalMilliseconds : 138162.234

    and PowerShell never gets above ~53 MB of private memory.

    --
    Keith


      My System SpecsSystem Spec

Page 1 of 2 12 LastLast
Searching large text files problems?

Similar Threads
Thread Thread Starter Forum Replies Last Post
SBS2008 + searching large mailbox Jim SBS Server 23 20 Apr 2010
Searching message text Sterling Live Mail 6 04 Aug 2009
searching text within word documents karan Vista General 1 26 May 2008
Searching for content in text files with powershell snofire PowerShell 5 31 Dec 2007
Help searching text within XLS files Ryochan Vista file management 6 10 Nov 2007