Windows Vista Forums
Vista Forums Home Join Vista Forums Donate Vista Tutorials Tags

Welcome to Vista Forums we are your forum to discuss Windows Vista x64 and x86 systems. Whether you need help or just want to post an idea you have on Vista, this is the forum for you.
Register at Vista forums...the world biggest Windows Vista resource Join Vista Forums Now

Go Back   Vista Forums > Microsoft Technical Newsgroups > PowerShell

Much faster than Get-Content. Why?

Update your Vista Drivers Update Your Drivers Now!!
Closed Thread
 
Thread Tools Display Modes
Old 10-16-2006   #1 (permalink)
Roman Kuzmin
Guest


 

Much faster than Get-Content. Why?

The functions f1() and f2() below do the same popular job, read lines from a
file. f1() uses standard Get-Content, f2() uses an interesting feature of
switch statement with empty regular expression. The code produces:

...
TotalMilliseconds : 3.0332
...
TotalMilliseconds : 1.5639

PS> 3.0332/1.5639
1.93951019886182

Thus, such a peculiar way to read a file as "switch -regex -file" looks
almost twice more effective than designed for this task Get-Content.

Is this reproducible? If yes:

*) Why is it so?
*) Is this going to change?

CODE:

# an existing file
$file = "$pshome\about_globbing.help.txt"

# Get-Content
function f1
{
Get-Content $file
}

# switch -regex -file
function f2
{
switch -regex -file ($file) {
'' {$_}
}
}

# measure
measure-command {f1}
measure-command {f2}

# to be sure that f1 and f2 do the same job
Compare-Object (f1) (f2)

--
Thanks,
Roman

My System SpecsSystem Spec
Old 10-16-2006   #2 (permalink)
Roman Kuzmin
Guest


 

RE: Much faster than Get-Content. Why?

Just for fun: two more competitors are in the game: ItemContent variable and
..NET [System.IO.File]::ReadAllLines:

TotalMilliseconds:

3.0423 # Get-Content
1.5551 # switch -regex -file
1.8145 # ItemContent variable
1.0138 # ReadAllLines

( these are averaged results from Measure-CommandEx.ps1
http://nightroman.spaces.live.com/blog/cns!F011223B604739FA!120.entry )

So far ReadAllLines is the fastest (but .NET) way, "switch -regex -file" is
still the fastest native PowerShell way. Are there any faster methods to get
lines from a file?

CODE:

# an existing file
$file = "$pshome\about_globbing.help.txt"

# Get-Content
function f1
{
Get-Content $file
}

# switch -regex -file
function f2
{
switch -regex -file ($file) {
'' {$_}
}
}

# ItemContent variable
function f3
{
Invoke-Expression "`${$file}"
}

# ReadAllLines
function f4
{
[System.IO.File]::ReadAllLines($file)
}

# measure
(measure-command {f1}).TotalMilliseconds
(measure-command {f2}).TotalMilliseconds
(measure-command {f3}).TotalMilliseconds
(measure-command {f4}).TotalMilliseconds

# to be sure that f1, f2, f3, f4 do the same job
Compare-Object (f1) (f2)
Compare-Object (f1) (f3)
Compare-Object (f1) (f4)

--
Thanks,
Roman

My System SpecsSystem Spec
Old 10-16-2006   #3 (permalink)
klumsy@xtra.co.nz
Guest


 

Re: Much faster than Get-Content. Why?

Sadly the slow speed of powershell is one of its weaknesses. I think
once powershell is released, and is known more of, and adopted because
of use in microsoft products like exchange, people are going to
benchmark it against things like perl and python and even ruby, and it
will show lacking, and many may write it off because of that. However i
think first powershell is a shell language, and the majority of scripts
are going to be small, and rather immediate. Stuff the size of some of
the larger perl applications out there may be quite hard to do well in
powershell, but thats not such a big problem to me, Also i know that in
future versions microsoft will spend alot of time making things go alot
faster. Personally i hope the benchmarks come in more favourable than i
presume, mostly because i don't want it to be written off because of
that , and the benefits of it overlooked, for like perl and other
dynamic languages, they came from a presumption, that presumption
being: in the past, the cost and speed of computers cost more than the
cost of the developers time, while with modern computers and hardware,
the preformance for most things is neglible, especially considering
over 90% of execution time is spend in OS api calls, drivers etc, the
modern reality is the time it takes to for the developer/IT
professional is more expensive than the computing hardware. And
powershell is very fast in this regard, its learning slight curve pays
off again and again with its consistency, interactivity, reusability ,
composibility etc.

Karl

My System SpecsSystem Spec
Old 10-17-2006   #4 (permalink)
dreeschkind
Guest


 

RE: Much faster than Get-Content. Why?

"Roman Kuzmin" wrote:

> TotalMilliseconds:
>
> a) 3.0423 # Get-Content
> b) 1.5551 # switch -regex -file
> c) 1.8145 # ItemContent variable
> d) 1.0138 # ReadAllLines
>
> Much faster than Get-Content. Why?


I'll try to answer your original question:

I think it is understandable why d) is the fastest way. PowerShell itself is
a .NET application, therefore .NET calls from PowerShell just neet to be
translated to native .NET calls, which seems to work very efficiently.
Solution b) and c) involve more complicated PowerShell language constructs
with a little more language overhead than d). Finally, a) is the slowest
possible sulution since that requires analyzing the command to invoke
(get-content could be an alias for example), creating new instances of the
get-content cmdlet class, etc.

In summary I'd say the higher the abstraction, the more overhead you'll get.
So you have basically two ways to go: Use cmdlets to write scripts in very
little time or use direct .NET calls to get scripts with better performance.
However, if you're going to avoid cmdlets in PowerShell completely, because
performance is more relevant for you than the time to write such a script,
then maybe IronPython or Ruby.NET is a better language for your needs.

--
greetings
dreeschkind
My System SpecsSystem Spec
Old 10-17-2006   #5 (permalink)
Roman Kuzmin
Guest


 

RE: Much faster than Get-Content. Why?

"dreeschkind" wrote: ...
"Karl" wrote: ...

Yes, in general I agree with everything.

Meanwhile it looks like my benchmark was not quite serious. The test file
was too small: "about_globbing.help.txt" - 161 bytes, 12 lines. Now let's try
a large file: "microsoft.powershell.commands.management.dll-help.xml" -
886281 bytes, 17420 lines:

TotalMilliseconds:

a) 1710.3015 # Get-Content
b) 715.2535 # switch -regex -file
c) 25.0589 # ItemContent variable
d) 25.097 # ReadAllLines

c) and d) are the fastest and actually almost the same. Now "switch -regex
-file" does not look fast at all. But it is still much faster than
Get-Content.

As for Get-Content, I can understand only some overhead at startup which
should be insignificant for large files. But difference is simply enormous
(~70 times for this example). IMHO, taking into account popularity and
importance of file reading operations, Get-Content should use effective file
operations directly avoiding some calls of provider or whatever makes its
work so slow. I believe exception for files is necessarily. A user does not
care how a cmdlet works internally if it works fine.

In my practice parsing of numerous large text files is quite an everyday
task. That's a pity that I actually have to use alternatives to Get-Content
which is supposed to be a standard way.

> then maybe IronPython or Ruby.NET is a better language for your needs


Actually I am pretty happy with Perl with its power and performance. Also C#
is my old good friend for more complex or performance sensitive tasks. But as
many of us I am already familiar with PPSS (Post PowerShell Syndrome)... So I
would like to do more things in PowerShell and preferably in its native ways.

--
Thanks,
Roman

My System SpecsSystem Spec
Old 10-18-2006   #6 (permalink)
Bruce Payette [MSFT]
Guest


 

Re: Much faster than Get-Content. Why?

This is a known issue with the way Get-Content works. For each object
returned from the pipe, it adds a bunch of extra information to that object
in the form of NoteProperties. You can see these properties using
get-member:

PS (37) > get-content file1.txt | gm -type noteproperty

TypeName: System.String

Name MemberType Definition
---- ---------- ----------
PSChildName NoteProperty System.String PSChildName=file1.txt
PSDrive NoteProperty System.Management.Automation.PSDriveInfo PSDrive=C
PSParentPath NoteProperty System.String PSParentPath=C:\Temp\files
PSPath NoteProperty System.String PSPath=C:\Temp\files\file1.txt
PSProvider NoteProperty System.Management.Automation.ProviderInfo
PSProvider=Mi
ReadCount NoteProperty System.Int64 ReadCount=1

These properties are being added for *every* object processed in the
pipeline. We do this to allow cmdlets to work more effectively together.
It's important because things like the Path property may vary across
different object types. In effect, we're doing "property name
normalization". Unfortunately, while this technique provides significant
benefits by making the system more consistent, it isn't free. It adds
significant overhead both in terms of processing time and memory space.
We're investigating ways to reduce these costs without losing the benefits
but in the end, we may need to add a way to suppress adding this extra
information.

One trick to work around this is to use the -ReadCount parameter. This
somewhat misnamed parameter controls the number of records Get-Content
writes into the pipeline at a time. So - if you execute

Get-Content -readcount 10 foo.txt | out-null

you'll see a significant perf improvement because the extra infromation is
being added to each collection of 10 records instead of to each record. Take
a look at the performace impact -readcount has in some simple examples:

PS (42) > (measure-command { get-content junk.txt |
out-null }).TotalMilliseconds
249.6448
PS (43) > (measure-command { get-content -readcount 10 junk.txt |
out-null }).TotalMilliseconds
52.6695
PS (44) > (measure-command { get-content -readcount 100 junk.txt |
out-null }).TotalMilliseconds
7.8794

-bruce

--
Bruce Payette [MSFT]
Windows PowerShell Technical Lead
Microsoft Corporation
This posting is provided "AS IS" with no warranties, and confers no rights.

Visit the Windows PowerShell Team blog at:
http://blogs.msdn.com/PowerShell
Visit the Windows PowerShell ScriptCenter at:
http://www.microsoft.com/technet/scr.../hubs/msh.mspx
My Book: http://manning.com/powershell
"Roman Kuzmin" <RomanKuzmin@discussions.microsoft.com> wrote in message
news:0716A722-065B-4F00-BEB8-E34CCFC0AB0E@microsoft.com...
> "dreeschkind" wrote: ...
> "Karl" wrote: ...
>
> Yes, in general I agree with everything.
>
> Meanwhile it looks like my benchmark was not quite serious. The test file
> was too small: "about_globbing.help.txt" - 161 bytes, 12 lines. Now let's
> try
> a large file: "microsoft.powershell.commands.management.dll-help.xml" -
> 886281 bytes, 17420 lines:
>
> TotalMilliseconds:
>
> a) 1710.3015 # Get-Content
> b) 715.2535 # switch -regex -file
> c) 25.0589 # ItemContent variable
> d) 25.097 # ReadAllLines
>
> c) and d) are the fastest and actually almost the same. Now "switch -regex
> -file" does not look fast at all. But it is still much faster than
> Get-Content.
>
> As for Get-Content, I can understand only some overhead at startup which
> should be insignificant for large files. But difference is simply enormous
> (~70 times for this example). IMHO, taking into account popularity and
> importance of file reading operations, Get-Content should use effective
> file
> operations directly avoiding some calls of provider or whatever makes its
> work so slow. I believe exception for files is necessarily. A user does
> not
> care how a cmdlet works internally if it works fine.
>
> In my practice parsing of numerous large text files is quite an everyday
> task. That's a pity that I actually have to use alternatives to
> Get-Content
> which is supposed to be a standard way.
>
>> then maybe IronPython or Ruby.NET is a better language for your needs

>
> Actually I am pretty happy with Perl with its power and performance. Also
> C#
> is my old good friend for more complex or performance sensitive tasks. But
> as
> many of us I am already familiar with PPSS (Post PowerShell Syndrome)...
> So I
> would like to do more things in PowerShell and preferably in its native
> ways.
>
> --
> Thanks,
> Roman
>



My System SpecsSystem Spec
Old 10-18-2006   #7 (permalink)
Roman Kuzmin
Guest


 

Re: Much faster than Get-Content. Why?

Bruce Payette [MSFT] wrote:
> …


Bruce,

Thank you for your quite an explanation of the issue and very useful
information. I see now that Get-Content is really more complex than I used to
think and its performance penalty is perhaps inevitable.

>We're investigating ways to reduce these costs without losing the benefits
>but in the end, we may need to add a way to suppress adding this extra
>information.


I wish you all to make things that are good already even much better.

Though I am not sure now, but just a thought: perhaps this mechanism should
be better disabled by default for some special cases, e.g. like reading lines
from a file. It can be optionally enabled by a user only when it is really
necessarily.

--
Thanks,
Roman

My System SpecsSystem Spec
Closed Thread

Thread Tools
Display Modes



Similar Threads
Thread Thread Starter Forum Replies Last Post
Set-Content not updating file after get-content and forEach-Object Tolli PowerShell 1 06-14-2007 09:01 PM
EMC and Microsoft Form New Enterprise Content Management Alliance, Extend Microsoft Office SharePoint Server With Content, Compliance and Archive Solutions z3r010 Vista News 0 10-03-2006 08:04 AM
Issue: getting/setting variable content using Get/Set-Content =?Utf-8?B?Um9tYW4gS3V6bWlu?= PowerShell 1 09-23-2006 04:09 AM
Weirdness with get-content | replace | set-content - file content is deleted!! Andrew Watt [MVP] PowerShell 4 05-23-2006 05:59 PM


Update your Vista Drivers Update Your Drivers Now!!

Vistax64.com is an independent web site and has not been authorized,
sponsored, or otherwise approved by Microsoft Corporation.
"Windows Vista", the Start Orb, and related materials are trademarks of Microsoft Corp.
© Designer Media 2005-2008