Windows Vista Forums
Vista Forums Home Join Vista Forums Windows 7 Forum Vista Tutorials Tags
Welcome to Windows Vista Forums. Our forum is dedicated to helping you find solutions with any problems, errors or issues you are experiencing with Windows Vista. The Vista forum also covers news and updates and has an extensive Windows Vista tutorial section that covers a wide range of tips and tricks.

Go Back   Vista Forums > Misc Newsgroups > PowerShell

Vista - how to improve hashtable performance

Reply
 
Old 04-13-2007   #1 (permalink)
Doug


 
 

how to improve hashtable performance

I have a file, 5500 lines, about 100K words.
This functuion returns a hash table of unique words.

<snippet>
function train($text)
{
$text = [string]::join(" ", $text)
[regex]::split($text.ToLower(), '\W+') | % {$h=@{}} {$h[$_] = ''} {$h}
}

measure-command { train (gc "test.txt") }
</snippet>

Takes 9 seconds. ~ 8 seconds is in piping the regex results to foreach
creating the hashtable.

Here is the python 2.5 approach and is subsecond.

How can I get the same speed in PowerShell?

<python>
import re

def words(text): return re.findall('[a-z]+', text.lower())

def train(features):
model = collections.defaultdict(lambda: 1)
for f in features:
model[f] += 1
return model

train(words(file('test.txt').read()))
</python>

My System SpecsSystem Spec
Old 04-14-2007   #2 (permalink)
Keith Hill


 
 

Re: how to improve hashtable performance

"Doug" <Doug@discussions.microsoft.com> wrote in message
news:6F6C1320-62B1-46B4-B211-88537F588FEB@microsoft.com...
>I have a file, 5500 lines, about 100K words.
> This functuion returns a hash table of unique words.
>
> <snippet>
> function train($text)
> {
> $text = [string]::join(" ", $text)
> [regex]::split($text.ToLower(), '\W+') | % {$h=@{}} {$h[$_] = ''} {$h}
> }
>
> measure-command { train (gc "test.txt") }
> </snippet>
>
> Takes 9 seconds. ~ 8 seconds is in piping the regex results to foreach
> creating the hashtable.
>
> Here is the python 2.5 approach and is subsecond.
>
> How can I get the same speed in PowerShell?


I'm not sure if this will help performance but to help with the correctness
I don't think you want to create a new hashtable for every instance of a
word sent down the pipe, right? See if this works any better (if not any
faster):

$ht = @{}
measure-command { $text = [string]::join(" ", $text) }
measure-command { $split = [regex]::split($text, '\W+') }
measure-command { $split | % {$ht[$_] += 1} }
$ht

Note that the ToLower() is unnecessary if you use PowerShell syntax to
create a hashtable ($ht = @{}) since PowerShell creates the hashtable with
case-insensitive keys.

--
Keith

My System SpecsSystem Spec
Old 04-14-2007   #3 (permalink)
nweissma


 
 

RE: how to improve hashtable performance


"Doug" wrote:

> I have a file, 5500 lines, about 100K words.
> This functuion returns a hash table of unique words.



sorry to sound like an idiot: what language is this?

>
> <snippet>
> function train($text)
> {
> $text = [string]::join(" ", $text)
> [regex]::split($text.ToLower(), '\W+') | % {$h=@{}} {$h[$_] = ''} {$h}
> }
>
> measure-command { train (gc "test.txt") }
> </snippet>
>
> Takes 9 seconds. ~ 8 seconds is in piping the regex results to foreach
> creating the hashtable.
>
> Here is the python 2.5 approach and is subsecond.
>
> How can I get the same speed in PowerShell?


what is python used for? and how is it implemented in windows?
>
> <python>
> import re
>
> def words(text): return re.findall('[a-z]+', text.lower())
>
> def train(features):
> model = collections.defaultdict(lambda: 1)
> for f in features:
> model[f] += 1
> return model
>
> train(words(file('test.txt').read()))
> </python>

My System SpecsSystem Spec
Old 04-14-2007   #4 (permalink)
Doug


 
 

Re: how to improve hashtable performance

Thanks Keith.

I had thought the

| % {$h=@{}} {$h[$_] = ''} {$h}

followed the "begin, process, end" pattern. Then the hash table only gets
created the first time through.

Good to know the keys are case-insensitive.

Performance didn't improve, I am going to pre-process the file and do a
Export-Clixml.







"Keith Hill" wrote:

> "Doug" <Doug@discussions.microsoft.com> wrote in message
> news:6F6C1320-62B1-46B4-B211-88537F588FEB@microsoft.com...
> >I have a file, 5500 lines, about 100K words.
> > This functuion returns a hash table of unique words.
> >
> > <snippet>
> > function train($text)
> > {
> > $text = [string]::join(" ", $text)
> > [regex]::split($text.ToLower(), '\W+') | % {$h=@{}} {$h[$_] = ''} {$h}
> > }
> >
> > measure-command { train (gc "test.txt") }
> > </snippet>
> >
> > Takes 9 seconds. ~ 8 seconds is in piping the regex results to foreach
> > creating the hashtable.
> >
> > Here is the python 2.5 approach and is subsecond.
> >
> > How can I get the same speed in PowerShell?

>
> I'm not sure if this will help performance but to help with the correctness
> I don't think you want to create a new hashtable for every instance of a
> word sent down the pipe, right? See if this works any better (if not any
> faster):
>
> $ht = @{}
> measure-command { $text = [string]::join(" ", $text) }
> measure-command { $split = [regex]::split($text, '\W+') }
> measure-command { $split | % {$ht[$_] += 1} }
> $ht
>
> Note that the ToLower() is unnecessary if you use PowerShell syntax to
> create a hashtable ($ht = @{}) since PowerShell creates the hashtable with
> case-insensitive keys.
>
> --
> Keith
>
>

My System SpecsSystem Spec
Old 04-14-2007   #5 (permalink)
Keith Hill


 
 

Re: how to improve hashtable performance

"Doug" <Doug@discussions.microsoft.com> wrote in message
news:33572893-3BB9-4723-A412-61A4E90C228D@microsoft.com...
> Thanks Keith.
>
> I had thought the
>
> | % {$h=@{}} {$h[$_] = ''} {$h}
>
> followed the "begin, process, end" pattern. Then the hash table only gets
> created the first time through.


Yes you are right. I just wasn't mentally parsing all the curly braces
correctly. :-)

--
Keith

My System SpecsSystem Spec
Old 04-14-2007   #6 (permalink)
Flowering Weeds


 
 

Re: how to improve hashtable performance



>I have a file, 5500 lines, about 100K words.
> This functuion returns a hash table of unique words.
>


Perhaps another way Log Parser 2.2

PS> $filename = "$pshome\about_assignment_operators.help.txt"
PS> $query = '"SELECT text, COUNT(*)'
PS> $query += " FROM $filename"
PS> $query += " GROUP BY text"
PS> $query += ' ORDER BY COUNT(*) DESC"'
PS> $lpOptions = " -i:textword -statsff -rtp:1000"
PS> $theList = LogParser "$query$lpOptions"
PS> 0..5 | foreach { $theList[$_] }
Text COUNT(ALL *)
-------------- ------------
the 209
to 104
value 80
a 77
PS> $theList.Count
519
PS> $theList | select-string "the"

the 209
The 20
then 12
THE 6
whether 2
either 2
another 2
other 2
there 1
then: 1

PS> $theList | select-string "^the "

the 209
The 20
THE 6

Just another way!



My System SpecsSystem Spec
Old 04-14-2007   #7 (permalink)
Doug


 
 

Re: how to improve hashtable performance

Cool, I like logparser.

I wanted a solution that doesn't have a dependency on another install.

Interestingly, the previous post where the hastable is built on the fly.
If the hashtable is piped to Export-Clixml.
Then a PoSH script can Import-Clixml, and sort in subsecond response time as
compared to taking 8 seconds to build it on the fly.

"Flowering Weeds" wrote:

>
>
> >I have a file, 5500 lines, about 100K words.
> > This functuion returns a hash table of unique words.
> >

>
> Perhaps another way Log Parser 2.2
>
> PS> $filename = "$pshome\about_assignment_operators.help.txt"
> PS> $query = '"SELECT text, COUNT(*)'
> PS> $query += " FROM $filename"
> PS> $query += " GROUP BY text"
> PS> $query += ' ORDER BY COUNT(*) DESC"'
> PS> $lpOptions = " -i:textword -statsff -rtp:1000"
> PS> $theList = LogParser "$query$lpOptions"
> PS> 0..5 | foreach { $theList[$_] }
> Text COUNT(ALL *)
> -------------- ------------
> the 209
> to 104
> value 80
> a 77
> PS> $theList.Count
> 519
> PS> $theList | select-string "the"
>
> the 209
> The 20
> then 12
> THE 6
> whether 2
> either 2
> another 2
> other 2
> there 1
> then: 1
>
> PS> $theList | select-string "^the "
>
> the 209
> The 20
> THE 6
>
> Just another way!
>
>
>
>

My System SpecsSystem Spec
Reply

Thread Tools


Similar Threads
Thread Forum
How I do improve my performance score? Vista performance & maintenance
array or hashtable PowerShell
Improve performance Vista performance & maintenance
Filter a hashtable? PowerShell
Gigabit Card - does it really improve performance? Vista networking & sharing


Vista Forums is an independent web site and has not been authorized,
sponsored, or otherwise approved by Microsoft Corporation.
"Windows Vista", the Start Orb, and related materials are trademarks of Microsoft Corp.
© Designer Media Ltd

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46