![]() |
![]() | ![]() | ![]() | ![]() | ![]() | ![]() | ![]() |
| Welcome to Windows Vista Forums. Our forum is dedicated to helping you find solutions with any problems, errors or issues you are experiencing with Windows Vista. The Vista forum also covers news and updates and has an extensive Windows Vista tutorial section that covers a wide range of tips and tricks. |
| |||||||
![]() |
| |
| | #1 (permalink) |
| | how to improve hashtable performance I have a file, 5500 lines, about 100K words. This functuion returns a hash table of unique words. <snippet> function train($text) { $text = [string]::join(" ", $text) [regex]::split($text.ToLower(), '\W+') | % {$h=@{}} {$h[$_] = ''} {$h} } measure-command { train (gc "test.txt") } </snippet> Takes 9 seconds. ~ 8 seconds is in piping the regex results to foreach creating the hashtable. Here is the python 2.5 approach and is subsecond. How can I get the same speed in PowerShell? <python> import re def words(text): return re.findall('[a-z]+', text.lower()) def train(features): model = collections.defaultdict(lambda: 1) for f in features: model[f] += 1 return model train(words(file('test.txt').read())) </python> |
My System Specs![]() |
| | #2 (permalink) |
| | Re: how to improve hashtable performance "Doug" <Doug@discussions.microsoft.com> wrote in message news:6F6C1320-62B1-46B4-B211-88537F588FEB@microsoft.com... >I have a file, 5500 lines, about 100K words. > This functuion returns a hash table of unique words. > > <snippet> > function train($text) > { > $text = [string]::join(" ", $text) > [regex]::split($text.ToLower(), '\W+') | % {$h=@{}} {$h[$_] = ''} {$h} > } > > measure-command { train (gc "test.txt") } > </snippet> > > Takes 9 seconds. ~ 8 seconds is in piping the regex results to foreach > creating the hashtable. > > Here is the python 2.5 approach and is subsecond. > > How can I get the same speed in PowerShell? I'm not sure if this will help performance but to help with the correctness I don't think you want to create a new hashtable for every instance of a word sent down the pipe, right? See if this works any better (if not any faster): $ht = @{} measure-command { $text = [string]::join(" ", $text) } measure-command { $split = [regex]::split($text, '\W+') } measure-command { $split | % {$ht[$_] += 1} } $ht Note that the ToLower() is unnecessary if you use PowerShell syntax to create a hashtable ($ht = @{}) since PowerShell creates the hashtable with case-insensitive keys. -- Keith |
My System Specs![]() |
| | #3 (permalink) |
| | RE: how to improve hashtable performance "Doug" wrote: > I have a file, 5500 lines, about 100K words. > This functuion returns a hash table of unique words. sorry to sound like an idiot: what language is this? > > <snippet> > function train($text) > { > $text = [string]::join(" ", $text) > [regex]::split($text.ToLower(), '\W+') | % {$h=@{}} {$h[$_] = ''} {$h} > } > > measure-command { train (gc "test.txt") } > </snippet> > > Takes 9 seconds. ~ 8 seconds is in piping the regex results to foreach > creating the hashtable. > > Here is the python 2.5 approach and is subsecond. > > How can I get the same speed in PowerShell? what is python used for? and how is it implemented in windows? > > <python> > import re > > def words(text): return re.findall('[a-z]+', text.lower()) > > def train(features): > model = collections.defaultdict(lambda: 1) > for f in features: > model[f] += 1 > return model > > train(words(file('test.txt').read())) > </python> |
My System Specs![]() |
| | #4 (permalink) |
| | Re: how to improve hashtable performance Thanks Keith. I had thought the | % {$h=@{}} {$h[$_] = ''} {$h} followed the "begin, process, end" pattern. Then the hash table only gets created the first time through. Good to know the keys are case-insensitive. Performance didn't improve, I am going to pre-process the file and do a Export-Clixml. "Keith Hill" wrote: > "Doug" <Doug@discussions.microsoft.com> wrote in message > news:6F6C1320-62B1-46B4-B211-88537F588FEB@microsoft.com... > >I have a file, 5500 lines, about 100K words. > > This functuion returns a hash table of unique words. > > > > <snippet> > > function train($text) > > { > > $text = [string]::join(" ", $text) > > [regex]::split($text.ToLower(), '\W+') | % {$h=@{}} {$h[$_] = ''} {$h} > > } > > > > measure-command { train (gc "test.txt") } > > </snippet> > > > > Takes 9 seconds. ~ 8 seconds is in piping the regex results to foreach > > creating the hashtable. > > > > Here is the python 2.5 approach and is subsecond. > > > > How can I get the same speed in PowerShell? > > I'm not sure if this will help performance but to help with the correctness > I don't think you want to create a new hashtable for every instance of a > word sent down the pipe, right? See if this works any better (if not any > faster): > > $ht = @{} > measure-command { $text = [string]::join(" ", $text) } > measure-command { $split = [regex]::split($text, '\W+') } > measure-command { $split | % {$ht[$_] += 1} } > $ht > > Note that the ToLower() is unnecessary if you use PowerShell syntax to > create a hashtable ($ht = @{}) since PowerShell creates the hashtable with > case-insensitive keys. > > -- > Keith > > |
My System Specs![]() |
| | #5 (permalink) |
| | Re: how to improve hashtable performance "Doug" <Doug@discussions.microsoft.com> wrote in message news:33572893-3BB9-4723-A412-61A4E90C228D@microsoft.com... > Thanks Keith. > > I had thought the > > | % {$h=@{}} {$h[$_] = ''} {$h} > > followed the "begin, process, end" pattern. Then the hash table only gets > created the first time through. Yes you are right. I just wasn't mentally parsing all the curly braces correctly. :-) -- Keith |
My System Specs![]() |
| | #6 (permalink) |
| | Re: how to improve hashtable performance >I have a file, 5500 lines, about 100K words. > This functuion returns a hash table of unique words. > Perhaps another way Log Parser 2.2 PS> $filename = "$pshome\about_assignment_operators.help.txt" PS> $query = '"SELECT text, COUNT(*)' PS> $query += " FROM $filename" PS> $query += " GROUP BY text" PS> $query += ' ORDER BY COUNT(*) DESC"' PS> $lpOptions = " -i:textword -stats ff -rtp:1000"PS> $theList = LogParser "$query$lpOptions" PS> 0..5 | foreach { $theList[$_] } Text COUNT(ALL *) -------------- ------------ the 209 to 104 value 80 a 77 PS> $theList.Count 519 PS> $theList | select-string "the" the 209 The 20 then 12 THE 6 whether 2 either 2 another 2 other 2 there 1 then: 1 PS> $theList | select-string "^the " the 209 The 20 THE 6 Just another way! |
My System Specs![]() |
| | #7 (permalink) |
| | Re: how to improve hashtable performance Cool, I like logparser. I wanted a solution that doesn't have a dependency on another install. Interestingly, the previous post where the hastable is built on the fly. If the hashtable is piped to Export-Clixml. Then a PoSH script can Import-Clixml, and sort in subsecond response time as compared to taking 8 seconds to build it on the fly. "Flowering Weeds" wrote: > > > >I have a file, 5500 lines, about 100K words. > > This functuion returns a hash table of unique words. > > > > Perhaps another way Log Parser 2.2 > > PS> $filename = "$pshome\about_assignment_operators.help.txt" > PS> $query = '"SELECT text, COUNT(*)' > PS> $query += " FROM $filename" > PS> $query += " GROUP BY text" > PS> $query += ' ORDER BY COUNT(*) DESC"' > PS> $lpOptions = " -i:textword -stats ff -rtp:1000"> PS> $theList = LogParser "$query$lpOptions" > PS> 0..5 | foreach { $theList[$_] } > Text COUNT(ALL *) > -------------- ------------ > the 209 > to 104 > value 80 > a 77 > PS> $theList.Count > 519 > PS> $theList | select-string "the" > > the 209 > The 20 > then 12 > THE 6 > whether 2 > either 2 > another 2 > other 2 > there 1 > then: 1 > > PS> $theList | select-string "^the " > > the 209 > The 20 > THE 6 > > Just another way! > > > > |
My System Specs![]() |
![]() |
| Thread Tools | |
| |
Similar Threads | ||||
| Thread | Forum | |||
| How I do improve my performance score? | Vista performance & maintenance | |||
| array or hashtable | PowerShell | |||
| Improve performance | Vista performance & maintenance | |||
| Filter a hashtable? | PowerShell | |||
| Gigabit Card - does it really improve performance? | Vista networking & sharing | |||