Windows Vista Forums
Vista Forums Home Join Vista Forums Windows 7 Forum Vista Tutorials Tags
Welcome to Windows Vista Forums. Our forum is dedicated to helping you find solutions with any problems, errors or issues you are experiencing with Windows Vista. The Vista forum also covers news and updates and has an extensive Windows Vista tutorial section that covers a wide range of tips and tricks.

Go Back   Vista Forums > Misc Newsgroups > PowerShell

Vista - Add-Content -Encoding UTF8 and -Encoding Unicode Powershell bugs

Reply
 
Old 07-03-2009   #1 (permalink)


 
 

Add-Content -Encoding UTF8 and -Encoding Unicode Powershell bugs

For proper character support, I need UTF8 or Unicode support.

Powershell provides this for the purposes of writing to text files via the Add-Content -Encoding UTF8 feature.

Set-Content -Encoding UTF8 ....works great

...... but......

Add-Content -Encoding UTF8.....adds strange chars to the beginning of lines in the text file!!

These extra chars look like little squares... They are not visible when I use Set-Content -Encoding UTF8, only when I use Add-Content -Encoding UTF8, so I am assuming that it is not the end of line chars.

I am using MS Notepad to look at the *.txt file. Using Powershell 1 and XP.


Can anyone explain this? How to get around this? Is it going to be fixed soon?

My System SpecsSystem Spec
Old 07-03-2009   #2 (permalink)
tojo2000


 
 

Re: Add-Content -Encoding UTF8 and -Encoding Unicode Powershell bugs

On Jul 3, 5:10*pm, ioioio322 <gu...@xxxxxx-email.com> wrote:
Quote:

> For proper character support, I need UTF8 or Unicode support.
>
> Powershell provides this for the purposes of writing to text files via
> the Add-Content -Encoding UTF8 feature.
>
> Set-Content -Encoding UTF8 ....works great
>
> ...... but......
>
> Add-Content -Encoding UTF8.....adds strange chars to the beginning of
> lines in the text file!!
>
> These extra chars look like little squares... *They are not visible
> when I use Set-Content -Encoding UTF8, only when I use Add-Content
> -Encoding UTF8, so I am assuming that it is not the end of line chars.
>
> I am using MS Notepad to look at the *.txt file. *Using Powershell 1
> and XP.
>
> Can anyone explain this? *How to get around this? *Is it going to be
> fixed soon?
>
> --
> ioioio322
This may be a stupid question, but are you using Add-Content to add
data to a file that was created as a Windows UTF-8 file?

There's a well-known issue with Windows UTF-8 files where they include
a BOM, where most Linux/Unix/Other utilities may not be expecting
it. If you're adding to an existing file, and you're not sure if it
has the BOM, then maybe it might be best to just read the old file,
add your data, and then over-write the file.
My System SpecsSystem Spec
Old 07-04-2009   #3 (permalink)


 
 

Re: Add-Content -Encoding UTF8 and -Encoding Unicode Powershell bugs

^Sometimes I'm adding a line to an already existing file (that's what add-content is for). Other times i use it to construct a complex txt file 1 line at a time from scratch.
Same effect. Only happens with Unicode or UTF-8 files....and these files were not made UNIX side.

I haven't tried the txt file UNIX side. So notepad should be able to open it without special chars appearing (like empty squares at the beginning of lines).

Has this bug been addressed in newer versions? Does MS know about it?
My System SpecsSystem Spec
Old 07-04-2009   #4 (permalink)
Bob Landau


 
 

Re: Add-Content -Encoding UTF8 and -Encoding Unicode Powershell bu

I understand how easy it is to point the finger at Microsoft however this
time the bug is with whatever tool you are using in the *nix world.

I suggest you bring this up with the tool vender you're having problems
with. Reference this which explain why and when you need the BOM.

http://www.unicode.org/faq/utf_bom.html


There is a very good reason why Powershell and Notepad use the BOM you've
named 3

is this ASCII
is this UTF-8
is this UNICODE big-endian

but there are more


"ioioio322" wrote:
Quote:

>
> ^Sometimes I'm adding a line to an already existing file (that's what
> add-content is for). Other times i use it to construct a complex txt
> file 1 line at a time from scratch.
> Same effect. Only happens with Unicode or UTF-8 files....and these
> files were not made UNIX side.
>
> I haven't tried the txt file UNIX side. So notepad should be able to
> open it without special chars appearing (like empty squares at the
> beginning of lines).
>
> Has this bug been addressed in newer versions? Does MS know about it?
>
>
> --
> ioioio322
>
My System SpecsSystem Spec
Old 07-04-2009   #5 (permalink)


 
 

Re: Add-Content -Encoding UTF8 and -Encoding Unicode Powershell bugs

^This has nothing to do with nix. You misread my post. I said I am NOT using unix\linux...

I specifically said what I was using.
-Powershell 1
-MS Notepad
-Win XP Pro
My System SpecsSystem Spec
Old 07-04-2009   #6 (permalink)
Bob Landau


 
 

Re: Add-Content -Encoding UTF8 and -Encoding Unicode Powershell bu

My applogies I mis-read the Unix part. I guess this was tojo that said some
linux are not aware of this.

Nevertheless what I said applies to whatever "tool" is reading the file.
There is a section in this FAQ "How I should deal with BOM's which explains
how you or whoever should handle this.

The BOM is only written once regardless of # of times Add-Content is used
and is absolutely required in order for any text processor to be able to
interrupt the character encoding. Both Add-Content and Set-Content add a BOM
unless the character encoding is ASCII.

Look

Set-Content -Encoding UTF8 -path sc.txt 'hello world'
Get-Content -encoding byte sc.txt | format-hex
ef bb bf 68 65 6c 6c 6f 20 77 6f 72 6c 64 0d 0a hello.world..
^^^^^^
BOM = UTF-8

Add-Content -Path ac.txt -Encoding UTF8 "hello world"
Add-Content -Path ac.txt -Encoding UTF8 "good bye"
Get-Content -encoding byte ac.txt | format-hex
ef bb bf 68 65 6c 6c 6f 20 77 6f 72 6c 64 0d 0a hello.world..
67 6f 6f 64 20 62 79 65 0d 0a good.bye..

## here GC is reading the BOM and interpreting the text according to the
character encoding

Get-Content ac.txt
hello world
good bye

If what I said above has nothing to do with what you're asking repost and I
promise to say away from this thread.

"ioioio322" wrote:
Quote:

>
> ^This has nothing to do with nix. You misread my post. I said I am
> *NOT* using unix\linux...
>
> I specifically said what I was using.
> -Powershell 1
> -MS Notepad
> -Win XP Pro
>
>
> --
> ioioio322
>
My System SpecsSystem Spec
Old 07-04-2009   #7 (permalink)
tojo2000


 
 

Re: Add-Content -Encoding UTF8 and -Encoding Unicode Powershell bu

On Jul 4, 1:21*pm, Bob Landau <BobLan...@xxxxxx>
wrote:
Quote:

> My applogies I mis-read the Unix part. I guess this was tojo that said some
> linux are not aware of this.
>
> Nevertheless what I said applies to whatever "tool" is reading the file. *
> There is a section in this FAQ "How I should deal with BOM's which explains
> how you or whoever should handle this.
>
> The BOM is only written once regardless of # of times Add-Content is used
> and is absolutely required in order for any text processor to be able to
> interrupt the character encoding. Both Add-Content and Set-Content add a BOM
> unless the character encoding is ASCII.
>
> Look
>
> *Set-Content -Encoding UTF8 -path sc.txt 'hello world'
> *Get-Content -encoding byte sc.txt | format-hex
> *ef bb bf 68 65 6c 6c 6f 20 77 6f 72 6c 64 0d 0a hello.world..
> ^^^^^^
> BOM = UTF-8
>
> *Add-Content -Path ac.txt -Encoding UTF8 "hello world"
> *Add-Content -Path ac.txt -Encoding UTF8 "good bye"
> *Get-Content -encoding byte ac.txt | format-hex
> *ef bb bf 68 65 6c 6c 6f 20 77 6f 72 6c 64 0d 0a hello.world..
> *67 6f 6f 64 20 62 79 65 0d 0a good.bye..
>
> ## here GC is reading the BOM and interpreting the text according to the
> character encoding
>
> Get-Content ac.txt *
> hello world
> good bye
>
> If what I said above has nothing to do with what you're asking repost andI
> promise to say away from this thread.
>
>
>
> "ioioio322" wrote:
>
Quote:

> > ^This has nothing to do with nix. *You misread my post. *I said I am
> > *NOT* using unix\linux...
>
Quote:

> > I specifically said what I was using.
> > -Powershell 1
> > -MS Notepad
> > -Win XP Pro
>
Quote:

> > --
> > ioioio322
If it is the BOM then I think that is a bug, because I think it should
only be added to the beginning of the file, but the original post
sounds like it's adding it to the beginning of each line, which would
explain why Notepad can't figure out what to do with it.

The Linux thing is a red herring in this situation, I think. BOM
issues with UTF-8 are usually associated with Linux because it is
almost never used (it interferes with the shebang for executable
files).
My System SpecsSystem Spec
Old 07-04-2009   #8 (permalink)


 
 

Re: Add-Content -Encoding UTF8 and -Encoding Unicode Powershell bugs

Unfortunately "format-hex is not recognized as a cmdlet" in my version of Powershell (v1)...

but the empty square chars happened on EVERY text line....except... the first line of text. And Linux/Unix is not involved in any way.
My System SpecsSystem Spec
Old 07-04-2009   #9 (permalink)
tojo2000


 
 

Re: Add-Content -Encoding UTF8 and -Encoding Unicode Powershell bugs

On Jul 4, 4:48*pm, ioioio322 <gu...@xxxxxx-email.com> wrote:
Quote:

> Unfortunately "format-hex is not recognized as a cmdlet" in my version
> of Powershell (v1)...
>
> but the empty square chars happened on EVERY text line....except... the
> first line of text. *And Linux/Unix is not involved in any way.
>
> --
> ioioio322
That makes sense that you wouldn't see them on the first line of text
because notepad would correctly identify that as the BOM and not show
it in the output. Are you going to submit that as a bug? Someone
should.
My System SpecsSystem Spec
Old 07-06-2009   #10 (permalink)
Bob Landau


 
 

Re: Add-Content -Encoding UTF8 and -Encoding Unicode Powershell bu

Unfortunately its not part of v2 either. Format-hex is work in progresss.
I've been writing it because I've not found one that worked quite the way I
wanted.

I thought by showing the the command line and output in v2 would be clearer
than putting it in words.

If you are seeing the BOM being added to each line then I would also call
this a bug. However given this has been fixed in v2 you may find it difficult
to convince them resolve the bug as fixed in v1.

A workaround would be either to use the string member StartsWith or a regex
^efbbbf to find and eliminate these.

"ioioio322" wrote:
Quote:

>
> Unfortunately "format-hex is not recognized as a cmdlet" in my version
> of Powershell (v1)...
>
> but the empty square chars happened on EVERY text line....except... the
> first line of text. And Linux/Unix is not involved in any way.
>
>
> --
> ioioio322
>
My System SpecsSystem Spec
Reply

Thread Tools


Similar Threads
Thread Forum
Problem settting encoding with Set-Content PowerShell
Microsoft PO3 Accounts & Unicode Encoding Live Mail
get-content -encoding byte problem, v1 and v2 ctp3 PowerShell
add-content -encoding unicode has strange outcome... PowerShell
Re: Encoding HTML mail in Powershell PowerShell


Vista Forums is an independent web site and has not been authorized,
sponsored, or otherwise approved by Microsoft Corporation.
"Windows Vista", the Start Orb, and related materials are trademarks of Microsoft Corp.
© Designer Media Ltd

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46