![]() |
![]() | ![]() | ![]() | ![]() | ![]() | ![]() | ![]() |
| Welcome to Windows Vista Forums. Our forum is dedicated to helping you find solutions with any problems, errors or issues you are experiencing with Windows Vista. The Vista forum also covers news and updates and has an extensive Windows Vista tutorial section that covers a wide range of tips and tricks. |
| |||||||
![]() |
| |
| | #1 (permalink) |
| | RegExp to find hex value 0D fails I have a set of text files from an antiquated DOS editor, where I need to convert a soft CR to a normal one in certain circumstances. I have tried using regular expressions to locate the soft CR, but have had no success. The match string I use is this: \.[A-Z]{2}\x8D that is, a literal period, followed by two letters, followed by the soft CR (8D, or ascii 141). The search finds the period/letters fine, but when I add the \x0D, it fails to locate anything. I have tried .IgnoreCase both true and false, different results, but neither correct. Yet when I use the string without the \x0D, locate an occurrrence and display the characters in the area, the ascii 141 is there, exactly following the period/letters, so the ..ReadAll method I use to bring in the text is doing so correctly. I've tried other hex codes, like 20 for a space, and they work fine. Is there a bug in the hex parsing routine, or am I doing something stupid (again)? Pete -- This e-mail address is fake, to keep spammers and their address harvesters out of my hair. If you need to get in touch personally, I am 'pdanes' and I use yahoo mail. But please use the newsgroups whenever possible, so that all may benefit from the exchange of ideas. |
My System Specs![]() |
| | #2 (permalink) |
| | Re: RegExp to find hex value 0D fails Pete wrote: Quote: >I have a set of text files from an antiquated DOS editor, where I need to >convert a soft CR to a normal one in certain circumstances. I have tried >using regular expressions to locate the soft CR, but have had no success. >The match string I use is this: > > \.[A-Z]{2}\x8D > > that is, a literal period, followed by two letters, followed by the soft > CR (8D, or ascii 141). The search finds the period/letters fine, but when > I add the \x0D, it fails to locate anything. I have tried .IgnoreCase both > true and false, different results, but neither correct. Yet when I use the > string without the \x0D, locate an occurrrence and display the characters > in the area, the ascii 141 is there, exactly following the period/letters, > so the .ReadAll method I use to bring in the text is doing so correctly. > I've tried other hex codes, like 20 for a space, and they work fine. Is > there a bug in the hex parsing routine, or am I doing something stupid > (again)? carriage-return. And "\n" matches a new-line character. Also, why 8D? Shouldn't it be 0D? -- Richard Mueller MVP Directory Services Hilltop Lab - http://www.rlmueller.net -- |
My System Specs![]() |
| | #3 (permalink) |
| | Re: RegExp to find hex value 0D fails Hello Richard, the 8D is correct, that is the code for the soft CR in the old editor format (CR plus the 128 bit), which is what I'm trying to eliminate. 0D is a normal CR, but those are okay where they are, so I don't want to find normal newline characters, only the old soft ones, and I know of no other way to get them beside the hex code, but that's not working. Pete "Richard Mueller [MVP]" <rlmueller-nospam@xxxxxx> píše v diskusním příspěvku news:OMVa1hG4JHA.5048@xxxxxx Quote: > Pete wrote: > Quote: >>I have a set of text files from an antiquated DOS editor, where I need to >>convert a soft CR to a normal one in certain circumstances. I have tried >>using regular expressions to locate the soft CR, but have had no success. >>The match string I use is this: >> >> \.[A-Z]{2}\x8D >> >> that is, a literal period, followed by two letters, followed by the soft >> CR (8D, or ascii 141). The search finds the period/letters fine, but when >> I add the \x0D, it fails to locate anything. I have tried .IgnoreCase >> both true and false, different results, but neither correct. Yet when I >> use the string without the \x0D, locate an occurrrence and display the >> characters in the area, the ascii 141 is there, exactly following the >> period/letters, so the .ReadAll method I use to bring in the text is >> doing so correctly. I've tried other hex codes, like 20 for a space, and >> they work fine. Is there a bug in the hex parsing routine, or am I doing >> something stupid (again)? > Can you use "\r"? My documentation indicates that will match a > carriage-return. And "\n" matches a new-line character. Also, why 8D? > Shouldn't it be 0D? > > -- > Richard Mueller > MVP Directory Services > Hilltop Lab - http://www.rlmueller.net > -- > > |
My System Specs![]() |
| | #4 (permalink) |
| | Re: RegExp to find hex value 0D fails "Petr Daneš" <skru.spammers@xxxxxx> wrote in message news:%23G4pRpG4JHA.4632@xxxxxx Quote: > Hello Richard, > > the 8D is correct, that is the code for the soft CR in the old editor > format (CR plus the 128 bit), which is what I'm trying to eliminate. 0D is > a normal CR, but those are okay where they are, so I don't want to find > normal newline characters, only the old soft ones, and I know of no other > way to get them beside the hex code, but that's not working. zipped so that it comes across exactly intact. |
My System Specs![]() |
| | #5 (permalink) |
| | Re: RegExp to find hex value 0D fails "Petr Daneš" <skru.spammers@xxxxxx> wrote in message news:%23G4pRpG4JHA.4632@xxxxxx Quote: > Hello Richard, > > the 8D is correct, that is the code for the soft CR in the old editor > format (CR plus the 128 bit), which is what I'm trying to eliminate. 0D is > a normal CR, but those are okay where they are, so I don't want to find > normal newline characters, only the old soft ones, and I know of no other > way to get them beside the hex code, but that's not working. > > Pete > > > "Richard Mueller [MVP]" <rlmueller-nospam@xxxxxx> píše v > diskusním příspěvku news:OMVa1hG4JHA.5048@xxxxxx Quote: >> Pete wrote: >> Quote: >>>I have a set of text files from an antiquated DOS editor, where I need to >>>convert a soft CR to a normal one in certain circumstances. I have tried >>>using regular expressions to locate the soft CR, but have had no success. >>>The match string I use is this: >>> >>> \.[A-Z]{2}\x8D >>> >>> that is, a literal period, followed by two letters, followed by the soft >>> CR (8D, or ascii 141). The search finds the period/letters fine, but >>> when I add the \x0D, it fails to locate anything. I have tried >>> .IgnoreCase both true and false, different results, but neither correct. >>> Yet when I use the string without the \x0D, locate an occurrrence and >>> display the characters in the area, the ascii 141 is there, exactly >>> following the period/letters, so the .ReadAll method I use to bring in >>> the text is doing so correctly. I've tried other hex codes, like 20 for >>> a space, and they work fine. Is there a bug in the hex parsing routine, >>> or am I doing something stupid (again)? >> Can you use "\r"? My documentation indicates that will match a >> carriage-return. And "\n" matches a new-line character. Also, why 8D? >> Shouldn't it be 0D? >> >> -- >> Richard Mueller >> MVP Directory Services >> Hilltop Lab - http://www.rlmueller.net Windows. For example, the URL: http://msdn.microsoft.com/en-us/libr...ffice.11).aspx titled: HTML Character Sets has a gap from € through Ÿ: } } --- Right curly brace ~ ~ --- Tilde --- --- Unused Nonbreaking space ! ¡ ¡ Inverted exclamation c ¢ ¢ Cent sign The character set code points within this gap are now 'undefined', or seeminly inconsistently defined in the scripting regular expression engine. Some of these code points have a kind of duality, partially dependent on the 'locale' that CScript/WScript is running under. Try following short script which kind of demonstrates this duality, in that the Asc and AscW values for some characters are different: Dim i, sMsg For i = 128 To 159 sMsg = sMsg & vbCrLf & i & vbTab & Chr(i) & vbTab & _ Asc(Chr(i)) & vbTab & ascW(Chr(i)) Next MsgBox smsg In the 1082 locale (Maltese), the Asc and AscW values for all characters is the same, and the Asc value can be greater than 255; this boggles my mind. Yes, do give us a short zipped example of the text you are searching through. Also, if you want to just get rid of all occurrences of chr(141), you might try the replace function: sNewText = Replace(sOldText, chr(141), "") -Paul Randall |
My System Specs![]() |
| | #6 (permalink) |
| | Re: RegExp to find hex value 0D fails Hello James, here is one file as an example. When I view it in Notepad on my system, the Chr(141) characters, which I want to change, show up as a capital T with a check mark over the top. But I have a Czech version of Windows, I don't know what it will do on your machine. In any case, it's always a set of period, two capital letters followed by the Chr(141), which I want to change to period, two capital letters followed by a normal Chr(13). A hex editor (I use NEO) will show you the contents exactly. Pete "James Whitlow" <jwhitlow.60372693@xxxxxx> píše v diskusním příspěvku news:uP1I3xL4JHA.5728@xxxxxx Quote: > "Petr Daneš" <skru.spammers@xxxxxx> wrote in message > news:%23G4pRpG4JHA.4632@xxxxxx Quote: >> Hello Richard, >> >> the 8D is correct, that is the code for the soft CR in the old editor >> format (CR plus the 128 bit), which is what I'm trying to eliminate. 0D >> is a normal CR, but those are okay where they are, so I don't want to >> find normal newline characters, only the old soft ones, and I know of no >> other way to get them beside the hex code, but that's not working. > Pete, could you please post an example file as an attachment? Perhaps > zipped so that it comes across exactly intact. > |
My System Specs![]() |
| | #7 (permalink) |
| | Re: RegExp to find hex value 0D fails Hello Paul, Quote: > Try following short script which kind of demonstrates this duality, in > that the Asc and AscW values for some characters are different: artefact of the ASCW function. I wonder if it expects two bytes and reaches into whatever is before the number, maybe part of the code. But I've run into unexpected results with these functions myself, like when I try printing this: ascW(chr(0) & Chr(i)), I get all zeros, but this: ascW(Chr(i) & chr(0)) give me the same results as simly: ascW(Chr(i)) Quote: > Also, if you want to just get rid of all occurrences of chr(141), you > might try the replace function: > sNewText = Replace(sOldText, chr(141), "") to convert it to a Chr(13), and only in certain places. It's important that the 'period and two capitals' command be on a line by itself, that's the format which a database expects, which is one of the uses for these files. In other places the Chr(141) is a legitimate line continuation character and I don't want to destroy those, there are in fact long, continued lines of text in these files, as well as the incorrect ones I am trying to repair. There are over two thousand of these files and more being created constantly. This troublesome set of about 160 was imported from somewhere and somewhere in the conversion process, years ago, this substitution of 141 for 13 occurred. The editor (an ancient DOS format editor, written in the Czech Republic, formerly Czechoslovakia) is what's used to maintain these text files (library catalogs, essentially) and it doesn't care about 13 or 141. But subsequent software, recently developed, does care, and so we're trying to put the catalogs in order. There are many other errors in them as well, they are the result of many people's work over many years, none particularly well supervised. Some are amenable to mass mayhem conversions, like this, if I can ever get it to work properly, some are individual errors that have to be corrected by hand, but naturally, I want to do it in code whenever possible. I've included an example file as a zipped attachment in my reply to James Whitlow. Pete "Paul Randall" <paulr90@xxxxxx> píše v diskusním příspěvku news:uM0X4wM4JHA.1716@xxxxxx Quote: > > "Petr Daneš" <skru.spammers@xxxxxx> wrote in message > news:%23G4pRpG4JHA.4632@xxxxxx Quote: >> Hello Richard, >> >> the 8D is correct, that is the code for the soft CR in the old editor >> format (CR plus the 128 bit), which is what I'm trying to eliminate. 0D >> is a normal CR, but those are okay where they are, so I don't want to >> find normal newline characters, only the old soft ones, and I know of no >> other way to get them beside the hex code, but that's not working. >> >> Pete >> >> >> "Richard Mueller [MVP]" <rlmueller-nospam@xxxxxx> píše v >> diskusním příspěvku news:OMVa1hG4JHA.5048@xxxxxx Quote: >>> Pete wrote: >>> >>>>I have a set of text files from an antiquated DOS editor, where I need >>>>to convert a soft CR to a normal one in certain circumstances. I have >>>>tried using regular expressions to locate the soft CR, but have had no >>>>success. The match string I use is this: >>>> >>>> \.[A-Z]{2}\x8D >>>> >>>> that is, a literal period, followed by two letters, followed by the >>>> soft CR (8D, or ascii 141). The search finds the period/letters fine, >>>> but when I add the \x0D, it fails to locate anything. I have tried >>>> .IgnoreCase both true and false, different results, but neither >>>> correct. Yet when I use the string without the \x0D, locate an >>>> occurrrence and display the characters in the area, the ascii 141 is >>>> there, exactly following the period/letters, so the .ReadAll method I >>>> use to bring in the text is doing so correctly. I've tried other hex >>>> codes, like 20 for a space, and they work fine. Is there a bug in the >>>> hex parsing routine, or am I doing something stupid (again)? >>> >>> Can you use "\r"? My documentation indicates that will match a >>> carriage-return. And "\n" matches a new-line character. Also, why 8D? >>> Shouldn't it be 0D? >>> >>> -- >>> Richard Mueller >>> MVP Directory Services >>> Hilltop Lab - http://www.rlmueller.net > I think you are being bitten by side effects of the internationalization > of Windows. For example, the URL: > http://msdn.microsoft.com/en-us/libr...ffice.11).aspx > titled: > HTML Character Sets > has a gap from € through Ÿ: > > } } --- Right curly brace > ~ ~ --- Tilde > --- --- Unused > Nonbreaking space > ! ¡ ¡ Inverted exclamation > c ¢ ¢ Cent sign > > The character set code points within this gap are now 'undefined', or > seeminly inconsistently defined in the scripting regular expression > engine. Some of these code points have a kind of duality, partially > dependent on the 'locale' that CScript/WScript is running under. > > Try following short script which kind of demonstrates this duality, in > that the Asc and AscW values for some characters are different: > > Dim i, sMsg > For i = 128 To 159 > sMsg = sMsg & vbCrLf & i & vbTab & Chr(i) & vbTab & _ > Asc(Chr(i)) & vbTab & ascW(Chr(i)) > Next > MsgBox smsg > > In the 1082 locale (Maltese), the Asc and AscW values for all characters > is the same, and the Asc value can be greater than 255; this boggles my > mind. > > Yes, do give us a short zipped example of the text you are searching > through. > > Also, if you want to just get rid of all occurrences of chr(141), you > might try the replace function: > sNewText = Replace(sOldText, chr(141), "") > > -Paul Randall > |
My System Specs![]() |
| | #8 (permalink) |
| | Re: RegExp to find hex value 0D fails Petr Danes schrieb: Quote: > Hello James, > > here is one file as an example. When I view it in Notepad on my system, > the Chr(141) characters, which I want to change, show up as a capital T > with a check mark over the top. But I have a Czech version of Windows, I > don't know what it will do on your machine. In any case, it's always a > set of period, two capital letters followed by the Chr(141), which I > want to change to period, two capital letters followed by a normal > Chr(13). A hex editor (I use NEO) will show you the contents exactly. > > Pete Using this code: Dim oFS : Set oFS = CreateObject( "Scripting.FileSystemObject" ) Dim sFSpecIn : sFSpecIn = ".\KYNA002.TXT" Dim sFSpecOut : sFSpecOut = ".\KYNA002-OUT.TXT" Dim oRE : Set oRE = New RegExp oRE.Global = True oRE.Pattern = "(\.[A-Z]{2})(\x8D)" Dim sText : sText = oFS.OpenTextFile( sFSpecIn ).ReadAll Dim oMTS : Set oMTS = oRE.Execute( sText ) Dim nCnt : nCnt = 0 Dim oMT For Each oMT In oMTS WScript.Echo Join( Array( _ Right( " " & nCnt, 4 ) _ , Right( " " & oMT.firstIndex, 6 ) _ , Right( "000000" & Hex( oMT.firstIndex ), 6 ) _ , hexDump( oMT.firstIndex, sText ) _ , oMT.Value ), "|" ) nCnt = nCnt + 1 Next WScript.Echo "found", nCnt, "matches for pattern", oRE.Pattern oFS.CreateTextFile( sFSpecOut, True ).Write oRE.Replace( sText, "$1" & vbCr ) Function hexDump( nStartPos, sText ) Dim aRVal( 6 ) Dim nPos : nPos = nStartPos Dim nIdx For nIdx = nIdx To UBound( aRVal ) aRVal( nIdx ) = Right( "0" & Hex( AscB( Mid( sText, nPos, 1 ) ) ), 2 ) nPos = nPos + 1 Next hexDump = Join( aRVal, " " ) End Function on your KYNA002.TXT file, I got === repl8D: replace 8D in text file =========================================== 0| 291|000123|0A 2E 41 55 8D 0A 46|.AU? 1| 306|000132|0A 2E 54 49 8D 0A 4C|.TI? 2| 414|00019E|0A 2E 49 4D 8D 0A 41|.IM? 3| 445|0001BD|0A 2E 50 4F 8D 0A 8D|.PO? 4| 454|0001C6|0A 2E 53 49 8D 0A 4B|.SI? 5| 496|0001F0|0A 2E 53 46 8D 0A 52|.SF? 6| 510|0001FE|0A 2E 44 41 8D 0A 31|.DA? ..... 1023| 35719|008B87|0A 2E 54 49 8D 0A 47|.TI? 1024| 35782|008BC6|0A 2E 49 4D 8D 0A 52|.IM? 1025| 35801|008BD9|0A 2E 50 4F 8D 0A 8D|.PO? 1026| 35810|008BE2|0A 2E 53 49 8D 0A 4B|.SI? 1027| 35853|008C0D|0A 2E 53 46 8D 0A 52|.SF? 1028| 35867|008C1B|0A 2E 44 41 8D 0A 31|.DA? found 1029 matches for pattern (\.[A-Z]{2})(\x8D) === repl8D: 0 done (00:00:00) ================================================= and: fc /b KYNA002.TXT KYNA002-OUT.TXT >tmp.txt Vergleichen der Dateien KYNA002.TXT und KYNA002-OUT.TXT 00000126: 8D 0D 00000135: 8D 0D 000001A1: 8D 0D 000001C0: 8D 0D 000001C9: 8D 0D 000001F3: 8D 0D 00000201: 8D 0D ..... 00008B45: 8D 0D 00008B81: 8D 0D 00008B8A: 8D 0D 00008BC9: 8D 0D 00008BDC: 8D 0D 00008BE5: 8D 0D 00008C10: 8D 0D 00008C1E: 8D 0D If you think it worthwhile, try the code on your maschine. |
My System Specs![]() |
| | #9 (permalink) |
| | Re: RegExp to find hex value 0D fails "Petr Danes" <skru.spammers@xxxxxx> wrote in message news:OgjS61s4JHA.1416@xxxxxx Quote: > Hello Paul, > Quote: >> Try following short script which kind of demonstrates this duality, in >> that the Asc and AscW values for some characters are different: > An ASCII code greater than 255 is impossible, surely? This must be an > artefact of the ASCW function. I wonder if it expects two bytes and > reaches into whatever is before the number, maybe part of the code. But > I've run into unexpected results with these functions myself, like when I > try printing this: ascW(chr(0) & Chr(i)), I get all zeros, but this: > ascW(Chr(i) & chr(0)) give me the same results as simly: ascW(Chr(i)) > > Quote: >> Also, if you want to just get rid of all occurrences of chr(141), you >> might try the replace function: >> sNewText = Replace(sOldText, chr(141), "") > Thanks for the suggestion, but I can't just eliminate the Chr(141), I need > to convert it to a Chr(13), and only in certain places. It's important > that the 'period and two capitals' command be on a line by itself, that's > the format which a database expects, which is one of the uses for these > files. In other places the Chr(141) is a legitimate line continuation > character and I don't want to destroy those, there are in fact long, > continued lines of text in these files, as well as the incorrect ones I am > trying to repair. > > There are over two thousand of these files and more being created > constantly. This troublesome set of about 160 was imported from somewhere > and somewhere in the conversion process, years ago, this substitution of > 141 for 13 occurred. The editor (an ancient DOS format editor, written in > the Czech Republic, formerly Czechoslovakia) is what's used to maintain > these text files (library catalogs, essentially) and it doesn't care about > 13 or 141. But subsequent software, recently developed, does care, and so > we're trying to put the catalogs in order. There are many other errors in > them as well, they are the result of many people's work over many years, > none particularly well supervised. Some are amenable to mass mayhem > conversions, like this, if I can ever get it to work properly, some are > individual errors that have to be corrected by hand, but naturally, I want > to do it in code whenever possible. > > I've included an example file as a zipped attachment in my reply to James > Whitlow. Hopefully ekkehard.horner's regular expression will work for you. It worked for me on a US English WXP-SP2. If it does not work, then perhaps you should try changing the locale to 1033 in his test script prior to the use of the regular expression. If you are still having problems, let us know what OS & service packs you are using and whether the computer was purchased with Czech as its native language or whether a language pack or something was used change to the language you are now using. To see how different locales affect what you see, try the following script, which is basically repeating the script I posted earlier in three locales: Locale(1033) 'US English Locale(1082) 'Maltese Locale(1029) 'Czech Notice that for a number of characters, such as Chr(141), the AscW value for the US English locale is different from the AscW value for Czech. On my system, no character is displayed for Chr(141) in US English locale, but it shows up as show up as a capital T with a check mark over the top in Czech locale. Dim i, sMsg, OldLocale OldLocale = SetLocale(1033) 'US English sMsg = "US English lcid = 1033" For i = 128 To 159 sMsg = sMsg & vbCrLf & i & vbTab & Chr(i) & vbTab & _ Asc(Chr(i)) & vbTab & ascW(Chr(i)) Next MsgBox sMsg OldLocale = SetLocale(1082) 'Maltese sMsg = "Maltese lcid = 1082" For i = 128 To 159 sMsg = sMsg & vbCrLf & i & vbTab & Chr(i) & vbTab & _ Asc(Chr(i)) & vbTab & ascW(Chr(i)) Next MsgBox sMsg OldLocale = SetLocale(1029) 'Czech sMsg = "Czech lcid = 1029" For i = 128 To 159 sMsg = sMsg & vbCrLf & i & vbTab & Chr(i) & vbTab & _ Asc(Chr(i)) & vbTab & ascW(Chr(i)) Next MsgBox sMsg -Paul Randall |
My System Specs![]() |
| | #10 (permalink) |
| | Re: RegExp to find hex value 0D fails Hi, Petr I'd like you to try one more thing: I believe that on your computer, if the script does not change the locale, when you create a character with Chr(141), you get a character whose AscW value is 356 (hex 164). In your locale, will your original regular expression find the character if you specify it as: \u164 -- please test with IgnoreCase property both true and false. -Paul Randall |
My System Specs![]() |
![]() |
| Thread Tools | |
| |
Similar Threads | ||||
| Thread | Forum | |||
| RegExp to find hex value 0D fails | Vista General | |||
| new regExp | VB Script | |||
| RegExp question | VB Script | |||
| SP1 fails with Failed to find the CixTarget | Vista installation & setup | |||
| Search fails to find any program shortcuts | Vista General | |||