![]() |
![]() | ![]() | ![]() | ![]() | ![]() | ![]() | ![]() |
| Welcome to Windows Vista Forums. Our forum is dedicated to helping you find solutions with any problems, errors or issues you are experiencing with Windows Vista. The Vista forum also covers news and updates and has an extensive Windows Vista tutorial section that covers a wide range of tips and tricks. |
| |||||||
![]() |
| |
| | #1 (permalink) |
| | Parsing MIME mail headers I have thousands of .EMLX files from a client. I need to parse the To, From, Subject, Date, and CC fields from the headers. I've got my code running, and it works. I just can't help but think it isn't as efficient as it could be. I'm reading the files, which are plain text, and line-by-line looking for the field names. The To and CC fields are particularly tricky because they are multi-line (or can be). When I hit one of these fields, I start a sub loop that captures the lines until it finds another field, which seems to work okay. I wrote versions of this code, which are practically identical, in VBA and VBS to store this data in an access DB. Does anyone know a dll, COM object, or .Net object that can do this more efficiently? I have no problem re-writing my code if necessary. I'm going to be getting a hard drive full of these soon, so I'd like to speed up my code as much as I can. I also need to extract attachments. I have my Base64 decoder working using MSXML2, but I'm having trouble identifying the begging and end of the attachment. |
My System Specs![]() |
| | #2 (permalink) |
| | Re: Parsing MIME mail headers hi Mike, Mike wrote: Quote: > I'm reading the files, which are plain text, and line-by-line looking for > the field names. http://msdn.microsoft.com/en-us/libr...01(VS.85).aspx http://msdn.microsoft.com/en-us/libr...ffice.10).aspx mfG --> stefan <-- |
My System Specs![]() |
| | #3 (permalink) |
| | Re: Parsing MIME mail headers This is not XML, perhaps they're not MIME, but they are standard internet mail files. Here are some of the headers. Received: <Removed for privacy> 12 Feb 1998 13:05:31 -0000 Received: from unknown <Removed for privacy> <Removed for privacy> by <Removed for privacy> with SMTP for <Removed for privacy> 12 Feb 1998 13:05:31 -0000 Received: <Removed for privacy>; 12 Feb 1998 13:05:31 -0000 Received: from <Removed for privacy> (envelope-sender <Removed for privacy>) by <Removed for privacy> with SMTP for <Removed for privacy>; 12 Feb 1998 13:05:30 -0000 Received: from unknown [<Removed for privacy>] (<Removed for privacy>) by <Removed for privacy> (mxl_mta-5.4.0-1) with ESMTP id <Removed for privacy>(envelope-from <Removed for privacy>); Tue, 12 Feb 1998 06:05:30 -0700 (MST) Received: from unknown [<Removed for privacy>] (EHLO<Removed for privacy>) by <Removed for privacy>(mxl_mta-5.4.0-1) over TLS secured channel with ESMTP id <Removed for privacy>(envelope-from <Removed for privacy>); Tue, 12 Feb 1998 06:05:28 -0700 (MST) Received: from <Removed for privacy> ([<Removed for privacy>]) by <Removed for privacy> ([<Removed for privacy>]) with mapi; Tue, 12 Feb 1998 08:00:12 -0500 From: <Removed for privacy> To: <Removed for privacy> CC: <Removed for privacy>, <Removed for privacy>, <Removed for privacy>, <Removed for privacy>, <Removed for privacy>, <Removed for privacy> Date: Tue, 12 Feb 1998 08:00:11 -0500 Subject: Re:something, something Thread-Topic: <Removed for privacy> Thread-Index: <Removed for privacy> Message-ID: <Removed for privacy> Accept-Language: en-US Content-Language: en-US "Stefan Hoffmann" <stefan.hoffmann@xxxxxx> wrote in message news:eEv9mj0FJHA.5224@xxxxxx Quote: > hi Mike, > > Mike wrote: Quote: >> I'm reading the files, which are plain text, and line-by-line looking for >> the field names. > > http://msdn.microsoft.com/en-us/libr...01(VS.85).aspx > http://msdn.microsoft.com/en-us/libr...ffice.10).aspx > > mfG > --> stefan <-- |
My System Specs![]() |
| | #4 (permalink) |
| | Re: Parsing MIME mail headers "Mike" <nospam@xxxxxx> wrote in message news:eAWuvN0FJHA.1000@xxxxxx Quote: >I have thousands of .EMLX files from a client. I need to parse the To, >From, Subject, Date, and CC fields from the headers. I've got my code >running, and it works. I just can't help but think it isn't as efficient >as it could be. > > I'm reading the files, which are plain text, and line-by-line looking for > the field names. The To and CC fields are particularly tricky because > they are multi-line (or can be). When I hit one of these fields, I start > a sub loop that captures the lines until it finds another field, which > seems to work okay. > > I wrote versions of this code, which are practically identical, in VBA and > VBS to store this data in an access DB. Does anyone know a dll, COM > object, or .Net object that can do this more efficiently? I have no > problem re-writing my code if necessary. I'm going to be getting a hard > drive full of these soon, so I'd like to speed up my code as much as I > can. > > I also need to extract attachments. I have my Base64 decoder working > using MSXML2, but I'm having trouble identifying the begging and end of > the attachment. work quite well for what you are wanting to do. See below for a small example. Set oFSO = CreateObject("Scripting.FileSystemObject") Set oRegEx = CreateObject("VBScript.RegExp") oRegEx.Multiline = True sEmail = oFSO.OpenTextFile("Email.txt", 1).ReadAll oRegEx.Pattern = "^To [\x00-\xff]*?[\n\r\f]*?)[\n\r\f]*?.*?:"sTo = oRegEx.Execute(sEmail)(0).Submatches(0) oRegEx.Pattern = "^CC [\x00-\xff]*?[\n\r\f]*?)[\n\r\f]*?.*?:"sCC = oRegEx.Execute(sEmail)(0).Submatches(0) MsgBox "To: " & sTo & vbCr & "CC:" & sCC |
My System Specs![]() |
| | #5 (permalink) |
| | Re: Parsing MIME mail headers hi Mike, Mike wrote: Quote: > This is not XML, perhaps they're not MIME, but they are standard internet > mail files. Here are some of the headers. where XML files... mfG --> stefan <-- |
My System Specs![]() |
| | #6 (permalink) |
| | Re: Parsing MIME mail headers Looking in the wrong places - asking how not to use VBA or VBS in VBA and VBS groups :~) Actually, file access is about the same speed in any environment, so it's not going to make any difference how you code it. Avoid string concatenation, because the fully managed string class in VBA and VBS does concatenation a lot slower than a C string or a TP string. Also, VBS is unable to do string folding or optimise out constant values. But that's unlikely to make any difference in a file-to-file filter application. Having said that, on my PC, VBA is faster than .NET, but I'm sure that's all just overhead: .Net is probably faster for some complex thing on some better computer. For the Access part of the loop, use bound variables rather than field collection members. Post your code for suggestions. (david) "Mike" <nospam@xxxxxx> wrote in message news:eAWuvN0FJHA.1000@xxxxxx Quote: >I have thousands of .EMLX files from a client. I need to parse the To, >From, Subject, Date, and CC fields from the headers. I've got my code >running, and it works. I just can't help but think it isn't as efficient >as it could be. > > I'm reading the files, which are plain text, and line-by-line looking for > the field names. The To and CC fields are particularly tricky because > they are multi-line (or can be). When I hit one of these fields, I start > a sub loop that captures the lines until it finds another field, which > seems to work okay. > > I wrote versions of this code, which are practically identical, in VBA and > VBS to store this data in an access DB. Does anyone know a dll, COM > object, or .Net object that can do this more efficiently? I have no > problem re-writing my code if necessary. I'm going to be getting a hard > drive full of these soon, so I'd like to speed up my code as much as I > can. > > I also need to extract attachments. I have my Base64 decoder working > using MSXML2, but I'm having trouble identifying the begging and end of > the attachment. > |
My System Specs![]() |
| | #7 (permalink) |
| | Re: Parsing MIME mail headers Much of email format is set out with blank lines and "boundary markers". There's usually a blank line in between parts of the message. (When you look at the raw code, the blank lines all serve a purose. They're not just for readability.) A boundary marker can be any string, with certain limitations, but most email programs go overboard and create them from something like a GUID + computer name, so they're very recognizable in the email body. The details of MIME format are available but they exist in excessively official, absurdly abstruse, nearly unreadable, technical documents. If you want to check that out search for: RFC2045 RFC2046 RFC 822 It's hard to find more readable documentation because few people deal with MIME format directly. Usually when programmers want to send email they're using a component or automate an email program that does the formatting internally. This might be somewhat helpful: www.jsware.net/jsware/vbcode.php5#mail It's VB code for sending email with no dependencies. I know that's not what you need, but since the code has to do the whole job of composing the actual email in this case, I needed to figure out the format of email messages. After having done so, I then included an explanatory file named MIME format.txt in the download. That file outlines the basic MIME structure. After days poring over those abominable RFC files I figured that I should try to do what I could to save others from the same horrible fate in the future. ![]() |
My System Specs![]() |
| | #8 (permalink) |
| | Re: Parsing MIME mail headers Thanks for the code!! The To: field extracts perfectly! My only issue is on this line: sCC = oRegEx.Execute(sEmail)(0).Submatches(0) I changed it to: If oRegEx.Test(sEmail) = True Then sCC = oRegEx.Execute(sEmail) (0).Submatches(0) The CC: Field is not a required field. Actually, an email must only have only one of the To:, CC:, or BCC: fields to be compliant. |
My System Specs![]() |
| | #9 (permalink) |
| | Re: Parsing MIME mail headers Ok, I'm REALLY new to RegEx. I have this Field in an email: Date: Tue, 4 Mar 2008 17:25:43 -0600 I have this code: oRegEx.Pattern = "^Date *\s*(Sun|Mon|Tue|Wed|Thu|Fri|Sat),\s*)?(0?[1-9]|[1-2][0-9]|3[01])\s+(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov| Dec)\s+(19[0-9]{2}|[2-9][0-9]{3}|[0-9]{2})\s+(2[0-3]|[0-1][0-9]) [0-5][0-9])(?: 60|[0-5][0-9]))?\s+([-\+][0-9]{2}[0-5][0-9]|(?:UT|GMT|(?:E|C|M|P)(?:ST|DT)|[A-IK-Z]))(\s*\((\\\(|\\\)|(?<=[^\\])\((?<C>)|(?<=[^\ \])\)(?<-C>)|[^\(\)]*)*(?(C)(?!))\))*\s*$" If oRegEx.Test(sEmail) = True Then sDT = oRegEx.Execute(sEmail) (0).Submatches(0) I used the tester on http://regexlib.com/RETester.aspx?regexp_id=969 to test this pattern and it works. My code fails on the If test portion. "Application-defined or Object-defined error" Where did I go wrong? |
My System Specs![]() |
| | #10 (permalink) |
| | Re: Parsing MIME mail headers "krazymike" <krazymike@xxxxxx> wrote in message news:5154613a-f4b0-4b9c-a94a-5da93deeb917@xxxxxx Quote: > Ok, I'm REALLY new to RegEx. I have this Field in an email: > Date: Tue, 4 Mar 2008 17:25:43 -0600 > > I have this code: > oRegEx.Pattern = "^Date *\s*(Sun|Mon|Tue|Wed|Thu|Fri|Sat),\s*)?(0?> [1-9]|[1-2][0-9]|3[01])\s+(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov| > Dec)\s+(19[0-9]{2}|[2-9][0-9]{3}|[0-9]{2})\s+(2[0-3]|[0-1][0-9]) [0-5]> [0-9])(?: 60|[0-5][0-9]))?\s+([-\+][0-9]{2}[0-5][0-9]|(?:UT|GMT|(?:E|> C|M|P)(?:ST|DT)|[A-IK-Z]))(\s*\((\\\(|\\\)|(?<=[^\\])\((?<C>)|(?<=[^\ > \])\)(?<-C>)|[^\(\)]*)*(?(C)(?!))\))*\s*$" > If oRegEx.Test(sEmail) = True Then sDT = oRegEx.Execute(sEmail) > (0).Submatches(0) > > I used the tester on http://regexlib.com/RETester.aspx?regexp_id=969 > to test this pattern and it works. My code fails on the If test > portion. > "Application-defined or Object-defined error" > > Where did I go wrong? Regular Expression Workbench can give a somewhat English interpretation of how the dot net engine would see a regular expression. Here is what it says about yours: ^ (anchor to start of string)Date: Capture * (zero or more times) Any whitespace character * (zero or more times) Capture Sun or Mon or Tue or Wed or Thu or Fri or Sat End Capture , Any whitespace character * (zero or more times) End Capture ? (zero or one time) Capture 0 ? (zero or one time) Any character in "1-9" or Any character in "1-2" Any character in "0-9" or 3 Any character in "01" End Capture Any whitespace character + (one or more times) Capture Jan or Feb or Mar or Apr or May or Jun or Jul or Aug or Sep or Oct or Nov or Dec End Capture Any whitespace character + (one or more times) Capture 19 Any character in "0-9" Exactly 2 times or Any character in "2-9" Any character in "0-9" Exactly 3 times or Any character in "0-9" Exactly 2 times End Capture Any whitespace character + (one or more times) Capture 2 Any character in "0-3" or Any character in "0-1" Any character in "0-9" End Capture : Capture Any character in "0-5" Any character in "0-9" End Capture Non-capturing Group : Capture 60 or Any character in "0-5" Any character in "0-9" End Capture End Capture ? (zero or one time) Any whitespace character + (one or more times) Capture Any character in "-\+" Any character in "0-9" Exactly 2 times Any character in "0-5" Any character in "0-9" or Non-capturing Group UT or GMT or Non-capturing Group E or C or M or P End Capture Non-capturing Group ST or DT End Capture or Any character in "A-IK-Z" End Capture End Capture Capture Any whitespace character * (zero or more times) ( Capture \( or \) or zero-width positive lookbehind Any character not in "\\" End Capture ( Capture to <C> End Capture or zero-width positive lookbehind Any character not in "\\" End Capture ) Capture ? (zero or one time) <-C> End Capture or Any character not in "\(\)" * (zero or more times) End Capture * (zero or more times) Conditional Subexpression if: C match: zero-width negative lookahead End Capture End Capture ) I'm no regex expert, but perhaps someone else can identify and comment on anything that VBScript's regular expression engine can't handle. -Paul Randall |
My System Specs![]() |
![]() |
| Thread Tools | |
| |
Similar Threads | ||||
| Thread | Forum | |||
| mail newsgroups headers | Vista mail | |||
| Windows Mail ... encoding/mime problems in sent e-mails | Vista mail | |||
| Retrieve only headers in Windows Mail | Vista mail | |||
| View headers in e-mail | Vista mail | |||
| Custom Mail Headers with Windows Mail | Vista mail | |||