RFC 2822 (and 822) do allow embedded comments, whitespace, and newlines within *some* parts of an email address, but this pattern above DOES NOT. See this thread for more info, including a version that does not use.
This pattern uses (.NET/Perl only?) features named group "(?(Note, no attempt is made to fully validate an IPv6 address-literal.) Loop through the encoded characters and replace any that are found.This accepts RFC 2822 email addresses in the form:ĭomain = rfc2821domain | rfc2821domain-literalĪn rfc 2821 domain (EXCEPT that the final sub-domain must consist of 2 or more letters only).
Manually convert common encoded characters into their UTF-8 equivalents.
$first_line_words = explode(' ', $lines)
actually base64-encoded, and decode it. If there are no spaces on the first line, assume that the body is * Results are not guaranteed, but it's pretty good at what it does. * and simply end up with something *resembling* plain text. * runs through a bunch of common encoding schemes to try to decode everything * the email directly through a particular decoding function, this method * encoded, quoted-printable encoded, or just plain text. * decoding method assumes that text passed through may actually be base64. * PHP seems to think that most emails are 7BIT-encoded, therefore this I’ve built a more robust decode7Bit() method in Imap.php, which goes through a bunch of common encoded characters (like =A0) and replaces them with their UTF-8 equivalents, and then also decodes messages if they look like they are base64-encoded: /** Is there any way to reliably convert all messages with supposedly-7Bit encodings to plaintext?Īfter spending a bit more time, I decided to just write up some heuristic detection, as Max suggested in the comments on my original question. These are all sent with ‘7Bit’ encodings (well, at least according to PHP/ imap_*), but they’re obviously in need of more decoding before I can pass them along as plaintext. _=0AFrom: Names Witheld =0ATo: Names Withheld= IHhtbG5zOm89InVybjpzY2hlbWFzLW1pY3Jvc29mdC1jb206b2ZmaWNlOm9mģ: tangerine apricot pepper.=0A=C2=A0=0ALet me know if you have any availabili= Here are a few examples (snips) of message bodies received with 7Bit encodings:Ģ: PGh0bWwgeG1sbnM6dj0idXJuOnNjaGVtYXMtbWljcm9zb2Z0LWNvbTp2bWwi And some that are HTML, but aren’t indicated as being HTML, and they’re also listed as 7BIT… And some that are not encoded in any way whatsoever. I’ve gotten some that are actually quoted-printable-encoded. I’ve gotten some emails that are supposedly 7BIT that are actually Base64-encoded. It seems that different email clients/services interpret 7BIT to mean different things. However, the one thing I simply can’t get to work reliably (or, sometimes, at all) is when a message comes in with Content-Transfer-Encoding: 7bit. I’ve also forked and completely rewritten a class for PHP, Imap, and the class handles email respectably well-I have some helpful methods in there to detect autoresponders (for out of office, old addresses, etc.), decode base64 and 8bit messages, etc.
I think that, by now, I’ve half-memorized RFC 2822 (the ‘Internet Message Format’ document guidelines), read through email-handling code for half a dozen open source CMSes, and read a bajillion forum posts, blog posts, etc. I’ve been implementing some PHP/IMAP-based email handling functionality lately, and have most everything working great, except for message body decoding (in some circumstances).