Python : How to parse the Body from a raw email , given that raw email does not have a "Body" tag or anything
To be highly positive you work with the actual email body (yet, still with the possibility you're not parsing the right part), you have to skip attachments, and focus on the plain or html part (depending on your needs) for further processing.
As the before-mentioned attachments can and very often are of text/plain or text/html part, this non-bullet-proof sample skips those by checking the content-disposition header:
b = email.message_from_string(a)
body = ""
if b.is_multipart():
for part in b.walk():
ctype = part.get_content_type()
cdispo = str(part.get('Content-Disposition'))
# skip any text/plain (txt) attachments
if ctype == 'text/plain' and 'attachment' not in cdispo:
body = part.get_payload(decode=True) # decode
break
# not multipart - i.e. plain text, no attachments, keeping fingers crossed
else:
body = b.get_payload(decode=True)
BTW, walk()
iterates marvelously on mime parts, and get_payload(decode=True)
does the dirty work on decoding base64 etc. for you.
Some background - as I implied, the wonderful world of MIME emails presents a lot of pitfalls of "wrongly" finding the message body. In the simplest case it's in the sole "text/plain" part and get_payload() is very tempting, but we don't live in a simple world - it's often surrounded in multipart/alternative, related, mixed etc. content. Wikipedia describes it tightly - MIME, but considering all these cases below are valid - and common - one has to consider safety nets all around:
Very common - pretty much what you get in normal editor (Gmail,Outlook) sending formatted text with an attachment:
multipart/mixed
|
+- multipart/related
| |
| +- multipart/alternative
| | |
| | +- text/plain
| | +- text/html
| |
| +- image/png
|
+-- application/msexcel
Relatively simple - just alternative representation:
multipart/alternative
|
+- text/plain
+- text/html
For good or bad, this structure is also valid:
multipart/alternative
|
+- text/plain
+- multipart/related
|
+- text/html
+- image/jpeg
Hope this helps a bit.
P.S. My point is don't approach email lightly - it bites when you least expect it :)
Use Message.get_payload
b = email.message_from_string(a)
if b.is_multipart():
for payload in b.get_payload():
# if payload.is_multipart(): ...
print payload.get_payload()
else:
print b.get_payload()
There is very good package available to parse the email contents with proper documentation.
import mailparser
mail = mailparser.parse_from_file(f)
mail = mailparser.parse_from_file_obj(fp)
mail = mailparser.parse_from_string(raw_mail)
mail = mailparser.parse_from_bytes(byte_mail)
How to Use:
mail.attachments: list of all attachments
mail.body
mail.to
Python 3.6+ provides built-in convenience methods to find and decode the plain text body as in @Todor Minakov
's answer. You can use the EMailMessage.get_body()
and get_content()
methods:
msg = email.message_from_string(s, policy=email.policy.default)
body = msg.get_body(('plain',))
if body:
body = body.get_content()
print(body)
Note this will give None
if there is no (obvious) plain text body part.
If you are reading from e.g. an mbox file, you can give the mailbox constructor an EmailMessage
factory:
mbox = mailbox.mbox(mboxfile, factory=lambda f: email.message_from_binary_file(f, policy=email.policy.default), create=False)
for msg in mbox:
...
Note you must pass email.policy.default
as the policy, since it's not the default...