You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As reported in #618: mail files sometimes end up being recognized as either text/html or text/plain. This happens for example when ingesting .pst files: their outgoing mail messages don't have Received: headers but instead seem to start with a header Status: RO.
The text was updated successfully, but these errors were encountered:
Please note that the root cause of this problem is using libmagic, which actually is a sort of we-don't-know-how-it-works-but-it-seems-to-work type of file type / mime-type detection. It can do wonders but it can also get things horribly wrong.
A proper fix would be to make use of the fact that readpst spits out its e-mails with a clear .eml file name extension, so we already know that they're message/rfc822. Ingesting the resulting files should be made aware of the mime-type - instead of trying to re-evaluate (doing it wrong). But that's beyond scope here.
Workaround
Hand importing PST archives works best as follows:
use readpst as used in ingestors/email/outlookpst.py, i.e. readpst -e -D -8 -cv
libmagic will detect message/rfc822 if a message begins with a Received header. This apparently doesn't need to be a proper RFC2822 compliant header, just adding Received: from localhost (127.0.0.1) on top of the message will do.
Thus, a simple script to only fix the problematic messages could be:
find -type f -name '*.eml' -print0|xargs -0 file --mime-type|grep -v message/rfc822|cut -f1 -d:|while read f; do sed -i '1iReceived: from localhost (127.0.0.1)' "$f"; done
Fix (Dirty)
But I'm actually thinking that a simpler fix would be to just add Received: from localhost (127.0.0.1) to every message: find -type f -name '*.eml' -print0|xargs -0 sed -i '1iReceived: from localhost (127.0.0.1)' and do this right after calling readpst.
Please note that I do not know what happens if an Outlook / Exchange mailbox would contain an actual attachment with the name 123.eml. Does readpst work around this? Does it overwrite the 123.eml mail message? The above script would surely "enhance" this e-mail-attachment, too - even if it weren't an actual .eml file. But that's for another time.
As reported in #618: mail files sometimes end up being recognized as either
text/html
ortext/plain
. This happens for example when ingesting .pst files: their outgoing mail messages don't haveReceived:
headers but instead seem to start with a headerStatus: RO
.The text was updated successfully, but these errors were encountered: