Skip to content

Reference list of email processing resources; focus on preservation and PII handling

Notifications You must be signed in to change notification settings

libratom/email-processing-resources

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 

Repository files navigation

email-processing-resources

Reference list of email processing resources; focus on analysis, PII handling, preservation, and access.

Tools and Libraries

Apache Pony Mail: Apache Pony Mail is a web-based mail archive browser built to scale to millions of archived messages with hundreds of requests per second. It allows you to browse, search, and interact with mailing lists including creating replies to mailing list threads.
https://ponymail.incubator.apache.org/

ePADD: ePADD is a software package developed by Stanford University's Special Collections & University Archives that supports archival processes around the appraisal, ingest, processing, discovery, and delivery of email archives.
https://github.com/ePADD/epadd

Email4n6: A simple cross-platform forensic application for processing email files
https://github.com/Marten4n6/Email4n6

imapfw: imapfw is a simple and powerful framework to work with mails.
https://github.com/OfflineIMAP/imapfw

libpff: Library and tools to access the Personal Folder File (PFF) and the Offline Folder File (OFF) format
https://github.com/libyal/libpff

libpst: Library for reading Microsoft Outlook PST files
http://hg.five-ten-sg.com/libpst/

maildir2mbox.py: Convert maildirs (including subfolders) to mbox format
https://gist.github.com/nyergler/1709069

mbox: Package mbox parses the mbox file format into messages and formats messages into mbox files.
https://github.com/blabber/mbox

mstor: A javamail provider supporting the unnofficial mbox mail storage format
https://github.com/benfortuna/mstor

Muse: Revive Precious Memories Using Email
https://mobisocial.stanford.edu/muse/

OfflineIMAP: Read/sync your IMAP mailboxes
https://github.com/OfflineIMAP/offlineimap

DArcMail: Digital Archiving of eMail
CERP decriptive link: https://siarchives.si.edu/what-we-do/digital-curation/email-preservation-cerp
Direct download link: https://siarchives.si.edu/sites/default/files/DArcMail/DArcMail-v1.2-2018-03-07.zip

TOMES: Transforming Online Mail with Embedded Semantics
https://github.com/StateArchivesOfNorthCarolina?utf8=%E2%9C%93&q=tomes&type=public&language=

Avocado Research Email Collection
https://catalog.ldc.upenn.edu/LDC2015T03
https://github.com/ic4f/pluto

PST Indexer using libpff (Simple example from LPFF)
https://github.com/PacktPublishing/Learning-Python-for-Forensics/blob/master/Chapter%2010/pst_indexer.py

Forensic Email Visualization
https://www.cs1.tf.fau.de/research/archive/forensic-email-visualization/

Sotera Newman: Email analysis and visualization
https://github.com/Sotera/newman

Node.js PST tool
https://github.com/epfromer/pst-extractor

Datasets and Dataset Annotations

Apache Software Foundation Public Mail Archives
https://aws.amazon.com/datasets/apache-software-foundation-public-mail-archives/

Email Research Data Sets
https://sites.google.com/site/emailresearchorg/datasets

Enron / CALO project
https://www.cs.cmu.edu/~./enron/

Enron / Nuix set v1.3
http://info.nuix.com/EnronDownload2013.html

Jeb Bush's Gubernatorial Email Archive
https://ab21www.s3.amazonaws.com/JebBushEmails-Text.7z

Labeled training and test data for email intent machine learning (for sentence-level speech acts)
https://github.com/ParakweetLabs/EmailIntentDataSet

Guides and Documentation

[MS-PST]: Outlook Personal Folders (.pst) File Format
https://docs.microsoft.com/en-us/openspecs/office_file_formats/ms-pst

[pstviewtool]: Microsoft's open source tool for viewing PST structure
https://archive.codeplex.com/?p=pstviewtool

[pstsdk]: Microsoft's cross platform header only C++ library for reading PST files
https://archive.codeplex.com/?p=pstsdk

How MAPI tables work
http://www.dimastr.com/redemption/mapitable.htm

Digital Preservation Coalition (Portal link for articles on email)
https://www.dpconline.org/knowledge-base/preservation-lifecycle/email

Strategies for Preserving Institutional and Researcher Email
https://www.cni.org/wp-content/uploads/2018/09/CNI-email-preservation-ERreport-Spring18.pdf

The Future of Email Archives: A Report from the Task Force on Technical Approaches for Email Archives
https://www.clir.org/pubs/reports/pub175/

Office 365: PII Guidelines ("What the sensitive information types look for")
https://docs.microsoft.com/en-us/office365/securitycompliance/what-the-sensitive-information-types-look-for

Office 365: Overview of retention policies
https://docs.microsoft.com/en-us/office365/securitycompliance/retention-policies

DArcMail Users Guide
https://siarchives.si.edu/sites/default/files/forum-pdfs/SIA_DArcMail_UsersGuide.pdf

Reading List

A Forensic Email Analysis Tool Using Dynamic Visualization
https://commons.erau.edu/jdfsl/vol12/iss1/6/

A Comprehensive Gold Standard for the Enron Organizational Hierarchy
http://www.aclweb.org/anthology/P12-2032

Machine Learning for email insight https://towardsdatascience.com/how-i-used-machine-learning-to-classify-emails-and-turn-them-into-insights-efed37c1e66

Network Analysis with the Enron Email Corpus
https://arxiv.org/pdf/1410.2759.pdf

Work Hard, Play Hard: Email Classification on the Avocado and Enron Corpora
https://pdfs.semanticscholar.org/d103/24c0a31845cb29e6d0157b60fb1130f89624.pdf

A Content-Based Approach to Email Triage Action Prediction: Exploration and Evaluation
https://www.groundai.com/project/a-content-based-approach-to-email-triage-action-prediction-exploration-and-evaluation/

About

Reference list of email processing resources; focus on preservation and PII handling

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published