METHOD FOR GENERATING EMAIL TIMELINES - HER MAJESTY THE QUEEN IN RIGHT OF CANADA AS REPRESENTED BY THE MINI OF THE DEPT OF NAT DEFENCE

Title:

METHOD FOR GENERATING EMAIL TIMELINES

Document Type and Number:

WIPO Patent Application WO/2016/145508

Kind Code:

Abstract:

Systems and methods for the generation of timelines for emails found on a computer. The files relating to emails on a computer are first located on the machine. These files are then analyzed and each email is extracted into a separate file. Each separate email file is further analyzed and a unique hash value for that separate email is generated. Data relating to that separate email is then retrieved and saved in a timeline file along with the unique hash value. Such data may include the email recipient, the email sender, a routing return path for the email, the file name for the separate email file, the file size for the separate email file, and the directory or path for that separate email file.

Inventors:

CARBONE RICHARD R (CA)

Application Number:

PCT/CA2015/050193

Publication Date:

September 22, 2016

Filing Date:

March 16, 2015

Export Citation:

Click for automatic bibliography generation Help

Assignee:

HER MAJESTY THE QUEEN IN RIGHT OF CANADA AS REPRESENTED BY THE MINI OF THE DEPT OF NAT DEFENCE (CA)

International Classes:

G06F17/00; G06F5/00; G06Q50/10

Foreign References:

US20060101285A1

2006-05-11

Other References:

NEREK NEWTON: "Information security insights and other ramblings - Searching and extracting data from PST files", 27 February 2011 (2011-02-27), Retrieved from the Internet [retrieved on 20150707]
SLEUTHKITWIKI: "Body file", 27 April 2009 (2009-04-27), Retrieved from the Internet
DAVID NIDES: "General Forensic Analysis checklist v.1.1", 16 December 2011 (2011-12-16), Retrieved from the Internet [retrieved on 20150707]

Attorney, Agent or Firm:

BRION RAFFOUL (Ottawa, Ontario K1L 7N6, CA)

Download PDF:

View/Download PDF PDF Help

Claims:

We claim

1. A method for generating a timeline for emails stored on a computer, the method comprising: a) locating files containing said emails on said

computer ; b) extracting emails from said files and storing each of said emails into a separate file to result in multiple separate email files; c) determining a unigue identifier for each of said emails ; d) gathering data about each separate email from said multiple separate email files and saving said data into a timeline file; wherein said timeline file associates each separate email with its unigue identifier determined in step c)

2. A method according to claim 1 wherein, prior to step a ) , an image of a disk containing data for said computer is created .

3. A method according to claim 1 wherein step c) includes applying a hash function to an email and using a resulting hash value as said unigue identifier for said email.

4. A method according to claim 1 wherein said data gathered in step d) includes sender information and addressee

information for each of said emails .

5. A method according to claim 1 wherein said data gathered in step d) includes data in a subject field in said emails.

6. A method according to claim 1 further including a step of converting dates for said emails into a single consistent date format .

7. A method according to claim 1 further including a step of converting time references in said email into a single consistent time format.

8. A method according to claim 7 wherein only time

references in headers of said emails are converted.

9. A method according to claim 6 wherein only dates in headers of said emails are converted.

10. A method according to claim 1 further including

generating at least one other file containing data from headers of said emails, said data from headers being at least one of : cc data, recipient data, sender data, email identifier data, and subject data.

11. A method according to claim 1 wherein said timeline file contains separate entries for each of said emails, each entry having data relating to at least one of:

- an addressee of said email;

- a sender of said email;

- a preceding email to which said email is a reply to;

- a subject of said email;

- a date of said email;

- a status of said email;

- a message ID of said email; - a generated hash value for said email;

- a routing return path for said email;

- a file name for a separate email file for said email;

- a file size for said separate email file for said email ;

- a directory name for said separate email file for said email; and

- attachments for said email.

12. A method according to claim 11 wherein said timeline file separates data for each entry using the characters I |-| I .

13. A method according to claim 1 wherein said method generates at least one output file containing separate entries for each of said emails, each entry having data relating to at least one of :

- an addressee of said email;

- a sender of said email;

- a preceding email to which said email is a reply to;

- a subject of said email;

- a date of said email;

- a status of said email;

- a message ID of said email;

- a generated hash value for said email;

- a routing return path for said email;

- a file name for a separate email file for said email; - a file size for said separate email file for said email ;

- a directory name for said separate email file for said email; and

- attachments for said email.

14. Computer readable media having encoded thereon computer readable and computer executable instructions which, when executed, implements a method for generating a timeline for emails stored on a computer, the method comprising: a) locating files containing said emails on said

Description:

METHOD FOR GENERATING EMAIL TIMELINES

A portion of the disclosure of this document contains material which is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the document or the disclosure, as it appears in a relevant patent office application or patent file or records, but otherwise reserves all copyright rights whatsoever. All code sections in this document are written and coded by the inventor. All code in this document is governed by Copyright © 2013, Her Majesty the Queen in Right of Canada, as

Represented by the Minister of National Defence .

TECHNICAL FIELD

[0001] The present invention relates to the forensic

examination of computers . More specifically, the present invention relates to tools, methods, and systems for generating a timeline for emails stored on a computer .

BACKGROUND OF THE INVENTION

[0002] The computer and communications revolution of the late

20th and early 21st century has changed how people communicate and interact with one another. Since email is now ubiguitous, it has now become one of the main communication tools for people and, as such, it can be the source of important information when it comes to litigation, criminal investigations, and the like . ^0003] Because of the importance of email, investigators would be greatly served if they can be provided with a timeline, a listing, or an investigative tool which allows for an easier view of emails discovered on a computer. The problem is that while various file system-based timeline solutions exist, none has placed any emphasis or focus on the technical aspect of e- mail based timeline analysis. Moreover, no single e- mail specific timeline analysis format has been found in the publicly available literature. Thus, there is currently no manner for anyone to agree upon an open e-mail timeline format.

;0004] This capability is urgently needed as more and more criminal investigations and civil litigations are relying on highly complex chain of events that can be readily corroborated through e-mail timeline analysis, as an individual's emails contain a large part of said individual's online presence. The use of digital forensic timelines can be of great assistance to forensic investigators and to legal counsel in this situation .

^0005] For the purposes of this document, an email forensics timeline is a representation of the information commonly contained within a computer disk-based file system in email files (e.g. .pst, .ost) . The

timeline's objective is to represent all the various information that can be obtained from emails contained in these files and their attachments. What

differentiates an email timeline from regular email listings is that the data generated for a timeline is specific to the date/time of the objects contained therein . [0006] Currently, there is no known publicly available tool which provides for the generation of e-mail timelines.

[0007] Based on the above, there is therefore a need for

systems, methods, and tools which allow for the automatic generation of timelines for emails found on a computer .

SUMMARY OF INVENTION

^0008] The present invention relates to systems and methods for the generation of timelines for emails found on a computer. The files relating to emails on a computer are first located on the machine. These files are then analyzed and each email is extracted into a separate file. Each separate email file is further analyzed and a unigue hash value for that separate email is generated. Data relating to that separate email is then retrieved and saved in a timeline file along with the unigue hash value. Such data may include the email recipient, the email sender, a routing return path for the email, the file name for the separate email file, the file size for the separate email file, and the directory or path for that separate email file.

0009] In a first aspect, the present invention provides a method for generating a timeline for emails stored on a computer, the method comprising: a) locating files containing said emails on said computer ; b) extracting emails from said files and storing each of said emails into a separate email file to result in multiple separate email files; c) determining a unigue identifier for each of said emails; d) gathering data about each separate email from said multiple separate email files and saving said data into a timeline file; wherein said timeline file associates each separate email with its unigue identifier determined in step c) . In a second aspect, the present invention provides computer readable media having encoded thereon computer readable and computer executable instructions which, when executed, implements a method for

generating a timeline for emails stored on a computer, the method comprising: a) locating files containing said emails on said computer ; b) extracting emails from said files and storing each of said emails into a separate file to result in multiple separate email files; c) determining a unigue identifier for each of said emails; d) gathering data about each separate email from said multiple separate email files and saving said data into a timeline file; wherein said timeline file associates each separate email with its unique identifier

determined in step c) .

BRIEF DESCRIPTION OF THE DRAWINGS

[0011] The embodiments of the present invention will now be described by reference to the following figures, in which identical reference numerals in different figures indicate identical elements and in which:

FIGURE 1 is a flowchart detailing the steps in a method according to one aspect of the invention.

DETAILED DESCRIPTION

[0012] In one aspect, the present invention provides methods which permit the generation of e-mail based timelines for use in digital forensic investigations, suitable for use in government, law enforcement and the commercial marketplace as well as an open e-mail timeline format which readily permits the exchange of e-mail timeline information between investigators and which can be readily integrated into existing digital forensic timelines.

[0013] In one implementation, the present invention relies on a filesystem timeline, also known as super-timelines. While super-timelines are not required to generate email timelines, it is one preferred method. In one implementation, the method begins with a script that examines the super-timeline for the existence of e- mail folders. Once found, it begins the pre-processing that consists of extracting all e-mails and

attachments from all detected email folders. The e- mails and attachments are then extracted and copied to a specific location. A list of all e-mails and attachments is then compiled. Based on this list, the system generates its e-mail timeline.

[0014] It should be noted that, while there are many

different email formats available, most of these formats can be converted into a format which is easier to work with. Open email formats, such as the MBox standard, are easier to work with. The Mbox standard is an easy format to work with, for reading and parsing the e-mails contained therein, as well as for extracting their associated attachments. Specifically, e-mails and their associated attachments are stored within contiguous text-based data files, generally one data file per e-mail folder, where each folder represents the same structure found within the e-mail client .

[0015] Attachments are encoded and stored within their

associated e-mail, generally using a text-based encoding. The Mbox standard is loosely defined in RFC 4155, although the original specification is found in RFC 822. Moreover, today there exist several derivative Mbox formats including mboxo, mboxrd, mboxcl and mboxcl2, all of which are wholly

incompatible with Mbox. Thus, use of the Mbox format readily facilitates e-mail analysis and timeline analysis due to its simple structure and the format of the e-mails and attachments contained therein. Thus, a variety of text-based data processing tools (e.g. awk, tr, cat, sort, uniq, cut, paste, sed, etc.) and Mbox- specific tools (e.g. mboxgrep, mailgrep, grepmail, etc.) can be used to process and analyse Mbox data files in order to conduct e-mail analysis and timeline generation (with additional scripting) .

[0016] In contrast to Mbox, the Maildir format stores each e- mail in its own file within a common e-mail folder. Associated attachments are also stored within the associated e-mail data file. The folder structure used by this format represents the same folder structure found and used within the e-mail client itself. Although the Maildir format is popular and is used by a variety of e-mail clients and servers, it has not yet been defined as an RFC standard. The Maildir format, however, is used by many different UNIX and FOSS-based e-mail servers and clients. As such, it too can be readily considered an open e-mail based data storage format. Moreover, the standard UNIX text processing tools can be used against Maildir (e.g. awk, tr, cat, sort, uniq, cut, paste, sed, etc.) as can other specialized tools including mboxgrep . Furthermore, many e-mail clients, especially the various FOSS-based ones, fully support the importation and exportation of this format.

[0017] Of course, conversion tools exist to convert between

Mbox and Maildir formats . For conversion from Mbox to Maildir, tools such as mbox2maildir, mb2md and perfect_maildir can be used, to name just a few. From Maildir to Mbox, two readily available tools that can be used include maildir2mbox and mutt. To use the mutt tool the following command can be used to convert a Maildir archive to an Mbox archive: mutt -f Archive/ -e 'set mbox_type=mbox; set

confirmcreate=no; \set delete=no; push

"T. *<enter>; sarchive<enter><quit>" '

[0018] A final open format worth mentioning is the MH Message

Handling System format, or just MH . This is actually an open source program consisting of several command line tools for reading and managing e-mails. MH is neither widely used nor has it undergone recent development. In contrast to Mbox and Maildir-based storage formats, all e-mails collected by MH are stored as individual files, each stored in its own directory with each attachment being stored within its associated e-mail. As such, this program is neither efficient at managing the disk space consumed by its e-mails nor does it provide high-performance e-mail management. However, its e-mails are stored in a text- based format, thereby making their contents readily accessible and trivial to process for e-mail analysis and timeline generation, both of which can be achieved using relatively simple customized scripts and programs .

[0019] Regarding Microsoft Outlook based formats, contrary to popular opinion, Microsoft Outlook is an open format, complete with a formal specification made public by Microsoft in 2010. It is important to understand that this specification, despite its size (199 pages in length), should generally be considered complete. In fact, the overall PST/OST file format is as complex and field-heavy as many modern file system

specifications. However, prior to 2010, understanding the PST and OST formats for Microsoft Outlook required developers to reverse engineer the format. Both formats are similar, although there are nevertheless differences between them. A thorough reading of references 7, 9, and 10 (listed at the end of this document) will help to verse the reader in their subtleties. It is also important to note that the PST and OST formats prior to Outlook 2002 used 32-bit internal pointers while Outlook 2003 and subsequent versions use 64-bit internal pointers. However, most PST and OST-aware programs can seamlessly compensate for these differences.

[0020] For Outlook Express emails, the email database can be converted. Converters for Outlook Express'

proprietary Mbx-based data storage format include ol2mbox, mbx2mbox and dbx2mbox . Older versions of Outlook Express used the Mbx format while recent versions of Outlook Express adopted the Dbx format, another closed proprietary format. However, this format has been successfully reverse engineered by Arne Schloh and Ulrich Krebs . Avi Rozen of the UnDBX project based his code on Krebs work. Finally, one other noteworthy program is DbxConv, a free Windows- based Dbx to Mbox conversion utility.

[0021] To assist with the implementation of the present

invention, subroutine libraries as well as extraction and information collection tools may be used. Specifically, the libraries LibPFF and LibPST can be used to extract emails from email databases. Like LibPFF, LibPST is a collection of C libraries and PST- based extraction and information collection tools . Its shared library file (.so file) can be compiled against other programs requiring specific functionality, or the various C libraries (.h files) can be incorporated into custom software. LibPST 's software tools are readpst and Ispst. Other tools exist in this library, but they are not used for extracting or listing information about PST and OST data storage files. The readpst program has the ability to convert PST and OST files into separate e-mails, a KMail-based e-mail storage file, or into several different sub-Mbox formats. The program has the ability to recover deleted but not permanently removed ( Shift+deleted) e- mails (and their underlying attachments) . This means that e-mails found in the "Deleted" subfolder are readily recoverable if the correct command line parameter is specified. Conversely, the Ispst program does not extract e-mails from PST and OST data storage files; instead, it is used to list their contents. Although LibPST is fully functional and provides useful C routines for working with PST and OST-based data storage files, it does not have the ability to recover Shift+deleted e-mails and their attachments from PST and OST data storage files. When e-mails are deleted in this fashion, they are returned to

"unallocated" space within an e-mail based data storage file and remain intact until either a new e- mail (or its attachment) writes over the same location where the deleted e-mail resided or until the storage data file is compacted. In order to recover these e- mails and attachments, investigators can turn to commercial e-mail forensic software (e.g. P2

Commander) or use LibPFF, which has this capability. The LibPFF library can be accessed at https : / /github . com/libyal/libpff while the LibPST library can be found at: http : / /www . five-ten-sg . com/libpst . [0022] It was found that LibPST was the more useful of the two libraries. The reasons for this are two-fold: The first is that e-mails and attachments recovered from unallocated space have a significantly higher chance of being damaged or corrupted, thereby making e-mail analysis and timeline generation all the more difficult. Secondly, LibPFF and its e-mail and attachment extraction tool, pffexport, does not extract e-mails and attachments in as straightforward a manner as readpst .

[0023] As for the latter reason, it is important to

understand that when readpst is instructed to extract all e-mails from a given e-mail based storage data file, it extracts each e-mail and its associated attachment using the same structure found within the data storage file itself, respecting folder and subfolder hierarchies. The same is true of the tool pffexport; however, the key difference between them is that readpst extracts only one file per e-mail message whereas pffexport exports multiple files per message. Specifically, pffexport uses multiple text files to express what readpst denotes using a single e-mail text file. Of course, both programs extract all associated attachments using a 1:1 mapping. Moreover, readpst extracts and renames all extracted e-mails and associated attachments using a simple numbering system on a per-folder basis. For example, the first e-mail extracted in a given subfolder will be renamed "1" while its attachments will be renamed "1-1. extension", "1-2. extension", "1-x . extension", etc., where ".extension" represents the actual filename extension of the attachment as stored within the associated e- mail. In addition, in order to prevent RTF attachment extraction for e-mails that have embedded RTF bodies, the appropriate readpst command line-based option can be specified to reduce the number of unnecessary attachments .

[0024] Finally, it is important to point out that pffexport is more advanced than readpst, as are its C libraries. Thus, the functionality of LibPFF is technologically superior to LibPST, although for timeline generation, LibPST is ideal. For purposes of the present

invention, LibPST and readpst are the preferable choice for timeline generation. Moreover, the use of readpst requires less additional code to be written for the prototype as less files had to be handled in order to extract all the pertinent information which readpst places within each extracted e-mail message. This is in contrast to pffexport that uses multiple files to achieve the same overall functionality.

[0025] It should be noted that LibPST is not the only option when it comes to libraries for email extraction. Alternatives exist for LibPST including, as previously mentioned, LibPFF. Others are Java LibPST and Python LibPST. With respect to Java LibPST, Python LibPST and LibPFF, only the latter provides PST/OST-specific user tools. The aforementioned LibPST-based libraries provide library-only functionality to the developer. This requires the developer to use the functions and classes present therein to write his own PST/OST- capable tool. For forensic investigators lacking the time or ability to write such tools, it is preferred to use LibPST with its two tools, readpst and Ispst, or LibPFF, which also provides PST/OST extraction and information listing programs, pffexport and pffinfo, respectively . [0026] It should be noted that, in one implementation, a tool called log2timeline was used to generate the files for emails stored in a disk volume. While the use of log2timeline was not necessary, it was convenient. Other tools, such as timescanner, may be used in place of log2timeline .

[0027] The log2timeline tool can be found at

http://plaso.kiddaland.net or at

http://log2timeline.net. An older version of this tool can be found at http://log2timeline.net. The timescanner tool has been implemented in the newer version of log2timeline .

[0028] As well, it should further be noted that, as noted above, a super-timeline (generated by tools such as log2timeline) is not necessary to generate an email timeline. All that is needed is a script or other program that goes through one or more supported disk image-based file systems looking for email PST and OST folders. Once these folders are found, they are to be copied to a working directory and using the readpst program, all e-mails and attachments therein are to be extracted. It would be preferable that, when locating, copying and extracting the e-mails and attachments, a logical structure is used. Otherwise, it will be difficult to determine from where a given data file originated.

[0029] It should also be noted that, depending on the

circumstances surrounding the data disk or the email files to be analyzed, it might be preferable if not advisable to first obtain an image of the disk or storage media containing the emails to be analyzed. This might be advisable as the image can be used for the email analysis and timeline generation instead of the original storage media. Creating a disk image or simply an image of storage media to be analyzed is well-known in the art and is within the purview of a person skilled in the art, techniques and science of digital forensics. To search and detect email files (in the case Outlook files), the following Bash shell code may be used:

412 #=======================

413 #Find all Outlook files:

414 #=======================

415 cd $location/timeline/$image_name/$part_number

416 echo ""

417 echo "Processing Outlook files..."

418 find $mount_point -type f -iname "*.pst" > outlook_found . txt

419 find $mount_point -type f -iname "*.ost" >> outlook_found . txt

420 cat outlook_found.txt | wc -1 >

num_outlook_fldrs_found . txt

421

422 num=l

423 echo "Copying located Outlook folders..."

424 while read line

425 do

426 echo $line

427 cp "$line" "outlook/ $num . outlook"

428 echo "Outlook file: $line = $num . outlook" >> outlook_mapping . txt

429 num=$ (expr $num + 1)

430 done <outlook_found . txt

431 cat outlook_mapping.txt | wc -1 >

num_outlook_fldrs_created . txt

432 echo "Copying completed..."

433

434 outlk_found=0

435 outlk_created=0

436

437 while read line

438 do

439 outlk_found=$line

440 done <num_outlook_fldrs_found . txt

441 while read line

442 do

443 outlk_created=$line 444 done <num_outlook_fldrs_created . txt

445

446 if [ "$outlk_found" == "$outlk_created" ]

447 find outlook/ -type f > create_outlook_dirs.txt

448 while read line

449 do

450 echo $line

451 mkdir "$line . extract"

452 readpst -D -S -o "$line . extract" "$line"

453 done <create_outlook_dirs . txt

454 rm create_outlook_dirs.txt

455 mv outlook_mapping.txt outlook/

456 mv outlook_found.txt outlook/

457 mv num_outlook_fldrs_created.txt outlook/

458 mv num_outlook_fldrs_found.txt outlook/

459 else

460 echo ""

461 echo "The number of Outlook folders found does not equal the number of Outlook processing directories."

462 echo "" Once the email files have been found, these emails can be processed prior to the timeline generation. The above code creates directory based e-mail file listings and these file listings can be converted into a file-specific directory listing with suitable formatting. The core logic of this conversion can be seen in the code snippet below:

41 while (fgets (line_buffer, BUF_SIZE,

outlook_input_file) != NULL)

42 {

43 if ( strncmp ( separator , line_buffer,

strlen ( separator ) ) == 0)

44 {

45 found_separator = 0 ;

46 }

48 if ( found_separator == 1)

49 {

50 length = strlen ( line_buffer ) ;

51 line_buffer [length-1] = '\0';

52 strcpy (dir_name, line_buffer ) ;

53 //fprintf (stdout, "%s/\n", dir_name);

54 }

55 56 if ( found_separator >= 2)

57 {

58 line_buffer [0] = ^y\x20';

59 line_buffer [1] = ^y\x20';

60 sscanf (line_buffer, "%s", &file_name)

61 fprintf (outlook_output_file, "%s/%s\n

file_name ) ;

62 }

63 ++found_separator ;

64 }

[0032] The result of the above processing is the extraction of emails from formatted email files on the volume being examined. Each email is now in a separate file and the multitude of separate email files can now be processed to generate a timeline.

[0033] To determine the size of the email being processed, a function may be created to determine this email's size. The function below may be used for this process .

578 // This function tries to get the size of the actual email being read

579 // and processed by this program. If it fails, then it exits. However,

580 // a note for trying to reading this email can be found in the debug

581 // file.

582 void email_file_size (char email_file_name [ ] )

583 {

584 long int size = 0 ;

585 struct stat sb;

586

587 if ( stat (email_file_name, &sb) == -1)

588 {

589 perror ( "stat" ) ;

590 fprintf (stderr, "ERROR. THE SIZE OF FILE: %s COULD NOT BE DETERMINED. \n",

591 email_file_name ) ;

592 fprintf ( stderr, "Exiting ... \n" ) ;

593 tmp_sleep ( ) ;

594 exit (EXIT_FAILURE) ;

595 }

596 else 597 {

598 size = (long) sb . st_size;

599 }

600

601 emailfilesize = size;

602 }

[0034] As part of the data generated per email, the directory and file name for each email is determined and stored. A function which uses the UNIX programs dirname and basename can be created which returns the directory name and file name for each separate email file. These programs merely need to be provided with the separate email file.

[0035] The number of attachments that an individual email may have is also determined and forms part of the timeline. The number of attachments within the directory for each separate email file is found and documented. This is then associated with each specific email.

[0036] It should be noted that the separate email files may reguire some extra processing to deal with badly formed emails or emails which have non-standard formatting. When these non-standard formatting emails are encountered, the data within them cannot be processed and a listing of such "bad" emails is generated. Similarly, if a separate email file is found to be a calendar entry (e.g. a vCalendar entry), a card entry (e.g. a vCard) , or a journal entry (e.g. a vJournal), these files are detailed as such in the final timeline file. Detecting these calendar, card, or journal entries can be simple as it merely involves matching header information in the email to an expected calendar, card, or journal header. The code below streams data from a separate email file and checks the header for a calendar header. If a calendar header is found, then the separate email file (in reality containing a calendar entry) is allocated or detailed in a calendar file listing, in the debugging file the file is listed as such, and a counter for files examined is incremented along with a counter for calendar files. For a vCalendar entry, the string to be matched is BEGI : VCALENDAR . This section of code may be replicated and adjusted to cover scanning for journal or card entries.

762 first_word [0] =' \0

763 second_word [ 0 ] = ' \0 ' ;

764 sscanf (line_buffer, "%s %s", &first_word,

&second_word) ;

765 //

766 // This specific check looks to ensure that if a VCALENDAR is found

767 // it is properly handled

768 //

769 if ( ( strcasecmp (first_word, match_VCALENDAR) ) == 0)

770 {

771 //fprintf (vcalendars_file , "%s\n",

read_in_email_filename ) ; —>

772 fprintf (vcalendars_file, "%s\n",

email_file_dirname ) ;

773 fseek (vcalendars_file, -1, SEEK_CUR) ;

774 fprintf (vcalendars_file, "/");

775 fprintf (vcalendars_file, "%s\n",

email_file_basename ) ;

776 fprintf (vcalendars_file, "%s", print_VCALENDAR) ;

777 fprintf (debugging, "%s", print_VCALENDAR) ;

778 for (j = 0; j < LINE_DIVIDER;

779 {

780 fprintf (debugging, "=");

781 fprintf (vcalendars_file, "=");

782 }

783 fprintf (debugging, "\n");

784 fprintf (vcalendars_file, "\n");

785 fclose (email ) ;

786 ++counter;

787 ++vcalendar_counter ;

788 break; [0037] To detect non-standard email formatting (e.g. UTF-7 or UTF-8 formatted emails), a string indicating such formatting can be scanned for. This string can take the form of =?utf-7?, =?utf-8?, or any similar string. The code below attempts to match the CC field in the email with these UTF-7 or UTF-8 strings. If a match is found, then a specific UTF-7 or UTF-8 indication is placed in the CC field in the timeline. The code can be adjusted to examine any of the TO, FROM, or SUBJECT fields in the email as well.

845 if ( ( strcasecmp (first_word, match_CC) ) == 0)

846 {

847 strcp (print_CC, line_buffer ) ;

848 //

849 // Check for encoded UTF 7 or 8 streams!

850 //

851 if ( strncasecmp ( second_word, match_utf7, 8) == 0)

852 {

853 strcp (print_CC, "Cc: ENCODED UTF7\n");

854 }

855 else if ( strncasecmp ( second_word, match_utf7_, 9) == 0)

856 {

857 strcpy (print_CC, "Cc: ENCODED UTF7\n");

858 }

859 else if ( strncasecmp ( second_word, match_utf8, 8) == 0)

860 {

861 strcpy (print_CC, "Cc: ENCODED UTF8\n");

862 }

863 else if ( strncasecmp ( second_word, match_utf8_, 9) == 0)

864 {

865 strcpy (print_CC, "Cc: ENCODED UTF8\n");

866 }

[0038] It should be noted that, as part of error-handling, problematic dates and date format handling should be considered and dealt with. As examples of incorrect date formatting, the hour may be less than zero or greater than 24 or the minutes or seconds may be greater than 60 or less than 0. To ensure proper formatting of dates, these eventualities should be caught and dealt with.

[0039] To create the actual timeline, the relevant fields from each separate email file are copied to a specific variable and the value of that variable is printed into the relevant timeline file. In the first code section below, the relevant fields are copied into specific variables while in the second code section, these specific variables are written into the relevant timeline files .

[0040] ***copy field string to variable ****

1042 if ( (strcasecmp (first_word, match_PRIORITY) ) == 0)

1043

1044 strcpy (print_PRIORITY, line_buffer ) ;

1045 }

1046 else if ( (strcasecmp (first_word,

match_ERRORS_TO) ) == 0)

1047

1048 strcp (print_ERRORS_TO, line_buffer ) ;

1049

1050 else if ( (strcasecmp (first_word,

match_IMPORTANCE ) ) == 0)

1051

1052 strcpy (print_IMPORTANCE, line_buffer ) ;

1053 }

1054 else if ( (strcasecmp (first_word,

match_MESSAGE_ID) ) == 0)

1055

1056 strcpy (print_MESSAGE_ID, line_buffer ) ;

1057 }

1058 else if ( (strcasecmp (first_word,

match_REFERENCES ) ) == 0)

1059

1060 strcpy (print_REFERENCES, line_buffer ) ;

1061 }

1062 else if ( (strcasecmp (first_word,

match_IN_REPLY_TO) ) == 0)

1063 {

1064 strcpy (print_IN_REPLY_TO, line_buffer ) ;

1065 } 1066 else if ( (strcasecmp (first_word,

match_RETURN_PATH ) )

1067 {

1068 strcpy (print_RETURN_PATH, line_buffer ) ;

1069 }

1070 else if ( (strcasecmp (first_word,

match_SENSITIVITY) ) == 0)

1071 {

1072 strcpy (print_SENSITIVITY, line_buffer ) ;

1073 } **** write variable string into timeline files ****

1078 if ( (strncasecmp (line_buffer, delimiter, 19)) == 0)

1079 {

1080 //

1081 // Debugging and emails files should state the actual processed

1082 // email number so that it lines up with data from messageid,

1083 // to, from, cc, subject, and shal .

1084 //

1085 fprintf (debugging, "Email No.: %ld\n",

email_counter+l ) ;

1086 fprintf (emails_file, "Email No.: %ld\n",

email_counter+l ) ;

1087 //

1088 // As stated above.

1089 //

1090 fprintf (debugging, "%s", print_TO) ;

1091 fprintf (emails_file, "%s", print_TO) ;

1092 fprintf (timeline_to, "Email No.: %ld\t%s\n", email_counter+l, print_TO) ;

1093 fprintf (debugging, "%s", print_FROM) ;

1094 fprintf (emails_file, "%s", print_FROM) ;

1095 fprintf (timeline_from, "Email No.: %ld\t%s\n", email_counter+l ,

print_FROM) ;

1096 fprintf (debugging, "%s", print_CC) ;

1097 fprintf (emails_file, "%s", print_CC) ;

1098 fprintf (timeline_cc, "Email No.: %ld\t%s\n", email_counter+l, print_CC) ;

1099 fprintf (debugging, "%s", print_IN_REPLY_TO) ;

1100 fprintf (emails_file, "%s", print_IN_REPLY_TO) ;

1101 fprintf (debugging, "%s", print_SUBJECT ) ;

1102 fprintf (emails_file, "%s", print_SUBJECT ) ;

1103 fprintf (timeline_subject , "Email No.: %ld\t%s\n", email_counter+l , print_SUBJECT) ; To assign a unique identifier to each email, a hash function is applied to each separate email file. The code below builds a SHAl hash command and applies that command to each separate email file. The hash value for each separate email file is then written to the timeline file.

1129 // Here we run SHAl hash against email file.

1130 // Use popen ( ) command to run shalsum against email and retrieve hash

1131 // value.

1132 // Build SHAl hash command

1133 //

1134 strcat ( shal_hash_command, shal_hash_cmdl ) ;

1135 strcat ( shal_hash_command, email_file_name ) ;

1136 strcat ( shal_hash_command, "/");

1137 strcat ( shal_hash_command, email_file_basename ) ;

1138 strcat ( shal_hash_command, "\"");

1139 pf = popen ( shal_hash_command, "r");

1140 //

1141 // SHAl hashing is required. An unsuccessful hashing results in program

1142 // termination.

1143 //

1144 if(!pf)

1145 {

1146 fprintf (stderr, "\nCOULD NOT OPEN PIPE FOR

COMMAND

OUTPUT . \n" ) ;

1147 fprintf ( stderr , " \nExiting ... \n" ) ;

1148 tmp_sleep ( ) ;

1149 exit (EXIT_FAILURE) ;

1150 }

1151 else

1152 {

1153 fscanf (pf , "%s", retrieve_shal_stream) ;

1154 fprintf (stdout, "Computed SHAl Hash = %s\n", retrieve_shal_stream) ;

1155 fprintf (debugging, "Computed SHAl Hash = %s\n", retrieve_shal_stream) ;

1156 fprintf (emails_file, "Computed SHAl Hash = %s\n", retrieve_shal_stream) ;

1157 fprintf (timeline_file, "%s| |-| |",

retrieve_shal_stream) ; 1158 fprintf (timeline_SHAl, "Email No.: %ld\t%s\n", email_counter+l ,

1159 retrieve_shal_stream) ;

1160 fflush (pf ) ;

1161 }

[0043] Other fields associated with an email are also copied from each separate email file and written to specific variables created for that purpose. These variables are then written to the different reporting files including the ultimate timeline file.

1181 fprintf (debugging, "%s", print_STATUS ) ;

1182 fprintf (emails_file, "%s", print_STATUS ) ;

1183 fprintf (debugging, "%s", print_PRIORITY) ;

1184 fprintf (emails_file, "%s", print_PRIORITY) ;

1185 fprintf (debugging, "%s", print_IMPORTANCE ) ;

1186 fprintf (emails_file, "%s", print_IMPORTANCE ) ;

1187 fprintf (debugging, "%s", print_SENSITIVITY) ;

1188 fprintf (emails_file, "%s", print_SENSITIVITY ) ;

1189 fprintf (debugging, "%s", print_MESSAGE_ID) ;

1190 fprintf (emails_file, "%s", print_MESSAGE_ID ) ;

1191 fprintf (timeline_messageid, "Email No.:

%ld\t%s\n", email_counter+l ,

1192 print_MESSAGE_ID) ;

1193 fprintf (debugging, "%s", print_ERRORS_TO) ;

1194 fprintf (emails_file, "%s", print_ERRORS_TO) ;

1195 fprintf (debugging, "%s", print_RETURN_PATH) ;

1196 fprintf (emails_file, "%s", print_RETURN_PATH ) ;

1197 fprintf (debugging, "%s", print_REFERENCES ) ;

[0044] In one implementation of the present invention,

multiple output files are generated. One of these files is a text file named debugging.txt. This file or one similar to it is useful for helping the

investigator correct aberrant situations should the system of the invention crash. Although the system of the invention can be written to detect and handle various errors, some may still cause crashes or aborts. If the system of the invention crashes, it is preferred if there is written important information to the debugging file concerning the e-mail that was opened for analysis at the time of the crash. This information can be used to remove the problematic email from the list of e-mails previously generated.

[0045] Another debugging file is a text file called

errors.txt, found within the directory errorlogs. This file or one similar to it can be used to capture two specific types of complications. The first is an inability to read an e-mail, usually due to

insufficient filesystem privileges. When this error occurs, the details regarding the error can be written to the errors file so it can be reviewed at a later date. The other issue that can be caught is that of incorrect or mismatched date/time strings which fail to return a correctly formatted date/time. When these errors are encountered, the details regarding the errors (e.g. the name of the separate email file being examined, the string that caused the error, etc.) can be written to the errors file and, if necessary, to the debugging file.

[0046] The system can be configured to create two additional directories, analysis and results. The analysis directory contains the following files: cc.txt, from.txt, messageid.txt, SHAl.txt, subject.txt, timeline.txt, and to.txt. The cc file lists all the addresses who have been cc'd on the emails. The from.txt file contains the addresses of all the originators of the emails. The messageid.txt file contains the message IDs for the emails with message identification numbers. The SHAl.txt would contain the hash values for all the separate email files while the subject.txt file would contain all the subject lines for all the emails. The to.txt file contains all the recipient email address from the emails. The timeline.txt file would contain a listing of all the emails along with the relevant data for each email (e.g. to, from, subject, email size, attachment, hash value, etc., etc.)

[0047] The results directory would contain the following

files: emails.txt, vcalendars.txt, vcards.txt, and vjournals.txt. These files detail the various types of files encountered during the timeline generation process, including calendar files, card files, journal files, and, of course, email files.

[0048] The format of the aforementioned files is self-evident, with the exception of the timeline.txt file format, which is explained below.

[0049] It should be noted that although e-mail Message-ID's are usually sufficient to distinguish between e-mails, not all email clients generate such message IDs. Thus, to determine an e-mail's unigueness, SHAl hashing is used. The system, using the function detailed above, fully supports a robust method for generating SHAl hashes of all e-mails. This can be extended to add SHAl hashing capabilities to all extracted e-mail attachments .

[0050] It should be noted that the system generates the file emails.txt. The entries in this file have the following format:

To :

From :

In-Reply-To :

Subject :

Date :

Status :

Message—ID : SHA1 Hash:

Return-Path :

Extracted e-mail file name:

Extracted e-mail file size:

Extracted e-mail directory name:

Extracted e-mail attachments:

;0051] The file emails.txt has enough e-mail information for the investigator to refer to rather than have to consult a given e-mail directly.

^0052] For the timeline.txt file, the entries have the

following format:

DATE I |-| I E-MAIL SIZE | |-| | SUBJECT | |-| | FROM | |-| | TO I |-| I MESSAGE-ID | |-| | SHAl HASH | |-| | E-MAIL LOCATION ON DISK where " \ \ - \ \ " is the divisor between the various fields of the timeline. This specific divisor was chosen, as it is unigue enough to not come into conflict with potential text embedded in an e-mail when performing additional text-based processing. This format can be used as the basis for a new and standardized e-mail timeline format.

^0053] The method according to one aspect of the invention may be summarized as detailed in the flowchart of Figure 1. The method begins with the creation of an image of the storage media being examined (step 10) . While this step is not strictly necessary, it is advisable to ensure that errors do not destroy the data being analyzed.

0054] The next step (step 20) is that of locating the email files or the database files containing emails on the storage device or on the image of the storage device. [0055] Once the files have been located, the various emails in the database of emails are then extracted, with each email being allocated a single separate email file. Attachments can be saved in separate files, (step 30)

[0056] After each email has been extracted into a separate email file, a unigue identifier can now be generated for each email. This can be done using the separate email file generated for each email. A hash value can be determined for each separate email file and this hash value can be used as the unigue identifier for the email, (step 40)

[0057] With separate email files generated, a timeline can be generated. This is done by analyzing each separate email file and gathering the relevant data into a timeline file. This data can include the email addresses of recipients, originators of the email, the subject of the email, as well as other data found usually in the headers of emails. (step 50)

[0058] Other output files can also be generated as the

timeline file is being generated. As an example, files containing the addresses of the recipient of emails, those copied on the emails, and the

originators of the emails can all be separately gathered in separate output files. This can be done while the timeline file is being generated, (step 60)

[0059] The software system according to one aspect of the invention may be implemented on a suitable computer system or device. Such a computer or device would, preferably, be eguipped with storage devices that can handle intensive I/O operations. High speed/high capacity disks were found to be suitable in one implementation. In the same implementation, 12 core CPUs (overclocked to 5 GHz) with 32 BG of RAM and six 1 TB disks were used to generate the super timeline. For the actual timeline implementation, 12 TB of data storage was used in an implementation that involved more than 10 million emails and more than 30 million attachments. The invention is suitable for handling production-size loads.

^0060] Regarding operating systems and other supporting

software, in one implementation, the following software was used: a Linux operating system was used with the appropriate dependencies installed (i.e. The Sleuthkit, Python, Perl, LibPST (or other similar program), a C/C++ compiler and reguired libraries. The invention will also run without much modification under Solaris & Mac OS X. The system of the invention can also be made to run under Windows using the Cygwin framework .

^0061] For a better understanding of the invention, the

following references may be consulted. These references are hereby incorporated by reference in their entirety.

[1] Campaign Monitor. Email Client Popularity. Informational web site. August 2012. http : / /www. campaignmonitor . com/resources/will-it- work/email-clients .

[2] Wikipedia. Mbox . Online encyclopaedic entry.

Wikimedia Foundation Inc. May 2012. http : //en . wikipedia . org/wiki/Mbox . [3] Wikipedia. Maildir. Online encyclopaedic entry. Wikimedia Foundation Inc. July 2012. http : //en . wikipedia . org/wiki/Maildir .

[4] Wikipedia. MH Message Handling System. Online encyclopaedic entry. Wikimedia Foundation Inc.

September 2012. http://en.wikipedia.org/wiki/Maildir.

[5] Wikipedia. Email. Online encyclopaedic entry.

Wikimedia Foundation Inc. September 2012.

http : //en . wikipedia . org/wiki/Email .

[6] Wikipedia. Email. Online encyclopaedic entry.

Wikimedia Foundation Inc. September 2012.

http : //en . wikipedia . org/wiki/MH_Message_Handling_Syste m .

[7] Microsoft MSDN. Outlook Personal Folders (.pst) File Format. Format specification. July 2012. Document No.: [MS-PST] - V20120630.

http : //msdn .microsoft . com/en-us/library/ff385210. aspx .

[8] July 2012 [MS-PST] : Outlook Personal Folders (.pst) File Format. Informational web site. 2012.

http : //msdn .microsoft . com/en-us/library/ff385210. aspx .

[9] Metz, Joachim. Personal Folder File (PFF) file format specification. File format specification.

August 2012. http : //code . google . com/p/libpff/downloads/detail?name= Personal%20Folder%20File%20%28PFF%29%20format .pdf&can= 2&q.

[10] Smith, Dave. File Format for Outlook pst files. File format specification. September 2004. http : / /alioth . debian . org/frs/download . php/ 2492/libpst- 0.5.3. tar . gz .

[11] Wikipedia. Personal Storage Table. Online encyclopaedic entry. Wikimedia Foundationlnc . September 2012.

http : //en . wikipedia . org/wiki/Personal_Storage_Table .

[12] Wikipedia. Microsoft Entourage. Online

encyclopaedic entry. Wikimedia Foundation Inc. July 2012.

http : //en . wikipedia . org/wiki/Microsoft_Entourage .

[13] Wikipedia. Outlook_Express . Online encyclopaedic entry. Wikimedia Foundation Inc. August 2012. http : //en . wikipedia . org/wiki/Outlook_Express .

[14] Schloh, Arne . Outlook Express dbx file format by Arne Schloh. Informational web site. April 2002.

http : / /oedbx . aroh . de/ .

[0062] The embodiments of the invention may be executed by a computer processor or similar device programmed in the manner of method steps, or may be executed by an electronic system which is provided with means for executing these steps. Similarly, an electronic memory means such as computer diskettes, CD-ROMs, Random Access Memory (RAM) , Read Only Memory (ROM) or similar computer software storage media known in the art, may be programmed to execute such method steps. As well, electronic signals representing these method steps may also be transmitted via a communication network.

[0063] Embodiments of the invention may be implemented in any conventional computer programming language. For example, preferred embodiments may be implemented in a procedural programming language (e.g."C") or an o ject-oriented language (e.g. "C++", "java", "PHP", "PYTHON" or "C#") . Alternative embodiments of the invention may be implemented as pre-programmed hardware elements, other related components, or as a combination of hardware and software components. Embodiments can be implemented as a computer program product for use with a computer system. Such

implementations may include a series of computer instructions fixed either on a tangible medium, such as a computer readable medium (e.g., a diskette, CD- ROM, ROM, or fixed disk) or transmittable to a computer system, via a modem or other interface device, such as a communications adapter connected to a network over a medium. The medium may be either a tangible medium (e.g., optical or electrical

communications lines) or a medium implemented with wireless technigues (e.g., microwave, infrared or other transmission technigues). The series of computer instructions embodies all or part of the functionality previously described herein. Those skilled in the art should appreciate that such computer instructions can be written in a number of programming languages for use with many computer architectures or operating systems. Furthermore, such instructions may be stored in any memory device, such as semiconductor, magnetic, optical or other memory devices, and may be

transmitted using any communications technology, such as optical, infrared, microwave, or other transmission technologies. It is expected that such a computer program product may be distributed as a removable medium with accompanying printed or electronic documentation (e.g., shrink-wrapped software), preloaded with a computer system (e.g., on system ROM or fixed disk) , or distributed from a server over a network (e.g., the Internet or World Wide Web) . Of course, some embodiments of the invention may be implemented as a combination of both software (e.g., a computer program product) and hardware. Still other embodiments of the invention may be implemented as entirely hardware, or entirely software (e.g., a computer program product) . A person understanding this invention may now conceive of alternative structures and embodiments or

variations of the above all of which are intended to fall within the scope of the invention as defined in the claims that follow.

Previous Patent: CAVITATION HYDROCARBON REFINING

Next Patent: METHODS AND SYSTEMS FOR CARRYING OUT VARIOUS TREATMENTS