CFEs and forensic accountants are seekers. We spend our days searching for the most relevant information about our client requested investigations from an ever-growing and increasingly tangled data sphere and trying to make sense of it. Somewhere hidden in our client’s computers, networks, databases, and spreadsheets are signs of the alleged fraud, accompanying control weaknesses and unforeseen risks, as well as possible opportunities for improvement. And the more data the client organization has, the harder all this is to find. Although most computer-assisted forensic audit tests focus on the numeric data contained within structured sources, such as financial and transactional databases, unstructured or text based data, such as e-mail, documents, and Web-based content, represents an estimated 8o percent of enterprise data within the typical medium to large-sized organization. When assessing written communications or correspondence about fraud related events, CFEs often find themselves limited to reading large volumes of data, with few automated tools to help synthesize, summarize, and cluster key information points to aid the investigation.
Text analytics is a relatively new investigative tool for CFEs in actual practice although some report having used it extensively for at least the last five or more years. According to the ACFE, the software itself stems from a combination of developments in our sister fields of litigation support and electronic discovery, and from counterterrorism and surveillance technology, as well as from customer relationship management, and research into the life sciences, specifically artificial intelligence. So, the application of text analytics in data review and criminal investigations dates to the mid-1990s.
Generally, CFEs increasingly use text analytics to examine three main elements of investigative data: the who, the what, and the when.
The Who: According to many recent studies, substantially more than a half of business people prefer using e-mail to use of the telephone. Most fraud related business transactions or events, then, will likely have at least some e-mail communication associated with them. Unlike telephone messages, e-mail contains rich metadata, information stored about the data, such as its author, origin, version, and date accessed, and can be documented easily. For example, to monitor who is communicating with whom in a targeted sales department, and conceivably to identify whether any alleged relationships therein might signal anomalous activity, a forensic accountant might wish to analyze metadata in the “to,” “from,” “cc,” or “bcc” fields in departmental e-mails. Many technologies for parsing e-mail with text analytics capabilities are available on the market today, some stemming from civil investigations and related electronic discovery software. These technologies are like the social network diagrams used in law enforcement or in counterterrorism efforts.
The What: The ever-present ambiguity inherent in human language presents significant challenges to the forensic investigator trying to understand the circumstances and actions surrounding the text based aspects of a fraud allegation. This difficulty is compounded by the tendency of people within organizations to invent their own words or to communicate in code. Language ambiguity can be illustrated by examining the word “shred”. A simple keyword search on the word might return not only documents that contain text about shredding a document, but also those where two sports fans are having a conversation about “shredding the defense,” or even e-mails between spouses about eating Chinese “shredded pork” for dinner. Hence, e-mail research analytics seeks to group similar documents according to their semantic context so that documents about shredding as concealment or related to covering up an action would be grouped separately from casual e-mails about sports or dinner, thus markedly reducing the volume of e-mail requiring more thorough ocular review. Concept-based analysis goes beyond traditional search technology by enabling users to group documents according to a statistical inference about the co-occurrence of similar words. In effect, text analytics software allows documents to describe themselves and group themselves by context, as in the shred example. Because text analytics examines document sets and identifies relationships between documents according to their context, it can produce far more relevant results than traditional simple keyword searches.
Using text analytics before filtering with keywords can be a powerful strategy for quickly understanding the content of a large corpus of unstructured, text-based data, and for determining what is relevant to the search. After viewing concepts at an elevated level, subsequent keyword selection becomes more effective by enabling users to better understand the possible code words or company-specific jargon. They can develop the keywords based on actual content, instead of guessing relevant terms, words, or phrases up front.
The When: In striving to understand the time frames in which key events took place, CFEs often need to not only identify the chronological order of documents (e.g., sorted by or limited to dates), but also link related communication threads, such as e-mails, so that similar threads and communications can be identified and plotted over time. A thread comprises a set of messages connected by various relationships; each message consists of either a first message or a reply to or forwarding of some other message in the set. Messages within a thread are connected by relationships that identify notable events, such as a reply vs. a forward, or changes in correspondents. Quite often, e-mails accumulate long threads with similar subject headings, authors, and message content over time. These threads ultimately may lead to a decision, such as approval to proceed with a project or to take some other action. The approval may be critical to understanding business events that led up to a particular journal entry. Seeing those threads mapped over time can be a powerful tool when trying to understand the business logic of a complex financial transaction.
In the context of fraud risk, text analytics can be particularly effective when threads and keyword hits are examined with a view to considering the familiar fraud triangle; the premise that all three components (incentive/pressure, opportunity, and rationalization) are present when fraud exists. This fraud triangle based analysis can be applied in a variety of business contexts where increases in the frequency of certain keywords related to incentive/pressure, opportunity, and rationalization, can indicate an increased level of fraud risk.
Some caveats are in order. Considering the overwhelming amount of text-based data within any modern enterprise, assurance professionals could never hope to analyze all of it; nor should they. The exercise would prove expensive and provide little value. Just as an external auditor would not reprocess or validate every sales transaction in a sales journal, he or she would not need to look at every related e-mail from every employee. Instead, any professional auditor would take a risk-based approach, identifying areas to test based on a sample of data or on an enterprise risk assessment. For text analytics work, the reviewer may choose data from five or ten individuals to sample from a high-risk department or from a newly acquired business unit. And no matter how sophisticated the search and information retrieval tools used, there is no guarantee that all relevant or high-risk documents will be identified in large data collections. Moreover, different search methods may produce differing results, subject to a measure of statistical variation inherent in probability searches of any type. Just as a statistical sample of accounts receivable or accounts payable in the general ledger may not identify fraud, analytics reviews are similarly limited.
Text analytics can be a powerful fraud examination tool when integrated with traditional forensic data-gathering and analysis techniques such as interviews, independent research, and existing investigative tests involving structured, transactional data. For example, an anomaly identified in the general ledger related to the purchase of certain capital assets may prompt the examiner to review e-mail communication traffic among the key individuals involved, providing context around the circumstances and timing, of events before the entry date. Furthermore, the forensic accountant may conduct interviews or perform additional independent research that may support or conflict with his or her investigative hypothesis. Integrating all three of these components to gain a complete picture of the fraud event can yield valuable information. While text analytics should never replace the traditional rules-based analysis techniques that focus on the client’s financial accounting systems, it’s always equally important to consider the communications surrounding key events typically found in unstructured data, as opposed to that found in the financial systems.