Analytics Confronts the Normal

The Information Audit and Control Association (ISACA) tells us that we produce and store more data in a day now than mankind did altogether in the last 2,000 years. The data that is produced daily is estimated to be one exabyte, which is the computer storage equivalent of one quintillion bytes, which is the same as one million terabytes. Not too long ago, about 15 years, a terabyte of data was considered a huge amount of data; today the latest Swiss Army knife comes with a 1 terabyte flash drive.

When an interaction with a business is complete, the information from the interaction is only as good as the pieces of data that get captured during that interaction. A customer walks into a bank and withdraws cash. The transaction that just happened gets stored as a monetary withdrawal transaction with certain characteristics in the form of associated data. There might be information on the date and time when the withdrawal happened; there may be information on which customer made the withdrawal (if there are multiple customers who operate the same account). The amount of cash that was withdrawn, the account from which the money was extracted, the teller/ATM who facilitated the withdrawal, the balance on the account after the withdrawal, and so forth, are all typically recorded. But these are just a few of the data elements that can get captured in any withdrawal transaction. Just imagine all the different interactions possible on all the assorted products that a bank has to offer: checking accounts, savings accounts, credit cards, debit cards, mortgage loans, home equity lines of credit, brokerage, and so on. The data that gets captured during all these interactions goes through data-checking processes and gets stored somewhere internally or in the cloud.  The data that gets stored this way has been steadily growing over the past few decades, and, most importantly for fraud examiners, most of this data carries tons of information about the nuances of the individual customers’ normal behavior.

In addition to what the customer does, from the same data, by looking at a different dimension of the data, examiners can also understand what is normal for certain other related entities. For example, by looking at all the customer withdrawals at a single ARM, CFEs can gain a good understanding of what is normal for that particular ATM terminal.  Understanding the normal behavior of customers is very useful in detecting fraud since deviation from normal behavior is a such a primary indicator of fraud. Understanding non-fraud or normal behavior is not only important at the main account holder level but also at all the entity levels associated with that individual account. The same data presents completely different information when observed in the context of one entity versus another. In this sense, having all the data saved and then analyzed and understood is a key element in tackling the fraud threat to any organization.

Any systematic, numbers-based system of understanding of the phenomenon of fraud as a past occurring event is dependent on an accurate description of exactly what happened through the data stream that got accumulated before, during, and after the fraud scenario occurred. Allowing the data to speak is the key to the success of any model-based system. This data needs to be saved and interpreted very precisely for the examiner’s models to make sense. The first crucial step to building a model is to define, understand, and interpret fraud scenarios correctly. At first glance, this seems like a very easy problem to solve. In practical terms, it is a lot more complicated process than it seems.

The level of understanding of the fraud episode or scenario itself varies greatly among the different business processes involved with handling the various products and functions within an organization. Typically, fraud can have a significant impact on the bottom line of any organization. Looking at the level of specific information that is systematically stored and analyzed about fraud in financial institutions for example, one would arrive at the conclusion that such storage needs to be a lot more systematic and rigorous than it typically is today. There are several factors influencing this. Unlike some of the other types of risk involved in client organizations, fraud risk is a censored problem. For example, if we are looking at serious delinquency, bankruptcy, or charge-off risk in credit card portfolios, the actual dollars-at-risk quantity is very well understood. Based on past data, it is relatively straightforward to quantify precise credit dollars at risk by looking at how many customers defaulted on a loan or didn’t pay their monthly bill for three or more cycles or declared bankruptcy. Based on this, it is easy to quantify the amount at risk as far as credit risk goes. However, in fraud, it is virtually impossible to quantify the actual amount that would have gone out the door as the fraud is stopped immediately after detection. The problem is censored as soon as some intervention takes place, making it difficult to precisely quantify the potential risk.

Another challenge in the process of quantifying fraud is how well the fraud episode itself gets recorded. Consider the case of a credit card number getting stolen without the physical card getting stolen. During a certain period, both the legitimate cardholder and the fraudster are charging using the card. If the fraud detection system in the issuing institution doesn’t identify the fraudulent transactions as they were happening in real time, typically fraud is identified when the cardholder gets the monthly statement and figures out that some of the charges were not made by him/her. Then the cardholder calls the issuer to report the fraud.  In the not too distant past, all that used to get recorded by the bank was the cardholder’s estimate of when the fraud episode began, even though there were additional details about the fraudulent transactions that were likely shared by the cardholder. If all that gets recorded is the cardholder’s estimate of when the fraud episode began, ambiguity is introduced regarding the granularity of the actual fraud episode. The initial estimate of the fraud amount becomes a rough estimate at best.
In the case in which the bank’s fraud detection system was able to catch the fraud during the actual fraud episode, the fraudulent transactions tended to be recorded by a fraud analyst, and sometimes not too accurately. If the transaction was marked as fraud or non-fraud incorrectly, this problem was typically not corrected even after the correct information flowed in. When eventually the transactions that were actually fraudulent were identified using the actual postings of the transactions, relating this back to the authorization transactions was often not a straightforward process. Sometimes the amounts of the transactions may have varied slightly. For example, the authorization transaction of a restaurant charge is sometimes unlikely to include the tip that the customer added to the bill. The posted amount when this transaction gets reconciled would look slightly different from the authorized amount. All of this poses an interesting challenge when designing a data-driven analytical system to combat fraud.

The level of accuracy associated with recording fraud data also tends to be dependent on whether the fraud loss is a liability for the customer or to the financial institution. To a significant extent, the answer to the question, “Whose loss is it?” really drives how well past fraud data is recorded. In the case of unsecured lending such as credit cards, most of the liability lies with the banks, and the banks tend to care a lot more about this type of loss. Hence systems are put in place to capture this data on a historical basis reasonably accurately.

In the case of secured lending, ID theft, and so on, a significant portion of the liability is really on the customer, and it is up to the customer to prove to the bank that he or she has been defrauded. Interestingly, this shift of liability also tends to have an impact on the quality of the fraud data captured. In the case of fraud associated with automated clearing house (ACH) batches and domestic and international wires, the problem is twofold: The fraud instances are very infrequent, making it impossible for the banks to have a uniform method of recording frauds; and the liability shifts are dependent on the geography.  Most international locations put the onus on the customer, while in the United States there is legislation requiring banks to have fraud detection systems in place.

The extent to which our client organizations take responsibility also tends to depend on how much they care about the customer who has been defrauded. When a very valuable customer complains about fraud on her account, a bank is likely to pay attention.  Given that most such frauds are not large scale, there is less need to establish elaborate systems to focus on and collect the data and keep track of past irregularities. The past fraud information is also influenced heavily by whether the fraud is third-party or first-party fraud. Third-party fraud is where the fraud is committed clearly by a third party, not the two parties involved in a transaction. In first-party fraud, the perpetrator of the fraud is the one who has the relationship with the bank. The fraudster in this case goes to great lengths to prevent the banks from knowing that fraud is happening. In this case, there is no reporting of the fraud by the customer. Until the bank figures out that fraud is going on, there is no data that can be collected. Also, such fraud could go on for quite a while and some of it might never be identified. This poses some interesting problems. Internal fraud where the employee of the institution is committing fraud could also take significantly longer to find. Hence the data on this tends to be scarce as well.

In summary, one of the most significant challenges in fraud analytics is to build a sufficient database of normal client transactions.  The normal transactions of any organization constitute the baseline from which abnormal, fraudulent or irregular transactions, can be identified and analyzed.  The pinpointing of the irregular is thus foundational to the development of the transaction processing edits which prevent the irregular transactions embodying fraud from even being processed and paid on the front end; furnishing the key to modern, analytically based fraud prevention.

Leave a Reply