One person's spam is another person's treasure

17 Jan One person's spam is another person's treasure

This is a post in an ongoing series going in-depth into the DoubleCheck Email Manager product. Don’t miss the first part of this series, summarizing the enhancements we’ve made to DoubleCheck in 2010.

Identifying spam and unwanted bulk email is inherently an imprecise art. One person’s spam is another person’s treasure. How to tailor the filtering to the individual whims of each end-user? This is where our adaptive, hierarchical, statistical spam classification technology comes into play to help address this challenge.

To make broad categorizations, email can be divided into four categories:

  • Hard-core spam (everyone agrees that nobody wants it)
  • “Iffy” bulk email (e.g., an advertisement from a company that you bought something from 5 years ago, or maybe even their affiliate)
  • Double opt-in bulk email (something you signed up for and also confirmed by clicking a link in a confirmation email at signup)
  • Email you want that is addressed specifically to you (colloquially called “ham”)

DoubleCheck is composed of many different filtering layers and strategies to sift email into 3 “buckets” (generally speaking): reject outright (nobody wants it), quarantine it in case it’s wanted, and deliver it (you definitely want it). The system can be fine-tuned for unusual requirements (e.g., message tagging), but this is usually how it works. Many of our filtering layers can clearly identify hard-core spam and iffy bulk email sent out en-masse with ease (global spamtraps, sender and domain reputation, message fingerprinting, etc.).

The grey area begins for other kinds of email. Let’s say you have an end-user that’s a law firm focused on real estate, and you have another customer in pharmaceuticals. For the law firm, receiving an email mentioning the names of drugs should be quite unusual, whereas this would be quite normal for the other customer. Likewise, an email about a mortgage approval would be viewed quite differently also. So you can have differences in “expected” mail patterns between different domains. But, even individuals at the same company can have conflicting mail patterns. For example, let’s say at the law firm one of the partners attended medical school before pursuing law, and still maintains an interest in medical topics by subscribing to several medical newsletters. This individual would find it normal to receive medical related email, whereas for the encompassing domain it would not be so normal. Additionally, the system needs to be able to better identify those messages that all users believe are spam.

Remember, eFolder’s goal is to make products that just work, without a lot of manual tweaking and adjusting. How to improve accuracy without requiring endless tweaking?

This is where the idea of hierarchical adaptivity comes into play. A unique feature of DoubleCheck is that it organizes users, domains, and groups of domains into a multilevel hierarchy of arbitrary depth. At each level in the hierarchy, DoubleCheck learns the mail patterns for that part of the hierarchy (for each user, for each domain, for each group of domains, etc.). When a message comes in, we apply some “secret sauce” to make a classification decision considering the collective intelligence of every level in the hierarchy that contains that user’s mailbox (the mailbox level, the domain level, groups of domains, globally, etc.). As users report missed spams or release wanted email from their quarantine, the system also collectively updates the learning algorithms at each relevant layer in the hierarchy. We’ve tuned the system so that individuals benefit from the overall mail patterns of their domain and the collective learning from all users, while still being highly customized to their own types of email they prefer to receive.

So, how is the “learning” accomplished at each level in the hierarchy? In short, emails are broken up into a series of features that characterize the message, such as individual words, and full or partial phrases. The system uses statistics and probabilities to model the likelihood that individual features, and more importantly, combinations of features, are signs of spam. For example, the features “inheritance” and “million dollars” on their own perhaps are not strong indicators of spam, but taken together they may be a very strong indicator of spam (unless, perhaps, you have some wealthy and aging relatives!), especially if they are in a particular order in relation to each other.

In summary this is an important new layer in the DoubleCheck filtering regime. While no one layer will have perfect accuracy, this layer provides the extremely important ability for DoubleCheck to automatically adapt to each user’s perception of which email is spam and which is not. We carefully tune the system so that a smart classification decision is made for each message based on the results of many layers.

Less manual work + better accuracy = everyone’s happier.