A Microsoft Patent application has just been published that goes into intricate detail about anti-phishing "predictive model" technology incorporated into Outlook and Outlook Express or providable to third-party providers.
The app is entitled "Finding phishing sites." The Patent literature is arguably the most thorough description of how Microsoft email software attempts to find phish email.
A useful preview of what is being described here is contained in the Abstract:
Described is a technology by which phishing-related data sources are processed into aggregated data and a given site evaluated the aggregated data using a predictive model to automatically determine whether the given site is likely to be a phishing site. The predictive model may be built using machine learning based on training data, e.g., including known phishing sites and/or known non-phishing sites.
To determine whether an object corresponding to a site is likely a phishing-related object are described, various criteria are evaluated, including one or more features of the object when evaluated. The determination is output in some way, e.g., made available to a reputation service, used to block access to a site or warn a user before allowing access, and/or used to assist a hand grader in being more efficient in evaluating sites.
The exact criteria used to flag emails as phish attempts is fascinating. To obtain an understanding of how this is done, I'd like to take you to accompanying documentation that goes into detail about how this anti-phishing detection technology works.
To summarize, the anti-phishing system uses various data sources to find phishing sites, including sources that are closely affiliated with the email and internet access services being offered, (e.g., non-third party sources). The combination of sources provides a stronger model, especially when aggregated across both email-based and browser-based sources, and the model is further strengthened by using data sources that contain known non-phishing sites (e.g. FBL good mail).
Features are extracted about the sites, including aggregations done at a host/site level, and probabilistic models are used to make predictions regarding phishing sites. The probabilities that are found may be used to automatically warn users or block users from visiting dangerous sites, as well as to help human graders be more efficient in grading. Trend analysis may be used as well, e.g., spikes or anomalies may be an indicator of something unusual.
In general, the system 300 works by monitoring how web hosts or site URLs appear in the data sources that are available, and by using machine learning to build models that predict the probability that any given host is a phishing site from the way a host appears in the data sources.
For example, consider a host that gets reported as a FN by an internet service user, where a phishing filter indicated safe, but the user thinks it is a phish. If that host appeared ten times in a feedback loop on messages that got a `SenderID pass` from a known good sender, then the system may be fairly certain it is unlikely that the reported host is a phishing site.
The system would be more suspicious when the host is a numeric IP, and it appears in ten spam messages in an email service feedback loop, and in every one of these message it is an exception with a known phishing target.
Whenever a new message or report arrives on one of the data sources 301-307, the message or complaint report is scanned by the system 300, any web hosts (or site URLs) it contains are extracted, and a report is recorded.
For example, with respect to email messages, for every URL in a message that arrives via the feedback loop, properties are recorded in the report. Such properties may include:
GUID--the GUID of the message; (note that this is an identifier for the message, not technically a property used for determining phish/not-phish)
reportTime--the time the Feedback user reported the message as spam or good
rcvdTime--the time the Feedback user received the message
host--the host of the URL (e.g., foo.com for three letter TLDs, bar.bax.us for two letter TLDs, and the complete IP address for numeric hosts)
url hasTargetWordBody--True if the body of the message contains one of several dozen phishing related words (including commonly phished brands, login, password, and so forth)
hasTargetPRD--true if the PRD (purported responsible domain) of the message is a commonly phished domain
hasPhishTrick--True if the message has any URL that triggers one of the phishing heuristics
hasSIDFail--True if the message has a fail result code from a sender ID check
numDomainsInMessage--The number of unique domains of web hosts in the message
isOMOWithTargetDomain--True if the host from this report is not a commonly phished domain, but every other web host in the message is a commonly phished domain
isWithTargetDomain--True if the host from this message is not a commonly phished domain, but there is a web host in the message from a commonly phished domain
isTargetDomain--True if the host from this report is from a commonly phished domain
isTopTrafficDomain--True if the host from this report is on a top traffic list
isNumericHost--True if the host from this report is a numeric IP address
For every browser-initiated report that arrives, properties including some or all of the following examples may be recorded in the report:
GUID--the GUID of the report (again, similar to a message GUID, an identifier for the report, not technically a property used for determining phish/not-phish)
Host--the host of the reported URL (foo.com for 3 letter TLDs, bar.bax.us for 2 letter TLDs, and the complete IP address for numeric hosts)
reportTime--the time the URL was reported
isTargetDomain--True if the host from this report is from a commonly phished domain isTopTrafficDomain--True if the host from this report is on a top traffic list
isFp--True if the browser phishing filter marked the URL as phish but the user reports that it is not phishing
isTp--True if the browser phishing filter marked the URL as phish and the user reports that it is phishing
isfn--True if the browser phishing filter marked the URL as not phish but the user reports that it is phishng
isTn--True if the browser phishing filter marked the URL as not phish and the user reports that it is not phishing
The system 300 also has two sources of classifications that it may monitor, including grader marks on the browser-generated reports and hand-tagged FBL messages.With respect to grader marks on the browser-generated reports, each complaint report may be eventually examined by a human grader who may give it one of the following marks:
phish=true phishing URL
nocat=not a phishing site
dead=the site was unreachable
placeholder=the site was a placeholder page
foreign=foreign site could not be graded
redirect=the site was a redirector
norepro=the grader couldn't reproduce what the customer said about the site