Couple of weeks ago, Microsoft’s search engine Bing began a search insights series (the first was about the results page) to educate the users about how Bing actually works and to offer an inside look about recent algorithm changes and updates.
Yesterday the second post of the series has been published, this time about how Bing is dealing with junk links and snippets. Dr. Richard Qian, the core search development manager of the Bing team, is explaining how his team handles the different sort of junk links and also junky/empty snippets.
Junk Links 1 – Dead Links
The dry definition of a “dead link” is a page that returns a 4xx or a 5xx error code, which means that instead of real content the users simply receive an error message. Referring to this kind of page from the search results is obviously offers a bad experience for the user.
However, removing these kinds of pages from the search index immediately is pretty complicated- Many times it is just a temporary error (due to server failure for example) which may be resolved in a timely manner. Therefore, Bing operates “classifiers” that determine if it is just a temporary or permanent problem.
If the “classifiers” can’t provide a sufficient answer, Bing may increase the crawl rate frequency of the page so it will be crawled again shortly after the initial error discovery.
Junk Link 2 – Soft 404
Soft 404 codes are when pages returns a 200 code where in fact they should return a 404 code. Now in human language: Even though a page has been deleted from a site, it’s still “saying” to the search bot that it still exist. Here’s an example how it can appear on the results:
Again, Bing is operating its classifiers tries to determine if it’s a legitimate page or not by scanning the page’s title, content and URL.
Junk Link 3 – Parked Domains
Parked domains are websites which basically don’t include anything yet, except for ads (like AdSense for domains which set to close soon). Almost no user really wants to enter this kind of website as it doesn’t provide any real value.
The problem is that parked domains has a basic structure of a website. To effectively identify parked domains, Bing has a signature index of parked domain patterns. Here’s an example of a parked domain:
This is when some sort of code has been “infiltrated” into one of the meta tags and provides garbage description or a title. Bing is using multiple detectors (document convertor, HTML parser for instance) to identify these junky snippets and replace them with more suitable text from the page. Here’s an example of a junk snippet on the page’s description:
Unlike junky snippets which provides garbage text, empty snippets don’t provide any text at all. This could happen for many reasons- For instance, AJAX and Flash pages don’t offer the necessary snippets text and sometimes the webmaster simply forgot to add it. As well in this case, Bing’s detectors and classifiers finds a suitable text from the page.