Perfect Third Party Ads

Everyone who ever worked on a high-profile web site with web standards know there’s inevitable moment when the site has to go to its’ new owners. More than often we fall in love with the site’s tight structure and semantic markup. My heart breaks at that right moment when I have to insert a third-party advertisement script, which I know in advance is a tag soup. But can we do something about it?

Lucky for Us, We Have Regular Expressions

With regular expressions and a bit of PHP we can clean HTML junk in the third-party scripts and turn it into nice, tidy markup. I assume that you have at least adequate knowledge of PHP.

Let’s see the typical piece of HTML we often get from an ad supplier:

<style>
a { COLOR: RED; BACKGROUND: GREEN; }
a:hover { COLOR: YELLOW; BACKGROUND: ORANGE; }
</style>
<table border=0 width=100% background=#FFFFF>
<tr>
<td><a href="http://somesite.com/">ad link 1</a></td>
</tr>
<tr>
<td><a href="http://somesite.com/">ad link 2</a></td>
</tr>
<tr>
<td><a href="http://somesite.com/">ad link 3</a></td>
</tr>
</table>

The above is most likely to be inserted somewhere under vertical navigation or within the rest ugly advertisement kids. The <style> tags shouldn’t be anywhere else than inside the <head> tags. I also personally don’t like that table, so we’ll attempt to transform it into a much more appropriate unordered list. But I suppose that’s just my weird taste.

Clean it While It’s in the Buffer

I suppose if you’re working on some content-driven, large-scale, high-profile web site, you output everything from a buffer, but that’s a discussion for some other occasion. It is the best to use output buffering to clean bad markup before it’s sent to the browser.

Careful Planning is the Key

First, let’s make a new structure. To make the HTML semantically correct we should place the above sample links into an unordered list. The final markup should look like the following:

<div id="advertisementId" class="ads">
<ul>
<li><a href="http://somesite.com/">ad link 1</a></li>
<li><a href="http://somesite.com/">ad link 2</a></li>
<li><a href="http://somesite.com/">ad link 3</a></li>
</ul>
</div>

The CSS should be append to your main CSS file or better yet (and if the ad supplier permits this kind of modification of the ads), you or your team designer should style ad links according to the site’s main look and feel. It’s still very rarely the case, but lately it’s getting better, especially with the clients who begin to understand that what attracts visitors is the web site’s content.

If you start replacing without the major plan, you’ll probably end up spending too much of your valuable hours, which could have been spent on cross-browser debugging or accessibility improvements, to name a few.

In PHP, strings are replaced with two very powerful functions: the str_replace() and the preg_replace(). The former is useful for small chunks and it takes advantage over the later with its’ speed. preg_replace() deals well with regular expression patterns, but is also more intensive for the server processor – it’s something you don’t want to play with on the popular web site. However, if applied carefully, it shouldn’t affect the performance.

Step By Step Replacement

First thing’s first – let’s remove white space between HTML elements – it will save us a lot of trouble later:

$content = preg_replace('/>(\s|\n|\r)*<si' , '><', $content);

The next step is also very simple – we’ll remove everything within <style> and </style>, including those two.

$content = preg_replace('/<style.*?style>/si', '', $content);

After we got rid of the improperly placed <style> element, we should remove <table> tags and place all those table rows into a division. At this point we can also define an id and a class attribute for that division and also add the unordered list for the list items.

$content = preg_replace('/<table.*?>(.*)?<\/table>/si', '<div id="advertisementId" class="ads"><ul>$1</ul></div>', $content);

We simply pulled everything inside the <table></table> and pushed it into a <div id="advertisementId" class="ads"><ul></ul></div>. This is stil pretty untasty tag soup, but the only thing that’s left to be made is transforming each table row into a list item…

$content = preg_replace('/<tr><td>/si', '<li>', $content);
$content = preg_replace('/<\/td><\/tr>/si', '</li>', $content);

… and there it is – a perfectly tight markup. Below is the complete code, which you can copy to a file with a .php extension and try it at the safety of your home or office:

<?php
function clean_HTML($content) {
   $content = preg_replace('/>(\s|\n|\r)*<si' , '><', $content);
   $content = preg_replace('/<style.*?style>/si', '', $content);
   $content = preg_replace('/<table.*?>(.*)?<\/table>/si', '<div id="advertisementId" class="ads"><ul>$1</ul></div>', $content);
   $content = preg_replace('/<tr><td>/si', '<li>', $content);
   $content = preg_replace('/<\/td><\/tr>/si', '</li>', $content);
   return $content;
}
ob_start('clean_HTML');   
?>

<style>
a { COLOR: RED; BACKGROUND: GREEN; }
a:hover { COLOR: YELLOW; BACKGROUND: ORANGE; }
</style>
<table border=0 width=100% background=#FFFFF>
<TR>
<TD><a href="http://somesite.com/">ad link 1</a></td>
</tr>
<tr>
<td><a href="http://somesite.com/">ad link 2</a></td>
</tr>
<tr>
<td><a href="http://somesite.com/">ad link 3</a></td>
</tr>
</table>

Marko Dugonjić is a designer specialized in user experience design, web typography and web standards. He runs a nanoscale user interface studio Creative Nights and organizes FFWD.PRO, a micro-conference and workshops for web professionals.

Interested in more content like this?