@twaddington sparked a Twitter debate when he tweeted
Why do people still insist on posting their emails as "something [at] domain [dot] com?" #petpeve
Below is the conversation that ensued.
- twaddington: Why do people still insist on posting their emails as "something [at] domain [dot] com?" #petpeve
- nickcummings: @twaddington I'm sure most crawlers can parse those sorts of things by now, so what's a more secure way to list an email address online?
- aaronpk: @twaddington Agreed. I stopped doing that and other obfuscation techniques when I started forwarding everything to Gmail.
- aaronpk: @nickcummings I've seen people write something like "email 'aaron' at the domain you're looking at"
- nickcummings: @aaronpk That's what I've been doing on @sasquatchgaming, and so far we haven't received any spam. :)
- nickcummings: @aaronpk To clarify: We're not doing the stupid ___ [at] ____.com thing. We've evolved past that. Take that, robots!
- kchrist: @twaddington I'll stop doing that when spammers stop harvesting email addresses.
- twaddington: .@kchrist you think a simple spider using a regex can't figure that out? @nickcummings I've never gotten spam from listing my email.
- twaddington: Mainly it's a usability barrier. Links were designed to be clicked!
- kchrist: @twaddington Not if there are hidden markup tags in the middle of it.
- aaronpk: @kchrist can't hidden markup tags in an email address just be removed by a call to strip_tags() or equivalent? /cc @twaddington
- lvidmar: @aaronpk Nobody ever claimed that spammers were geniuses. /cc @twaddington @kchrist
- kchrist: @aaronpk They need to identify the email addr in the text first. "<span>user</span> at <span>domain</span>. com" doesn't match any pattern.
- twaddington: @kchrist, @aaronpk, @lvidmar but have any of you ever received spam from publishing your email on your site?
- aaronpk: @twaddington i'm not sure how much spam is a result of publishing my address, but gmail filters everything out 99.9% perfectly.
- twaddington: @kchrist strpos("contact") or strpos("email") then strip_tags() then regex lookaround for something "at" somethingelse "tld" /cc @aaronpk
- twaddington: @aaronpk I haven't gotten much spam at my new account since I switched. I think forums are a big target for email harvesting. /cc @kchrist
Afterwards, I wrote a quick bit of PHP to test scraping email addresses from a website.
<?php
$plain = strip_tags($html);
$plain = preg_replace(array('/@/', '/\s+at\s+/', '/./', '/\s+dot\s+/'), array('@', '@', '.', '.'), $plain);
if(preg_match_all('/[a-z0-9-_]+@[a-z0-9-_]+\.[a-z0-9]{2,4}/i', $plain, $matches))
{
print_r($matches);
}
?>
This works surprisingly well, and only needs a few additional cases put into the preg_replace line to match things like [at] instead of just "at".
The point being that word-substitution-based obfuscation techniques are relatively easy to crack. Some better techniques are
- hiding text in an image (only works on low-profile websites, otherwise it becomes a target for running OCR)
- decoding the email address in javascript will prevent most spiders from finding the address (unless they run javascript too)
- puzzle-based obfuscation such as "our email addresses are our first names at our domain name." would be very difficult to automatically find
Alternatively, and what I end up doing, is I forward all my email to Gmail and let them sort out the spam. It is nearly 100% effective after a little bit of training. The only messages that end up falsely in spam now are emails from automatic scripts on my servers where I didn't set the proper headers.