Automatically Unshortening Links in Wordpress Posts

August 19th, 2015 • 2 min read #Meta #Programming
Cover image

On this site, I have the Broken Links Checker Plugin chugging away in the background. He tirelessly checks and rechecks every link in every post to find URLs that no longer work; pages sometimes just disappear.

In most cases, I'm able to use the Internet Archive Wayback Machine to find archived snapshots of the long-gone links so that the context of my writing archive remains preserved.

I also recently imported all of my old Twitter posts from the past years into my Microblog. Quite a few of those tweets contain links I shared.

At some point, Twitter started automatically shortening links to go through their service. Link shortening https://en.wikipedia.org/wiki/URL\_shortening has become somewhat commonplace. Lots of companies exist to provide link shortening services (ex. bit.ly); one of their value propositions is that they provide interesting analytics about the kinds of sites people visit.

Others have written about the problems with link shorteners.

A primary concern is that link shortening creates a single point of failure on the web; this is the antithesis of the way the Internet is supposed to work. If any one of these shortening services goes down, then suddenly those short links point to nothing, effectively breaking the web. This is a real issue; it actually happens.

Furthermore, if the unshortened link goes away, then the short link obfuscates the original source, making archiving nearly impossible.

Brett Terpstra's StretchLink is an invaluable tool that watches your clipboard for shortened links to expand in the background. However, manually going through the thousands of back posts on my blog to unshorten links by copying and pasting seems a bit obsessive and not really worth my time. Automatic cross-posting happens using IFTTT, and I don't want to have to "fix" posts that are inbound from Twitter.

So I quickly hacked some code to automatically unshorten links in my posts. It uses a code snippet I found by Jonathon Hill and Gruber's URL matching regex.

I noticed that the unshortened links tended to have analytics-enabling "UTM" parameters, so I strip those out as well.

A next step would be to somehow "bake" the older links using the Wayback Machine or via downloading snapshots so that they remain in an unchanged format.

Just add this code to the functions.php of your WordPress theme and you're on your way to abandoning shortened links whenever you save or update a post.

function unshorten_url($url) {
$ch = curl_init($url[0]);
curl_setopt_array($ch, array(
CURLOPT_FOLLOWLOCATION => TRUE, // the magic sauce
CURLOPT_RETURNTRANSFER => TRUE
CURLOPT_HEADER => TRUE,
CURLOPT_CONNECTTIMEOUT => 5,
CURLOPT_SSL_VERIFYHOST => FALSE, // suppress certain SSL errors
CURLOPT_SSL_VERIFYPEER => FALSE,
));
curl_exec($ch);
$url = curl_getinfo($ch, CURLINFO_EFFECTIVE_URL);
curl_close($ch);
$url = preg_replace( '/&?utm_.+?(&|$|\s)/', '', $url );
$url = str_replace("%5C", '', $url); // hack - sometimes wikipedia appends a backslash
$url = rtrim($url, "?");
return $url;
}
function find_links_for_unshortening($text) {
$pattern = "#\b((?:[a-z][\w-]+:(?:\/{1,3}|[a-z0-9%])|www\d{0,3}[.])(?:[^\s()<>]+|\([^\s()<>]+\))+(?:\([^\s()<>]+\)|[^`!()\[\]{};:'\".,<>?«»“”‘’\s]))#i";
$text = preg_replace_callback($pattern,'unshorten_url',$text);
return $text;
}
add_filter('content_save_pre', 'find_links_for_unshortening', 999);
view raw unshortening.php hosted with ❤ by GitHub

Get my weekly newsletter about Soulful Computing

(First episode drops on August 13, 2020)

Keep up with weekly resources about our rapidly evolving cyborganic relationship with technology. Topics include humanity inside computers, technology culture, digital artifacts, and augmented productivity for 21st century knowledge work.

Stay Connected

I won't ever give away your email address. You can always unsubscribe. No hard feelings.