Setting custom values for _trackPageviews in Google Analytics may have unexpected consequences

I recently deployed Joost de Valk‘s awesome Google Analytics for WordPress plugin on a few sites. (I mean, he even SEO’d the plugin name, how brilliant is that?)  One of the features of this plugin will add custom parameters to the URL in your _trackPageviews call so that you can gain insight into what kind of results you return.

For example, if you search for “K-Stew” and get no results, that will get tracked in GA as [“_trackPageview”,”http://www.example.com/?s=no-results:k-stew&cat=no-results”%5D.  Great for being able to view GA reports and realize, hey, people want to see K-Stew and I’m not giving them any!

So what’s the problem?
OK, here’s the takeaway:

  • Always noindex,follow your search results pages
  • Add both Disallow: /?s= and Disallow: /search/ to your robots.txt.

What does Kristin Stewart have to do with the battery life of my cellphone?

Like a lot of other people I wasn’t noindexing our search results pages.  It didn’t seem to really matter much, occasionally an author would link to a search results page in their post, a spider would come along and crawl it, we’d get a few high CPU alerts from CloudWatch and move on.  Something to fix on another day.

But about 4 days after deploying this plugin, our database CPU utilization started going nuts.  Web server response latency was up (the time it took for the server to generate the page and send it back to the load balancer).  This isn’t normal behaviour for high traffic on our sites.  Normally when we get spidered or have an increase in visitors, we see a lot of memcache traffic (lots of gets and sets) and a spike in CPU utilization of the database master.  This time we were seeing a lot of gets but no corresponding sets, and all the database CPU utilization was on the read slave replica.

The number of alerts drained the battery of our sysadmin’s phone over night.

So we started tailing access logs and parsing slow query logs.  The first thing that became obvious was the traffic was primarily searches.  Further investigation deduced that it was coming from Googlebot.

Why is Googlebot searching our site?

Sure, we have the occasional link to search results in our post — but nothing that would explain such crazy behaviour.  So, work backwards through a sample of the search result pages shown in the access log:

  • Were there links to searches in our code?  No.
  • Were there links to the searches in posts or other content on the site? No.
  • There are links to searches but not the ones we were looking at, and not that many.
  • Were the searches somehow in our sitemap?  No.
  • Were there links to searches from 3rd party sites?  No.  A quick trip to Google Webmaster Tools confirmed this.

That’s when our sysadmin pointed out that the search URLs looked different than when you just did a search through the site.  Up until this point I was looking for searches based on the slow query log and terms he saw in the access log URLs.  When I scoot over to his desk and look, I see all of them have “&cat=plus-5-posts” appended to the querystring.  Where’s that coming from?

I start my code editor searching for the string “plus-5-posts” and fire off a simultaneous search in Google while that’s working.  A lot of other sites have the same thing indexed: search results pages with “plug-5-posts” tacked on.  And they’re all WordPress sites.  When I tab back to my code editor I see the culprit: 1 hit, Google Analytics for WordPress.

When great features go awry

So here’s what happened: Google saw something that looked like a link inside javascript.  Being the clever creature it is, Google decided to come pay us a visit and see what this page was.  For every search.  For every page of results.  And because we hadn’t paid enough attention to our search pages, they were still being indexed.  And then other search engines like Bing started joining the party.

It was a dumb oversight on our part, but also a common one.  After working to implement a “noindex,follow” robots meta tag on our search results pages I also noticed that Joost’s WordPress SEO plugin offers the same option, and he has a one-off post recommending it.

And also more proof that nothing put on the internet is private, even if it’s a customized url only sent to an analytics provider.

Bonus Pro-Tip

When using GA with WordPress, don’t forget to let Google Analytics know!  Add the “s” querystring param to your site settings, and if you’re using the Google Analytics for WordPress plugin you may also want to add “cat” there too.

Enhanced by Zemanta
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s