Integrate your log server in Data Studio automatically

Today, in the SEO area, it’s more than important to analyse your log server, moreover when you have got an access. The problem is: you have to deal with heavy and complicated files. You have to download them, clean them, order them… After all this hard work, you can finally analyse them. Below, you’ll find an easy way to automatize this process, and to visualize them in Google Data Studio.

Only the main part of the algorithm is explained in this post. I can send you the full script for free. You only need to send me an MP on Twitter.  And don’t forget: please share this post on the social networks. Thank you 🙂

Before start, please be sure you have got (or you have access to):

  • your server credentials
  • your log server
  • a database to stock the data (in this post we’ll use MySQL)
  • a computer able to run a script to get back the logs, clean them and put them in the database. If you have a dedicaded server, it’s even better.

1st step: get the logs

We are in a situation where my website is hosted on OVH (french domain/hoster provider). But I guess the logic should be the same your provided. Please read his documentation to know more, and you’ll be able to adapt the logic to your situation. The language I’m using is PHP. But it can be done in JAVA or even in R.

In my case to get back the logs, I have to call an URL and get back the content. Below the following structure:

https://log.ovh.net/{{host}}/logs-{{month}}-{{year}}/{{host}}-{{date format day-month-year}}

I’m 100% sure you’ll need some credentials to get the data. Not a suprise otherwise, anybody can access to the logs. In that case, this is an HTTP authentication.

http authentification ovh

We’ll have to add this information at the beginning of the URL to call. Here is the PHP code, to get data for yesterday:

Carefull, data you’ll receive might be compressed (.gz file i that case). You’ll have to unzip it in order to get a log file. A function is proposed in the full code (function transform_file).

2nd step: read the log file

A log file contains a lot of information: visitors data and all the robots crawling your website. In our example, the goal will be to get all the visits (hits) made by Google. For information, you’ll find the Google bots here.

In order to optimize the code and limit the calls to the database, we’ll send 5000 lines in a row on every insert. I was able to reduce the run time by 90%.

Here is the logic:

  1. Read every lines in the file (one by one)
  2. If the line contains a Google bot pattern
  3. Explode the line
  4. Get all the main information : date, code response, URL crawled, bot name…
  5. Put the data in a table ($insert_value)
  6. Do it again
  7. When the table gets 5000 lines, insert it in the database
  8. Etc…

3rd step: write  in the database

As you saw it, I get back only a few data. My MySQL table will be very simple regarding the structure:

  • hit date
  • bot name
  • URL crawled
  • number of hit (always 1)
  • code response
  • size downloaded by the bot

mysql log server strcuture

And here is the PHP code to insert the data in the table:

 

Last step: schedule this script in order to be executed every day. Google Data Studio will be updated automatically. Again, if you wish the full script, please feel to drop me a  MP on Twitter.  And don’t forget: please share this on the social networks and RT.

4th step: the dashboards in Google Data Studio

You’ll need to connect your database with Data Studio (don’t forget to indicate your MySQL ID).

mysql data studio

Regarding the dashboards, here is what I’ve built:

  • Global KPIs view: it includes number of total unique hits, total hits, total size of the data downloaded by the bots. Split by bots as well, and during time. Here you can filter by bot type: Desktop, Mobile, Image…

glbal log kpi dashboard

  • The next dashboard is focus on the response server: 20OK, 3XX, 4XX or 5XX. The 1st graph is interesting because it’s showing the page re-partition crawled by Google. In this screenshot, we can see that less than 50% of the pages seen were in 20OK (ooops…). Here you’ll be able to filter by code response.

glbal code statut kpi dashboard

  •  Last dashboard: analyse by page category. Are my french pages crawled more often than my english pages ? What about my pictures ? My admin pages ? My robots/sitemap files ? Here you can filter by page type.

glbal log category dashboard

Here is the result in Google Analytics Studio. Note you can only see Sept/Oct 2017 on this demo account. Here is the direct link for full screenDon’t forget to change dashboard to see all of them (top left). 

Next step: integrate Google Analytics/Search Console data and mix them. Please feel free to let a comment to optimize the PHP code, or the dashboards. And don’t forget to share on Twitter or any other social networks. 

You will be interested in:

  • Why you should never trust Search Console data (Part 1)
  • Why you should never trust Search Console data (Part 2)

Passionate, curious and always open to new web technologies, I’m spending my time between my family and the world web. Please I’m be more than happy to have a chat with your or simply share.

Leave a Reply