Google Analytics in R

Load libraries

You need to install rga from Github. I'm also using ggplot2 which produces much nicer plots that the default R plotting facilities.

library(rga)

## Loading required package: bitops

## Loading required package: RCurl

## Loading required package: rjson

## Loading required package: lubridate

## Loading required package: httr

library(ggplot2)
library(scales)

Get the data

Download the data, and do some light pre-processing.

options(width = 140)
rga.open(instance = "ga", where = "~/temp/ga-api")

# The ga.df dataset simply retrieves the number of visits (and the average
# time on site) per day for the past year.
ga.df <- ga$getData("XXXXXXXX", start.date = "2012-03-23", end.date = "2013-03-23", 
    metrics = "ga:visits,ga:avgTimeOnSite", dimensions = "ga:date,ga:nthDay,ga:dayOfWeek", 
    max = 1500, sort = "ga:nthDay")
ga.df$nthDay <- as.integer(ga.df$nthDay)
# Nicely readable names of the week by making it a factor.
ga.df$dayOfWeek <- factor(ga.df$dayOfWeek, labels = c("Sunday", "Monday", "Tuesday", 
    "Wednesday", "Thursday", "Friday", "Saturday"))

# The ga.pvs dataset is not only faceted by time (like above), but also by
# page and the source of traffic (medium).
ga.pvs <- ga$getData("XXXXXXXX", start.date = "2012-03-23", end.date = "2013-03-23", 
    metrics = "ga:pageviews,ga:visits,ga:visitors", dimensions = "ga:pageTitle,ga:medium,ga:nthDay", 
    sort = "-ga:pageviews", batch = TRUE)
ga.pvs$nthDay <- as.integer(ga.pvs$nthDay)
# Remove recurring prefix in pagetitle
ga.pvs$pageTitle <- gsub(" - Branch and Bound", "", ga.pvs$pageTitle)

First look at the data

Let's just have a look at the ga.df dataset. The head() command shows us the first few entries of the dataset. The summary command gives some statistics for all the values in each column. The last statement lists the datatypes.

head(ga.df)

##         date nthDay dayOfWeek visits avgTimeOnSite
## 1 2012-03-23      0    Friday    220         43.35
## 2 2012-03-24      1  Saturday     96         37.68
## 3 2012-03-25      2    Sunday     34          0.00
## 4 2012-03-26      3    Monday     51         24.00
## 5 2012-03-27      4   Tuesday     11         14.09
## 6 2012-03-28      5 Wednesday     12         29.17

summary(ga.df)

##       date                nthDay          dayOfWeek      visits       avgTimeOnSite   
##  Min.   :2012-03-23   Min.   :  0.0   Sunday   :52   Min.   :   0.0   Min.   :   0.0  
##  1st Qu.:2012-06-22   1st Qu.: 91.2   Monday   :52   1st Qu.:   6.0   1st Qu.:   1.5  
##  Median :2012-09-21   Median :182.5   Tuesday  :52   Median :  15.0   Median :  23.2  
##  Mean   :2012-09-21   Mean   :182.5   Wednesday:52   Mean   :  63.6   Mean   :  60.5  
##  3rd Qu.:2012-12-21   3rd Qu.:273.8   Thursday :52   3rd Qu.:  49.0   3rd Qu.:  61.0  
##  Max.   :2013-03-23   Max.   :365.0   Friday   :53   Max.   :2077.0   Max.   :2689.2  
##                                       Saturday :53

sapply(ga.df[1, ], class)

##          date        nthDay     dayOfWeek        visits avgTimeOnSite 
##        "Date"     "integer"      "factor"     "numeric"     "numeric"

The summary for example shows that I have had an average of 63 visits per day. The median visitor count is quite a bit lower at 15 visits, indicating quite some variance. And indeed, the summary also shows that there was a maximum of 2077 visits on one day. That's what happens when Martin Odersky tweets one of your posts to his followers!

We can do the same exploratory commands for the ga.pvs dataset. Keep in mind that these figures are split over multiple dimensions (pagetitle, medium and nthday) whereas the previous numbers just had one dimension (nthday). So we'll have to do some slicing&dicing later on to make sense of the actual numbers.

head(ga.pvs)

##                         pageTitle   medium nthDay pageviews visits visitors
## 1               Scala is like Git referral    272      1809   1675     1631
## 2 Cross-Build Injection in action referral    200      1148   1099     1083
## 3               Scala is like Git referral    273       966    846      798
## 4      Preparing a technical talk referral     75       842    785      766
## 5  Modern concurrency and Java EE referral    123       730    688      662
## 6               Scala is like Git referral    274       633    567      533

summary(ga.pvs)

##   pageTitle            medium              nthDay      pageviews        visits          visitors     
##  Length:3111        Length:3111        Min.   :  0   Min.   :   0   Min.   :   0.0   Min.   :   1.0  
##  Class :character   Class :character   1st Qu.:159   1st Qu.:   1   1st Qu.:   1.0   1st Qu.:   1.0  
##  Mode  :character   Mode  :character   Median :252   Median :   2   Median :   1.0   Median :   1.0  
##                                        Mean   :230   Mean   :   9   Mean   :   7.5   Mean   :   7.7  
##                                        3rd Qu.:304   3rd Qu.:   4   3rd Qu.:   3.0   3rd Qu.:   3.0  
##                                        Max.   :364   Max.   :1809   Max.   :1675.0   Max.   :1631.0

sapply(ga.pvs[1, ], class)

##   pageTitle      medium      nthDay   pageviews      visits    visitors 
## "character" "character"   "integer"   "numeric"   "numeric"   "numeric"

Analysis and visualization

Now that we have a better understanding of the raw input data, we'll try to get some visualizations going. Can we spot any trends or interesting datapoints?

Visits

We'll start with calculating the total number of visits this year. Easy:

sum(ga.df$visits)

## [1] 23262

Plotting the visits per day in the last year. This type of plot is directly available in the Google Analytics dashboard as well, but let's reproduce it using ggplot2 in R and add something extra:

qplot(data = ga.df, x = nthDay, y = visits) + geom_smooth(method = lm)  # add linear regression line with confidence interval in dark gray

plot of chunk visitsperday

The line represents a linear regression model of the number of visits against the number of days my blog has run. Clearly a linear model does not account very well for the spiky data, but at least it shows a rising trend :)

The line in the plot was created by ggplot, but we can also calculate the linear model using lm() to see some more information about it.

# calculate the linear regression model
ga.lm <- lm(ga.df$visits ~ ga.df$nthDay)
print(summary(ga.lm))

## 
## Call:
## lm(formula = ga.df$visits ~ ga.df$nthDay)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -103.0  -55.6  -39.3  -18.7 1994.1 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)  
## (Intercept)   24.1479    18.3569    1.32    0.189  
## ga.df$nthDay   0.2159     0.0871    2.48    0.014 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 176 on 364 degrees of freedom
## Multiple R-squared: 0.0166,  Adjusted R-squared: 0.0139 
## F-statistic: 6.15 on 1 and 364 DF,  p-value: 0.0136

The details of the linear model show that I receive about 0.22 more visits per day that my blog runs (see coefficient of ga.df$nthDay). The p-value isn't half-bad either…

In the following boxplot shows the distribution of visits over the days of the week. The boxes represent the 25% and 75% quantile of the data. The bar inside the boxes is the median value for that day of the week.

qplot(data = ga.df, x = dayOfWeek, y = visits) + geom_boxplot(outlier.shape = NA) + coord_cartesian(ylim = quantile(ga.df$visits, c(0.1, 0.9)))

plot of chunk visitsperweekday

Note that I have zoomed in on the y-axis (using coord_cartesian) to keep the chart readable since there are some outliers. The maximum number of visits on a day is 2077 as could be seen earlier, whereas this boxchart is restricted to the more common visitor numbers.

Weekends are particularly slow in comparison to normal weekdays. Programming blogs are read at work it seems. Makes sense. I don't really have an explanation as to why Tuesdays are relatively lower than other weekdays though.

Popular posts

We're now going to switch to the ga.pvs dataset, so we can zoom in more on particular posts and traffic patterns. Let's count the total number of pageviews and unique visitors:

print(sum(ga.pvs$pageviews))

## [1] 27860

print(sum(ga.pvs$visitors))

## [1] 23903

Now break it down by article:

# Create subtotals per pageTitle
pageViewsByPost <- aggregate(cbind(pageviews, visits, visitors) ~ pageTitle, data = ga.pvs, FUN = sum)
# Drop everything that has insignificant views and will only clutter our table
pageViewsByPost <- pageViewsByPost[pageViewsByPost$pageviews > 15, ]
# Sort, most popular article first (by pageviews)
print(pageViewsByPost[order(-pageViewsByPost$pageviews), ])

##                                                        pageTitle pageviews visits visitors
## 21                                             Scala is like Git      9383   8124     7936
## 18                                Modern concurrency and Java EE      4300   3849     3695
## 29 Unify client-side and server-side rendering by embedding JSON      3141   2807     2769
## 14                               Cross-Build Injection in action      2355   2111     2148
## 19                                    Preparing a technical talk      2343   2104     1995
## 10                            Branch and Bound Blog - Sander Mak      1225    791      948
## 15        Cross-build injection attacks: how safe is your build?      1218    883     1051
## 30                                 Verify dependencies using PGP       784    628      680
## 17                            JFall 2012: Dutch Java awesomeness       729    620      648
## 7                                    Book review: DSLs in Action       633    575      598
## 16                                      JEEConf 2012 trip report       505    442      446
## 12      Branch and Bound Blog - pruning the software universe...       285    156      200
## 5                                               About Sander Mak       277     21      231
## 6                                                        Archive       188     10      162
## 11     Branch and Bound Blog - pruning the software universe ...       124     72       82
## 4                                                          About       103      4       87
## 27                                           Talks by Sander Mak        90      4       79
## 8                             Book review: Infinity and the Mind        70     50       55
## 13                                                    Categories        55      0       49
## 26                                                          Tags        35      2       32

The article 'Scala is like Git' accounts for approx. a third of my pageviews! That one really exceeded my expectations. Unfortunately, on of my personal favorites, the book review of 'Infinity and the Mind' languishes near the bottom of the barrel. Oh well at least I had fun reading the book and writing the post!

Also, I noted that some entries have fewer visits than visitors. That can't be right? Haven't found an explanation yet.

Organic traffic vs. referals vs. direct

What traffic patterns emerged over the last year? The following plot has again time (number of days the blog was live) on the x-axis. The y-axis represents the percentage of traffic that originated from one of the three categories, which are also colored.

pageViewsByMediumAndDay <- aggregate(pageviews ~ medium + nthDay, data = ga.pvs, FUN = sum)

# Percentage of visits from Google, direct and through referrals over time
ggplot(pageViewsByMediumAndDay, aes(nthDay, weight = pageviews, fill = medium)) + geom_bar(position = "fill", binwidth = 7) + scale_y_continuous(labels = percent) + 
    xlab("nth day") + ylab("Percentage") + scale_fill_discrete("Traffic source\n(medium)", breaks = c("(none)", "organic", "referral"), labels = c("Direct", 
    "Search", "Referral"))

plot of chunk trafficpercentages

The majority of traffic is driven by referral, mostly from link-sharing sites like Reddit and Dzone but also Twitter. It is good to see how the proportion of search traffic is increasing over time. Apparently my pagerank is increasing, and more content also means more opportunities for Google to send traffic to my blog.

By looking at the absolute number of Google search traffic over time the trend is even clearer:

# Visits from Google in absolute numbers over time.
organicTraffic <- pageViewsByMediumAndDay[pageViewsByMediumAndDay$medium == "organic", ]
ggplot(organicTraffic, aes(nthDay, pageviews)) + geom_point() + geom_smooth(method = lm) + ylim(0, max(organicTraffic$pageviews))

plot of chunk searchtraffic

That's all for now! Hopefully this will give you an idea of how to start playing around with Google Analytics data in R.