You need to install rga from Github. I'm also using ggplot2 which produces much nicer plots that the default R plotting facilities.
library(rga)
## Loading required package: bitops
## Loading required package: RCurl
## Loading required package: rjson
## Loading required package: lubridate
## Loading required package: httr
library(ggplot2)
library(scales)
Download the data, and do some light pre-processing.
options(width = 140)
rga.open(instance = "ga", where = "~/temp/ga-api")
# The ga.df dataset simply retrieves the number of visits (and the average
# time on site) per day for the past year.
ga.df <- ga$getData("XXXXXXXX", start.date = "2012-03-23", end.date = "2013-03-23",
metrics = "ga:visits,ga:avgTimeOnSite", dimensions = "ga:date,ga:nthDay,ga:dayOfWeek",
max = 1500, sort = "ga:nthDay")
ga.df$nthDay <- as.integer(ga.df$nthDay)
# Nicely readable names of the week by making it a factor.
ga.df$dayOfWeek <- factor(ga.df$dayOfWeek, labels = c("Sunday", "Monday", "Tuesday",
"Wednesday", "Thursday", "Friday", "Saturday"))
# The ga.pvs dataset is not only faceted by time (like above), but also by
# page and the source of traffic (medium).
ga.pvs <- ga$getData("XXXXXXXX", start.date = "2012-03-23", end.date = "2013-03-23",
metrics = "ga:pageviews,ga:visits,ga:visitors", dimensions = "ga:pageTitle,ga:medium,ga:nthDay",
sort = "-ga:pageviews", batch = TRUE)
ga.pvs$nthDay <- as.integer(ga.pvs$nthDay)
# Remove recurring prefix in pagetitle
ga.pvs$pageTitle <- gsub(" - Branch and Bound", "", ga.pvs$pageTitle)
Let's just have a look at the ga.df dataset. The head() command shows us the first few entries of the dataset. The summary command gives some statistics for all the values in each column. The last statement lists the datatypes.
head(ga.df)
## date nthDay dayOfWeek visits avgTimeOnSite
## 1 2012-03-23 0 Friday 220 43.35
## 2 2012-03-24 1 Saturday 96 37.68
## 3 2012-03-25 2 Sunday 34 0.00
## 4 2012-03-26 3 Monday 51 24.00
## 5 2012-03-27 4 Tuesday 11 14.09
## 6 2012-03-28 5 Wednesday 12 29.17
summary(ga.df)
## date nthDay dayOfWeek visits avgTimeOnSite
## Min. :2012-03-23 Min. : 0.0 Sunday :52 Min. : 0.0 Min. : 0.0
## 1st Qu.:2012-06-22 1st Qu.: 91.2 Monday :52 1st Qu.: 6.0 1st Qu.: 1.5
## Median :2012-09-21 Median :182.5 Tuesday :52 Median : 15.0 Median : 23.2
## Mean :2012-09-21 Mean :182.5 Wednesday:52 Mean : 63.6 Mean : 60.5
## 3rd Qu.:2012-12-21 3rd Qu.:273.8 Thursday :52 3rd Qu.: 49.0 3rd Qu.: 61.0
## Max. :2013-03-23 Max. :365.0 Friday :53 Max. :2077.0 Max. :2689.2
## Saturday :53
sapply(ga.df[1, ], class)
## date nthDay dayOfWeek visits avgTimeOnSite
## "Date" "integer" "factor" "numeric" "numeric"
The summary for example shows that I have had an average of 63 visits per day. The median visitor count is quite a bit lower at 15 visits, indicating quite some variance. And indeed, the summary also shows that there was a maximum of 2077 visits on one day. That's what happens when Martin Odersky tweets one of your posts to his followers!
We can do the same exploratory commands for the ga.pvs dataset. Keep in mind that these figures are split over multiple dimensions (pagetitle, medium and nthday) whereas the previous numbers just had one dimension (nthday). So we'll have to do some slicing&dicing later on to make sense of the actual numbers.
head(ga.pvs)
## pageTitle medium nthDay pageviews visits visitors
## 1 Scala is like Git referral 272 1809 1675 1631
## 2 Cross-Build Injection in action referral 200 1148 1099 1083
## 3 Scala is like Git referral 273 966 846 798
## 4 Preparing a technical talk referral 75 842 785 766
## 5 Modern concurrency and Java EE referral 123 730 688 662
## 6 Scala is like Git referral 274 633 567 533
summary(ga.pvs)
## pageTitle medium nthDay pageviews visits visitors
## Length:3111 Length:3111 Min. : 0 Min. : 0 Min. : 0.0 Min. : 1.0
## Class :character Class :character 1st Qu.:159 1st Qu.: 1 1st Qu.: 1.0 1st Qu.: 1.0
## Mode :character Mode :character Median :252 Median : 2 Median : 1.0 Median : 1.0
## Mean :230 Mean : 9 Mean : 7.5 Mean : 7.7
## 3rd Qu.:304 3rd Qu.: 4 3rd Qu.: 3.0 3rd Qu.: 3.0
## Max. :364 Max. :1809 Max. :1675.0 Max. :1631.0
sapply(ga.pvs[1, ], class)
## pageTitle medium nthDay pageviews visits visitors
## "character" "character" "integer" "numeric" "numeric" "numeric"
Now that we have a better understanding of the raw input data, we'll try to get some visualizations going. Can we spot any trends or interesting datapoints?
We'll start with calculating the total number of visits this year. Easy:
sum(ga.df$visits)
## [1] 23262
Plotting the visits per day in the last year. This type of plot is directly available in the Google Analytics dashboard as well, but let's reproduce it using ggplot2 in R and add something extra:
qplot(data = ga.df, x = nthDay, y = visits) + geom_smooth(method = lm) # add linear regression line with confidence interval in dark gray
The line represents a linear regression model of the number of visits against the number of days my blog has run. Clearly a linear model does not account very well for the spiky data, but at least it shows a rising trend :)
The line in the plot was created by ggplot, but we can also calculate the linear model using lm() to see some more information about it.
# calculate the linear regression model
ga.lm <- lm(ga.df$visits ~ ga.df$nthDay)
print(summary(ga.lm))
##
## Call:
## lm(formula = ga.df$visits ~ ga.df$nthDay)
##
## Residuals:
## Min 1Q Median 3Q Max
## -103.0 -55.6 -39.3 -18.7 1994.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 24.1479 18.3569 1.32 0.189
## ga.df$nthDay 0.2159 0.0871 2.48 0.014 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 176 on 364 degrees of freedom
## Multiple R-squared: 0.0166, Adjusted R-squared: 0.0139
## F-statistic: 6.15 on 1 and 364 DF, p-value: 0.0136
The details of the linear model show that I receive about 0.22 more visits per day that my blog runs (see coefficient of ga.df$nthDay). The p-value isn't half-bad either…
In the following boxplot shows the distribution of visits over the days of the week. The boxes represent the 25% and 75% quantile of the data. The bar inside the boxes is the median value for that day of the week.
qplot(data = ga.df, x = dayOfWeek, y = visits) + geom_boxplot(outlier.shape = NA) + coord_cartesian(ylim = quantile(ga.df$visits, c(0.1, 0.9)))
Note that I have zoomed in on the y-axis (using coord_cartesian) to keep the chart readable since there are some outliers. The maximum number of visits on a day is 2077 as could be seen earlier, whereas this boxchart is restricted to the more common visitor numbers.
Weekends are particularly slow in comparison to normal weekdays. Programming blogs are read at work it seems. Makes sense. I don't really have an explanation as to why Tuesdays are relatively lower than other weekdays though.
We're now going to switch to the ga.pvs dataset, so we can zoom in more on particular posts and traffic patterns. Let's count the total number of pageviews and unique visitors:
print(sum(ga.pvs$pageviews))
## [1] 27860
print(sum(ga.pvs$visitors))
## [1] 23903
Now break it down by article:
# Create subtotals per pageTitle
pageViewsByPost <- aggregate(cbind(pageviews, visits, visitors) ~ pageTitle, data = ga.pvs, FUN = sum)
# Drop everything that has insignificant views and will only clutter our table
pageViewsByPost <- pageViewsByPost[pageViewsByPost$pageviews > 15, ]
# Sort, most popular article first (by pageviews)
print(pageViewsByPost[order(-pageViewsByPost$pageviews), ])
## pageTitle pageviews visits visitors
## 21 Scala is like Git 9383 8124 7936
## 18 Modern concurrency and Java EE 4300 3849 3695
## 29 Unify client-side and server-side rendering by embedding JSON 3141 2807 2769
## 14 Cross-Build Injection in action 2355 2111 2148
## 19 Preparing a technical talk 2343 2104 1995
## 10 Branch and Bound Blog - Sander Mak 1225 791 948
## 15 Cross-build injection attacks: how safe is your build? 1218 883 1051
## 30 Verify dependencies using PGP 784 628 680
## 17 JFall 2012: Dutch Java awesomeness 729 620 648
## 7 Book review: DSLs in Action 633 575 598
## 16 JEEConf 2012 trip report 505 442 446
## 12 Branch and Bound Blog - pruning the software universe... 285 156 200
## 5 About Sander Mak 277 21 231
## 6 Archive 188 10 162
## 11 Branch and Bound Blog - pruning the software universe ... 124 72 82
## 4 About 103 4 87
## 27 Talks by Sander Mak 90 4 79
## 8 Book review: Infinity and the Mind 70 50 55
## 13 Categories 55 0 49
## 26 Tags 35 2 32
The article 'Scala is like Git' accounts for approx. a third of my pageviews! That one really exceeded my expectations. Unfortunately, on of my personal favorites, the book review of 'Infinity and the Mind' languishes near the bottom of the barrel. Oh well at least I had fun reading the book and writing the post!
Also, I noted that some entries have fewer visits than visitors. That can't be right? Haven't found an explanation yet.
What traffic patterns emerged over the last year? The following plot has again time (number of days the blog was live) on the x-axis. The y-axis represents the percentage of traffic that originated from one of the three categories, which are also colored.
pageViewsByMediumAndDay <- aggregate(pageviews ~ medium + nthDay, data = ga.pvs, FUN = sum)
# Percentage of visits from Google, direct and through referrals over time
ggplot(pageViewsByMediumAndDay, aes(nthDay, weight = pageviews, fill = medium)) + geom_bar(position = "fill", binwidth = 7) + scale_y_continuous(labels = percent) +
xlab("nth day") + ylab("Percentage") + scale_fill_discrete("Traffic source\n(medium)", breaks = c("(none)", "organic", "referral"), labels = c("Direct",
"Search", "Referral"))
The majority of traffic is driven by referral, mostly from link-sharing sites like Reddit and Dzone but also Twitter. It is good to see how the proportion of search traffic is increasing over time. Apparently my pagerank is increasing, and more content also means more opportunities for Google to send traffic to my blog.
By looking at the absolute number of Google search traffic over time the trend is even clearer:
# Visits from Google in absolute numbers over time.
organicTraffic <- pageViewsByMediumAndDay[pageViewsByMediumAndDay$medium == "organic", ]
ggplot(organicTraffic, aes(nthDay, pageviews)) + geom_point() + geom_smooth(method = lm) + ylim(0, max(organicTraffic$pageviews))
That's all for now! Hopefully this will give you an idea of how to start playing around with Google Analytics data in R.