R Snippets

How to combine several date columns into one:

weather <- data.frame(Year.UTC=c(2012, 2013), Month.UTC=c(1,2), Day.UTC=c(1,2), Hour.UTC=c(22,23))

weather <- within(weather, datetime <- as.POSIXct(paste(Year.UTC, Month.UTC, Day.UTC, Hour.UTC, sep=”-“), format=”%Y-%m-%d-%H”, tz=”UTC”))

Link

Plot multiple histograms in vertical stack based on a factor

library(ggplot2)

library(ggplot2) ggplot(movies, aes(x = rating)) geom_histogram(binwidth = 0.5, aes(fill =factor(Comedy))) + facet_grid(Comedy~.)

Link

Check the column names and contents of columns

df <- data.frame(Name = “Ben”, Age = 12, Address = “CA”, ContactNo = 1234567)

# Note the names attribute for the data frame df:

names(df) > names(df) [1] “Name” “Age” “Address” “ContactNo”

#Then you can check to see if the variables of interest are in the set of variables in the data frame:

c(“Gender”, “Age”) %in% names(df) > c(“Gender”, “Age”) %in% names(df)

Link

Sort a dataframe by multiple columns

dd[with(dd, order(-z, b)), ]

b x y z

4 Low C 9 2

2 Med D 3 1

1 Hi A 8 1

3 Hi A 9 1

Link

The apply functions

R has many *apply functions which are ably described in the help files (e.g. ?apply). There are enough of them, though, that beginning useRs may have difficulty deciding which one is appropriate for their situation or even remembering them all. They may have a general sense that “I should be using an *apply function here”, but it can be tough to keep them all straight at first.

Despite the fact (noted in other answers) that much of the functionality of the *apply family is covered by the extremely popular plyr package, the base functions remain useful and worth knowing.

This answer is intended to act as a sort of signpost for new useRs to help direct them to the correct *apply function for their particular problem. Note, this is not intended to simply regurgitate or replace the R documentation! The hope is that this answer helps you to decide which *apply function suits your situation and then it is up to you to research it further. With one exception, performance differences will not be addressed.

apply – When you want to apply a function to the rows or columns of a matrix (and higher-dimensional analogues).

# Two dimensional matrix

M <- matrix(seq(1,16), 4, 4)

# apply min to rows

apply(M, 1, min)

[1] 1 2 3 4

# apply min to columns

apply(M, 2, max)

[1] 4 8 12 16

# 3 dimensional array

M <- array( seq(32), dim = c(4,4,2))

# Apply sum across each M[*, , ] – i.e Sum across 2nd and 3rd dimension

apply(M, 1, sum)

# Result is one-dimensional

[1] 120 128 136 144

# Apply sum across each M[*, *, ] – i.e Sum across 3rd dimension

apply(M, c(1,2), sum)

# Result is two-dimensional

[,1] [,2] [,3] [,4]

[1,] 18 26 34 42

[2,] 20 28 36 44

[3,] 22 30 38 46

[4,] 24 32 40 48

If you want row/column means or sums for a 2D matrix, be sure to investigate the highly optimized, lightning-quick colMeans, rowMeans, colSums, rowSums.

lapply – When you want to apply a function to each element of a list in turn and get a list back.

This is the workhorse of many of the other *apply functions. Peel back their code and you will often find lapply underneath.

x <- list(a = 1, b = 1:3, c = 10:100)

lapply(x, FUN = length)

$a

[1] 1

$b

[1] 3

$c

[1] 91

lapply(x, FUN = sum)

$a

[1] 1

$b

[1] 6

$c

[1] 5005

sapply – When you want to apply a function to each element of a list in turn, but you want a vector back, rather than a list.

If you find yourself typing unlist(lapply(…)), stop and consider sapply.

x <- list(a = 1, b = 1:3, c = 10:100)

#Compare with above; a named vector, not a list

sapply(x, FUN = length)

a b c

1 3 91

sapply(x, FUN = sum)

a b c

1 6 5005

In more advanced uses of sapply it will attempt to coerce the result to a multi-dimensional array, if appropriate. For example, if our function returns vectors of the same length, sapply will use them as columns of a matrix:

sapply(1:5,function(x) rnorm(3,x))

If our function returns a 2 dimensional matrix, sapply will do essentially the same thing, treating each returned matrix as a single long vector:

sapply(1:5,function(x) matrix(x,2,2))

Unless we specify simplify = “array”, in which case it will use the individual matrices to build a multi-dimensional array:

sapply(1:5,function(x) matrix(x,2,2), simplify = “array”)

Each of these behaviors is of course contingent on our function returning vectors or matrices of the same length or dimension.

vapply – When you want to use sapply but perhaps need to squeeze some more speed out of your code.

For vapply, you basically give R an example of what sort of thing your function will return, which can save some time coercing returned values to fit in a single atomic vector.

x <- list(a = 1, b = 1:3, c = 10:100)

#Note that since the adv here is mainly speed, this

# example is only for illustration. We’re telling R that

# everything returned by length() should be an integer of

# length 1.

vapply(x, FUN = length, FUN.VALUE = 0)

a b c

1 3 91

mapply

For when you have several data structures (e.g. vectors, lists) and you want to apply a function to the 1st elements of each, and then the 2nd elements of each, etc., coercing the result to a vector/array as in sapply.

This is multivariate in the sense that your function must accept multiple arguments.

#Sums the 1st elements, the 2nd elements, etc.

mapply(sum, 1:5, 1:5, 1:5)

[1] 3 6 9 12 15

#To do rep(1,4), rep(2,3), etc.

mapply(rep, 1:4, 4:1)

[[1]]

[1] 1 1 1 1

[[2]]

[1] 2 2 2

[[3]]

[1] 3 3

[[4]]

[1] 4

rapply – For when you want to apply a function to each element of a nested list structure, recursively.

To give you some idea of how uncommon rapply is, I forgot about it when first posting this answer! Obviously, I’m sure many people use it, but YMMV. rapply is best illustrated with a user-defined function to apply:

#Append ! to string, otherwise increment

myFun <- function(x){

if (is.character(x)){

return(paste(x,”!”,sep=””))

}

else{

return(x + 1)

}

}

#A nested list structure

l <- list(a = list(a1 = “Boo”, b1 = 2, c1 = “Eeek”),

b = 3, c = “Yikes”,

d = list(a2 = 1, b2 = list(a3 = “Hey”, b3 = 5)))

#Result is named vector, coerced to character

rapply(l,myFun)

#Result is a nested list like l, with values altered

rapply(l, myFun, how = “replace”)

tapply – For when you want to apply a function to subsets of a vector and the subsets are defined by some other vector, usually a factor. The black sheep of the *apply family, of sorts. The help files use of the phrase “ragged array” can be a bit confusing, but it is actually quite simple.

A vector:

x <- 1:20

A factor (of the same length!) defining groups:

y <- factor(rep(letters[1:5], each = 4))

Add up the values in x within each subgroup defined by y:

tapply(x, y, sum)

a b c d e

10 26 42 58 74

More complex examples can be handled where the subgroups are defined by the unique combinations of a list of several factors. tapply is similar in spirit to the split-apply-combine functions that are common in R (aggregate, by, ave, ddply, etc.) Hence its black sheep status.

Link

SQL-type joins in R

Given two data frames

df1 = data.frame(CustomerId=c(1:6),Product=c(rep(“Toaster”,3),rep(“Radio”,3)))

df2 = data.frame(CustomerId=c(2,4,6),State=c(rep(“Alabama”,2),rep(“Ohio”,1)))

> df1

CustomerId Product

1 Toaster

2 Toaster

3 Toaster

4 Radio

5 Radio

6 Radio

> df2

CustomerId State

2 Alabama

4 Alabama

6 Ohio

How does one do:

An inner join of df1 and df1

An outer join of df1 and df2

A left outer join of df1 and df2

A right outer join of df1 and df2

Can I do a sql style select statement?

By using the merge function and its optional parameters:

Inner join: merge(df1, df2) will work for these examples because R automatically joins the frames by common variable names, but you would most likely want to specify merge(df1, df2, by=”CustomerId”) to make sure that you were matching on only the fields you desired. You can also use the by.x and by.y parameters if the matching variables have different names in the different data frames.

Outer join: merge(x = df1, y = df2, by = “CustomerId”, all = TRUE)

Left outer: merge(x = df1, y = df2, by = “CustomerId”, all.x=TRUE)

Right outer: merge(x = df1, y = df2, by = “CustomerId”, all.y=TRUE)

Cross join: merge(x = df1, y = df2, by = NULL)

Link

Import large amounts of data

What is the best way to import 2GB or more of text data into an R data frame. Yesterday I wrote a blog post about using sqldf() to import the data into SQLite as a staging area, and then sucking it from SQLite into R. This works really well for me. I was able to pull in 2GB (3 columns, 40mm rows) of data in < 5 minutes. By contrast, the read.csv command ran all night and never completed.

Set up the test data:

bigdf <- data.frame(dim=sample(letters, replace=T, 4e7), fact1=rnorm(4e7), fact2=rnorm(4e7, 20, 50))

write.csv(bigdf, ‘bigdf.csv’, quote = F)

I restarted R before running the following import routine:

library(sqldf)

f <- file(“bigdf.csv”)

system.time(bigdf <- sqldf(“select * from f”, dbname = tempfile(), file.format = list(header = T, row.names = F)))

Two lines in the same plot

x <- seq(-2, 2, 0.05)

y1 <- pnorm(x)

y2 <- pnorm(x,1,1)

plot(x,y1,type=”l”,col=”red”)

lines(x,y2,col=”green”)

Link

Heat map of stock

stock <- “MSFT”

start.date <- “2006-01-12″

end.date <- Sys.Date()

quote <- paste(“http://ichart.finance.yahoo.com/table.csv?s=”,

stock, “&a=”, substr(start.date,6,7),

“&b=”, substr(start.date, 9, 10),

“&c=”, substr(start.date, 1,4),

“&d=”, substr(end.date,6,7),

“&e=”, substr(end.date, 9, 10),

“&f=”, substr(end.date, 1,4),

“&g=d&ignore=.csv”, sep=””)

stock.data <- read.csv(quote, as.is=TRUE)

stock.data <- transform(stock.data,

week = as.POSIXlt(Date)$yday %/% 7 + 1,

wday = as.POSIXlt(Date)$wday,

year = as.POSIXlt(Date)$year + 1900)

library(ggplot2)

ggplot(stock.data, aes(week, wday, fill = Adj.Close)) +

geom_tile(colour = “white”) +

scale_fill_gradientn(colours = c(“#D61818″,”#FFAE63″,”#FFFFBD”,”#B5E384″)) +

facet_wrap(~ year, ncol = 1)

Link

Pivot-table like manipulations with R

#Create some random data

genes = paste(‘MMP’, sprintf(“%04d”,1:10), sep=””)

data = expand.grid(gene=genes, condition=c(‘copper’, ‘cheetos’, ‘beer’, ‘pizza’))

data$value = rnorm(40)

data

library(reshape2)

dcast(data, gene ~ condition)

How to remove rows of a df that contain NAs

Remove the lines in a dataframe that contain NAs, across all columns.

Example:

gene hsap mmul mmus rnor cfam

1 ENSG00000208234 0 NA NA NA NA

2 ENSG00000199674 0 2 2 2 2

3 ENSG00000221622 0 NA NA NA NA

4 ENSG00000207604 0 NA NA 1 2

5 ENSG00000207431 0 NA NA NA NA

6 ENSG00000221312 0 1 2 3 2

Desired result is a DF such as :

gene hsap mmul mmus rnor cfam

2 ENSG00000199674 0 2 2 2 2

6 ENSG00000221312 0 1 2 3 2

Also – How to only filter for some columns, with a resulting DF like this:

gene hsap mmul mmus rnor cfam

2 ENSG00000199674 0 2 2 2 2

4 ENSG00000207604 0 NA NA 1 2

6 ENSG00000221312 0 1 2 3 2

I know this is straight forward DF manipulation, but I never get it right :-(

Answer – Try complete.cases :

final[complete.cases(final),]

gene hsap mmul mmus rnor cfam

2 ENSG00000199674 0 2 2 2 2

6 ENSG00000221312 0 1 2 3 2

na.omit is nicer for just removing all NA’s. complete.cases allows partial selection by using part of the dataframe :

final[complete.cases(final[,5:6]),]

gene hsap mmul mmus rnor cfam

2 ENSG00000199674 0 2 2 2 2

4 ENSG00000207604 0 NA NA 1 2

6 ENSG00000221312 0 1 2 3 2

Link

How to drop levels in a subset of a data frame

When a data frame contains a factor, and a subset of this dataframe is made using subset() or another indexing function, a new data frame is created. However, the factor variable retains all of its original levels — even when they do not exist in the new data frame.

This creates headaches when doing faceted plotting or using functions that rely on factor levels.

What is the most best way to remove levels from a factor in a new data frame?

Example:

df <- data.frame(letters=letters[1:5],

numbers=seq(1:5))

levels(df$letters)

## [1] “a” “b” “c” “d” “e”

subdf <- subset(df, numbers <= 3)

## letters numbers

## 1 a 1

## 2 b 2

## 3 c 3

## but the levels are still there!

levels(subdf$letters)

## [1] “a” “b” “c” “d” “e”

Answer

subdf$letters

[1] a b c

Levels: a b c d e

subdf$letters <- factor(subdf$letters)

subdf$letters

[1] a b c

Levels: a b c

Link

Pivot-table like actions for huge files

Pivot table type operations

For a huge csv file – data table has with 1000 columns similar to var1, var2, … var1000, etc. Here is a small version

library(data.table)

dt <- data.table(uid=c(“a”,”b”), var1=c(1,2), var2=c(100,200))

The goal is output similar to reshape’s melt function:

melt(dt, id=c(“uid”))

uid variable value

1 a var1 1

2 b var1 2

3 a var2 100

4 b var2 200

That is, all the columns except for uid are listed under a single column with the corresponding values in an adjoining column. Melt is not an option due to the time it takes to run the operation on a dataset that is several GBs in size

Answer:

dt[, list(variable = names(.SD), value = unlist(.SD, use.names = F)), by = uid]

Link

Using reshape 2 to get pivot table analysis:

Given this type of data:

Country Salesperson Order.Date OrderID Order.Amount

1 UK Suyama 10/07/2003 10249 1863.40

2 USA Peacock 11/07/2003 10252 3597.90

3 USA Peacock 12/07/2003 10250 1552.60

4 UK Dodsworth 15/07/2003 10255 2490.50

5 USA Leverling 15/07/2003 10251 654.06

6 UK Buchanan 16/07/2003 10248 440.00

library(reshape2)

Melt

Melt transforms a data frame from the original format to a so called long format, where all the observed variables (called measures) appear, with their respective value, in two adjacent columns named variable and value. Each row of this new data format is identified by a unique combination of the id variables, also part of the original data frame.

# id is assigned the columns you may want to pivot on – ie, lets look at totals for salesperson, or average by country. Measure is the result:

data.m <- melt(data, id=c(1:4), measure=c(5))

This is what the “melted” data looks like:

Country Salesperson Order.Date OrderID variable value

1 UK Suyama 10/07/2003 10249 Order.Amount 1863.40

2 USA Peacock 11/07/2003 10252 Order.Amount 3597.90

3 USA Peacock 12/07/2003 10250 Order.Amount 1552.60

4 UK Dodsworth 15/07/2003 10255 Order.Amount 2490.50

5 USA Leverling 15/07/2003 10251 Order.Amount 654.06

6 UK Buchanan 16/07/2003 10248 Order.Amount 440.00

Now, cast the melted data to build the pivot table. Cast requires us to indicate how we want to re-aggregate the values.

Form is similar to this algorithm:

dcast(data, formula = … ~ variable, fun.aggregate=NULL, …, margins=FALSE, subset=TRUE)

# The actual code:

data.c <- dcast(data.m, Salesperson ~ variable, sum, margins=c(“grand_row”))

To filter the pivot table

data.c <- dcast(data.m, Salesperson ~ variable, sum, margins=c(“grand_row”), subset=(Country==”USA”))

# An example from my seasonal appy study

df <-data.frame(Appendectomy.Seasonal.Pt.Abstract.Trimmed)

library(reshape2)

df.m <- melt(df, id=c(2), measure=c(5))

df.c <- dcast(df.m, Hospital_City ~ variable, mean)

head(df.c)

Link

How to quickly check the integrity and structure of a dataframe

str(df)

gives:

‘data.frame': 142166 obs. of 34 variables:

$ Hospital_Number : int 1001 1001 1001 1001 1001 1001 1001 1001 1001 1001 …

$ Hospital_City : Factor w/ 45 levels “Akron”,”Atlanta”,..: 41 41 41 41 41 41 41 41 41 41 …

Convert date to correct format

data$Order.Date <- as.Date(data$Order.Date, “%d/%m/%Y”)

Create a pivot table with some flexibility

library(reshape2)

library(ggplot2)

dataset <- data.frame(var1 = rep(c(“a”,”b”,”c”,”d”,”e”,”f”), each = 4),

var2 = rep(c(“level1″,”level1″,”level2″,”level2″), 6),

var3 = rep(c(“h”,”m”), 12), meas = rep(1:12))

head(dataset)

# simply pivot table

dcast(dataset, var1 ~ var2 + var3)

# Using meas as value column. Use the value argument to cast to override this choice

# mean by var1 and var2

dcast(dataset, var1 ~ var2, mean)

# Using meas as value column. Use the value argument to cast to override this choice

# mean by var1 and var3

dcast(dataset, var1 ~ var3, mean)

# Using meas as value column. Use the value argument to cast to override this choice

# mean by var1, var2 and var3 (version 1)

dcast(dataset, var1 ~ var2 + var3, mean)

# Using meas as value column. Use the value argument to cast to override this choice

# mean by var1, var2 and var3 (version 2)

dcast(dataset, var1 + var2 ~ var3, mean)

# Using meas as value column. Use the value argument to cast to override this choice

# use package plyr to create flexible data frames…

library(plyr)

dataset_plyr <- ddply(dataset, .(var1, var2), summarise,

mean = mean(meas),

se = sd(meas),

CV = sd(meas)/mean(meas))

dataset_plyr

# Make a plot

qplot(var1, mean, colour = var2, size = CV, data = dataset_plyr, geom = “point”)

Link

Create Sums by Month in ggplot2

Question

Given this data –

x <- structure(list(date = structure(c(1264572000, 1266202800, 1277362800, 1277456400, 1277859600, 1278032400, 1260370800, 1260892800, 1262624400, 1262707200), class = c(“POSIXt”, “POSIXct”), tzone = “”), data = c(-0.00183760994446658, 0.00089738603087497, 0.000423513598318936, 0, -0.00216496690393131, -0.00434836817931339, -0.0224199153445617, 0.000583823085470003, 0.000353088613905206, 0.000470295331234771)), .Names = c(“date”, “data”), row.names = c(“1″, “2”, “3”, “4”, “5”, “6”, “7”, “8”, “9”, “10”), class = “data.frame”)

#The binning by month, saving as a date

x$month <- as.Date(cut(x$date, breaks = “month”))

What’s the best way to plot this as a bar plot in ggplot that would show the total value per month (with the month name as text)?

Answer

#Plotting

library(ggplot2)

p <- ggplot(x, aes(month, data))+ stat_summary(fun.y = sum, geom = “bar”)

#My suggestions for display

minmax <- max(abs(x$data))

p + geom_hline(y = 0)+

scale_x_date(minor = “month”)+

ylim(-minmax, minmax)

p

# or more ggplot2 accurately

#+coord_cartesian(ylim = c(-minmax, minmax))

p + scale_x_date(major = “month”, format = “%b”)

#or

p + scale_x_date(major = “month”, format = “%B %Y”)

Print System Time in R Markdown

The date is : “`r Sys.Date()`”

Get rid of the NA’s in the first row

Link http://stackoverflow.com/questions/6286313/remove-an-entire-column-from-a-data-frame-in-r#

temps_frequency = temps_frequency[-1,]

Remove columns in R:

Data$genome <- NULL

Data[2] <- NULL # Wojciech Sobala
Data[[2]] <- NULL # same as above
Data <- Data[,-2] # Ian Fellows
Data <- Data[-2] # same as above

You can remove multiple columns via:

Data[1:2] <- list(NULL) # Marek

Get top ranked

Flights %>%

group_by(cpt_code)

select(Month, Year) %>%

top_n(50) %>%

Arrange (uniqueCarrier, desc(deldelay))