Saturday, June 20, 2015

SELECTing from Multiple Data Frames : JOINing


So far we discussed operations on a single Data Frame. However, it is often needed to JOIN two data frames. This is rather straight forward to do in R using the merge() function


JOINING Based on Single Column


Let's consider these two Data Frames:

Customers <- data.frame(Id=integer(),
                        Name=character(),
                        Revenue=double(),
                        Geography=factor(levels=c('South','North','East','West')),
                        IsActive=logical(),
                        stringsAsFactors=FALSE)


ContactPersons <- data.frame(Id=integer(), 
                        CustomerId=integer(),
                        Name=character(),
                        Email=character(),

                        stringsAsFactors=FALSE)


The two Data Frames are related based on the Customer Id.

INNER JOIN

merge(Customers, ContactPersons, by.x="Id",by.y="CustomerId")

CROSS JOIN

merge(Customers, ContactPersons, by= NULL)

LEFT OUTER JOIN

merge(Customers, ContactPersons, by.x="Id",by.y="CustomerId", all.x = TRUE)


RIGHT OUTER JOIN

merge(Customers, ContactPersons, by.x="Id",by.y="CustomerId", all.y = TRUE)


FULL OUTER JOIN

merge(Customers, ContactPersons, by.x="Id",by.y="CustomerId", all.x = TRUE, all.y = TRUE)

JOINING Based on Multiple Columns


We can join based on multiple columns by passing vectors of column names to by.x and by.y.

For example, here is how we can perform an inner join between the two Data Frames based on Id as well as Name columns. (Whether it makes sense to join based on these two columns or not is a different issue)

merge(Customers, ContactPersons, by.x=c("Id","Name"),by.y=c("CustomerId","Name"))


Ordering and Subsetting after Joins

merge() function returns a Data Frame that can be ordered and subsetted using techniques discussed in the previous post named SELECTing from a Data Frame.

In addition, the function also takes a logical parameter named sort that would sort the resultant Data Frame based on the joining column.

merge(Customers, ContactPersons, by.x="Id",by.y="CustomerId", sort=TRUE)

No comments:

Post a Comment