So far we discussed operations on a single Data Frame. However, it is often needed to JOIN two data frames. This is rather straight forward to do in R using the merge() function
JOINING Based on Single Column
Let's consider these two Data Frames:
Customers <- data.frame(Id=integer(),
Name=character(),
Revenue=double(),
Geography=factor(levels=c('South','North','East','West')),
IsActive=logical(),
stringsAsFactors=FALSE)
ContactPersons <- data.frame(Id=integer(),
CustomerId=integer(),
Name=character(),
Email=character(),
stringsAsFactors=FALSE)
The two Data Frames are related based on the Customer Id.
INNER JOIN
merge(Customers, ContactPersons, by.x="Id",by.y="CustomerId")
CROSS JOIN
merge(Customers, ContactPersons, by= NULL)
LEFT OUTER JOIN
merge(Customers, ContactPersons, by.x="Id",by.y="CustomerId", all.x = TRUE)
RIGHT OUTER JOIN
merge(Customers, ContactPersons, by.x="Id",by.y="CustomerId", all.y = TRUE)
FULL OUTER JOIN
merge(Customers, ContactPersons, by.x="Id",by.y="CustomerId", all.x = TRUE, all.y = TRUE)
JOINING Based on Multiple Columns
We can join based on multiple columns by passing vectors of column names to by.x and by.y.
For example, here is how we can perform an inner join between the two Data Frames based on Id as well as Name columns. (Whether it makes sense to join based on these two columns or not is a different issue)
merge(Customers, ContactPersons, by.x=c("Id","Name"),by.y=c("CustomerId","Name"))
Ordering and Subsetting after Joins
merge() function returns a Data Frame that can be ordered and subsetted using techniques discussed in the previous post named SELECTing from a Data Frame.
In addition, the function also takes a logical parameter named sort that would sort the resultant Data Frame based on the joining column.
merge(Customers, ContactPersons, by.x="Id",by.y="CustomerId", sort=TRUE)
No comments:
Post a Comment