R for SQL minded: SELECTing Data from a Data Frame

I expect this post to be most intense and most fun. We will discuss how we can perform different SELECT type of operations with Data Frames. This post is restricted to reading from a single Data Frame. In the subsequent post, I will discuss reading from multiple Data Frames by 'joining' them.

SELECT Name, Geography
FROM Customers
WHERE Revenue > 10000

The basic approach for selecting data from Data Frame is by subsetting. We will discuss subsetting before getting into examples and more complex data extraction examples.

We think of a Data Frame as a tabular structure with R rows and C columns. We can subset and select only certain rows and columns from the Data Frame based on our criteria. In general, we can write

myDF[r,c, drop=FALSE], where r and c are our criteria for row selection and column selection respectively. drop=FALSE is a special nuance that is explained later.

r and c both can take the following values:

Empty (nothing): Selects all
Index: Selects entity with the given index.
Name: Selects entity with the given name. (Rows can have names too)
A vector containing indices: Selects entries at the specified index positions. A negative index excludes the entity at the specified index position.
A vector containing names: Selects entries matching row and column names
A vector containing Boolean values: Selects only those entries for which the corresponding entry in the vector is true
A Boolean criteria: Selects only those entries for which the Boolean

Here are few examples:

Customers[,]
Row Selection: All, Column Selection: All

Customers[1,]
Row Selection: 1st Row, Column Selection: All

Customers[,-1]
Row Selection: All, Column Selection: All but first

Customers[,'Geography', drop=FALSE]
Row Selection: All, Column Selection:Geography

drop? What is that?
If the resultant Data Frame is of only one column, R would convert the result to a vector; in other words, drop a dimension. Welcome to R. In order to avoid this, we need to add drop=FALSE. In fact, as a new comer, we may decide to use drop=FALSE always while subsetting Data Frames.

Customers[c(1,2),]
Row Selection: 1st and 2nd Rows, Column Selection: All

Customers[,c(2,2)]
Row Selection: All, Column Selection:2nd Column selected twice

Customers[,c('Name','Geography')]
Row Selection: All, Column Selection:Name and Geography

Customers[c('1','2'),c('Name','Geography')]
Row Selection: Rows whose names are '1' and '2', Column Selection:Name and Geography

Customers[TRUE,FALSE]
Row Selection: All, Column Selection:None. (Why would you do that?)

Customers[c(TRUE,FALSE),]
Row Selection: Alternate starting with First, Column Selection: All

What?? Alternate? How?

Here is how: If we provide a Boolean vector whose length is less than the number of rows or columns, the vector would be repeated for the rest of the entries. Welcome to R.

Customers[1==1,]
Row Selection: All, Column Selection:All. (Remember 1==1 is TRUE)

Customers[Customers$Revenue>10000,c('Name','Geography')]
Row Selection: Customers with Revenue > 10000, Column Selection:Name and Geography.

Using subset() function

We can also use subset function, which has a more intuitive syntax, to subset a Data Frame. However, I have read that its usage is not encouraged as in some cases it may have unanticipated consequences. It is best to use the subsetting mechanism discussed above.

subset(Customers, Customers$Revenue>10000 , c('Name','Geography'))

SELECT TOP 10 Name, Geography
FROM Customers
WHERE Revenue > 10000 AND IsActive = TRUE

Use the head() function

head(Customers[(Customers$Revenue>10000)&(Customers$IsActive==TRUE), c('Name','Geography')], 10)

SELECT TOP 10 Name, Geography
FROM Customers
WHERE Revenue > 100 AND IsActive = TRUE
ORDER BY Name DESC

In this case, we need to order before selecting the top 10. We will do it in two steps.

tmp<-Customers[(Customers$Revenue>100)&(Customers$IsActive==TRUE), c('Name','Geography')]

head(tmp[order(tmp$Name, decreasing=TRUE),],10)

SELECT TOP 10 Name, Geography
FROM Customers
WHERE Revenue > 100 AND IsActive = TRUE
ORDER BY Name DESC, Geography ASC

This one is little tricky. The first temptation is to use the order function as above. However, the order function takes only one parameter, decreasing, that applies to all the fields. So, we will use the rank() function inside the order function and use a negative sign before the rank for Name.

tmp<-Customers[(Customers$Revenue>100)&(Customers$IsActive==TRUE), c('Name','Geography')]

head(tmp[order(-rank(tmp$Name),rank(tmp$Geography)),],10)

SELECT Geography, IsActive, SUM(Revenue)
FROM Customer
WHERE Revenue > 500
GROUP BY Geography, IsActive

We use the aggregate() function to calculate the value by group.

aggregate(Revenue~Geography+IsActive, data=Customers[Customers$Revenue>500,], FUN=sum)

Here, the FUN can be one of the summary statistics functions such as mean, median, minimum, maximum etc.

SELECT Geography, IsActive, SUM(Revenue), COUNT(Revenue)
FROM Customer
WHERE Revenue > 500
GROUP BY Geography, IsActive

Note that we are using multiple aggregate functions. Here, we can use a custom function that would return both the aggregates.

aggregate(Revenue~Geography+IsActive, data=Customers[Customers$Revenue>500,],
FUN= function(x) c(sum(x),length(x)))

SELECT Geography, IsActive, SUM(Revenue)
FROM Customer
WHERE Revenue > 500
GROUP BY Geography, IsActive
HAVING SUM(Revenue)> 2000

This is easy to do in two steps. The first step uses the above approach and the second step uses the basic subsetting.

tmp<- aggregate(Revenue ~ Geography+IsActive, data=Customers[Customers$Revenue>500,], sum)

tmp[tmp$Revenue>2000,]

R for SQL minded

Saturday, June 20, 2015

SELECTing Data from a Data Frame

No comments:

Post a Comment