Sunday, June 21, 2015

UNION and UNION ALL



The commonly used function to perform UNION ALL operation in R is rbind().

Assuming we have Customers and ArchivedCustomers as two Data Frame, we can call the following:

rbind(Customers, ArchivedCustomers)

However, a few things need to be in place for the above to work properly. Both Customers as well as Archived Customers must have the same number and names of columns. The column order can be different in the two Data Frames, but the names need to be the same. This behavior can be useful, but is not what a SQL minded person may expect. So, be careful.

Using rbind() when numbers of columns are different

If the number of columns is different, either drop the unwanted columns by subsetting the Data Frame or add null columns with matching names to the other data frame.

Using rbind() when column names are different

If the column names are different and we want to UNION ALL based on the column order, we can use the following:

rbind(Customers, 
     setNames(ArchivedCustomers, 
          names(Customers)
     )
)

Here, we are copying the column names from the Customers Data Frame to the ArchivedCustomers Data Frame.

Performing UNION

Remember, rbind() does UNION ALL and not UNION. We can use the unique() function in addition to rbind in order to achieve the result similar to UNION in SQL


unique(
     rbind(Customers, 
          setNames(ArchivedCustomers, 
               names(Customers)
          )
     )
)

Saturday, June 20, 2015

SELECTing from Multiple Data Frames : JOINing


So far we discussed operations on a single Data Frame. However, it is often needed to JOIN two data frames. This is rather straight forward to do in R using the merge() function


JOINING Based on Single Column


Let's consider these two Data Frames:

Customers <- data.frame(Id=integer(),
                        Name=character(),
                        Revenue=double(),
                        Geography=factor(levels=c('South','North','East','West')),
                        IsActive=logical(),
                        stringsAsFactors=FALSE)


ContactPersons <- data.frame(Id=integer(), 
                        CustomerId=integer(),
                        Name=character(),
                        Email=character(),

                        stringsAsFactors=FALSE)


The two Data Frames are related based on the Customer Id.

INNER JOIN

merge(Customers, ContactPersons, by.x="Id",by.y="CustomerId")

CROSS JOIN

merge(Customers, ContactPersons, by= NULL)

LEFT OUTER JOIN

merge(Customers, ContactPersons, by.x="Id",by.y="CustomerId", all.x = TRUE)


RIGHT OUTER JOIN

merge(Customers, ContactPersons, by.x="Id",by.y="CustomerId", all.y = TRUE)


FULL OUTER JOIN

merge(Customers, ContactPersons, by.x="Id",by.y="CustomerId", all.x = TRUE, all.y = TRUE)

JOINING Based on Multiple Columns


We can join based on multiple columns by passing vectors of column names to by.x and by.y.

For example, here is how we can perform an inner join between the two Data Frames based on Id as well as Name columns. (Whether it makes sense to join based on these two columns or not is a different issue)

merge(Customers, ContactPersons, by.x=c("Id","Name"),by.y=c("CustomerId","Name"))


Ordering and Subsetting after Joins

merge() function returns a Data Frame that can be ordered and subsetted using techniques discussed in the previous post named SELECTing from a Data Frame.

In addition, the function also takes a logical parameter named sort that would sort the resultant Data Frame based on the joining column.

merge(Customers, ContactPersons, by.x="Id",by.y="CustomerId", sort=TRUE)

SELECTing Data from a Data Frame

I expect this post to be most intense and most fun. We will discuss how we can perform different SELECT type of operations with Data Frames. This post is restricted to reading from a single Data Frame. In the subsequent post, I will discuss reading from multiple Data Frames by 'joining' them.

SELECT Name, Geography 
FROM Customers 
WHERE Revenue > 10000

The basic approach for selecting data from Data Frame is by subsetting. We will discuss subsetting before getting into examples and more complex data extraction examples.

We think of a Data Frame as a tabular structure with R rows and C columns. We can subset and select only certain rows and columns from the Data Frame based on our criteria. In general, we can write

myDF[r,c, drop=FALSE],  where r and c are our criteria for row selection and column selection respectively. drop=FALSE is a special nuance that is explained later.

r and c both can take the following values:

Empty (nothing): Selects all
Index: Selects entity with the given index.
Name: Selects entity with the given name. (Rows can have names too)
A vector containing indices: Selects entries at the specified index positions. A negative index excludes the entity at the specified index position.
A vector containing names: Selects entries matching row and column names
A vector containing Boolean values: Selects only those entries for which the corresponding entry in the vector is true
A Boolean criteria: Selects only those entries for which the Boolean

Here are few examples:

Customers[,]
Row Selection: All, Column Selection: All

Customers[1,]
Row Selection: 1st Row, Column Selection: All

Customers[,-1]
Row Selection: All, Column Selection: All but first

Customers[,'Geography', drop=FALSE]
Row Selection: All, Column Selection:Geography

drop? What is that?
If the resultant Data Frame is of only one column, R would convert the result to a vector; in other words, drop a dimension. Welcome to R. In order to avoid this, we need to add drop=FALSE. In fact, as a new comer, we may decide to use drop=FALSE always while subsetting Data Frames.

Customers[c(1,2),]
Row Selection: 1st and 2nd Rows, Column Selection: All


Customers[,c(2,2)]
Row Selection: All, Column Selection:2nd Column selected twice

Customers[,c('Name','Geography')]
Row Selection: All, Column Selection:Name and Geography


Customers[c('1','2'),c('Name','Geography')]
Row Selection: Rows whose names are '1' and '2', Column Selection:Name and Geography

Customers[TRUE,FALSE]
Row Selection: All, Column Selection:None. (Why would you do that?)

Customers[c(TRUE,FALSE),]
Row Selection: Alternate starting with First, Column Selection: All

What?? Alternate? How? 

Here is how: If we provide a Boolean vector whose length is less than the number of rows or columns, the vector would be repeated for the rest of the entries. Welcome to R.

Customers[1==1,]
Row Selection: All, Column Selection:All. (Remember 1==1 is TRUE)

Customers[Customers$Revenue>10000,c('Name','Geography')]
Row Selection: Customers with Revenue > 10000, Column Selection:Name and Geography. 


Using subset() function

We can also use subset function, which has a more intuitive syntax, to subset a Data Frame. However, I have read that its usage is not encouraged as in some cases it may have unanticipated consequences. It is best to use the subsetting mechanism discussed above.

subset(Customers, Customers$Revenue>10000 , c('Name','Geography'))


SELECT TOP 10 Name, Geography 
FROM Customers 
WHERE Revenue > 10000 AND IsActive = TRUE

Use the head() function

head(Customers[(Customers$Revenue>10000)&(Customers$IsActive==TRUE), c('Name','Geography')], 10)

SELECT TOP 10 Name, Geography 
FROM Customers 
WHERE Revenue > 100 AND IsActive = TRUE 
ORDER BY Name DESC

In this case, we need to order before selecting the top 10. We will do it in two steps.

tmp<-Customers[(Customers$Revenue>100)&(Customers$IsActive==TRUE), c('Name','Geography')]

head(tmp[order(tmp$Name, decreasing=TRUE),],10)


SELECT TOP 10 Name, Geography 
FROM Customers 
WHERE Revenue > 100 AND IsActive = TRUE 
ORDER BY Name DESC, Geography ASC


This one is little tricky. The first temptation is to use the order function as above. However, the order function takes only one parameter, decreasing, that applies to all the fields. So, we will use the rank() function inside the order function and use a negative sign before the rank for Name.

tmp<-Customers[(Customers$Revenue>100)&(Customers$IsActive==TRUE), c('Name','Geography')]

head(tmp[order(-rank(tmp$Name),rank(tmp$Geography)),],10)

SELECT Geography, IsActive, SUM(Revenue)
FROM Customer 
WHERE Revenue > 500 
GROUP BY Geography, IsActive

We use the aggregate() function to calculate the value by group.

aggregate(Revenue~Geography+IsActive, data=Customers[Customers$Revenue>500,], FUN=sum)

Here, the FUN can be one of the summary statistics functions such as mean, median, minimum, maximum etc.

SELECT Geography, IsActive, SUM(Revenue), COUNT(Revenue)
FROM Customer 
WHERE Revenue > 500 
GROUP BY Geography, IsActive

Note that we are using multiple aggregate functions. Here, we can use a custom function that would return both the aggregates.

aggregate(Revenue~Geography+IsActive, data=Customers[Customers$Revenue>500,], 
                FUN= function(x) c(sum(x),length(x)))


SELECT Geography, IsActive, SUM(Revenue)
FROM Customer 
WHERE Revenue > 500 
GROUP BY Geography, IsActive
HAVING SUM(Revenue)> 2000

This is easy to do in two steps. The first step uses the above approach and the second step uses the basic subsetting.


tmp<- aggregate(Revenue ~ Geography+IsActive, data=Customers[Customers$Revenue>500,], sum)


tmp[tmp$Revenue>2000,]


Sunday, June 14, 2015

INSERT data into Data Frames

More often than not, we read data into a Data Frame from external sources such as a database, a csv file, or even a compressed csv file. R has ways and libraries to make the reading easy. I intend to write separately on data exchange between R and a SQL database. In this post, I would focus on common ways of populating a Data Frame.

Reading data from a CSV File

Here is how we can read the data from a csv file:

Customers <- read.csv("Customers.csv")

What we get is a Data Frame named Customers.

read.csv can read not only using a file path but also using a URL from internet. It can even read few types of compressed files and internally uncompress it before reading. It takes parameters for column names, headers etc. One should read the help file for the details.

Populating a Data Frame

If we want to populate the data frame computationally and not read from an external data source, there are a few ways. We can use rbind(), or row index based population.

Inserting Row using Row Index

We can explicitly specify the row number to populate at in a Data Frame. If there is an existing row at that index, the values would be replaced. If there is no existing row at that index, a new row would be created.

This approach requires knowing the index of the new row to be inserted. This is how we can insert a new row at the end of the Data Frame in a generic manner:

Customers[nrow(Customers)+1,]<-c(4,'YFitness',123.435,'North',TRUE,'Issaquah')

nrow() gives the number of rows in a Data Frame.

We can also use this approach to insert a new row at a particular location in the Data Frame. Let's say we want to insert a new row at the 3rd position in the Customers Data Frame. This is done in two steps. In the first step, we move rows 3:end to 4:(end+1). In the second step, we insert our new row at 3rd position.

Customers[seq(4,nrow(Customers)+1),] <- Customers[seq(3,nrow(Customers)),]
Customers[3,] <- c(3,'HFitness',123.435,'North',TRUE,'Issaquah')

Inserting Data using rbind()

This is useful when we have few similar Data Frames and we want to create a UNION of the those.

Customers<-rbind(Customers, 
                               OldCustomers, 
                               c(4,'YFitness',123.435,'North',TRUE,'Issaquah'))

In the above line, we are doing a Union of Customers Data Frame, OldCustomers Data Frame, and a vector that is created on the fly; and assigning the result to the Customers Data Frame. The rows are added in the order in which the appear in rbind.

rbind() can also be used for inserting a new record at a specified location. Let's say we want to insert a new row at the 3rd position in the Customers Data Frame. We create two subsets of the Customers Data Frame at the position of interest, and then create a UNION of these two Data Frames along with the new record in the middle.

Customers <- rbind(Customers[1:2,],
                                c(3,'HFitness',123.435,'North',TRUE,'Issaquah'),
                                Customers[-(1:2),])


Saturday, June 13, 2015

Creating and Modifying 'Tables' in R

Data Frames and Data Tables are your Tables

Two objects in R come close to a SQL Table: Data Frame, and Data Table.  Data Tables are relatively new to the R world, have simpler syntax, and arguably have better performance and memory utilization. However, Data Frames are more pervasive in the R world.

For now, we will stick to Data Frames and see how we can perform SQL type DDL operations on Data Frames.

We will use a customer database for our examples. We will introduce the tables and columns as needed.

Create Table

Data Frame is a very simple in-memory data structure. Things that we take for granted in a SQL table do not even find a mention in R. There is no Primary Key, Foreign Key, Unique Key, Identity Column, Default Value, or Indexes in R. Every column is NULLable, or I should say NAable - If you do not provide a valid value, it becomes NA. 'Factors' come closest to 'Constraints'.

SQL provides a lot of tools for maintaining the Data Integrity. In R, maintaining the integrity of data is your responsibility. In some cases, even if you are able to enforce some constraint as we will see later, R would let you go with a warning. If you are not careful, you may end up with incorrect results. 

Note: One important distinction to keep in mind that unlike SQL, the order or records in R is preserved.  SQL does not guarantee that the records would be stored in the same order in which it was stored, but R does.

If you need to create an empty Data Frame, here is how to do it.

Customers <- data.frame(Id=integer(),
                        Name=character(),
                        Revenue=double(),
                        Geography=factor(levels=c('South','North','East','West')),
                        IsActive=logical(),
                        stringsAsFactors=FALSE)


Notice the data type of Geography column. It is a factor data type with predefined levels. It is closest to column constraint in SQL. However, if you insert a different value in the Geography column, R would warn you and insert NA instead.

Describe Table

If you want to see the structure of the Data Frame, use str() function. 

> str(Customers)
'data.frame':    0 obs. of  5 variables:
 $ Id       : int
 $ Name     : chr
 $ Revenue  : num
 $ Geography: Factor w/ 4 levels "South","North",..:
 $ IsActive : logi


If you just want to see the names of the columns, use names() or colnames() functions

> names(Customers)
[1] "Id"        "Name"      "Revenue"   "Geography" "IsActive"
> colnames(Customers)
[1] "Id"        "Name"      "Revenue"   "Geography" "IsActive"


However, most of the times you load the data in R from an external data source and you do not need to create a Data Frame from scratch. Be careful about the data types when you import the data from external sources. R might have imported everything as a character. It is a good idea to check and enforce data types after data import.

A very common way to load data in R is to read from a csv file using read.csv function.


Alter Table - Add Column

In order to add a column, you just need to pick a column name and assign it some value. R creates your 'column' automatically. Careful, if the column already exists, its values would be replaced with the new values.

Add an empty column to a Data Frame:

Customers$City <- NA

You can also assign values to the column right when you create it. Create a vector and put values in the order corresponding to your row order. Then assign the vector to the new column in the Data Frame.

Assuming your Customer data  frame has 3 records in it, you can use the following to create a new column as well as assign values to it.

Customers$City <- c("Bellevue", "Redmond", "Seattle")

There are alternate syntax to do the same thing:

Customers[["City"]] <- c("Bellevue", "Redmond", "Seattle")
Customers[,"City"] <- c("Bellevue", "Redmond", "Seattle")

Alter Table - Rename Column

Renaming a column is little convoluted if you do not want to use the column index:

names(Customers)[names(Customers)=="Id"] <- "CustId"

However, if you know the index of the column, you can use it directly.

names(Customers)[1] <- "CustId"

Alter Table - Remove Column
For removing a column, assign NULL to it using any of the above syntax. 

Customers$Geography<-NULL

Alter Table - Change Data Type of Column

You can use as.X() function to change the data type of a column from existing type to X. If R fails to meaningfully convert the data type, it would set NAs and warn you.

Customers$City<-as.numeric(Customers$City)

The above line would try to convert City to Numeric and will warn you, setting NAs in the city column. However, if the conversion were meaningful, it would do the required conversion.


Drop Table


Data Frames are in-memory objects and you can get rid of those just like any other object in R by using the rm() or remove() functions.

rm(Customers)

Saturday, June 6, 2015

Primitive Data Types in R

Data Types are fundamental building blocks that we need to learn before we actually get into data manipulation. So  let's start with these.

Note: I often see R tutorial websites mixing R's primitive data types and R objects and listing the two together. It is like saying that Table is a data type in SQL. This can create confusion. So, let's keep the two separate.


Primitive Data Types


R has the following main primitive data types:
Character, Logical, Integer, Double, Complex, and Date


Mapping SQL Data Types to R Data Types

This is how SQL data types map to the R data types

Char, Varchar : Character
Binary, Varbinary : No primitive type but can be stored as raw vectors.
Boolean : Logical
SmallInt, Integer: Integer
Float, Real, Double, BigInt, Decimal, Numeric : Double
Date, Time : Date


NULL

In SQL, NULL represents a missing data. However, in R, there are two related concepts: NA and NULL. NA represents a missing value, a placeholder for something that exists but is unknown. However, NULL is something that does not exist. The distinction is more nuanced, but for now, just keep in mind that NULL comes in two forms.

is.na() and is.null() functions tell whether a value is NA or NULL.

NaN and Inf

Other interesting values in R are NaN and Inf. NaN represents undefined number, such as 0/0. Inf represents values such as 1/0.

CAST and CONVERT

Often, if you load the data into R from an external source, you may find that the data types are not what you expected it to be. It can be frustrating because R will do some implicit conversions and results may be different from what you expect.

The following R functions are helpful in such situations:

class(x): Tells the class of 'x'
as.X(y): Converts the data type of y to X. X can be character, numeric, logical etc.




Motivation behind this blog


Both SQL and R are meant for different purposes. R is arguably the richest statistical tool whereas SQL is a robust programming language for managing relational databases. However, there is an overlap between then two when it comes to manipulating the data.  R needs to manipulate data while reading, cleansing, and structuring it before feeding it to R's powerful libraries. SQL, on the other hand, needs to manipulate data for living.

When I started with R, I used to think in SQL and then try to perform the same operation in R. It used to be very time consuming and I wished there was a tutorial or guide that would tell me how to do SQL kind of things in R. Therefore I started this blog to help people like me who come from SQL background into the world of R. I will try to give cookbook kind of solutions in R for common SQL operations.

Keep your inputs coming.

Cheers!

My favorite SQL Joke:
A SQL walks into a bar and sees two tables. He walks upto them and says, "Can I join you?"

My SQL Poem:
SELECT Friends
FROM World
WHERE Hatred is NOT NULL
GROUP BY Friends
HAVING  SUM(Fun) > 0
-- ORDER BY ... (Friends are friends, they should not be ordered by an attribute).