R for SQL minded: Data Frames

More often than not, we read data into a Data Frame from external sources such as a database, a csv file, or even a compressed csv file. R has ways and libraries to make the reading easy. I intend to write separately on data exchange between R and a SQL database. In this post, I would focus on common ways of populating a Data Frame.

Reading data from a CSV File

Here is how we can read the data from a csv file:

Customers <- read.csv("Customers.csv")

What we get is a Data Frame named Customers.

read.csv can read not only using a file path but also using a URL from internet. It can even read few types of compressed files and internally uncompress it before reading. It takes parameters for column names, headers etc. One should read the help file for the details.

Populating a Data Frame

If we want to populate the data frame computationally and not read from an external data source, there are a few ways. We can use rbind(), or row index based population.

Inserting Row using Row Index

We can explicitly specify the row number to populate at in a Data Frame. If there is an existing row at that index, the values would be replaced. If there is no existing row at that index, a new row would be created.

This approach requires knowing the index of the new row to be inserted. This is how we can insert a new row at the end of the Data Frame in a generic manner:

Customers[nrow(Customers)+1,]<-c(4,'YFitness',123.435,'North',TRUE,'Issaquah')

nrow() gives the number of rows in a Data Frame.

We can also use this approach to insert a new row at a particular location in the Data Frame. Let's say we want to insert a new row at the 3rd position in the Customers Data Frame. This is done in two steps. In the first step, we move rows 3:end to 4:(end+1). In the second step, we insert our new row at 3rd position.

Customers[seq(4,nrow(Customers)+1),] <- Customers[seq(3,nrow(Customers)),]
Customers[3,] <- c(3,'HFitness',123.435,'North',TRUE,'Issaquah')

Inserting Data using rbind()

This is useful when we have few similar Data Frames and we want to create a UNION of the those.

Customers<-rbind(Customers,
                               OldCustomers,
                               c(4,'YFitness',123.435,'North',TRUE,'Issaquah'))

In the above line, we are doing a Union of Customers Data Frame, OldCustomers Data Frame, and a vector that is created on the fly; and assigning the result to the Customers Data Frame. The rows are added in the order in which the appear in rbind.

rbind() can also be used for inserting a new record at a specified location. Let's say we want to insert a new row at the 3rd position in the Customers Data Frame. We create two subsets of the Customers Data Frame at the position of interest, and then create a UNION of these two Data Frames along with the new record in the middle.

Customers <- rbind(Customers[1:2,],
            c(3,'HFitness',123.435,'North',TRUE,'Issaquah'),
                                Customers[-(1:2),])

Data Frames and Data Tables are your Tables

Two objects in R come close to a SQL Table: Data Frame, and Data Table. Data Tables are relatively new to the R world, have simpler syntax, and arguably have better performance and memory utilization. However, Data Frames are more pervasive in the R world.

For now, we will stick to Data Frames and see how we can perform SQL type DDL operations on Data Frames.

We will use a customer database for our examples. We will introduce the tables and columns as needed.

Create Table

Data Frame is a very simple in-memory data structure. Things that we take for granted in a SQL table do not even find a mention in R. There is no Primary Key, Foreign Key, Unique Key, Identity Column, Default Value, or Indexes in R. Every column is NULLable, or I should say NAable - If you do not provide a valid value, it becomes NA. 'Factors' come closest to 'Constraints'.

SQL provides a lot of tools for maintaining the Data Integrity. In R, maintaining the integrity of data is your responsibility. In some cases, even if you are able to enforce some constraint as we will see later, R would let you go with a warning. If you are not careful, you may end up with incorrect results.

Note: One important distinction to keep in mind that unlike SQL, the order or records in R is preserved. SQL does not guarantee that the records would be stored in the same order in which it was stored, but R does.

If you need to create an empty Data Frame, here is how to do it.

Customers <- data.frame(Id=integer(),
                        Name=character(),
                        Revenue=double(),
                        Geography=factor(levels=c('South','North','East','West')),
                        IsActive=logical(),
                        stringsAsFactors=FALSE)

Notice the data type of Geography column. It is a factor data type with predefined levels. It is closest to column constraint in SQL. However, if you insert a different value in the Geography column, R would warn you and insert NA instead.

Describe Table

If you want to see the structure of the Data Frame, use str() function.

> str(Customers)
'data.frame':    0 obs. of 5 variables:
$ Id       : int
$ Name     : chr
$ Revenue : num
$ Geography: Factor w/ 4 levels "South","North",..:
$ IsActive : logi

If you just want to see the names of the columns, use names() or colnames() functions

> names(Customers)
[1] "Id"        "Name"      "Revenue"   "Geography" "IsActive"
> colnames(Customers)
[1] "Id"        "Name"      "Revenue"   "Geography" "IsActive"

However, most of the times you load the data in R from an external data source and you do not need to create a Data Frame from scratch. Be careful about the data types when you import the data from external sources. R might have imported everything as a character. It is a good idea to check and enforce data types after data import.

A very common way to load data in R is to read from a csv file using read.csv function.

Alter Table - Add Column

In order to add a column, you just need to pick a column name and assign it some value. R creates your 'column' automatically. Careful, if the column already exists, its values would be replaced with the new values.

Add an empty column to a Data Frame:

Customers$City <- NA

You can also assign values to the column right when you create it. Create a vector and put values in the order corresponding to your row order. Then assign the vector to the new column in the Data Frame.

Assuming your Customer data frame has 3 records in it, you can use the following to create a new column as well as assign values to it.

Customers$City <- c("Bellevue", "Redmond", "Seattle")

There are alternate syntax to do the same thing:

Customers[["City"]] <- c("Bellevue", "Redmond", "Seattle")
Customers[,"City"] <- c("Bellevue", "Redmond", "Seattle")

Alter Table - Rename Column

Renaming a column is little convoluted if you do not want to use the column index:

names(Customers)[names(Customers)=="Id"] <- "CustId"

However, if you know the index of the column, you can use it directly.

names(Customers)[1] <- "CustId"

Alter Table - Remove Column
For removing a column, assign NULL to it using any of the above syntax.

Customers$Geography<-NULL

Alter Table - Change Data Type of Column

You can use as.X() function to change the data type of a column from existing type to X. If R fails to meaningfully convert the data type, it would set NAs and warn you.

Customers$City<-as.numeric(Customers$City)

The above line would try to convert City to Numeric and will warn you, setting NAs in the city column. However, if the conversion were meaningful, it would do the required conversion.

Drop Table

Data Frames are in-memory objects and you can get rid of those just like any other object in R by using the rm() or remove() functions.

rm(Customers)

R for SQL minded

Sunday, June 14, 2015

INSERT data into Data Frames

Saturday, June 13, 2015

Creating and Modifying 'Tables' in R