Saturday, June 13, 2015

Creating and Modifying 'Tables' in R

Data Frames and Data Tables are your Tables

Two objects in R come close to a SQL Table: Data Frame, and Data Table.  Data Tables are relatively new to the R world, have simpler syntax, and arguably have better performance and memory utilization. However, Data Frames are more pervasive in the R world.

For now, we will stick to Data Frames and see how we can perform SQL type DDL operations on Data Frames.

We will use a customer database for our examples. We will introduce the tables and columns as needed.

Create Table

Data Frame is a very simple in-memory data structure. Things that we take for granted in a SQL table do not even find a mention in R. There is no Primary Key, Foreign Key, Unique Key, Identity Column, Default Value, or Indexes in R. Every column is NULLable, or I should say NAable - If you do not provide a valid value, it becomes NA. 'Factors' come closest to 'Constraints'.

SQL provides a lot of tools for maintaining the Data Integrity. In R, maintaining the integrity of data is your responsibility. In some cases, even if you are able to enforce some constraint as we will see later, R would let you go with a warning. If you are not careful, you may end up with incorrect results. 

Note: One important distinction to keep in mind that unlike SQL, the order or records in R is preserved.  SQL does not guarantee that the records would be stored in the same order in which it was stored, but R does.

If you need to create an empty Data Frame, here is how to do it.

Customers <- data.frame(Id=integer(),
                        Name=character(),
                        Revenue=double(),
                        Geography=factor(levels=c('South','North','East','West')),
                        IsActive=logical(),
                        stringsAsFactors=FALSE)


Notice the data type of Geography column. It is a factor data type with predefined levels. It is closest to column constraint in SQL. However, if you insert a different value in the Geography column, R would warn you and insert NA instead.

Describe Table

If you want to see the structure of the Data Frame, use str() function. 

> str(Customers)
'data.frame':    0 obs. of  5 variables:
 $ Id       : int
 $ Name     : chr
 $ Revenue  : num
 $ Geography: Factor w/ 4 levels "South","North",..:
 $ IsActive : logi


If you just want to see the names of the columns, use names() or colnames() functions

> names(Customers)
[1] "Id"        "Name"      "Revenue"   "Geography" "IsActive"
> colnames(Customers)
[1] "Id"        "Name"      "Revenue"   "Geography" "IsActive"


However, most of the times you load the data in R from an external data source and you do not need to create a Data Frame from scratch. Be careful about the data types when you import the data from external sources. R might have imported everything as a character. It is a good idea to check and enforce data types after data import.

A very common way to load data in R is to read from a csv file using read.csv function.


Alter Table - Add Column

In order to add a column, you just need to pick a column name and assign it some value. R creates your 'column' automatically. Careful, if the column already exists, its values would be replaced with the new values.

Add an empty column to a Data Frame:

Customers$City <- NA

You can also assign values to the column right when you create it. Create a vector and put values in the order corresponding to your row order. Then assign the vector to the new column in the Data Frame.

Assuming your Customer data  frame has 3 records in it, you can use the following to create a new column as well as assign values to it.

Customers$City <- c("Bellevue", "Redmond", "Seattle")

There are alternate syntax to do the same thing:

Customers[["City"]] <- c("Bellevue", "Redmond", "Seattle")
Customers[,"City"] <- c("Bellevue", "Redmond", "Seattle")

Alter Table - Rename Column

Renaming a column is little convoluted if you do not want to use the column index:

names(Customers)[names(Customers)=="Id"] <- "CustId"

However, if you know the index of the column, you can use it directly.

names(Customers)[1] <- "CustId"

Alter Table - Remove Column
For removing a column, assign NULL to it using any of the above syntax. 

Customers$Geography<-NULL

Alter Table - Change Data Type of Column

You can use as.X() function to change the data type of a column from existing type to X. If R fails to meaningfully convert the data type, it would set NAs and warn you.

Customers$City<-as.numeric(Customers$City)

The above line would try to convert City to Numeric and will warn you, setting NAs in the city column. However, if the conversion were meaningful, it would do the required conversion.


Drop Table


Data Frames are in-memory objects and you can get rid of those just like any other object in R by using the rm() or remove() functions.

rm(Customers)

No comments:

Post a Comment