R is a very popular language for doing analytics, and particularly statistics, on your data. There are a number of R functions for reading in data, but most of them take a delimited text file (such as .CSV) for input. That's great if your existing data is in a spreadsheet, but if you have large amounts of data, it's probably stored in a relational database. If you work for a large company, chances are that it is an Oracle database.
The most efficient way to access an Oracle database from R is using the RODBC package, available from CRAN. If the RODBC package is not installed in your R environment, use the install.packages("RODBC") command to install it. ODBC stands for Open DataBase Connectivity, an open standard application programming interface (API) for databases. ODBC was created by the SQL Access Group and first released in September, 1992. Although Microsoft Windows was the first to provide an ODBC product, versions now exist for Linux and Macintosh platforms as well. ODBC is built-in to current versions of Windows. If you are using a different operating system, you'll need to install on OBDC driver manager.
Before you can access a database from R, you'll need to create a Data Source Name, or DSN. This is an alias to the database, which provides the connection details. In Windows, you create the DSN using the ODBC Source Administrator. This tool can be found in the Control Panel. In Windows 10, it's under System and Security -> Administrative Tools -> ODBC Data Sources. Or you can just type "ODBC" in the search box. On my system, it looks like this:
As you can see, I already have a connection to an Oracle database. To set one up, click Add, and you'll get this box:
Select the appropriate driver (in my case, Oracle in OraDB12Home1) and click the Finish button. A Driver Configuration box opens:
For "Data Source Name," you can put in almost anything you want. This is the name you will use in R when you connect to the database.
The "Description" field is optional, and again, you can put in whatever you want.
TNS Service Name is the name that you (or your company data base administrator) assigned when configuring the Oracle database. And "User ID" is your ID that you use with the database.
After you fill in these fields, click the "Test Connection" button. Another box pops up, with the TNS Service Name and User ID already populated, and an empty field for your password. Enter your password and click "OK." You should see a "Connection Successful" message. If not, check the Service Name, User ID, and Password.
Now you are ready to connect R to the database.
Here's the R code that you need:
library(RODBC) # Create a connection to the database called "channel" channel <- odbcConnect("DATABASE", uid="USERNAME", pwd="PASSWORD") # Query the database and put the results into the data frame # "dataframe" dataframe <;- sqlQuery(channel, " SELECT * FROM SCHEMA.DATATABLE") # When finished, it's a good idea to close the connection odbcClose(channel)
A couple of comments about this code are in order:
First, I don't like the idea of having a password appear, unencrypted, in the R program. One possible solution is to prompt the user for the password before creating the connection:
pswd <- readline("Input Password: ") channel <- odbcConnect("DATABASE", uid="USERNAME", pwd=pswd)
This will enable the connection to be made without compromising the security of the password.
Second, the sqlQuery will pass to Oracle whatever is inside the quotation marks. This is the workhorse function of the RODBC package. The term ‘query’ includes any valid SQL statement including table creation, updates, etc, as well as ‘SELECT’s.
Finally, I should mention that R works with data that is loaded into the computer's memory. If you try to load a really huge database into memory all at once, it will a) take a very long time, and b) possibly fail due to exceeding your computer's memory capacity. Of course, relational database systems like Oracle are the natural habitat of very large data sets, so that may be your motivation for connecting R to Oracle in the first place. Carefully constructed SQL Queries will let Oracle do the work of managing the data, and return just the data that R needs for performing analytics.
Writing SQL Queries is beyond the scope of this blog post. If you need help with that, there are plenty of free tutorials on the web, or you might find this book helpful: Oracle 12c for Dummies