Pentaho/R Integration Plugin

Pentaho R Integration Plugin is used to execute R code direct from Pentaho Data Integration (PDI).
R Script Plugin can be downloaded here: Downloads

NOTE: This plugin is an old and open source version and is not the same as R script executor, which is available from Pentaho.

Installation

R script plugin can be installed by performing the following steps:

1) Copy the plugin folder (RScriptPlugin) into the folder of your PDI installation:
data-integration\plugins\steps

2) Install R project from the site http://www.r-project.org/

3) Install rJava package in R by executing: install.packages(“rJava”)

4) Copy JRI library rJava/jri/jri.dll (windows) or rJava/jri/libjri.so (linux) to: data-integration/libswt/{your operation system}, for example: data-integration/libswt/win64

5) Specify location of R using R_HOME environment variable. (like: C:\R\R-3.0.2).

6) For windows, it is also needed to put in PATH variable the path to R.dll, for example: C:\R\R-3.0.2\bin\{your operation system}

7) The plugin can be found under Sripting Folder in Design Tab:

Note: to check the installation press Help button in R plugin in Spoon.


Usage

We demonstrate the usage of “Execute R Script” plugin by solving the following task. In CSV file we have the values for two variables a and b. We should calculate the sum of these variables for every row in CSV file.

For this we create transformation, which contains three steps:

  • CSV Step to load input values
  • Execute R Script Step
  • Select Values Step

The transformation is demonstrated on the following picture:

CSV file has the following values:

R script contains two sections. The first one is used to define mapping between input fields in PDI (Pentaho Data Integration) and variables in R script. The second one is used to define mapping between output variables from R script and output fields in PDI.


We have two fields in PDI with names a and b. In R file we have two variables with names a and b. In R script we define the third variable c as a sum of a and b.

In R script there is one service variable OUTPUT of type list. Using this variable we can define output values. In our case the output is only variable c. For this variable we can define a new field in PDI. In our case this is c_res field. In Select Values step we show the values for three fields a, b, and c_res.

The result is demonstrated on the picture below.

For more complicated tasks it is recommended to define functions and use direct database connection in R script. For example, we use Run.R in PDI and call function MainScript, which is defined in MainScript.R.

/code/r/Run.R

library(RMySQL)
library(R.utils)

sourceDirectory(“/code/r/func/”);
ctx.date_from=’01.01.2000′;#this value can be transferred from PDI
ctx.date_to=’01.12.2010′;#this value can be transferred from PDI
OUTPUT<-list(“err_train”=0, “err_gen”=0);
OUTPUT<-MainScript(ctx);

/code/r/func/MainScript.R
MainScript<-function(ctx) {
con <-dbConnect(MySQL(), user=”root”, password=””, dbname=”lab”, host=”localhost”);
on.exit(dbDisconnect(con));
printf(“\nTime Period: %s-%s”, ctx.date_from, ctx.date_to)
flush.console()
#Select data from database
query = paste(“”,”select * from table where column between “);
query = paste(query, “str_to_date(‘”, ctx.date_from, “‘, ‘%d.%m.%Y’) and str_to_date(‘”, ctx.date_to, “‘, ‘%d.%m.%Y’)”, sep=””);
rs <- dbSendQuery(con, query);
result <- fetch(rs, n = -1)
printf(“\nSize: %g”, nrow(result))
dbDisconnect(con)
#Here the model can be evaluated and results can be saved in variables err_train and err_gen.
#results to PDI
OUTPUT<-list(“err_train”=err_train, “err_gen”=err_gen);
return(OUTPUT)
}