RWeka Hacks: Applying Supervised Filters to Test Sets

As a researcher, analyzing data is one of my core responsibilities. While there are many tookits, libraries and languages dedicated to statistical analysis, I generally find myself using R. R does mostly what I want. But, sometimes it doesn’t.

For example, I have a pretty specific workflow for building performant classifiers (especially the kind that are atheoretic, so their performance need not be explained). Maybe one day I’ll write a post about that, but it generally involves supervised discretization [1], feature subset selection [2], followed by model selection [3]. Weka is a machine learning tookit that makes all of these steps easy. R probably has some combination of libraries that does all of these things independently, but I have yet to find a franken-workflow that makes this process as painless as just using Weka. Weka exposes a Java library, though, and I find using Java for vector processing is cumbersome.

Luckily, you can talk to Weka in R with the RWeka package [3]. The RWeka package is great. It uses rJava to create all of the links required to interchange data frames into ArrayLists and what not. But, it’s one of those many libraries-on-the-fringe of conventional development that is only half built. It exposes some of the functionality of Weka to R through a nice interface, and a framework to provide the rest of the functionality, but if you need anything other than the typical uses of Weka then you’ll have to do some reverse engineering.

At this point, I’ve done a lot of that. So, I thought I’d share a series of posts where I talk about how I implemented some the functionality of Weka that I would like in R but that is not supported by the prepackaged RWeka interface. This first post is about applying a supervised discretization filter that is fitted on a training set but then applied to a test set.

Details

I’ll keep this short. This is a function I modified from the RWeka source code that takes in a train set, a test set and the name of a supervised attribute filter [4]. The function will train the filter on the train set and transform the test set accordingly. Similarly, it will also apply the trained filter on the test set and transform the test set accordingly.

# Takes in a train and test model frame along with the name
# of a supervised Weka filter. Trains the filter on the training
# model frame, applies the fitted filter to the test model frame
# and returns the transformed train and test frames.
rWeka.apply.filter <- function(train.mf, test.mf, name, control=NULL)
{
  input <- list(
    train=read_data_into_Weka(train.mf),
    test=read_data_into_Weka(test.mf)
  )

  ## Build and initialize filter.
  filter <- .jnew(name)
  control <- as.character(control)
  if (length(control)) {
     .jcall(filter, "V", "setOptions", .jarray(control))
  }
  .jcall(filter, "Z", "setInputFormat", input$train)

  # helper function to apply filter
  .apply.filter <- function(dataset) {
    read_instances_from_Weka(.jcall(
      "weka/filters/Filter",
      "Lweka/core/Instances;",
      "useFilter",
      .jcast(dataset,"weka/core/Instances"),
      .jcast(filter,"weka/filters/Filter")
    ))
  }

  # apply filter
  lapply(input, .apply.filter)
}

Usage:

# train is any data frame, test is any other data frame of the same form as train.
rWeka.apply.filter(train, test, "weka.filters.supervised.attribute.Discretize")

While I've only tested with the Discretize attribute filter (because that's the only supervised attribute filter I typically use), this function should theoretically work for any other supervised attribute filter as well. It should also be easy to generalize this function to more than just a train and test set. I would just accept a list argument and lapply the the .apply.filter helper function.

Unfortunately, in order to get this function to work, you'll also need to copy over a couple of RWeka's internal functions that are not exposed by simply installing the package. To do so, you'll have to download the RWeka source. Once you do, open up readers.R and copy over the following two functions:

# copy over from RWeka/readers.R
read_data_into_Weka(x, classIndex = ncol(x))
read_instances_from_Weka(x)

That's about it. This one was pretty simple, but hopefully it's helpful!

Footnotes:

[1] Supervised discretization is just the process of transforming continuous variables into discrete variable by binning them along cut-points that best distinguish between classes. I generally use Fayyad-Irani's entropy minimization approach.

[2] Feature subset selection is the process of selecting a performant subset of features for building a classifier. It's important because it helps avoid overfitting your model to training data, particularly when you do not have a lot of training data. If I have few features, I'll use the Wrapper approach that tests every possible subset of features. If not, I'll generally use correlation feature selection.

[3] This is less systematic. I generally have some intuition as to which models will work best based on the shape and properties of my data, but it's mostly just trial and error and parameter tuning.

[4] In making this function, I referenced both the Weka javadocs and the RWeka internal function for using filters (RWeka_use_filter in filters.R). I basically just generalized the existing function to apply a filter to multiple datasets and return a list.

Share on FacebookShare on Google+Tweet about this on TwitterShare on LinkedInShare on TumblrPin on Pinterest

One thought on “RWeka Hacks: Applying Supervised Filters to Test Sets

Leave a Reply

Your email address will not be published. Required fields are marked *