string operations on levels

03 Sep 2015

At various times I have had to do string operations on the levels of a
factor. The standard example is when you have an ID that consists of
multiple pieces of information (think of “customer1-item1”). The
operation

works for this. For large factors this can become really and
unnessesarily slow. The factor is first converted to a character
vector on which the extract operation is performed. This means that
stri_extract has to extract customer1 from customer1-item1 4
times in the above code. To make this more efficient we can use
the following operations

then the above becomes

On large factors this can be much more efficient.

You might want to go one step further and too far by replacing the
line

by the line

This fails in ways that will lead to interesting debugging later on:

Note that level “customer1” occurs twice in levels. It is true that
if you can ensure the levels will remain distinct this is indeed lots
faster again.