next up previous
Next: About this document ... Up: Programming in S-PLUS1 Previous: Debugging Functions

An Example

Note: in this section, numerous mistakes are made, for illustrative purposes.

Consider a function for breaking a vector into any number of quantiles. If the data look like c(.42,.15,.25,.27,.35,.38,.41,.1) for instance, the two-quantile breakdown would be c(2,1,1,1,2,2,2,1) and the four-quantile version would be c(4,1,2,2,3,3,4,1).

S-PLUS has a function, quantile which takes a vector of data and a vector of probability breakpoints (c(0,.5,1) for two quantiles, c(0,.25,.5,.75,1) for four, etc. ) and returns the breakpoints for the quantiles of data. This will be the guts of the function.

The function will look something like this:

by.quantiles <- function(x, n)
{
Generate the vector of probability breakpoints for n quantiles.

Use quantile to generate the actual breakpoints.

Use these breakpoints to classify the data, x, into n quantiles.

Return the result.

}

The first step is to figure out how to generate the probability breakpoints. We know that this will be some vector, starting with 0 and ending with 1. The vector to split the data in half contained three elements (0, .5, 1) and the vector to split the data in four parts took five elements (0, .25, .5, .75, 1) so we can figure that the vector for splitting the data into n parts will require n+1 elements. We can use the rep function to create a framework for this vector. It is probably best to try out each part of the function separately, until we know that all the pieces will fit together.

> by.quantiles <- function(n,x)
+ {
+ pbreaks <- rep(0, n+1)
+ pbreaks[n+1] <- 1
+ pbreaks
+ }
> by.quantiles(c(1,2,3,4),2)
Error in rep.int: rep() only defined for length(times)==1 or length(x): rep.int(
1:xlen, times)
        . . .
Dumped

Woops, x should be the first argument, n the second. After that correction, the test run produces ``[1] 0 0 1'' as output, which looks right so far. We need to figure a way to calculate the middle numbers in the vector. One way to do this could be a for-loop. We know the first and last numbers in the vector, and want to fill in the middle numbers, one at a time. In the example of two quantiles the vector counted by halves, (0, .5, 1) and in the example of four quantiles the vector counted by quarters, (0, .25, .5, .75, 1). In the case of n quantiles, the vector should count by 1/n.

Use the ed or edit function to change by.quantiles to the following form:

function(x, n)
{
        pbreaks <- rep(0, n + 1)
        pbreaks[n + 1] <- 1
        for(i in 2:(n-1))
                pbreaks[i] <- 1/n
        pbreaks
}

Now let's try this out:

> by.quantiles(c(1,2,3,4),2)
[1] 0.5 0.5 1.0

Hmm, that doesn't look right, the first one should be 0. Let's try some different input to get a better idea of what is going wrong.

> by.quantiles(c(1,2,3,4),4)
[1] 0.00 0.25 0.25 0.00 1.00

This shows a couple of problems: the 0 is showing up next-to-last, and the numbers are not increasing. The second problem is coming from the pbreaks[i] <- 1/n line. We want to get i somewhere on the right side of that equation. The first problem is because we used n-1 instead of n in the for-loop. The vector has n+1 elements, so the next-to-last element is element n, not n-1.

Fixing the range of the for-loop should solve the first problem, and let's try i/n instead of 1/n to solve the second problem.

> by.quantiles(c(1,2,3,4),2)
[1] 0 1 1
> by.quantiles(c(1,2,3,4),4)
[1] 0.00 0.50 0.75 1.00 1.00

Woops, it should be (i-1)/n. After fixing that bug, this part of the function does what it is supposed to.

> by.quantiles
function(x, n)
{
        pbreaks <- rep(0, n + 1)
        pbreaks[n + 1] <- 1
        for(i in 2:n)
                pbreaks[i] <- (i - 1)/n
        pbreaks
}
> by.quantiles(c(1,2,3,4),4)
[1] 0.00 0.25 0.50 0.75 1.00

The next step is to use the quantile function to calculate the breaks between the quantiles in the data. Type help(quantile) to find out how to use the this function. To prevent having to type in a vector of data every time we test the program, store a vector of data as sample.

> sample <- c(.42,.15,.25,.27,.35,.38,.41,.1)

Now, let's put in the quantile function and test it out on sample. The values returned for n=4 should be something like (.1,.2,.3,.4,.5).

> by.quantiles
function(x, n)
{
        pbreaks <- rep(0, n + 1)
        pbreaks[n + 1] <- 1
        for(i in 2:n)
                pbreaks[i] <- (i - 1)/n
        dbreaks <- quantile(x, probs = pbreaks)
        dbreaks
}
> by.quantiles(sample, 4)
   0%   25%  50%    75% 100%
  0.1 0.225 0.31 0.3875 0.42

That looks about right. The next step is to use dbreaks to write the data by quantiles. What we want to do is look at each element of data, and if it is between two breakpoints, it is in that quantile.

We can start by generating n+1 zeroes, and then using a for-loop to fill them in. Actually, we will need two for-loops: one to bring in each element of data, another to check that element against each pair of quantiles. Note that this means that each element will continue to be checked against each quantile, even after it has been placed in the correct quantile. We could write a faster function which stops checking once it finds the correct quantile, but that would be more complicated to write.

The two for-loops (one on i, the other on j) will contain an if statement, which asks whether the element fits in the j'th quantile. If it does, the value j is recorded for that element.

> by.quantiles <- edit(by.quantiles)
> by.quantiles
function(x, n)
{
        pbreaks <- rep(0, n + 1)
        pbreaks[n + 1] <- 1
        for(i in 2:n)
                pbreaks[i] <- (i - 1)/n
        dbreaks <- quantile(x, probs = pbreaks)
        data.by.quantiles <- rep(0, n + 1)
        for(i in 1:(n + 1))
                for(j in 1:n)
                        if(x[i] >= dbreaks[j] & x[i] < dbreaks[j + 1])
                                  data.by.quantiles[i] <- j
        data.by.quantiles
}
> by.quantiles(sample, 4)
[1] 0 1 2 2 3

How come the output is not as long as the original data? Let's use inspect to try to figure out what is going wrong inside those for-loops. (Also, what is that zero doing there?)

> inspect(by.quantiles(sample, 4))
entering function by.quantiles
stopped in by.quantiles (frame 3), at:
        pbreaks <- rep(0, n + 1)
d> do
stopped in by.quantiles (frame 3), at:
        pbreaks[n + 1] <- 1
d> do
stopped in by.quantiles (frame 3), ahead of loop:
        for(i in 2:n)
                pbreaks[i] <- (i - 1)/n
d> do
stopped in by.quantiles (frame 3), after loop:
        for(i in 2:n)
                pbreaks[i] <- (i - 1)/n
d> do
stopped in by.quantiles (frame 3), at:
        dbreaks <- quantile(x, probs = pbreaks)
d> do
stopped in by.quantiles (frame 3), at:
        data.by.quantiles <- rep(0, n + 1)
d> do
stopped in by.quantiles (frame 3), ahead of loop:
        for(i in 1:(n + 1))
                for(j in 1:n)
                        if(x[i] >= dbreaks[j] & x[i] < dbreaks[j + 1])
        ...
d> eval x
[1] 0.42 0.15 0.25 0.27 0.35 0.38 0.41 0.10
d> eval dbreaks
   0%   25%  50%    75% 100%
  0.1 0.225 0.31 0.3875 0.42
d> eval data.by.quantiles
[1] 0 0 0 0 0
d> quit

Aha! The data.by.quantiles vector should have the same length as x, not the same length as dbreaks. This means i should go from 1 to length(x).

> by.quantiles <- edit(by.quantiles)
> by.quantiles
function(x, n)
{
        pbreaks <- rep(0, n + 1)
        pbreaks[n + 1] <- 1
        for(i in 2:n)
                pbreaks[i] <- (i - 1)/n
        dbreaks <- quantile(x, probs = pbreaks)
        data.by.quantiles <- rep(0, length(x))
        for(i in 1:length(x))
                for(j in 1:n)
                        if(x[i] >= dbreaks[j] & x[i] < dbreaks[j + 1])
                                  data.by.quantiles[i] <- j
        data.by.quantiles
}

This looks better, but still doesn't work.

> by.quantiles(sample, 4)
[1] 0 1 2 2 3 3 0 1

Let's use inspect again and see what's going wrong. Remember, the output should be (4, 1 2, 2, 3, 3, 4, 1), so how come the function is generating 0's instead of 4's? Or, since the output starts with 0's, why is it leaving them as 0's instead of generating 4's?

> inspect(by.quantiles(sample, 4))
entering function by.quantiles
stopped in by.quantiles (frame 3), at:
        pbreaks <- rep(0, n + 1)
d> do 6
stopped in by.quantiles (frame 3), ahead of loop:
        for(i in 1:length(x))
                for(j in 1:(n - 1))
                        if(x[i] >= dbreaks[j] & x[i] < dbreaks[j + 1])
        ...
d> step 5
stopped in by.quantiles (frame 3), at:
        if(x[i] >= dbreaks[j] & x[i] < dbreaks[j + 1])
                data.by.quantiles[i] <- j
d> eval j
[1] 4
d> eval data.by.quantiles[1]
[1] 0

At this point S-PLUS has checked the first element (.42) against the fourth quantile, but has recorded a 0 meaning that it does not fit. Why not?

d> eval x[i] >= dbreaks[j]
  75%
    T
d> eval x[i] < dbreaks[j+1]
 100%
    F

Hmm, the first element is greater than or equal to the 75th percentile, but not less than the 100th percentile. Why not?

d> eval x[i]
[1] 0.42
d> eval dbreaks[j+1]
 100%
 0.42

Aha! The test should be whether x[i] is less than or equal to dbreaks[j+1] instead of strictly less than. After this correction, the function works as it is supposed to. The working function looks like this:

function(x, n)
{
        pbreaks <- rep(0, n + 1)
        pbreaks[n + 1] <- 1
        for(i in 2:n)
                pbreaks[i] <- (i - 1)/n
        dbreaks <- quantile(x, probs = pbreaks)
        data.by.quantiles <- rep(0, length(x))
        for(i in 1:length(x))
                for(j in 1:n)
                        if(x[i] >= dbreaks[j] & x[i] <= dbreaks[j + 1])
                                  data.by.quantiles[i] <- j
        data.by.quantiles
}


next up previous
Next: About this document ... Up: Programming in S-PLUS1 Previous: Debugging Functions
Brian Junker 2002-08-26