Note: in this section, numerous mistakes are made, for illustrative purposes.
Consider a function for breaking a vector into any number
of quantiles. If the data look like c(.42,.15,.25,.27,.35,.38,.41,.1)
for instance, the two-quantile breakdown would be c(2,1,1,1,2,2,2,1)
and the four-quantile version would be c(4,1,2,2,3,3,4,1)
.
S-PLUS has a function, quantile
which takes a vector of data
and a vector of probability breakpoints (c(0,.5,1)
for two quantiles,
c(0,.25,.5,.75,1)
for four, etc. ) and returns the breakpoints
for the quantiles of data. This will be the guts of the function.
The function will look something like this:
by.quantiles <- function(x, n) {Generate the vector of probability breakpoints for n quantiles.
Use quantile
to generate the actual breakpoints.
Use these breakpoints to classify the data, x, into n quantiles.
Return the result.
}
The first step is to figure out how to generate the probability
breakpoints. We know that this will be some vector, starting
with 0 and ending with 1. The vector to split the data in half
contained three elements (0, .5, 1) and the vector to split the
data in four parts took five elements (0, .25, .5, .75, 1) so
we can figure that the vector for splitting the data into n
parts
will require n+1
elements. We can use the rep
function
to create a framework for this vector. It is probably best to
try out each part of the function separately, until we know that
all the pieces will fit together.
> by.quantiles <- function(n,x) + { + pbreaks <- rep(0, n+1) + pbreaks[n+1] <- 1 + pbreaks + } > by.quantiles(c(1,2,3,4),2) Error in rep.int: rep() only defined for length(times)==1 or length(x): rep.int( 1:xlen, times) . . . Dumped
Woops, x
should be the first argument, n
the second. After
that correction, the test run produces ``[1] 0 0 1
'' as output,
which looks right so far. We need to figure a way to calculate the
middle numbers in the vector. One way to do this could be a for-loop.
We know the first and last numbers in the vector, and want to fill in
the middle numbers, one at a time. In the example of two quantiles
the vector counted by halves, (0, .5, 1) and in the example of four
quantiles the vector counted by quarters, (0, .25, .5, .75, 1). In
the case of n
quantiles, the vector should count by 1/n.
Use the ed
or edit
function to change by.quantiles
to the following form:
function(x, n) { pbreaks <- rep(0, n + 1) pbreaks[n + 1] <- 1 for(i in 2:(n-1)) pbreaks[i] <- 1/n pbreaks }
Now let's try this out:
> by.quantiles(c(1,2,3,4),2) [1] 0.5 0.5 1.0
Hmm, that doesn't look right, the first one should be 0. Let's try some different input to get a better idea of what is going wrong.
> by.quantiles(c(1,2,3,4),4) [1] 0.00 0.25 0.25 0.00 1.00
This shows a couple of problems: the 0 is showing up next-to-last,
and the numbers are not increasing. The second
problem is coming from the pbreaks[i] <- 1/n
line. We
want to get i
somewhere on the right side of that equation.
The first problem is because we used n-1
instead of n
in the for-loop. The vector has n+1
elements, so the
next-to-last element is element n
, not n-1
.
Fixing the range of the for-loop should solve the first problem, and
let's try i/n
instead of 1/n
to solve the second problem.
> by.quantiles(c(1,2,3,4),2) [1] 0 1 1 > by.quantiles(c(1,2,3,4),4) [1] 0.00 0.50 0.75 1.00 1.00
Woops, it should be (i-1)/n
. After fixing that bug,
this part of the function does what it is supposed to.
> by.quantiles function(x, n) { pbreaks <- rep(0, n + 1) pbreaks[n + 1] <- 1 for(i in 2:n) pbreaks[i] <- (i - 1)/n pbreaks } > by.quantiles(c(1,2,3,4),4) [1] 0.00 0.25 0.50 0.75 1.00
The next step is to use the quantile
function to calculate
the breaks between the quantiles in the data. Type help(quantile)
to find out how to use the this function. To prevent having to type
in a vector of data every time we test the program,
store a vector of data as sample
.
> sample <- c(.42,.15,.25,.27,.35,.38,.41,.1)
Now, let's put in the quantile
function and test it out on
sample
. The values returned for n
=4 should be something
like (.1,.2,.3,.4,.5).
> by.quantiles function(x, n) { pbreaks <- rep(0, n + 1) pbreaks[n + 1] <- 1 for(i in 2:n) pbreaks[i] <- (i - 1)/n dbreaks <- quantile(x, probs = pbreaks) dbreaks } > by.quantiles(sample, 4) 0% 25% 50% 75% 100% 0.1 0.225 0.31 0.3875 0.42
That looks about right. The next step is to use dbreaks
to
write the data by quantiles. What we want to do is look at each
element of data, and if it is between two breakpoints, it is in that
quantile.
We can start by generating n+1
zeroes, and then using a for-loop
to fill them in. Actually, we will need two for-loops: one to bring in
each element of data, another to check that element against each pair
of quantiles. Note that this means that each element will continue to
be checked against each quantile, even after it has been placed in the
correct quantile. We could write a faster function which stops
checking once it finds the correct quantile, but that would be more
complicated to write.
The two for-loops (one on i
, the other on j
) will contain
an if statement, which asks whether the element fits in the
j
'th quantile. If it does, the value j
is recorded
for that element.
> by.quantiles <- edit(by.quantiles) > by.quantiles function(x, n) { pbreaks <- rep(0, n + 1) pbreaks[n + 1] <- 1 for(i in 2:n) pbreaks[i] <- (i - 1)/n dbreaks <- quantile(x, probs = pbreaks) data.by.quantiles <- rep(0, n + 1) for(i in 1:(n + 1)) for(j in 1:n) if(x[i] >= dbreaks[j] & x[i] < dbreaks[j + 1]) data.by.quantiles[i] <- j data.by.quantiles } > by.quantiles(sample, 4) [1] 0 1 2 2 3
How come the output is not as long as the original data?
Let's use inspect
to try to figure out what is going wrong
inside those for-loops. (Also, what is that zero doing there?)
> inspect(by.quantiles(sample, 4)) entering function by.quantiles stopped in by.quantiles (frame 3), at: pbreaks <- rep(0, n + 1) d> do stopped in by.quantiles (frame 3), at: pbreaks[n + 1] <- 1 d> do stopped in by.quantiles (frame 3), ahead of loop: for(i in 2:n) pbreaks[i] <- (i - 1)/n d> do stopped in by.quantiles (frame 3), after loop: for(i in 2:n) pbreaks[i] <- (i - 1)/n d> do stopped in by.quantiles (frame 3), at: dbreaks <- quantile(x, probs = pbreaks) d> do stopped in by.quantiles (frame 3), at: data.by.quantiles <- rep(0, n + 1) d> do stopped in by.quantiles (frame 3), ahead of loop: for(i in 1:(n + 1)) for(j in 1:n) if(x[i] >= dbreaks[j] & x[i] < dbreaks[j + 1]) ... d> eval x [1] 0.42 0.15 0.25 0.27 0.35 0.38 0.41 0.10 d> eval dbreaks 0% 25% 50% 75% 100% 0.1 0.225 0.31 0.3875 0.42 d> eval data.by.quantiles [1] 0 0 0 0 0 d> quit
Aha! The data.by.quantiles
vector should have the same length
as x
, not the same length as dbreaks
. This means
i
should go from 1
to length(x)
.
> by.quantiles <- edit(by.quantiles) > by.quantiles function(x, n) { pbreaks <- rep(0, n + 1) pbreaks[n + 1] <- 1 for(i in 2:n) pbreaks[i] <- (i - 1)/n dbreaks <- quantile(x, probs = pbreaks) data.by.quantiles <- rep(0, length(x)) for(i in 1:length(x)) for(j in 1:n) if(x[i] >= dbreaks[j] & x[i] < dbreaks[j + 1]) data.by.quantiles[i] <- j data.by.quantiles }
This looks better, but still doesn't work.
> by.quantiles(sample, 4) [1] 0 1 2 2 3 3 0 1
Let's use inspect
again and see what's going wrong.
Remember, the output should be (4, 1 2, 2, 3, 3, 4, 1),
so how come the function is generating 0's instead of 4's?
Or, since the output starts with 0's, why is it leaving
them as 0's instead of generating 4's?
> inspect(by.quantiles(sample, 4)) entering function by.quantiles stopped in by.quantiles (frame 3), at: pbreaks <- rep(0, n + 1) d> do 6 stopped in by.quantiles (frame 3), ahead of loop: for(i in 1:length(x)) for(j in 1:(n - 1)) if(x[i] >= dbreaks[j] & x[i] < dbreaks[j + 1]) ... d> step 5 stopped in by.quantiles (frame 3), at: if(x[i] >= dbreaks[j] & x[i] < dbreaks[j + 1]) data.by.quantiles[i] <- j d> eval j [1] 4 d> eval data.by.quantiles[1] [1] 0
At this point S-PLUS has checked the first element (.42) against the fourth quantile, but has recorded a 0 meaning that it does not fit. Why not?
d> eval x[i] >= dbreaks[j] 75% T d> eval x[i] < dbreaks[j+1] 100% F
Hmm, the first element is greater than or equal to the 75th percentile, but not less than the 100th percentile. Why not?
d> eval x[i] [1] 0.42 d> eval dbreaks[j+1] 100% 0.42
Aha! The test should be whether x[i]
is less than or equal
to dbreaks[j+1]
instead of strictly less than. After
this correction, the function works as it is supposed to.
The working function looks like this:
function(x, n) { pbreaks <- rep(0, n + 1) pbreaks[n + 1] <- 1 for(i in 2:n) pbreaks[i] <- (i - 1)/n dbreaks <- quantile(x, probs = pbreaks) data.by.quantiles <- rep(0, length(x)) for(i in 1:length(x)) for(j in 1:n) if(x[i] >= dbreaks[j] & x[i] <= dbreaks[j + 1]) data.by.quantiles[i] <- j data.by.quantiles }