How to split data into training/testing sets using sample function
I've just started using R and I'm not sure how to incorporate my dataset with the following sample code:
sample(x, size, replace = FALSE, prob = NULL)
I have a dataset that I need to put into a training (75%) and testing (25%) set. I'm not sure what information I'm supposed to put into the x and size? Is x the dataset file, and size how many samples I have?
There are numerous approaches to achieve data partitioning. For a more complete approach take a look at the createDataPartition
function in the caret
package.
Here is a simple example:
data(mtcars)
## 75% of the sample size
smp_size <- floor(0.75 * nrow(mtcars))
## set the seed to make your partition reproducible
set.seed(123)
train_ind <- sample(seq_len(nrow(mtcars)), size = smp_size)
train <- mtcars[train_ind, ]
test <- mtcars[-train_ind, ]
It can be easily done by:
set.seed(101) # Set Seed so that same sample can be reproduced in future also
# Now Selecting 75% of data as sample from total 'n' rows of the data
sample <- sample.int(n = nrow(data), size = floor(.75*nrow(data)), replace = F)
train <- data[sample, ]
test <- data[-sample, ]
By using caTools package:
require(caTools)
set.seed(101)
sample = sample.split(data$anycolumn, SplitRatio = .75)
train = subset(data, sample == TRUE)
test = subset(data, sample == FALSE)
I would use dplyr
for this, makes it super simple. It does require an id variable in your data set, which is a good idea anyway, not only for creating sets but also for traceability during your project. Add it if doesn't contain already.
mtcars$id <- 1:nrow(mtcars)
train <- mtcars %>% dplyr::sample_frac(.75)
test <- dplyr::anti_join(mtcars, train, by = 'id')