Stratified random sampling in R from a data frame

Important update

The original function that was present at this post has been deleted. Instead, I’ve posted a much improved version for the sake of others visiting this page. The function is presently defined as:


  • df: The input data.frame
  • group: The grouping column(s). Can be a character vector or the numeric positions of the columns.
  • size: The desired sample size. Can be a decimal (proportionate by group) or an integer (same number of samples per group).
  • select: A named list with optional subsetting statements.
  • replace: Logical. Should sampling be done with or without replacement.
  • bothSets: Logical. Should a list be returned. Useful when setting up a "testing" and "training" sampling setup.


And here are some examples of the function in action:

There is also a data.table version that is much more efficient but has the same functionality.

