Skip to main content

Tidy data

· One min read

Mostly a summary of the paper tidy data.

Example of badly formatted data:

treatmentatreatmentb
John Smith-2
Jane Doe1611
Mary Johnson31

Better formatted version of that data:

nametrtresult
Jane Doea16
Jane Doeb11
John Smitha-
John Smithb2
Mary Johnsona3
Mary Johnsonb1

Important guidelines:

  • Rows: observations
  • Columns: variables
  • Values: variable values at specific observations

Order:

  • variables: fixed (descriptions of the experiment) first, then measured variables, always the ones related to each other next to each other
  • observations: order by first variable, then break ties with the following variables

This leads to a standard, which is important as programs knows how their input is structured. So they can take the data, transform it and return tidy data again.

From messy to tidy

To get a dataset from messy to tidy one can employ three operations:

Melting

Turns multiple columns that are variables into a column with the names of the specific columns and a column with the value.

Messy:

rowabc
A147
B257
C369

Molten:

rowcolumnvalue
Aa1
Ab4
Ac7
Ba2
Bb5
Bc8
Ca3
Cb6
Cc9

String splitting

Casting