3 min read

Distribution of height & weight in sumo divisions

library(tidyverse)

Now that we’ve downloaded sumo data let’s have a look at sumo wrestlers’ height and weight.

Read banzuke.csv with hard-coded column types:

df <- read_csv(
    "banzuke.csv",
    col_types = "ciccccDddcii"
)

The dataset goes a long way back:

df %>% summarise_at("basho", c("min", "max"))
## # A tibble: 1 x 2
##   min     max    
##   <chr>   <chr>  
## 1 1983.01 2019.09

First one or two letters in the rank column indicate which of sumo divisions the wrestler belongs to:

divisions <- c("Jk", "Jd", "Sd", "Ms", "J", "M", "K", "S", "O", "Y")

df %>% 
    mutate(division = str_extract(rank, "^\\D+")) %>% 
    mutate_at("division", ordered, levels = divisions) %>% 
    count(division)
## # A tibble: 10 x 2
##    division     n
##    <ord>    <int>
##  1 Jk       18359
##  2 Jd       59584
##  3 Sd       43889
##  4 Ms       26473
##  5 J         5904
##  6 M         6653
##  7 K          464
##  8 S          482
##  9 O          797
## 10 Y          477

To keep things simple, let’s put komusubi, sekiwake, ōzeki and yokozuna in “M” division:

df <- df %>% 
    mutate(division = str_extract(rank, "^\\D+")) %>% 
    mutate_at("division", recode, K = "M", S = "M", O = "M", Y = "M") %>% 
    mutate_at("division", ordered, levels = head(divisions, -4))

df %>% count(division)
## # A tibble: 6 x 2
##   division     n
##   <ord>    <int>
## 1 Jk       18359
## 2 Jd       59584
## 3 Sd       43889
## 4 Ms       26473
## 5 J         5904
## 6 M         8873

There are a few records with missing height/weight in lower divisions, but that’s unlikely to introduce any bias:

df %>% 
    group_by(division) %>% 
    summarise(
        total = n(),
        no_height = sum(is.na(height)),
        no_weight = sum(is.na(weight))
    )
## # A tibble: 6 x 4
##   division total no_height no_weight
##   <ord>    <int>     <int>     <int>
## 1 Jk       18359      2843      2843
## 2 Jd       59584      2132      2132
## 3 Sd       43889       123       123
## 4 Ms       26473        74        74
## 5 J         5904         2         2
## 6 M         8873         0         0

Now, this is interesting – higher the division, taller the average wrestler:

df %>% 
    drop_na(height) %>% 
    ggplot(aes(height, colour = division)) +
        geom_density() +
        scale_colour_brewer(palette = "Greys") +
        theme_minimal()

Increase in average weight is less of a surprise:

df %>% 
    drop_na(weight) %>% 
    ggplot(aes(weight, colour = division)) +
        geom_density() +
        scale_colour_brewer(palette = "Greys") +
        theme_minimal()

Calculating mean or median (to reduce the influence of outliers) yields neatly increasing figures:

df %>% 
    group_by(division) %>% 
    summarise_at(
        vars(height, weight),
        median,
        na.rm = TRUE
    )
## # A tibble: 6 x 3
##   division height weight
##   <ord>     <dbl>  <dbl>
## 1 Jk         176    106 
## 2 Jd         178    114 
## 3 Sd         180.   126.
## 4 Ms         182.   135.
## 5 J          183    144 
## 6 M          185    149