30 Day Chart Challenge 2025

Let’s make some charts!
Author

gregers kjerulf dubrow

Published

April 25, 2025

image listing the prompts for the 30 day chart challenge 2025.

The Plan for the 2025 Challenge

It’s time again for the 30 Day Chart Challenge, a social-media-driven data viz challenge to create and share a data visualization on any topic, but somehow corresponding to the daily prompt as suggested by the organizers.

I did participate last year, though started late and did not get to every prompt. Like last year I am going to keep all the prompt posts in one mega-post, adding to it regularly. Perhaps not each day, but I will try to get to every prompt. Like last year:

  • I will post contributions to social media via my Bluesky, and LinkedIn accounts.
  • I will add contributions to this post, meaning the posting date will refresh to the date I’ve added the chart. So If I get to a lot or even all of them, this will be a long post.
  • Use the table of contents to the right to navigate to a specific chart.
  • Images are “lightboxed”, so click on them to enlarge. Click esc or outside the image to restore the page.

One major change is that this year I am going to only pull data from Danmarks Statistik (Statistics Danmark, the national statistics agency) via the danstat package for r. To the extent possible I’ll be focusing on education data. These constraints will help me by not having to figure out which data source to use each day.

It may not be a straight linear narrative and not every post will necessarily connect to the whole. I may not even know exactly what I have until I’m done. I have sketched out ideas for many of the prompts, but things may change as I go along. If the daily prompt just doesn’t lend itself to Danish education data, or Danish data at all, the subject for that chart will be something else.

As with all of my posts, I’m offering a tutorial/how-to as much as I am trying to convey insights from the data. If a topic or approach interests you, hopefully there’s enough explanation in the writing or the code so that you can do something similar or use it to hone your r skills.

Let’s get to the charts.

Prompt #1 - Fractions

The first week of prompts is devoted to comparisons, and the first day is fractions. I cheated a bit and replicated last year’s 1st prompt “Part-to-Whole”. But where last year I looked at educational attainment in the UK, this year it’s Denmark.

I won’t do too much explaining about education levels in Denmark, you can read up on them on the ministry’s page.

We’ll do a horizontal stacked bar chart displaying the highest education attained by Danes aged 25-69 as of 2023, with this population divided into four age groups.

To pull data from danstat, which accesses the Danmarks Statistik API, you have to know at least the table name and number. So it’s necessary to poke around Danmarks Statistik’s StatBank page and figure out which table has the data you need. The value added of the API and package is not having to download and keep flat files.

The table_meta command produces a nested tibble with values for each field in the table. To get the data you first create a list of fields and values (see the varaibles_ed object) then use the get_data command to actually fetch the data.

Show code for getting and cleaning data
library(tidyverse) # to do tidyverse things
library(tidylog) # to get a log of what's happening to the data
library(janitor) # tools for data cleaning
library(danstat) # package to get Danish statistics via api
library(ggtext) # enhancements for text in ggplot

# some custom functions
source("~/Data/r/basic functions.R")


# metadata for table variables, click thru nested tables to find variables and ids for filters
table_meta <- danstat::get_table_metadata(table_id = "hfudd11", variables_only = TRUE)

# create variable list using the ID value in the variable
variables_ed <- list(
    list(code = "bopomr", values = c("000", "081", "082", "083", "084", "085")),
    list(code = "hfudd", values = c("H10", "H20", "H30", "H35",
                                        "H40", "H50", "H60", "H70", "H80", "H90")),
    list(code = "køn", values = c("TOT")),
    list(code = "alder", values = c("25-29", "30-34", "35-39", "40-44", "45-49",
                                            "50-54", "55-59", "60-64", "65-69")),
    list(code = "tid", values = 2023))

# past variable list along with table name.
# note that in package, table name is lower case, though upper case on statbank page.
edattain1 <- get_data("hfudd11", variables_ed, language = "da") %>%
    as_tibble() %>%
    select(region = BOPOMR, age = ALDER, edlevel = HFUDD, n = INDHOLD)

# arrange data for chart
# collapse age and education levels into groups
edattain <- edattain1 %>%
    mutate(age_group = case_when(age %in% c("25-29 år", "30-34 år", "35-39 år") ~ "25-39",
                age %in% c("40-44 år","45-49 år") ~ "40-49",
                age %in% c("50-54 år","55-59 år") ~ "50-59",
                age %in% c("60-64 år","65-69 år") ~ "60-69")) %>%
    mutate(ed_group = case_when(edlevel == "H10 Grundskole" ~ "Grundskole/Primary",
                edlevel %in% c("H20 Gymnasiale uddannelser",
                "H30 Erhvervsfaglige uddannelser",
                "H35 Adgangsgivende uddannelsesforløb") ~ "Secondary",
                edlevel == "H40 Korte videregående uddannelser, KVU" ~ "Tertiary - 2yr",
                edlevel %in% c("H50 Mellemlange videregående uddannelser, MVU",
                "H60 Bacheloruddannelser, BACH") ~ "Tertiary - Bachelor",
                edlevel == "H70 Lange videregående uddannelser, LVU" ~ "Tertiary - Masters",
                edlevel == "H80 Ph.d. og forskeruddannelser" ~ "Tertiary - PhD",
                edlevel == "H90 Uoplyst mv." ~ "Not stated")) %>%
    group_by(region, age_group, ed_group) %>%
    mutate(n2 = sum(n)) %>%
    ungroup() %>%
    select(-n, -age) %>%
    distinct(region, age_group, ed_group, .keep_all = T) %>%
    rename(n = n2) %>%
    mutate(ed_group =
                factor(ed_group,
                levels = c("Grundskole/Primary", "Secondary", "Tertiary - 2yr",
                "Tertiary - Bachelor", "Tertiary - Masters", 
                "Tertiary - PhD", "Not stated")))

Now that we have a nice tidy tibble, let’s make a chart. The approach is fairly straight-forward ggplot. I used tricks like fct_rev to order the age groups on the chart, and the scales and ggtext packages to make things look nicer.

Show code for making the chart

## chart - all DK, horizontal bar age groups, percent stacks percent by deg level
edattain %>%
    filter(region == "Hele landet") %>%
    group_by(age_group) %>%
    mutate(age_total = sum(n)) %>%
    mutate(age_pct = round(n/age_total, 2)) %>%
    select(age_group, ed_group, age_pct) %>%
    # the line below passes temporary changes to the data object through to the chart commands
    {. ->> tmp} %>%
    ggplot(aes(age_pct, fct_rev(age_group), fill = fct_rev(ed_group))) +
    geom_bar(stat = "identity") +
    scale_x_continuous(expand = c(0,0),
                breaks = c(0, 0.25, 0.50, 0.75, 1),
                labels = c("0", "25%", "50%", "75%", "100%")) +
    geom_text(data = subset(tmp, age_pct >0.025),
                aes(label = scales::percent(round(age_pct , 2))),
                position = position_stack(vjust = 0.5),
                color= "white", vjust = 0.5, size = 12) +
    scale_fill_brewer(palette = "Set3") +
    labs(title = "Danes under 40 have a higher rate of post-Bachelor educational attainment than other age groups",
             subtitle = "Highest education level attained by age groups",
             caption = "*Data from Danmarks Statistik via danstat package*",
             x = "", y = "Age Group") +
    theme_minimal() +
    theme(legend.position = "bottom", legend.spacing.x = unit(0, 'cm'),
                legend.key.width = unit(4, 'cm'), legend.margin=margin(-10, 0, 0, 0),
                legend.text = element_text(size = 12), legend.title = element_text(size = 16),
                plot.title = element_text(hjust = .5, size = 20),
                plot.subtitle = element_text(size = 16),
                plot.caption = element_markdown(size = 12, face = "italic"),
                axis.text.x = element_text(size = 14),
                axis.text.y = element_text(size = 14),
                panel.grid.major = element_blank(), panel.grid.minor = element_blank()) +
    guides(fill = guide_legend(label.position = "bottom", reverse = TRUE, 
                direction = "horizontal", nrow = 1, 
                title = "Highest Educational Attainment", title.position = "top")) 
rm(tmp)

horizontal bar chart showing educational attainment by age group

As you might expect, younger Danes tend to have higher levels of educational attainment than Danes over 60. That’s not uncommmon in countries with post-secondary education policies designed to increase access and attainment. As in the US this is a post-WWII development, though accelerated in the 1960s and after.

It’s been successful public policy in Denmark for many decades to increase access to schooling beyond the Grundskole (primary years…to about grade 9 in a US context). More than 45% of Danes between 25-40 have at least a Bachelor’s degree, and 20% have a Master’s. There would of course be more nuance in disaggregated age groups, but I wanted to keep this first story a bit more simple.

created and posted March 24, 2025

Prompt #2 - Slope

Building on what we saw in Prompt 1 for educational attainment by age group in 2023, let’s use the slope prompt to compare 2023 to 2005, the furthest back we can get educational attainment by age statistics via Danmarks Statistik’s StatBank.

The process for getting data is more or less the same as in Prompt 1, so I won’t repeat the code here. The only thing I did differently was collapse the 2-year and Bachelor’s degrees into one “college” category and collapsed Master’s and PhD into one post-bacc group.

For this chart I want to see if there are differences in highest educational attainment level by age group in two distinct years. Instead of one chart with too many lines, essentially a spaghetti graph, I made four charts and used the patchwork package (created by a Dane!) to facet out the individual charts into a (hopefully) coherent whole.

Keep in mind that while of course many people will be represented in both 2005 and 2023, this is not a longitudinal study. It’s a snapshot of the Danish population by age groups at two distinct points in time.

We have the data, and it’s time to build the charts. But instead of copy-pasting the ggplot code to build each charts, let’s create a function and use that four times.

Show code for the chart function
# plotdf is the placeholder name in the function for the data you pass to run it
slope_graph <- function(plotdf) {
    plotdf %>%
    ggplot(aes(x = year, y = age_ed_pct, group = age_group, color = age_group)) +
        geom_line(size = 1) +
        geom_point(size = .2) +
        scale_x_continuous(
            breaks = c(2005, 2023),
            labels = c("2005", "2023")) +
        scale_y_continuous(limits = c(0, .5),
            labels = scales::label_percent()) +
        scale_color_brewer(palette = "Set2") +
        labs(x = "", y = "") +
        theme_minimal() +
        theme(legend.position = "bottom", legend.spacing.x = unit(0, 'cm'),
            legend.key.width = unit(3, 'cm'), legend.margin=margin(-10, 0, 0, 0),
            legend.text = element_text(size = 10), legend.title = element_text(size = 12),
            plot.title = element_text(hjust = .5, size = 16),
            plot.subtitle = element_markdown(size = 14, vjust = -.5),
            plot.caption = element_markdown(size = 12, face = "italic"),
            axis.text.x = element_text(size = 11),
            axis.text.y = element_text(size = 11),
            panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
            strip.background = element_blank()) +
        guides(color = guide_legend(label.position = "bottom", reverse = FALSE, 
            direction = "horizontal", nrow = 1,
            title = "Age Group", title.position = "top"))
}

Now let’s put the function to use. We’ll create four plots, then stitch them together. Each plot will have its own descriptive title and subtitle, and the final chart will have one as well. Notice that we’re using the titles to tell a quick story or share an insight, not to dryly name what the chart is doing. The ggtext package gives us markdown enhancements for chart elements like the title and subtitle.

Show code for making the final chart
# create individual plots with titles and annotations
plot_grundsk <-
    edattain2 %>%
    filter(ed_group2 == "Grundskole/Primary") %>%
    slope_graph() +
    labs(title = "Primary school (thru grade 10)",
             subtitle = "*In 2023 Danes of all ages were much less likely to have stopped their education at<br>primary school
             than they were in 2005.<br>*")

plot_hs <-
    edattain2 %>%
    filter(ed_group2 == "Secondary") %>%
    slope_graph() +
    labs(title = "Gymnasium & Vocational (High School)",
             subtitle = "*From 2005 to 2023, there was a steep decline in the percentage of Danes<br>
             aged 25-49 who were finished with education at the high school level, especially<br>
             Danes under 40. For Danes older than 50 there was a very slight increase.*")

plot_colldegs <-
    edattain2 %>%
    filter(ed_group2 == "Tertiary - 2yr/Bach") %>%
    slope_graph() +
    labs(title = "2-year & Bachelor's Degrees",
             subtitle = "*For Danes of all age groups, but especially those under 50, there was a<br>
             noticable increase in the percentage earning 2 or 4 year degrees.*")

plot_masters <-
    edattain2 %>%
    filter(ed_group2 == "Tertiary - Masters+") %>%
    slope_graph() +
    labs(title = "Master's & PhD Degrees",
             subtitle = "*The percentage of Danes earning a Master's or PhD increased across all age groups between 2005 and 2025;<br>
             the increase was strongest among Danes under 50.*")

plot_grundsk + plot_hs + plot_colldegs + plot_masters +
    plot_annotation(
        title = "Danes of all ages have become more likely to continue their education beyond primary level. Danes under 50 have over time become likely to earn a Master's.",
        subtitle = "*Highest level of education earned by age groups, in 2005 and 2023.*",
        caption = "*Data from Danmarks Statistik via danstat package. Groups are not longitudinal - age is for the person in the year of data collection.*") &
    theme(plot.title = element_text(size = 16), plot.subtitle = element_markdown(),
        plot.caption = element_markdown(),
        # plot.background add the grey lines between plots
        plot.background = element_rect(colour = "grey", fill=NA))

four slope graphs displaying educational attainment by age group

The story here is hopefully clear. In 2005, more Danes, especially Danes 60+ years of age, had ended their education at primary level. By 2023 many more Danes were likely to have progressed beyond the primary level, and younger Danes especially were finishing Bachelor’s degrees and earning Master’s degrees.

It’s been government policy to support furthering education levels via free tuition and living stipends. This is made possible by a democratic-socialism approach to the economy, with higher tax rates on some earners than you might see in the US or UK, but for those taxes you received and your kids get at least 4 years of college paid for, and a small living stipend to boot.

In the US public higher education affordability models vary from state to state. Some states try to keep cost-of-attendance low via more direct state support. But not all students get full rides, even with need-based aid. I am much more a fan of the socialized approach to college affordability. the data shows it works in terms of increasing overall educational attainment.

created and posted March 24, 2025

Prompt #3 - Circular

Went wild with the circles here, sixty of them altogether. I took inspiration from this Albert Rapp post on using a bubble chart in place of a heat map.

We are still pulling data from from Danmarks Statistik via the danstat package. This time we’ll use the “Labour & income” domain to look at earnings by employment sector and highest level of education attained, the LONS11 table; løn is the Danish word for wage or salary. Of the twenty-five available earnings indicators, we’re using the MDRSNIT field, which is standardized monthly earnings. I couldn’t easily find the documentation on exactly how it was imputed, but given the analysis is across sectors, I thought it made the most sense. We’ll also pull in whether the employee was salaried or hourly (spoiler alert…it matters).

Let’s first get and clean the data.

Show code for getting and cleaning data
library(tidyverse) # to do tidyverse things
library(tidylog) # to get a log of what's happening to the data
library(janitor) # tools for data cleaning
library(danstat) # package to get Danish statistics via api
library(ggtext) # enhancements for text in ggplot

# some custom functions
source("~/Data/r/basic functions.R")

# lons11
table_meta <- danstat::get_table_metadata(table_id = "lons11", variables_only = TRUE)

# create variable list using the ID value in the variable
variables_ed <- list(
    list(code = "uddannelse", values = c("H10", "H20", "H30", "H40", "H50", "H60", "H70", "H80")),
    list(code = "sektor", values = c(1016, 1020, 1025, 1046)),
    list(code = "lønmål", values = "MDRSNIT"), # avg monthly
#   list(code = "køn", values = c("M", "K")),
  list(code = "afloen", values = c("TIME", "FAST")),
    list(code = "tid", values = 2023))

sal1 <- get_data("lons11", variables_ed, language = "en") %>%
    as_tibble() %>%
    clean_names()

# earnings came in character form, so changing to numeric and rounding up to whole numbers
# group the ed levels and change the names, and change names on sector. make them factors with orders.
sal2 <- sal1 %>%
    mutate(income = as.numeric(indhold)) %>%
    mutate(income = round(income, 2)) %>%
    mutate(sector = case_when(
        sektor == "Corporations and organizations" ~ "Private sector",
        sektor == "Government including social security funds" ~ "Gov't - National",
        sektor == "Municipal government" ~ "Gov't - Municipal",
        sektor == "Regional government" ~ "Gov't - Regional")) %>%
    mutate(sector =
        factor(sector,
        levels = c("Gov't - Municipal", "Gov't - Regional",
        "Gov't - National", "Private sector"))) %>%
    mutate(ed_level =
        case_when(uddannelse == "H10 Primary education" ~ "Primary",
        uddannelse == "H20 Upper secondary education" ~ "HS-Academic",
        uddannelse == "H30 Vocational Education and Training (VET)" ~
        "HS-Vocational",
        uddannelse == "H40 Short cycle higher education" ~
        "Short-cycle college",
        uddannelse == "H50 Vocational bachelors educations" ~
        "Bachelor-Vocational",
        uddannelse == "H60 Bachelors programs" ~ "Bachelor-Academic",
        uddannelse == "H70 Masters programs" ~ "Masters",
        uddannelse == "H80 PhD programs" ~ "PhD")) %>%
    mutate(ed_level =
        factor(ed_level,
        levels = c("Primary", "HS-Academic", "HS-Vocational",
        "Short-cycle college", "Bachelor-Vocational", 
        "Bachelor-Academic", "Masters", "PhD")))

Ok, we have the data, onto making the circles. Again, we’re doing a bubble chart that functions as a heat map by pegging the bubble size to the value of the earnings. I also added text labels next to the bubbles.

Show code for making the chart
sal2 %>%
    ggplot(aes(x = sector, y = ed_level)) +
    geom_point(aes(col = income, fill = income, size = income), shape = 21) +
    theme_minimal() +
    scale_size_area(max_size = 15) +
    labs(x = "", y = "",
             title = "More education = higher monthly earnings across all sectors. Larger effect for salaried vs hourly workers",
             subtitle = "Monthly standardized earnings by education level & earner category",
             caption = "*Data from Danmarks Statistik via danstat package*") +
    facet_wrap(~ afloen) +
    theme(legend.position = 'top',
                legend.justification = c(.95,0),
                plot.title = element_text(size = 20),
                plot.subtitle = element_text(size = 16),
                plot.caption = element_markdown(size = 12, face = "italic")) +
    guides(
        col = guide_none(),
        size = guide_none(),
        fill = guide_colorbar(
            barheight = unit(0.5, 'cm'),
            barwidth = unit(10, 'cm'),
            title.position = 'top',
            title = "Standardized Monthly Earnings (Danish Kroner)")) +
    scale_fill_continuous(limit = c(25000, 80000),
        breaks = c(30000, 40000, 50000, 60000, 70000, 80000)) +
    geom_text(data = subset(sal2, !is.na(income)),
        aes(label =
                    paste0(round(income, 0), " DKK")), nudge_x = 0.35, size = 3)

I see the story here being that education level matters when it comes to how much Danish workers earn in a month. If you work in the private sector, having a Masters 8,000 to 10,000 more DKK per month than Bachelor’s degrees. The difference is even greater working for regional government agencies. Also, hourly workers with the same education earn less per month than salaried workers. But without the component parts of the total earnings I can’t explain exactly why.

The bottom line is, as we’re seeing so far in the charts, there’s both an increase in the number of people earning college and post-bacc degrees, and the payoff seems to be there for the extra year of school it takes for most programs.

But again, given that education is socialized and subsidized through taxes, the opportunity cost for that extra year is nothing, because the basic higher education coverage is for five years, for tuition and living stipend. And higher wage earners pay more taxes, helping to boost everyone’s opportunity to earn the degrees they desire, without being worried about cost.

It’s worth mentioning here, that there’s discussion ongoing in the Danish government to squeeze education spending and spots available in the gymnasiums (essentially high school) that feed university programs. I need to read up on that a bit more as I tell this Danish education story while doing the chart challenge.

Oh, the exchange rate is usually around 6.5 or so DKK to USD. So while it may seem amazing to earn 50000 per month, and while that’s not bad in this market, adjust your bearings if you’re thinking in USD or EURO.

bubble chart displaying monthly earnings by educational attainment and employment sector

created and posted April 2, 2025

Prompt #4 - Big or Small

For this post I’ll stay with a circle theme and make a simple circle packing chart, sort of a heat map or tree map but with circles, the size of the circles corresponding to the variable value. I’m using the code-through from the R Graph Gallery to get going, as I’ve never done this type of chart before. The packcircles package does the computational work to create the circle objects.

Keeping with the thematic constraint I’ve set, I’ll once again be using Danish education data from Danmarks Statistik via the danstat package.

This chart will display Bachelor’s degrees awarded in 2023, by top-line degree field (Science, Humanities, Education, etc). For instance, Law is a sub-field in Social Sciences, one of many. To make it a bit easier I’ll limit this chart to academic Bachelor’s, not the vocational degrees. There’s definitely a more focused project worth doing that displays the sub-fields within the top-line areas, shown as nested smaller circles within larger circles, or something interactive. for now this will suffice.

To start, let’s get the data. It’s from a different table than the last chart, so I’ll show the work.

Show code for getting and cleaning data
library(tidyverse) # to do tidyverse things
library(tidylog) # to get a log of what's happening to the data
library(janitor) # tools for data cleaning
library(danstat) # package to get Danish statistics via api
library(packcircles) # makes the circles!
library(ggtext) # enhancements for text in ggplot
library(ggrepel) # helps with label offsets

# some custom functions
source("~/Data/r/basic functions.R")

# this table is UDDAKT60, set to all lower-case in the table_id call
table_meta <- danstat::get_table_metadata(table_id = "uddakt60", variables_only = TRUE)

# create variable list using the ID value in the variable
# the uddanelse codes are the top-line degree fields.
variables_ed <- list(
    list(code = "uddannelse", values = c("H6020", "H6025", "H6030", "H6035",
                 "H6039", "H6059", "H6075", "H6080", "H6090")),
    list(code = "fstatus", values = c("F")),
    list(code = "tid", values = 2023))

# start with a larger dataset in case we want to do more analysis or viz with it
degs1 <- get_data("uddakt60", variables_ed, language = "en") %>%
    as_tibble() %>%
    # shorten the given degree field names
    mutate(deg_field = case_when(UDDANNELSE == "H6020 Educational, BACH" ~ "Educ.",
            UDDANNELSE == "H6025 Humanities and theological, BACH" ~ "Humanities",
            UDDANNELSE == "H6030 Arts, BACH" ~ "Arts",
            UDDANNELSE == "H6035 Science, BACH" ~ "Science",
            UDDANNELSE == "H6039 Social Sciences, BACH" ~ "Social Science",
            UDDANNELSE == "H6059 Technical sciences, BACH" ~ "Tech Science",
            UDDANNELSE == "H6075 Food, biotechnology and laboratory technology, BACH" 
                        ~ "Food/Biotech/LabTech",
            UDDANNELSE == "H6080 Agriculture, nature and environment, BACH"
                     ~ "Agricultural Science",
            UDDANNELSE == "H6090 Health science, BACH" ~ "Health Sciences")) 

# keep only what we ween to do the circles
degs2 <- degs1 %>%
    select(group = deg_field, value = INDHOLD, text)

# this creates x & y coordinates and circle radius values.
degs_packing <- circleProgressiveLayout(degs2$value, sizetype='area')
# new tibble putting it together
degs3 <- cbind(degs2, degs_packing)
# not sure exactly what this does, but it's an important step
degs.gg <- circleLayoutVertices(degs_packing, npoints=50)

Ok, we now have a nice clean tibble, time for the plot. What took the most time here was working on the labels…getting them to the correct size, adjusting for circle size. You’ll notice one label set off to the side thanks to the ggrepel package.

Show code for making the chart
ggplot() +
    geom_polygon_interactive(
        data = degs.gg,
        aes(x, y, group = id, fill=id, tooltip = data$text[id], data_id = id),
        colour = "black", alpha = 0.6) +
    scale_fill_viridis() +
    geom_text(data = degs3 %>% filter(value > 6000),
            aes(x, y, label = paste0(group, "\n", format(value, big.mark=","))),
            size=7, color="black") +
    geom_text(data = degs3 %>% filter(between(value, 2410, 3350)),
            aes(x, y, label = paste0(group, "\n", format(value, big.mark=","))),
            size=7, color="black") +
    geom_text(data = degs3 %>% filter(between(value, 600, 2410)),
            aes(x, y, label = paste0(group, "\n", format(value, big.mark=","))),
            size=6, color="black") +
    geom_text_repel(data = degs3 %>% filter(between(value, 200, 400)),
            aes(x, y, label = paste0(group, "\n", format(value, big.mark=","))),
            size=4, color="black",
            max.overlaps = Inf, nudge_x = -110, nudge_y = 50,
            segment.curvature = 0,
            segment.ncp = 8,
            segment.angle = 30) +
    labs(x = "", y = "",
             title = "Social Sciences were the most popular Bachelor's degrees awarded by Danish universities in 2023",
             subtitle = "*Labels not diplayed: Education = 134, Food Science = 61*",
             caption = "*Data from Danmarks Statistik via danstat package*") +
    theme_void() +
    theme(legend.position="none", plot.margin=unit(c(0,0,0,0),"cm"),
            plot.title = element_markdown(size = 16),
            plot.subtitle = element_markdown(size = 12),
            plot.caption = element_markdown(size = 8)) +
    coord_equal()

packed circle graph displaying Bachelors degrees awarded by field

There you have it, academic Bachelor’s degrees awarded by Danish universities in 2023. The Social Sciences category accounts for more than twice as many degrees as the next field. Again, the next interesting step would be to visualize the disaggregation of the main fields, and see what subfields are the most popular. It would also be interesting to track the numbers over time, see which majors rose or fell.

created and posted April 2, 2025

Prompt #5 - Ranking

For this prompt I desperately wanted to do something relating to music from the band The English Beat because of Ranking Roger. But I’d already committed to the Danish education bit, and Spotify shut down all the interesting API endpoints and don’t seem to be reanimating them any time soon. Something about “safety”, but it’s likely AI crawlers, the same ones increasing hosting costs for Wikipedia and other websites as they steal content for their models. And funny for Spotify to complain about someone else using their work to monetize a product.

But anyway, let’s talk ranks. Specifically let’s look at the rankings of Bachelor degrees awarded in Denmark in 2023. I’ll focus on the academic (4-year) degrees because I wanted to look at majors within broader disciplines and look at differences by gender.

Again, data comes from Danmarks Statistik via the danstat package.

So first, let’s get the data. It’s from the same table used in prompt 4 but retreived slightly differently so I’ll show it again. This time we’re getting all the majors within the broader disciplines so in the list call we use list(code = "uddannelse", values = "*").

Show code for getting and cleaning data
library(tidyverse) # to do tidyverse things
library(tidylog) # to get a log of what's happening to the data
library(janitor) # tools for data cleaning
library(danstat) # package to get Danish statistics via api
library(packcircles) # makes the circles!
library(ggtext) # enhancements for text in ggplot
library(ggrepel) # helps with label offsets

# some custom functions
source("~/Data/r/basic functions.R")

# this table is UDDAKT60, set to all lower-case in the table_id call
table_meta <- danstat::get_table_metadata(table_id = "uddakt60", variables_only = TRUE)

# create variable list using the ID value in the variable.
variables_ed <- list(
    list(code = "uddannelse", values = "*"),
    list(code = "fstatus", values = c("F")),
    list(code = "køn", values = c("M", "K")),
    #list(code = "alder", values = c("TOT")),
    list(code = "tid", values = 2023))

bacdegs1 <- get_data("uddakt60", variables_ed, language = "en") %>%
    as_tibble()

bacdegs <- bacdegs1 %>%
    # clean up the degree text and code field
    mutate(deg_code = str_extract(UDDANNELSE, "^[^ ]+")) %>%
    mutate(deg_name = sub("^\\S+\\s+", '', UDDANNELSE)) %>%
    mutate(deg_name = str_remove(deg_name, ", BACH")) %>%
    mutate(deg_name = str_remove(deg_name, " BACH")) %>%
    mutate(deg_group = case_when(
        deg_code %in% c("H6020", "H6025", "H6030", "H6035", "H6039",
                                        "H6059", "H6075", "H6080", "H6090") ~ "Main",
        deg_code == "H60" ~ "All",
        TRUE ~ "Sub")) %>%
    # create discipline category for making individual plots.
    mutate(deg_field =  case_when(
        str_detect(deg_code, "H6020") ~ "Education",
        str_detect(deg_code, "H6025") ~ "Humanities",
        str_detect(deg_code, "H6030") ~ "Arts",
        str_detect(deg_code, "H6035") ~ "Science",
        str_detect(deg_code, "H6039") ~ "Social Sciences",
        str_detect(deg_code, "H6059") ~ "Technical sciences",
        str_detect(deg_code, "H6075") ~ "Technical sciences",
        str_detect(deg_code, "H6080") ~ "Agriculture & Veterinary",
        str_detect(deg_code, "H6090") ~ "Health science",
        TRUE ~ "no")) %>%
    rename(degs_n = INDHOLD, sex = KØN)

Now that we have tidy data, let’s make the charts. This time out it’s horizontal bar charts, ordered by number of degrees awarded. Note that because I’m faceting by gender, and because I want to show the top majors in descending order for each gender, I use the scales = "free_y" option in the facet function.

A few things before we see the work. First, I had hoped to write a function and output all the by-discipline charts but it was too complicated to get each chart to look exactly as I wanted, and I realized I’d have to do lots of individual formatting anyway. So sadly it’s “copy-paste-amend” for all the charts. I’ll only show the code for the chart showing all degrees and for one of the discipline charts, as the rest of the discipline charts were essentially the same, just minor edits on the geom_text() code.

Second, I had to make a design choice regarding the x axes. I decided to keep them standard to the maximum number of degrees by major within each discipline, so the x axes for men & women is the same. To me, having different x axes maximums in the same chart was distorting the scope of the bars, making different numbers look the same.

Third, I am only displaying degree counts, not percentages. I went back and forth on that as well, and decided to go with the counts and save the percentages for the “Diverging” prompt on Day 9.

Show code for making the charts

bacdegs %>%
    filter(deg_group == "Main") %>%
    {. ->> tmp} %>%
    ggplot(aes(x = degs_n, y = reorder_within(deg_name, degs_n, sex), fill = sex)) +
    geom_bar(stat = "identity") +
    scale_fill_manual(values = c("#E66100", "#5D3A9B")) +
    scale_y_reordered() +
    scale_x_continuous(labels = comma) +
    labs(x = "", y = "",
             title = "All Bachelor degrees by field & sex, 2023",
             subtitle = glue::glue("*Total degrees earned = 19,199: Men = 8,694, Women = 10,505*"),
             caption = "*Data from Danmarks Statistik via danstat package*") +
    facet_wrap(~ sex, scales = "free_y") +
    geom_text(data = subset(tmp, degs_n > 3000),
                        aes(label = comma(degs_n)), hjust = 1.5, color = "white") +
    geom_text(data = subset(tmp, degs_n < 3000),
                        aes(label = comma(degs_n)), hjust = -.5, color = "black") +
    theme_minimal() +
    theme(panel.grid = element_blank(),
                legend.position = "none",
                plot.subtitle = element_markdown(),
                plot.caption = element_markdown())


# agriculture and veterinary sciences
bacdegs %>%
    filter(deg_group == "Sub") %>%
    filter(deg_field == "Agriculture & Veterinary") %>%
    {. ->> tmp} %>%
    ggplot(aes(x = degs_n, y = reorder_within(deg_name, degs_n, sex), fill = sex)) +
    geom_bar(stat = "identity") +
    scale_fill_manual(values = c("#E66100", "#5D3A9B")) +
    scale_y_reordered() +
    scale_x_continuous(labels = comma) +
    labs(x = "", y = "",
             title = "Agriculture & Veterinary Science Bachelor degrees by field & sex, 2023",
             subtitle = glue::glue("*Total degrees earned = 308: Men = 55, Women = 253*")) +
    facet_wrap(~ sex, scales = "free_y") +
    geom_text(data = subset(tmp, degs_n > 100),
                        aes(label = comma(degs_n)), hjust = 1.5, color = "white") +
    geom_text(data = subset(tmp, degs_n < 100),
                        aes(label = degs_n), hjust = -.5, color = "black") +
    theme_minimal() +
    theme(panel.grid = element_blank(),
                legend.position = "none",
                plot.subtitle = element_markdown())

horizontal bar graph displaying Bachelors degrees awarded by field & sex

This first chart shows all degrees awarded by discipline and sex. The most immediate thing to notice is that more Bachelor degrees were awarded to women than to men. But that tracks with a chart I did last year, showing rates of educational attainment by sex and it’s similar to other countries, including the US.

Social science degrees were the most popular for both men and women. After that it changes a bit. Humanities were 2nd most popular for women, Technical Sciences next for men. And so forth. To be honest, it’s kind of a shock to see degree fields so starkly gendered now. I would have though some disciplines would have been a bit more evenly split. Again, we’ll dive deeper into the percentage differences on Day 9 for the “Diverging” prompt.

In the tab set below are the rest of the charts, one for each major discipline. If you’re interested you can comb through and see which majors were most popular in each discipline. They are lightboxed, so click on the image to enlarge it, and click esc or outside the image to restore it to the page.

The only ones I did not make were for Education and the Food-Biotech-Lab Tech degrees as each only had one major with degrees awarded. In Education there were 134 degrees awarded, 117 to women, 17 to men. For Food-Bio-Lab Tech there were 61 degrees total, 20 to Men, 41 to Women all in a general Food & Nutrition Science major.

This makes me think I could have done this in Tableau and let the user filter by discipline. Or I could have done parameterized charts in Quarto. Either way I could have added other degree levels. Lots of data prep, but it would be a useful chart. Maybe after the 30 Day Challenge is over.

created and posted April 4, 2025

Prompt #6 - Florence Nightingale

The obvious association with Florence Nightingale is that of nursing, so today let’s look at health-care degrees, and who gets them. I’m especially interested in degrees for a job called in shorthand here SOSU, or “social- og sundhedsassistent”. Essentially a nurses aid/nursing assistant. When I was in the hospital after my bicycle accident in 2024 I noticed almost all SOSUs were women and many first or second-generation immigrants.

Which makes sense when you think of recent immigrants to any country and the types of jobs the first wave immigrants, mainly those who don’t come in with college degrees or professional credentials, tend to get…merchants, housekeeping, health care assistants. They establish a stable life and their kids tend to get more education and higher paying jobs. It’s the immigration story told often over the decades and in many countries.

I feel like I need to make clear here that this is not intended to “other” immigrants to Denmark. I want to help explain SOSU degrees in the conext of their paths to economic and social stability through education and decently paying stable jobs.

You can read more about the path to a SOSU degree and job in this explainer and more about vocational secondary education in particular on pages 10-11 here in this PDF report.

Again, data comes from Danmarks Statistik via the danstat package. This time we’ll look at table UDDAKT35, which has degrees completed and includes immigration category.

First we’ll get the data.

Show code for getting and cleaning data
# Nursing and SOSU degrees by gender and immigrant status

library(tidyverse) # to do tidyverse things
library(tidylog) # to get a log of what's happening to the data
library(janitor) # tools for data cleaning
library(danstat) # package to get Danish statistics via api
library(ggtext) # enhancements for text in ggplot
library(scales)
library(tidytext)

# some custom functions
source("~/Data/r/basic functions.R")

# UDDAKT35 - get english and danish versions of metadata to make sure we get the right degree codes.
table_meta <- danstat::get_table_metadata(table_id = "uddakt35", variables_only = TRUE)
table_meta_dk <- danstat::get_table_metadata(table_id = "uddakt35", variables_only = TRUE, language = "da")

# create variable list using the ID value in the variable
variables_ed <- list(
    list(code = "uddannelse", values = c("H30", "H3010", "H301020")),
    list(code = "fstatus", values = c("F")),
    list(code = "køn", values = "*"),
    list(code = "herkomst", values = "*"),
    list(code = "tid", values = c(2019, 2020, 2021, 2022, 2023, 2024)))

# fetch the data using the variable list
sosu1 <- get_data("uddakt35", variables_ed, language = "en") %>%
    as_tibble() %>%
    clean_names()

# set up the main dataset...create a main degree field, clean up the national origin field
sosu_all <- sosu1 %>%
    rename(sex = kon, national_origin = herkomst, year = tid, n = indhold) %>%
    select(-fstatus) %>%
    mutate(deg_code = str_extract(uddannelse, "^[^ ]+")) %>%
    mutate(deg_name = case_when(
        deg_code == "H30" ~ "VET All",
        deg_code == "H3010" ~ "Health care/Educ All",
        deg_code == "H301020" ~ "SOSU")) %>%
    mutate(national_origin = ifelse(
        national_origin == "Persons of Danish origin", "Danish origin", national_origin)) %>%
    mutate(national_origin =
        factor(national_origin,
        levels = c("Total", "Danish origin", "Descendant",
        "Immigrants", "Unknown origin"))) %>%
    mutate(sex = ifelse(sex == "Sex, total", "Total", sex))

We have the base frame now from which to make a few charts. All of them will cover the years 2019 to 2024. The chart code isn’t all that complicated so I won’t include them in this post. If you’re interested you can find them in the github repo for this year’s challenge.

We’ll first look at total SOSU degrees, then SOSU as a percent of all the vocational degrees in it’s main group. We’ll also look at sex and national origin.

bar graph displaying SOSU degrees awarded 2019 to 2024

The key insight that jumps out here is the more than 200% increase in SOSU degrees awarded during this period. The one thing I’m not sure of here if that has to do with any policy change or something else that helped the numbers increase so much over the last five years.

How about SOSU degrees about relative to all H30 vocational programs?

bar graph displaying SOSU degrees awarded 2019 to 2024 as a pct of all vocational degrees

SOSU degrees were 2% of the H30 vocational category in 2019 and 4% in 2024. The H30 group grew by almost 1,900 (6%) during that period, meaning SOSU degrees were responsible for more than the growth of all degrees in this group.

So who gets SOSU degrees in terms of gender and national origin status? Unsurprisingly, given the very gendered nature of nursing health care assistant jobs, about 90% of SOSU degrees go to women.

bar graph displaying SOSU degrees awarded 2019 to 2024

What about national origin? First it helps to explain how that’s defined in Denmark for these statistics (page is in Danish). Someone of Danish origin has at least 1 parent born in and a citizen of Denmark, regardless of where the person was born. A Descendant is considered someone to born in Denmark but neither parent is a citizen or born in Denmark. However if at least one of a Descendant’s parents acquire Danish citizenship, then that person is now of Danish origin. Finally, an Immingrant is considered as someone born abroad and neither parent is a citizen of or born in Denmark. Immigrants can of course gain citizenship, but I’m not sure if they then become considered of Danish origin for these statistics.

bar graph displaying SOSU degrees awarded 2019 to 2024

For the past few years, immigrants have comprised about 30% or so of all who earned a SOSU degree.

So how do these gender and national origin percentages compare to all H30 vocational programs?

bar graph displaying H30 degrees awarded 2019 to 2024

Well, the H30 degrees are not only much more balanced, a slight majority are awarded to men.

As for national origin…

bar graph displaying H30 degrees awarded 2019 to 2024

…immigrants are growing as a percent of all who earn H30 vocational degrees, at 14% in 2024 up from 8% in 2019. but that’s less than half the 34% percent of SOSU degrees awarded to immigrants.

So to sum up, its as expected; a vast majority of SOSU degrees go to women, and the percentage of SOSU degrees going to immigrants has been rising.

There’s plenty more to examine here, for instance:

  • How do SOSU degree demographics compare to nursing degrees and other medical fields?
  • What’s the demographic picture in other vocational fields? Or in academic secondary, or academic higher education?

created and posted April 5, 2025

Prompts #7 & 8 - Outliers & Histogram

These next six prompts are under a general theme of Distributions, or how data is arrayed or spread out by one of more factors.

Taking another look at vocational degrees, it turned out to be best to combine these first two prompts, outliers and histogram, into one post. So we’ll use the same table UDDAKT35, from Danmarks Statistik via the danstat package.

This time we’ll look at all degrees awarded and see the overall distribution (histogram), and degrees which had the most and fewest awarded (outliers). So reversing the prompts I guess, but keeping with the spirit of the challenge.

As always, we’ll fetch and clean the data first. Some slight changes to the previous prompt where we used this table. This time we’ll get all degree programs and just for 2024. We won’t worry about demographics this time.

Show code for getting and cleaning data
library(tidyverse) # to do tidyverse things
library(tidylog) # to get a log of what's happening to the data
library(janitor) # tools for data cleaning
library(danstat) # package to get Danish statistics via api
library(ggtext) # enhancements for text in ggplot
library(scales)
library(tidytext)

# some custom functions
source("~/Data/r/basic functions.R")

# UDDAKT35 - get english and danish versions of metadata to make sure we get the right degree codes.
table_meta <- danstat::get_table_metadata(table_id = "uddakt35", variables_only = TRUE)
table_meta_dk <- danstat::get_table_metadata(table_id = "uddakt35", variables_only = TRUE, language = "da")

# create variable list using the ID value in the variable
# getting all education programs, pare down
variables_ed <- list(
    list(code = "uddannelse", values = "*"),
    list(code = "fstatus", values = c("F")),
    list(code = "tid", values = 2024))

# fetch all the data
vet1 <- get_data("uddakt35", variables_ed, language = "en") %>%
    as_tibble() %>%
    clean_names()

# clean up a bit, create category fields to help with charts
vet_all <- vet1 %>%
    mutate(deg_code = str_extract(uddannelse, "^[^ ]+")) %>%
    filter(!grepl("H29", deg_code)) %>%
    filter(!grepl("Total", deg_code)) %>%
    mutate(deg_level = if_else(str_length(deg_code) == 5, "Main", "Sub")) %>%
    mutate(deg_level = if_else(deg_code == "H30", "All", deg_level)) %>%
    mutate(deg_name = sub("^\\S+\\s+", '', uddannelse)) 

# create outlier categories. only majors with +0 degrees
vet_majors <- vet_all %>%
    filter(deg_level == "Sub") %>%
    filter(indhold > 0) %>%
    mutate(degs_mean = mean(indhold),
                 degs_std = sd(indhold),
                 degs_pctl10 = quantile(indhold, 0.10),
                 degs_pctl90 = quantile(indhold, 0.90)) %>%
    mutate(outlier = case_when(
        indhold >= degs_pctl90 ~ "outlier+ > 90thpctl",
        indhold <= degs_pctl10 ~ "outlier- < 10thpctl",
        TRUE ~ "not outlier")) %>%
    mutate(outlier = factor(outlier,
        levels = c("outlier+ > 90thpctl", "not outlier", "outlier- < 10thpctl")))

The vet_majors dataframe created at the end is what we’ll use for outlier analysis – it keeps only the majors (not the top-line category) and only for degrees awarded greater-than 0. As we’ll see in the histogram, the distributions were so skewed I set outlier status at equal-or-above the 90th percentile and equal-or-below the 10th percentiles.

Ok, let’s build the histogram. I added vertical lines at the median and mean to highlight the skew.

Show code for making the histogram
vet_majors %>%
    ggplot(aes(indhold)) +
    geom_histogram(fill = "#CC79A7", bins = 100) +
    geom_vline(aes(xintercept = mean(indhold)), color="black") +
    geom_vline(aes(xintercept = median(indhold)), color="#A9A9A9") +
    labs(x = "", y = "",
            title = "Eight programs with +1,000 degrees awarded accounted for 43% of the H30 Vocational degrees awarded in 2024",
        subtitle = "*Total H30 degrees awarded in 2024 = 30,570*",
        caption = "*Data from Danmarks Statistik table UDDAKT35 via danstat package*") +
    scale_y_continuous(expand = c(.001, 0), limits = c(0, 20),
            breaks = c(0, 1, 3, 6, 9, 12, 15, 18, 20)) +
    scale_x_continuous(breaks = c(0, 500, 1000, 2000, 3000, 4000, 5000, 6000),
        labels = comma) +
    theme_minimal() +
    theme(plot.title = element_markdown(size = 16, hjust = 0),
        plot.subtitle = element_markdown(size = 14, hjust = 0),
        plot.caption = element_markdown(),
        legend.title = element_blank(),
        legend.position = c(.98, .05),
        legend.justification = c("right", "bottom"),
        legend.direction = "horizontal",
        legend.box.just = "right",
        legend.margin = margin(6, 6, 6, 6),
        panel.grid = element_blank(),
        panel.border = element_blank())

histogram displaying H30 degrees awarded 2024

Most of the fields awarded fewer than 500 degrees. Only 8 awarded more than 1,000. So a heavy left skew to the distribution.

So which fields had the most and least relative to the others? The code is below, we’ll talk about the chart below the image. I used the white-space resulting from the array of data as space for the legend and explanatory notes. For those annotations I created a couple of text vectors to plug in, one with coloring to use as richtext in the annotate call.

Show code for making the outlier chart
# horizontal bar with fills based on outlier status, bars for mean and median
# for annotation,
caption1 <- paste(strwrap("Health, education, business (office, retail, trading (mercantile)), and
        craft trades (electrician, carpentry, mechanic) are by far the most popular vocational degree choices.", 75), collapse = "\n")

caption2 <- glue::glue(
    "<span style = 'color:#A9A9A9;'>Median = 97   </span><span style = 'color:#000000;'>
    Mean = 392</span>"
)

vet_majors %>%
    ggplot(aes(x = indhold, y = reorder(deg_name, indhold), fill = outlier)) +
    geom_bar(stat = "identity") +
    scale_fill_manual(values = c("#0072B2", "#CC79A7", "#D55E00")) +
    geom_vline(aes(xintercept = mean(indhold)), color="black") +
    geom_vline(aes(xintercept = median(indhold)), color="#A9A9A9") +
    scale_x_continuous(limits = ~ c(0, max(.x) + 0.1), breaks = pretty_breaks(),
        expand = c(0, 0), labels = comma) +
    labs(x = "", y = "",
        title = "Total H30 degrees awarded in 2024 = 30,570",
        caption = "*Data from Danmarks Statistik table UDDAKT35 via danstat package*") +
    annotate("text", x = 2000, y = "Storage, port and terminal education",
        label = caption1, size = 5, hjust = 0) +
    annotate("richtext", x = 4700, y = "Train preparation educations",
        label = caption2,
        size = 4.5, label.color = NA) +
    theme_minimal() +
    theme(plot.title = element_markdown(size = 16),
        plot.caption = element_markdown(),
        axis.text.y = element_text(size = 7),
        legend.title = element_blank(),
        legend.position = c(.98, .05),
        legend.justification = c("right", "bottom"),
        legend.direction = "horizontal",
        legend.box.just = "right",
        legend.margin = margin(6, 6, 6, 6),
        panel.grid = element_blank(),
        panel.border = element_blank())

horizontal bar displaying H30 degrees awarded by field 2024

What we see is there are outliers, if we define an outlier as outside the 10th or 90th percentiles. We knew from prompt 6 that SOSU degrees (social health and education) would account for the greatest proportion of vocational degrees, so no surprise that’s tops. We also see business & commerce related fields as well as craft & trades also being very popular. Glazier and fitness instructor are among the low outliers, below the 10th percentile in the distribution.

A quick note on what might look like odd degree names. Obviously the original names were in Danish. But the translation to English isn’t always perfect. So for example, gartneruddannelsen translates to “Gartner Education” but what it should be is “Gardening”, or training to be a florist or work in landscaping. “Trading Education” comes from handelsuddannelsen which is essentially training to be a merchant - sourcing, logistics, etc.

created and posted April 5, 2025

Prompt #9 - Diverging

For the Diverging prompt last year I looked at educational attainment by gender in Denmark. This year I’ll do a similar chart, and we’ll take a different look at the same data used in the Day 5 Ranking prompt. Here though, instead of arraying degrees and majors in terms of raw counts, we’ll look at the percentage differences between gender by discipline and major.

We will go back to the UDDAKT60 table from Danmarks Statistik via the danstat package.

It’s a bit of a long post because I’m walking though the process of building the nine discipline charts via a function. If you’re most interested in the charts, scroll through the how-to parts.

Show code for getting and cleaning data
library(tidyverse) # to do tidyverse things
library(tidylog) # to get a log of what's happening to the data
library(janitor) # tools for data cleaning
library(danstat) # package to get Danish statistics via api
library(packcircles) # makes the circles!
library(ggtext) # enhancements for text in ggplot
library(ggrepel) # helps with label offsets

# some custom functions
source("~/Data/r/basic functions.R")

# this table is UDDAKT60, set to all lower-case in the table_id call
table_meta <- danstat::get_table_metadata(table_id = "uddakt60", variables_only = TRUE)

# create variable list using the ID value in the variable
variables_ed <- list(
    list(code = "uddannelse", values = "*"),
    list(code = "fstatus", values = c("F")),
    list(code = "køn", values = c("M", "K")),
    #list(code = "alder", values = c("TOT")),
    list(code = "tid", values = 2023))

# fetch the data, use the janitor package to clean the names
bacdegs1 <- get_data("uddakt60", variables_ed, language = "en") %>%
    as_tibble() %>%
    clean_names()

# build the main dataset for charting
bacdegs <- bacdegs1 %>%
    select(-fstatus) %>%
    # create categories for filtering - deg_group is category for all, discipline, or major 
    # deg_name is major, deg_field is discipline, 
    mutate(deg_code = str_extract(uddannelse, "^[^ ]+")) %>%
    mutate(deg_name = sub("^\\S+\\s+", '', uddannelse)) %>%
    mutate(deg_name = str_remove(deg_name, ", BACH")) %>%
    mutate(deg_name = str_remove(deg_name, " BACH")) %>%
    mutate(deg_group = case_when(
        deg_code %in% c("H6020", "H6025", "H6030", "H6035", "H6039",
                "H6059", "H6075", "H6080", "H6090") ~ "Main",
        deg_code == "H60" ~ "All",
        TRUE ~ "Sub")) %>%
    mutate(deg_field =  case_when(
        str_detect(deg_code, "H6020") ~ "Education",
        str_detect(deg_code, "H6025") ~ "Humanities",
        str_detect(deg_code, "H6030") ~ "Arts",
        str_detect(deg_code, "H6035") ~ "Science",
        str_detect(deg_code, "H6039") ~ "Social Sciences",
        str_detect(deg_code, "H6059") ~ "Technical sciences",
        str_detect(deg_code, "H6075") ~ "Food/Biotech/Lab Tech",
        str_detect(deg_code, "H6080") ~ "Agriculture Nature Environment",
        str_detect(deg_code, "H6090") ~ "Health science",
        TRUE ~ "All")) %>%
    mutate(deg_field = factor(deg_field,
        levels = c("All", "Agriculture Nature Environment", "Arts", "Education", "Food/Biotech/Lab Tech",
        "Health science", "Humanities", "Science", "Social Sciences", "Technical sciences"))) %>%
    rename(degs_n = indhold, sex = kon)

## dataframe to create faint highlight lines in chart
vlines_df <- data.frame(xintercept = seq(-100, 100, 20))

Putting this together I first built a chart at the discipline level, showing the percentage differences by gender on a diverging graph. It’s a bit of a remix of last year’s chart for the same prompt. The code for that is below as is the chart.

Show code for making the first chart
bacdegs %>%
    filter(deg_group == "Main") %>%
    select(deg_field, sex, degs_n) %>%
    group_by(deg_field) %>%
    mutate(deg_field_tot = sum(degs_n)) %>%
    ungroup() %>%
    group_by(deg_field, sex) %>%
    mutate(deg_sex_pct = round(degs_n /deg_field_tot, 3)) %>%
    mutate(deg_sex_pct = ifelse(sex == "Men", deg_sex_pct *-1, deg_sex_pct)) %>%
    mutate(deg_sex_pct2 = round(deg_sex_pct * 100, 1)) %>%
    ungroup() %>%
    mutate(deg_field = fct_reorder(deg_field, desc(deg_field))) %>%
    {. ->> tmp} %>%
    ggplot() +
    geom_col(aes(x = -50, y = deg_field), width = 0.75, fill = "#e0e0e0") +
    geom_col(aes(x = 50, y = deg_field), width = 0.75, fill = "#e0e0e0") +
    geom_col(aes(x = deg_sex_pct2, y = deg_field, fill = sex, color = sex), width = 0.75) +
    scale_x_continuous(expand = c(0, 0),
        labels = function(x) abs(x), breaks = seq(-100, 100, 20)) +
    geom_vline(data = vlines_df, aes(xintercept = xintercept), color = "#FFFFFF", size = 0.1, alpha = 0.5) +
    coord_cartesian(clip = "off") +
    scale_fill_manual(values = c("#E66100", "#5D3A9B")) +
    scale_color_manual(values = c("white", "white")) +
    geom_text(data = subset(tmp, sex == "Men"),
        aes(x = deg_sex_pct2, y = deg_field, label = paste0(abs(deg_sex_pct2), "%")),
        size = 5, color = "white", position = position_dodge(width = .4), hjust = -0.09) +
    geom_text(data = subset(tmp, sex == "Women"),
        aes(x = deg_sex_pct2, y = deg_field, label = paste0(abs(deg_sex_pct2), "%")),
        size = 5, color = "white", position = position_dodge(width = .4), hjust = 1.1) +
    geom_text(aes(x = -100, y = deg_field, label = paste0("Total awarded = ", comma(deg_field_tot))),
        size = 4, color = "#3b3b3b", hjust = -.1) +
    labs(x = "", y = "",
        title = "<span style = 'color: #5D3A9B;'>Women</span> earn more Bachelor's degrees than
        <span style = 'color: #E66100;'>men</span> in all disciplines except Science and Technical sciences",
        subtitle = glue::glue("*Total degrees awarded = 19,199:
        <span style = 'color: #E66100;'>Men = 8,694 (45%)</span>,
        <span style = 'color: #5D3A9B;'>Women = 10,505 (55%)</span>*"),
        caption = "*Data from Danmarks Statistik table UDDAKT60 via danstat package*") +
    theme_minimal() +
    theme(panel.grid = element_blank(),
        plot.title = element_markdown(size = 16),
        plot.subtitle = element_markdown(size = 12),
        plot.caption = element_markdown(),
        legend.position = "none", legend.title = element_blank(),
        axis.text.y = element_text(size = 10))
rm(tmp)

diverging horizontal bar displaying bachelor degrees by sex 2023

The next step is to repeat this chart for each discipline, showing the top 10 majors by overall degrees awarded. Rather than copy-paste-amend like in prompt 5, this time I set about doing it via a plot function. As you can see from the code for the first chart though, I wanted to add descriptive and informational text in the title and subtitle. It was a bit complicated to do this functionally and I’m sure it took me much longer to sort out the data structures and function mapping, but it worked. I’ll walk through the steps here before displaying the charts. After the challenge I’ll do a walk-through as a stand-alone post.

First up is building dataframes for each discipline, each frame with consistent naming. I went about it this way to avoid errors in filtering when running the plot function across disciplines. I took a similar approach in my October 2024 post on attendance at football (soccer) matches around the world.

So first a function to build the dataframes.

Show code for making the data frames
## function for individual discipline tables
disc_top10 <- function(degfield, dfname = "NA") {
dfout <- bacdegs %>%
    filter(deg_group == "Sub") %>%
    filter(deg_field == degfield) %>%
    select(deg_field, deg_name, sex, degs_n) %>%
    mutate(degs_all_tot = sum(degs_n)) %>%
# creates fields to use in plot dynamic subtitle
    group_by(sex) %>%
    mutate(degs_all_sex_tot = sum(degs_n)) %>%
    mutate(degs_all_sex_pct = round(degs_all_sex_tot / degs_all_tot, 3) * 100) %>%
    mutate(degs_all_sex_pct = paste0(degs_all_sex_pct, "%")) %>%
    ungroup() %>%
    group_by(deg_name) %>%
    mutate(deg_name_tot = sum(degs_n)) %>%
    ungroup() %>%
    arrange(desc(deg_name_tot), deg_name) %>%
    mutate(rank=row_number()) %>%
    group_by(deg_name, sex) %>%
    mutate(deg_sex_pct = round(degs_n /deg_name_tot, 3)) %>%
    mutate(deg_sex_pct = ifelse(sex == "Men", deg_sex_pct *-1, deg_sex_pct)) %>%
    mutate(deg_sex_pct2 = round(deg_sex_pct * 100, 1)) %>%
    ungroup() %>%
    # more fields to use in plot dynamic subtitle 
    mutate(deg_all_men_tot = ifelse(sex == "Men", degs_all_sex_tot, NA)) %>%
    mutate(deg_all_women_tot = ifelse(sex == "Women", degs_all_sex_tot, NA)) %>%
    mutate(deg_all_men_pct = ifelse(sex == "Men", degs_all_sex_pct, NA)) %>%
    mutate(deg_all_women_pct = ifelse(sex == "Women", degs_all_sex_pct, NA)) %>%
    mutate(deg_field_men_pct = ifelse(sex == "Men", paste0(abs(deg_sex_pct2), "%"), NA)) %>%
    mutate(deg_field_women_pct = ifelse(sex == "Women", paste0(deg_sex_pct2, "%"), NA))  %>%
    group_by(deg_name) %>%
    fill(deg_all_men_tot) %>%
    fill(deg_all_women_tot, .direction = "up") %>%
    fill(deg_all_men_pct) %>%
    fill(deg_all_women_pct, .direction = "up") %>%
    fill(deg_field_men_pct) %>%
    fill(deg_field_women_pct, .direction = "up") %>%
    ungroup()
# uses the second input to name the dataframe and put it in the global environment
assign(str_c(dfname, "_pcts"), dfout, envir=.GlobalEnv)
}

To run the function once it’s disc_top10("Science", "Science"). The first input is the discipline name, the 2nd is for the dataframe name.

But how do we map it across the disciplines? Well, we need a list of disciplines, and we do that by creating a list of fields.

degfields <- unique(bacdegs$deg_field)

Now we map it over the function, passing each mapping component to the .x sign.

map(degfields, ~disc_top10(degfield = .x, dfname = .x))

Ok, let’s build the plot function.

Show code for the plot function
## plot function
majorplot <- function(plotdf, degfield, plotname = "NA") {

    ## for faint highlight lines in chart
    vlines_df <- data.frame(xintercept = seq(-100, 100, 20))

    #create plot
    plotout <- plotdf %>%
        filter(deg_field == degfield) %>%
        filter(rank < 21) %>%
        # to get top 10 for plot sorted in order of total degrees in major
        arrange(desc(deg_name_tot)) %>%
        mutate(deg_name = fct_reorder(deg_name, deg_name_tot)) %>%
        # not used, but this orders top 10 for plot in degree name order
    # mutate(deg_name = fct_reorder(deg_name, desc(deg_name))) %>%
        ggplot() +
    # makes the grey bars
    geom_col(aes(x = -50, y = deg_name), width = 0.75, fill = "#e0e0e0") +
    geom_col(aes(x = 50, y = deg_name), width = 0.75, fill = "#e0e0e0") +
        # makes the bars for each gender, fill colors in the scale_fill_manual below
    geom_col(aes(x = deg_sex_pct2, y = deg_name, fill = sex, color = sex), width = 0.75) +
    scale_x_continuous(expand = c(0, 0),
                                         labels = function(x) abs(x), breaks = seq(-100, 100, 20)) +
    scale_fill_manual(values = c("#E66100", "#5D3A9B")) +
    # adds faint guide lines for the bars
    geom_vline(data = vlines_df, aes(xintercept = xintercept), color = "#FFFFFF", size = 0.1, alpha = 0.5) +
    coord_cartesian(clip = "off") +
    scale_color_manual(values = c("white", "white")) +
    # adds text labels for each major's percent of men & women
    geom_text(data = subset(plotdf, sex == "Men" & rank < 21),
        aes(x = deg_sex_pct2, y = deg_name, label = paste0(abs(deg_sex_pct2), "%")),
        size = 5, color = "white", position = position_dodge(width = .4), hjust = -0.09) +
    geom_text(data = subset(plotdf, sex == "Women" & rank < 21),
        aes(x = deg_sex_pct2, y = deg_name, label = paste0(abs(deg_sex_pct2), "%")),
        size = 5, color = "white", position = position_dodge(width = .4), hjust = 1.1) +
    # inserts the total number of degrees for each discipline into the bars to the 
        # left or right of the percentage bars
    geom_text(data = subset(plotdf, sex == "Men" & rank < 21 & deg_sex_pct > -0.85),
        aes(x = -100, y = deg_name, label = paste0("Total awarded = ", comma(deg_name_tot))),
        size = 3.5, color = "#3b3b3b", hjust = -.1) +
    geom_text(data = subset(plotdf, sex == "Men" & rank < 21 & deg_sex_pct < -0.85),
        aes(x = 80, y = deg_name, label = paste0("Total awarded = ", comma(deg_name_tot))),
        size = 3.5, color = "#3b3b3b", hjust = .5) +
    labs(x = "", y = "",
        caption = "*Data from Danmarks Statistik table UDDAKT60 via danstat package*",
        # uses html tags for color and fields from the input df to make the title
          # specific to the discipline
        subtitle = glue::glue("*Total degrees awarded in {degfield} = {comma(plotdf$degs_all_tot)}:
            <span style = 'color: #E66100;'>Men =
            {comma(plotdf$deg_all_men_tot)} ({plotdf$deg_all_men_pct})</span>,
            <span style = 'color: #5D3A9B;'>Women =
            {comma(plotdf$deg_all_women_tot)} ({plotdf$deg_all_women_pct})</span>.
            Only top 10 majors displayed.*")) +
    theme_minimal() +
    theme(panel.grid = element_blank(),
            plot.title = element_markdown(size = 16, hjust = 0),
            plot.subtitle = element_markdown(size = 12, hjust = 0),
            plot.caption = element_markdown(),
            legend.position = "none", legend.title = element_blank(),
            axis.text.y = element_text(size = 10))
    # to create individual plot names and place plots in the global environment
    assign(str_c(plotname, "_plot"), plotout, envir=.GlobalEnv)
    # outputs plots 
    return(plotout)
}

The function takes three inputs - the dataframe to run it on, the filter for the discipline in the plot, and the output lot name. The 2nd & 3rd input are the same. To map it over the nine disciplines we need the list of dataframes and the list of disciplines. The code below shows how to build those

# create list of data frames to mutate function over
df_list <- list(mget(ls(pattern = "local")))[[1]]
df_list = mget(ls(pattern = "_pcts"))

# for degree fields to loop over in function, need a character vector of just major names
# first a df of just the field names
bacdegs_maj <- bacdegs %>%
    filter(deg_group == "Sub") %>%
    mutate(deg_fieldc = as.character(deg_field)) %>%
    select(deg_fieldc)

# then output names to character vector, making sure to sort so that the filter matches to 
 # the 
degfieldsc <- (unique(bacdegs_maj$deg_fieldc)) %>%
    str_sort()

Now we’re ready to build the charts. Because of the way the input dataframes were built we’ll have title and subtitle text specific to each discipline. Here’s the mapping call:

map2(df_list, degfieldsc, ~majorplot(plotdf = .x, degfield = .y, plotname = .y))

That outputs nine plots that only need minor customization for the explanatory title. Everything else was taken care of in the functions. Because Education and the Food Science disciplines only have one major each, no additional charts for them. The gender difference is in the main chart.

I’ll display the seven annotated charts in the tabset below, showing the customization for the Humanities discipline. They are lightboxed, so click on the image to enlarge it, and click esc or outside the image to restore it to the page.

horizontal bar graph displaying Bachelors degrees awarded by field & sex

horizontal bar graph displaying Bachelors degrees awarded by field & sex

horizontal bar graph displaying Bachelors degrees awarded by field & sex

horizontal bar graph displaying Bachelors degrees awarded by field & sex

Humanities_plot +
    labs(title = "Humanities degrees mostly went to <span style = 'color: #5D3A9B;'>women</span>.
             By major, <span style = 'color: #E66100;'>men</span> only earned more degrees in History")

horizontal bar graph displaying Bachelors degrees awarded by field & sex

horizontal bar graph displaying Bachelors degrees awarded by field & sex

horizontal bar graph displaying Bachelors degrees awarded by field & sex

The overarching take-away is one of some noticeable difference or divergence by gender in many disciplines and majors. To be honest, I was surprised at the extent and size of some differences. I remember seeing it in some colleges and majors at UC Berkeley, but even in the so-called “male” majors women were becoming more and more evenly represented.

The next step for this subject would be to look at some of the more gendered majors and examine whether or not the differences have held steady over time, and if anything explains the change or stasis. Maybe after the challenge or for one of the challenge time-series prompts.

created and posted April 6, 2025

Prompt #13 - Clusters

For this “back to the future” post, a prompt coming between 9 (posted on April 6) and 14 (posted on April 18) but posted after April 18, let’s take a look at R&D expenses by source and research field in Denmark. The prompt is “clusters”, so I’m clustering research activity by the general field and the sector doing the work.

As always for this year’s challenge, the data is from Danmarks Statistik via the danstat package. This time we’re using table RDCE05, which is Expenses to R&D by sector and field of research and development. I’ll mention it more than once, but all expense figures are in millions of Danish kroner. So 1,500 DKK here means 1.5 billion. As of the date of posting, one US dollar gets you about 6.57 kroner. That’s about half a kroner lower than this time last year.

It’s also worth mentioning at least once that the unit of analysis is research expenses, not funding. Though you could argue that expenses are a proxy for funding, because the money has to come from somewhere. Expenses as reported here means money spent on any and all R&D activity and the people who do it. The funding sources are irrelevant.

I would love to dig in a bit deeper and connect funding sources, the sectors that get funding, and for which fields. But that means at the very least connecting a number of tables or having micro-data access.

I won’t spend any time explaining the research fields and sectors. You can get a detailed look at how Danmarks Statistik describes the data is here. The research fields follow the rubric laid out in the OECD’s Frascati Manual, with the fields themselves enumerated here.

Do keep in mind that science is interdisciplinary. So data science work, for example, might be in the Engineering and Tech field per the Frascati Manual. But it’s quite likely that data science work is happening in medical or social sciences. It’s hard to know how every activity is being classified without access to microdata and then being able to tease out interdisciplinary overlaps.

For the plot itself, I took inspiration for the scatterplot from Nicola Rennie’s cluster chart, modifying the colors and other aesthetics a bit. But the code structure starts with Nicola’s work, and I learned a bit about some good coding practice and tips. Which is the point of an exercise like this, right?

Let’s start by getting and cleaning the data.

Show code for getting and cleaning data
#package and process for getting & cleaning data
library(tidyverse) # to do tidyverse things
library(tidylog) # to get a log of what's happening to the data
library(janitor) # tools for data cleaning
library(danstat) # package to get Danish statistics via api
library(gt) # for tables
library(ggiraph) # interactive plot with tooltip
library(ggtext) # add markdown to plot labels
library(scales) # better number formatting

source("~/Data/r/basic functions.R")

# RDCE05
table_meta <- danstat::get_table_metadata(table_id = "rdce05", variables_only = TRUE)
table_meta_dk <- danstat::get_table_metadata(table_id = "rdce05", variables_only = TRUE, language = "da")

# create variable list using the ID value in the variable
variables_ed <- list(
    list(code = "sektor", values = "*"),
    list(code = "videnhoved", values = "*"),
    list(code = "tid", values = c(2017, 2023)))

r_and_d1 <- get_data("rdce05", variables_ed, language = "en") %>%
    as_tibble() %>%
    clean_names()

r_and_d_main <- r_and_d1 %>%
    mutate(sektor = str_remove(sektor, " sector")) %>%
    mutate(sektor = case_when(
        sektor == "Total source of funding" ~ "Total funding",
        sektor == "Business enterprise" ~ "Business",
        TRUE ~ sektor)) %>%
    mutate(videnhoved = case_when(
        videnhoved == "Agricultural and veterinary sciences" ~ "Agri & veterinary sci",
        videnhoved == "Engineering and technology" ~ "Eng & tech",
        videnhoved == "Humanities and the arts" ~ "Humanities & arts",
        videnhoved == "Medical and health sciences" ~ "Medical & health sci",
        TRUE ~ videnhoved)) %>%
    select(sector = sektor, field = videnhoved, year = tid, dkk_million = indhold)

Before the main plot, let’s get a lay of the land for R&D expenses. Note the pctchange() function. It’s one I have among a bunch of basic functions I created and load into every project using source("~/Data/r/basic functions.R") just after loading packages. The function itself is simple, just (x - lag(x)) / lag(x). But having it as a function saves me from having to redo the formula every time I use it.

The first table shows expenses for R&D by field of research in Denmark in 2017 & 2023. Below that we see expenses by sector. All values in millions of Danish kroner (DKK).

Show code for making the 1st table

# arrange the data
r_and_d_main %>%
    filter(sector == "Total funding") %>%
    group_by(field) %>%
    mutate(pct_chg_funding = pctchange(dkk_million)) %>%
    ungroup() %>%
    pivot_wider(names_from = year, values_from = c(dkk_million, pct_chg_funding)) %>%
    mutate(pct_chg_2023 = round(pct_chg_funding_2023, 3)) %>%
    select(field, dkk_million_2017, dkk_million_2023, pct_chg_2023) %>%
    arrange(field) %>%
    # build the table
    gt() %>%
    cols_label(
        field = "Research area",
        dkk_million_2017 = "2017",
        dkk_million_2023 = "2023",
        pct_chg_2023 = "% change 2017-2023") %>%
    tab_style(
        style = cell_text(align = "center"),
        locations = cells_column_labels(
            columns = c(dkk_million_2017, dkk_million_2023, pct_chg_2023))) %>%
    fmt_number(columns = c(dkk_million_2017, dkk_million_2023),
        sep_mark = ",", decimals = 0) %>%
    fmt_percent(columns = pct_chg_2023, decimals = 1) %>%
    opt_align_table_header(align = "left") %>%
    opt_stylize(style = 5)
Research area 2017 2023 % change 2017-2023
Agri & veterinary sci 1,747 2,968 69.9%
Eng & tech 25,651 27,440 7.0%
Humanities & arts 2,039 1,985 −2.6%
Medical & health sci 20,416 34,213 67.6%
Natural sciences 11,033 14,473 31.2%
Social sciences 3,397 4,561 34.3%
Sector 2017 2023 % change 2017 to 2023
Business 40,757 53,338 30.9%
Government 1,973 2,157 9.3%
Higher education 21,355 29,821 39.6%
Private non-profit 198 325 64.1%

Agriculture/Veterinary & Medical/Health sciences had biggest increases in R&D expenses from 2017-2023. Engineering/Tech & Medical had highest R&D expenses in 2017, but was overtaken by Medical & Health Sciences in 2023. Again, however, given how prevalent data science technologies are in medical sciences and other fields, it’s entirely possible that there’s lots of data science going on in the medical fields that might have been classified as Engineering & Information Technology in prior years.

The Private non-profit sector had biggest increase in R&D expenses 2017-2023, though Business sector by far has the highest R&D expenses.

My initial reaction to this was surprise, but that’s due to my skewed perspective of R&D in the US; I am so steeped in the story of federally-sponsored research and how it grew out of WWII research funding. Billions of dollars in basic and applied research is indeed funded by the federal government (and some from foundations) and carried out via contracts with universities. I assumed the federal government provided far more than the business sector. But surprise, per this Brookings Institute report since the mid-1980s the business sector has been funding more R&D in the US than the federal government.

The story here in Denmark is not that different.

So let’s look at percent change by share of expenses in sector and the fields of research in which the sectors spend funds.

First we’ll do some data prep to get a dataframe to use for plotting.

Show code for making the scatterplot data
## first make colors for geom_points
bg_col <- "grey97"
text_col <- "black"
highlight_col <- "#0072B2"
highlight_col2 <- "#CC79A7"

# create the dataframe
r_and_d_plotdf <-
r_and_d_main %>%
    filter(!sector == "Total funding") %>%
    group_by(sector, field) %>%
    mutate(pct_chg_funding = pctchange(dkk_million)) %>%
    ungroup() %>%
    pivot_wider(names_from = year, values_from = c(dkk_million, pct_chg_funding)) %>%
    select(-pct_chg_funding_2017) %>%
    mutate(dkk_million_2023_c = formatC(dkk_million_2023, format = "d", big.mark = ",")) %>%
    mutate(dkk_million_2023_c = paste0(dkk_million_2023_c, " mill DKK")) %>%
    group_by(sector) %>%
    mutate(sector_funding  = sum(dkk_million_2023)) %>%
    mutate(sector_field_funding_pct  = dkk_million_2023 / sum(sector_funding)) %>%
    ungroup() %>%
    mutate(sector_funding_pct  = sector_funding / sum(dkk_million_2023)) %>%
    mutate(sector_short = case_when(
        sector == "Government" ~ "Gov't",
        sector == "Higher education" ~ "Higher ed",
        sector == "Private non-profit" ~ "NGO",
        TRUE ~ sector)) %>%
    mutate(sector_field = paste0(sector_short, "/", field)) %>%
    filter(!is.na(pct_chg_funding_2023)) %>%
    filter(dkk_million_2017 > 0) %>%
    mutate(col = case_when(
        pct_chg_funding_2023 > 0 ~ highlight_col2,
        pct_chg_funding_2023 == 0 ~ "grey60",
        pct_chg_funding_2023 < 0 ~ highlight_col))

Now we build the plot. I wanted to add hover-over tool-tips so I’m using the ggiraph package. The only real coding difference from a static scatterplot is using the geom_point_interactive function. After we create the plot as an object, we call back to it using the ggiraph::girafe function to render it.

When you hover over the dots you’ll see the sector/field combination and the amount of expenses they had in 2023, in millions of Danish kroner (DKK). Also note that hovering over say “Gov’t / Eng & tech” not only brings up the tool-tip with 2023 expenses, it also highlights all Government dots, so you can put the Engineering & Tech expenses in contrast with and comparison to other government R&D expenses. Note again, that expenses is essentially a proxy for funding, regardless of funding source.

Above median in expense share, expenses increased from 2017-2023 Below median in expense share, expenses decreased from 2017-2023 -100% 0% 100% 200% 0% 5% 10% Sector/field % of all expenses in 2023 Sector/field % change 2017-2023 Expenses include all R&D activity & personnel, regardless of funding source. Hover over for sector/field & 2023 expenses in millions of DKK. Which combination of sector and field of study spent the most on R&D? Data from Danmarks Statistik table STATUSV2 via danstat package

If you start with some of the outliers along the x axis (sector/field combo percentage of all expenses) you notice that most of the outliers are in the medical sciences field, across sectors. Which makes sense from the tables that showed us how much funding medical sciences gets overall relative to other fields. So voila, a cluster!

As you hover over the dots, what other patterns or clusters do you see?

created and posted April 21, 2025

Prompt #14 - Kinship

Full disclosure…depending on when you read this, you may or may not see any posts between this one and prompt 9. As you can see by the “created and posted” date, if/when I’ve backfilled some prompts between 9 and here, this was the next post I completed after 9. Life gets in the way sometimes.

Anyway, I knew when the prompts came out that for kinship that I’d want to make a sankey diagram to look at educational pathways depending on parent education. A sankey is a form of network graph, and I wanted to see the connection, or network effect, of parent education on their children’s educational attainment.

StatBank’s education section does have some tables for parent -> child education, but they aren’t as disaggregated as I’d like. So to see educational outcomes by parent educational attainment you have to choose whether to look at secondary outcomes or higher ed outcomes. We’re going here with higher ed, so it’s table STATUSV2: 25-45 year-olds, by status for higher education, age and parents attainment.

Slight aside…there is micro-data that can answer the question as detailed as I would like. The EduQuant project, located in the Economic Department at Københavns Universitet and funded by the Danish Education Ministry, has access to that data and has produced some work on social mobility and education.

There are a few ways to do sankeys in r. After some trial-and-error, I settled on using the networkD3 package. A big reason being that the data comes aggregated, and some other packages like ggsankey and ggalluvial weren’t handling the data as well as networkD3.

To get the final result I was helped by this post from Louise Sinks and this video from the UK’s NHS R community.

Let’s get started…first loading some packages and fetching and cleaning the data. The data cleaning is mainly to make the parent education levels easier to read, and to create age groups. I didn’t use the age groups for this chart, but I think I will for the “complicated” prompt.

Show code for getting and cleaning data
library(tidyverse) # to do tidyverse things
library(tidylog) # to get a log of what's happening to the data
library(janitor) # tools for data cleaning
library(danstat) # package to get Danish statistics via api
library(infer) # tidy statistical inference
library(scales)
library(networkD3) # for Sankey plots
library(htmlwidgets) # html widgets helps to handle the networkD3 objects
library(htmltools) # for formatting html code
library(gt)


# some custom functions
source("~/Data/r/basic functions.R")

table_meta <- danstat::get_table_metadata(table_id = "statusv2", variables_only = TRUE)
table_meta_dk <- danstat::get_table_metadata(table_id = "statusv2", variables_only = TRUE, language = "da")

# create variable list using the ID value in the variable
variables_ed <- list(
    list(code = "statusvid", values = "*"),
    list(code = "alder1", values = "*"),
    list(code = "forudd1", values = "*"),
    list(code = "tid", values = 2023))

# fetch the data
ed_attain1 <- get_data("statusv2", variables_ed, language = "en") %>%
    as_tibble() %>%
    clean_names()

# create a main data frame with cleaned up educational levels and renaming the variables
ed_attain_main <- ed_attain1 %>%
    # let's make the Total values easier to read
    mutate(statusvid =
            ifelse(statusvid == "Educational graduation statement, total", "Total", statusvid)) %>%
    mutate(alder1 =
            ifelse(alder1 == "Age, total", "Total", alder1)) %>%
    # let's make the parent ed values a bit easier to read and understand
    mutate(forudd1 =
            case_when(forudd1 == "H10 Primary education" ~ "Primary",
                forudd1 == "H20 Upper secondary education" ~
                    "HS-Academic",
                forudd1 == "H30 Vocational Education and Training (VET)" ~
                    "HS-Vocational",
                forudd1 == "H40 Short cycle higher education" ~
                    "Short-cycle college",
                forudd1 == "H50 Vocational bachelors educations" ~
                    "Bachelor-Vocational",
                forudd1 == "H60 Bachelors programs" ~
                    "Bachelor-Academic",
                forudd1 == "H70 Masters programs" ~ "Masters",
                forudd1 == "H80 PhD programs" ~ "PhD",
                forudd1 == "H90 Not stated" ~ "Not stated",
                TRUE ~ forudd1)) %>%
    mutate(statusvid = ifelse(
        statusvid == "No registrated education", "No registered education", statusvid)) %>%
    # create age groups in case I want to disaggregate that way
    mutate(child_age_group =
            case_when(
                alder1 %in% c("25 years", "26 years", "27 years", "28 years", "29 years") ~ "25-29",
                alder1 %in% c("30 years", "31 years", "32 years", "33 years", "34 years") ~ "30-34",
                alder1 %in% c("35 years", "36 years", "37 years", "38 years", "39 years") ~ "35-39",
                alder1 %in% c("40 years", "41 years", "42 years",
                    "43 years", "44 years", "45 years") ~ "40-45",
                TRUE ~ alder1)) %>%
    select(parent_ed = forudd1, child_age = alder1, child_age_group, child_ed = statusvid, N = indhold)

Before we get to the sankey diagram, first a quick table so we can see the breakdown of parent educational attainment and children’s higher education status, independent of each other.

Show code for making the descriptive tables
ed_attain_main %>%
    filter(!parent_ed == "Total") %>%
    filter(child_age == "Total") %>%
    filter(child_ed == "Total") %>%
    select(parent_ed, N) %>%
    mutate(pct = N/ sum(N)) %>%
    gt() %>%
    cols_label(parent_ed = "Parent education level") %>%
    fmt_number(columns = N, sep_mark = ",", decimals = 0) %>%
    fmt_percent(columns = pct, decimals = 1) %>%
    opt_stylize(style = 5)
Parent education level N pct
Primary 177,985 11.3%
HS-Academic 32,585 2.1%
HS-Vocational 562,666 35.6%
Qualifying educ paths 587 0.0%
Short-cycle coll 75,482 4.8%
Bach-Vocational 282,343 17.9%
Bach-Academic 12,126 0.8%
Masters 124,083 7.9%
PhD 10,408 0.7%
Not stated 302,241 19.1%
Higher ed status N pct
Completed education 826,652 52.3%
Undergoing education 51,141 3.2%
Discontinued education 78,037 4.9%
No registered education 624,676 39.5%

Of the 1,580,506 people in this data sample, more than half have completed some form of higher education. About half of the parents of the entire pool completed a vocational high school or college degree. There is no education status for almost 20% of parents.

To make the plot you need to data frames, one for the nodes and one with the data. The nodes frame contains all the unique values of parent and child educational attainment, or for this chart, the nodes. The sankey_plot_id frame has the frequencies and parent-to-child paths we need for the sankey band widths. Make sure to render them each as a data.frame, as that’s the format needed for the sankey functions to work.

Show code for making the sankey plot dfs
# make the nodes data frame
nodes_df <- as.data.frame(ed_attain_allage %>%
    select(parent_ed, child_ed) %>%
    pivot_longer(c("parent_ed", "child_ed"),
    names_to = "col_name", values_to = "name_match") %>%
    # this removes dupes keeps only name match
    # may need to clean names for labels to look nice
    select("name_match") %>%
    distinct())

# create ID df for order of nodes from ed_path_freqs & nodes; -1 ensures 0 index
sankey_plot_id <- as.data.frame(ed_attain_allage %>%
  mutate(IDIn = match(parent_ed, nodes_df$name_match) -1,
             IDOut = match(child_ed, nodes_df$name_match)-1))

Now we make the plot. We call the links from the df with the frequencies and the nodes fro the nodes df. Font size and family can be changed, as can the node widths and other aesthetic features. The sinksRight call places the data labels on the right side node, the end stage, outside the graph. There is no way to put the left-side labels outside the graph. Changing the size accommodates a title and caption added using htmlwidgets. Louise Sinks explained in her post that leaving the plot in native size from the sankeyNetwork call could truncate parts of the plot if you add a title.

One important thing it took some searching to solve was the order of the nodes. You can see in the data cleaning that I set them up as factors with the levels I wanted them ordered. It worked in the gt tables but initially not in the sankey. It turns out you need to add iterations = 0 to keep the auto-positioning from running.

If you hover over the node bands of the chart, you can see the total number for that path.

Show code for making the sankey plot
# create the plot as an object
edattain_sankey <-
sankeyNetwork(
    Links = sankey_plot_id, Nodes = nodes_df, 
    Source = "IDIn", Target = "IDOut",
    # "Freq" from freq df, "name" from nodes_df 
    Value = "N", NodeID = "name_match", 
    sinksRight = FALSE, iterations = 0, 
    width = 600,  height = 400,
    fontSize = 16, fontFamily = "Arial", nodeWidth = 20)

## add title and caption to plot object
edattain_sankey <-
  htmlwidgets::prependContent(
    edattain_sankey, htmltools::tags$h4("Higher education pathways for people depending on parent's educational attainment"))
edattain_sankey <-
  htmlwidgets::appendContent(
    edattain_sankey, htmltools::tags$em("Data from Danmarks Statistik table STATUSV2 via danstat package"))

edattain_sankey

Higher education pathways for people depending on parent's educational attainment

HS-Vocational → No registered education
274,390 
HS-Vocational → Completed education
237,502 
Not stated → Completed education
186,166 
Bachelor-Vocational → Completed education
184,619 
Primary → No registered education
117,370 
Not stated → No registered education
109,108 
Masters → Completed education
96,503 
Bachelor-Vocational → No registered education
67,829 
Primary → Completed education
46,572 
Short-cycle college → Completed education
42,090 
HS-Vocational → Discontinued education
32,888 
Short-cycle college → No registered education
25,226 
HS-Vocational → Undergoing education
17,886 
Bachelor-Vocational → Discontinued education
16,824 
HS-Academic → Completed education
16,158 
Masters → No registered education
14,975 
Bachelor-Vocational → Undergoing education
13,071 
HS-Academic → No registered education
12,436 
Primary → Discontinued education
9,516 
PhD → Completed education
8,509 
Bachelor-Academic → Completed education
8,238 
Masters → Discontinued education
6,472 
Masters → Undergoing education
6,133 
Short-cycle college → Discontinued education
4,887 
Primary → Undergoing education
4,527 
Not stated → Discontinued education
3,719 
Short-cycle college → Undergoing education
3,279 
Not stated → Undergoing education
3,248 
HS-Academic → Discontinued education
2,500 
Bachelor-Academic → No registered education
2,331 
HS-Academic → Undergoing education
1,491 
PhD → No registered education
798 
Bachelor-Academic → Undergoing education
785 
Bachelor-Academic → Discontinued education
772 
PhD → Undergoing education
688 
PhD → Discontinued education
413 
Primary
177,985
Primary
Completed education
826,357
Completed education
HS-Academic
32,585
HS-Academic
HS-Vocational
562,666
HS-Vocational
Short-cycle college
75,482
Short-cycle college
Bachelor-Vocational
282,343
Bachelor-Vocational
Bachelor-Academic
12,126
Bachelor-Academic
Masters
124,083
Masters
PhD
10,408
PhD
Not stated
302,241
Not stated
Undergoing education
51,108
Undergoing education
Discontinued education
77,991
Discontinued education
No registered education
624,463
No registered education
Data from Danmarks Statistik table STATUSV2 via danstat package

So what do we see? Well for parents who stopped out at the primary level, most of their kids are not recorded as having registered for higher education. Same for parents who stopped out at a vocational secondary diploma. We also see a lot of parents without any educational attainment recorded. But most of their kids have either completed higher education or started but then discontinued.

Some of this could be an artifact of the Danish personal registry system not having complete parental education records for the upper end of the 25-45 age pool here, or for parents of immigrant or 1st generation born in Denmark. Some of it could be a product of the older students here being less likely to complete higher education, something we’ve seen in earlier posts.

I’ll break down by age groups in the “complicated” prompt. Sadly we don’t have a table that shows both parent education and parent or child national origin status. I’d love to have access to micro-data to answer these more detailed questions; that would mean working for StatBank, the Education Ministry, or the EduQuant project (if it gets new funding). A man can dream…

created and posted April 18, 2025

Prompt #15 - Complicated

I took complicated to mean trying to visualize multiple dimensions. As I mentioned in prompt 14 I wanted to look at higher education completion by parent educational attainment, but broken down a bit more by age groups.

The chart for this prompt will be a waffle, which I did last year for prompts 3 & 4. We’ll use the waffle package which has a built-in geom to make waffles. Well, not make actual waffles, but a waffle plot. Sorry if this is making you hungry for waffles.

We’ll use the same data as prompt 14, so no need to redo the code to get it. We can get straight to making the charts. There was a bit of complication in making the charts, as I’ll explain.

We do need to create a dataframe to make the plot, and that’s where the first complication came in. Because of rounding, some plots went to 101, some stopped at 99. So I needed to do a bit of cleaning to make each plot even out at 100. I suppose I could have done a more nuanced rounding with an ifelse but I likely still would have had to examine the plot and check the data.

Show code for making the waffle plot data
library(tidyverse) # to do tidyverse things
library(tidylog) # to get a log of what's happening to the data
library(waffle) # to make the waffle plots
library(patchwork) # stitching together the final plots

# using the ed_attain_main df from prompt 14
ed_attain_waffle <- ed_attain_main %>%
    filter(!parent_ed == "Total") %>%
    filter(!parent_ed == "Qualifying educ paths") %>%
    filter(!child_ed == "Total") %>%
    filter(!child_age_group == "Total") %>%
    group_by(child_age_group, parent_ed) %>%
    mutate(age_grp_par_ed_sum = sum(N)) %>%
    ungroup() %>%
    group_by(child_age_group, parent_ed, child_ed) %>%
    mutate(age_par_ed_child_ed_sum = sum(N)) %>%
    ungroup() %>%
    mutate(age_par_ed_child_ed_pct = age_par_ed_child_ed_sum / age_grp_par_ed_sum) %>%
    mutate(age_par_ed_child_ed_pct2 = round(age_par_ed_child_ed_pct, 2) * 100) %>%
    distinct(parent_ed, child_age_group, child_ed, .keep_all = TRUE) %>%
    select(-child_age, -N) %>%
    mutate(parent_ed = factor(parent_ed,
        levels = c("Primary", "HS-Vocational", "HS-Academic", "Short-cycle coll",
            "Bach-Vocational", "Bach-Academic", "Masters", "PhD", "Not stated"))) %>%
    mutate(child_ed = factor(child_ed,
        levels = c("Completed education", "Undergoing education", "Discontinued education",
            "No registered education"))) %>%
    # clean up pct vals to make charts even out
    mutate(age_par_ed_child_ed_pct2 =
            ifelse(parent_ed == "HS-Vocational" & child_ed == "Completed education" & child_age_group == "25-29",
                40, age_par_ed_child_ed_pct2)) %>%
    mutate(age_par_ed_child_ed_pct2 =
            ifelse(parent_ed == "Short-cycle coll" & child_ed == "No registered education"
                & child_age_group == "25-29", 31, age_par_ed_child_ed_pct2)) %>%
    mutate(age_par_ed_child_ed_pct2 =
            ifelse(parent_ed == "Masters" & child_ed == "Completed education" & child_age_group == "25-29",
        69, age_par_ed_child_ed_pct2)) %>%
    mutate(age_par_ed_child_ed_pct2 =
            ifelse(parent_ed == "HS-Academic" & child_ed == "Discontinued education"
                & child_age_group == "30-34", 9, age_par_ed_child_ed_pct2)) %>%
    mutate(age_par_ed_child_ed_pct2 =
            ifelse(parent_ed == "Short-cycle coll" & child_ed == "Completed education"
                & child_age_group == "30-34", 60, age_par_ed_child_ed_pct2)) %>%
    mutate(age_par_ed_child_ed_pct2 =
            ifelse(parent_ed == "PhD" & child_ed == "Discontinued education"
                & child_age_group == "30-34", 4, age_par_ed_child_ed_pct2)) %>%
    mutate(age_par_ed_child_ed_pct2 =
            ifelse(parent_ed == "Primary" & child_ed == "No registered education"
                & child_age_group == "35-39", 68, age_par_ed_child_ed_pct2)) %>%
    mutate(age_par_ed_child_ed_pct2 =
            ifelse(parent_ed == "HS-Academic" & child_ed == "No registered education"
                & child_age_group == "35-39", 39, age_par_ed_child_ed_pct2)) %>%
    mutate(age_par_ed_child_ed_pct2 =
            ifelse(parent_ed == "Short-cycle coll" & child_ed == "Discontinued education"
                & child_age_group == "35-39", 7, age_par_ed_child_ed_pct2)) %>%
    mutate(age_par_ed_child_ed_pct2 =
            ifelse(parent_ed == "Bach-Vocational" & child_ed == "Discontinued education"
                & child_age_group == "35-39", 5, age_par_ed_child_ed_pct2)) %>%
    mutate(age_par_ed_child_ed_pct2 =
            ifelse(parent_ed == "Primary" & child_ed == "No registered education"
                & child_age_group == "40-45", 67, age_par_ed_child_ed_pct2)) %>%
    mutate(age_par_ed_child_ed_pct2 =
            ifelse(parent_ed == "HS-Academic" & child_ed == "No registered education"
                & child_age_group == "40-45", 39, age_par_ed_child_ed_pct2)) %>%
    mutate(age_par_ed_child_ed_pct2 =
            ifelse(parent_ed == "Bach-Vocational" & child_ed == "Undergoing education"
                & child_age_group == "40-45", 1, age_par_ed_child_ed_pct2)) %>%
    mutate(age_par_ed_child_ed_pct2 =
            ifelse(parent_ed == "Not stated" & child_ed == "Undergoing education"
                & child_age_group == "40-45", 1, age_par_ed_child_ed_pct2))

If we are making more than one plot with the same variables, do we copy-paste-amend the code?

Not if we can help it.

Let’s make the basic plot into a function to limit the amount of overall code and chance for error. What we’ll do here is add a filter expression to the function call.

Show code for making the plot function
waffleplot <- function(plotdf, filter_expr) {
    
    # Convert the string expression to an actual R expression
    filter_expr <- rlang::parse_expr(filter_expr)
    # Filter the data
    filtered_df <- plotdf %>% filter(!!filter_expr)
    
    # Create the plot
    ggplot(filtered_df, (aes(fill = child_ed, values = age_par_ed_child_ed_pct2))) +
        geom_waffle(na.rm=TRUE, n_rows=10, flip=TRUE, size = 0.33, colour = "white") +
        facet_wrap(~parent_ed, nrow=1,strip.position = "bottom", scales = "free_x") +
        scale_x_discrete(labels = ) +
        scale_y_continuous(labels = function(x) x * 10, # make this multipler the same as n_rows
            expand = c(0,0)) +
        scale_fill_brewer(palette = "Set2") +
        theme_minimal() +
        theme(legend.position = "bottom", legend.justification = "left",
            legend.spacing.x = unit(0, 'cm'),
            legend.key.width = unit(1, 'cm'), legend.margin=margin(-10, 0, 0, 0),
            legend.title = element_text(size = 8), legend.text = element_text(size = 8),
            panel.grid.major = element_blank(), panel.grid.minor = element_blank()) +
        guides(fill = guide_legend(label.position = "bottom",
            title = "Higher education completion status", title.position = "top"))
}

Now that we have the function, it’s easy to call. In this case we pass the data frame and age group filter. I suppose I could have added the dataframe name in the function, but since I may want to reuse the function I kept some things as generic as possible.

Calling the function is simple: waffleplot(ed_attain_waffle, "child_age_group == '25-29'"). As you can see in the code below, we’ll change the age group for each call, creating a plot object each time that we can edit before stitching together for the final plot.

Show code for making the plots
# make each individual plot
plot_2529 <- waffleplot(ed_attain_waffle, "child_age_group == '25-29'")
plot_3034 <- waffleplot(ed_attain_waffle, "child_age_group == '30-34'")
plot_3539 <- waffleplot(ed_attain_waffle, "child_age_group == '35-39'")
plot_4045 <- waffleplot(ed_attain_waffle, "child_age_group == '40-45'")

### add subtitles to the plots with age-group displayed
plot_2529 <-
plot_2529 +
    labs(subtitle = "Age group 25-29")

plot_3034 <-
    plot_3034 +
    labs(subtitle = "Age group 30-34")

plot_3539 <-
plot_3539 +
    labs(subtitle = "Age group 35-39")

plot_4045 <-
plot_4045 +
    labs(subtitle = "Age group 40-45")

# usepatchwork to stitch the plots together and add general annotations.
plot_2529 + plot_3034 +  plot_3539 + plot_4045 +
    plot_annotation(
        title = "In Denmark, regardless of age, the likelihood of completing higher education increases as level of parent education increases.",
        subtitle = "Higher education completion status for people ages 25-45, by age group and parent educational attainment, 2023. Each block = 1 %",
        caption = "Data from Danmarks Statistik table STATUSV2 via danstat package")

waffle plot displaying higher ed completion by age group and parent education

One complication I could not adequately solve is the size of the x-axis labels. There’s something in the facet_wrap call to put together the parent ed levels that does not allow me to change the size the way I could in a stand-alone plot. Not sure why, so we’ll have to live with the truncated axis labels. Otherwise I’m happy with the output.

Also, for the legend, there was no way I know of to include it in only one plot that didn’t change the proportions of that plot relative to the others, throwing off the symmetry in the stitched-together plots. So the legend is in every plot. I don’t think it’s possible to include it as an annotation or extra object in the stitched plot, but maybe there’s a hack I don’t know of in patchwork.

A few things jump out here.

  • First, the pattern is the same regardless of age. If your parents have higher levels of education, you are more likely to complete higher education. I know from my career doing higher ed policy analysis in the US, that the pattern is the same there.
  • Not surprisingly the younger age group, 25-29, includes more people who are still in the proces of finishing their college education (the orange blocks).
  • The discontinued rates (blue blocks) seemed similar-ish across age-groups, so the likelihood of stopping out does not look to be affected by parental education level.

Of course if this were a more in-depth report there’d be some descriptive tables to set the stage. And again, if I had access to the micro-data I’d be running regression models to explain factors leading to higher completion rates.

created and posted April 20, 2025

Prompt #19 - Smooth

For this very-out-of-prompt-order post (being posted after the prompt that should come after this one) I’m revisiting prompt 9, “diverging”, to look at academic Bachelor degrees awarded by sex (as a percentage of the whole) but adding a time-series element to it as the theme for the week is time series. The “smooth” aspect will be either normalization or smoothing lines.

As in prompt 9 we’re using the UDDAKT60 table from Danmarks Statistik StatBank via the danstat package. The fetching & cleaning is slightly different; this time I’m fetching all available years, 2005 to 2024.

Show code for getting and cleaning data
library(tidyverse) # to do tidyverse things
library(tidylog) # to get a log of what's happening to the data
library(janitor) # tools for data cleaning
library(danstat) # package to get Danish statistics via api
library(patchwork) # to join plots
library(ggtext) # enhancements for text in ggplot
library(ggrepel)
library(scales)


# some custom functions
source("~/Data/r/basic functions.R")

# make sure numbers don't present in scientific notation
options(scipen=999)

# UDDAKT60
table_meta <- danstat::get_table_metadata(table_id = "uddakt60", variables_only = TRUE)

# create variable list using the ID value in the variable
variables_ed <- list(
    list(code = "uddannelse", values = "*"),
    list(code = "fstatus", values = c("F")),
    list(code = "køn", values = "*"),
    #list(code = "alder", values = c("TOT")),
    list(code = "tid", values = "*"))

bacdegs1 <- get_data("uddakt60", variables_ed, language = "en") %>%
    as_tibble() %>%
    clean_names()

## the main cleaned dataset
bacdegs_main <- bacdegs1 %>%
    mutate(kon = ifelse(kon == "Sex, total", "Total", kon)) %>%
    mutate(deg_code = str_extract(uddannelse, "^[^ ]+")) %>%
    mutate(deg_name = sub("^\\S+\\s+", '', uddannelse)) %>%
    mutate(deg_name = str_remove(deg_name, ", BACH")) %>%
    mutate(deg_name = str_remove(deg_name, " BACH")) %>%
    mutate(deg_group = case_when(
        deg_code %in% c("H6020", "H6025", "H6030", "H6035", "H6039",
            "H6059", "H6075", "H6080", "H6090") ~ "Main",
        deg_code == "H60" ~ "All",
        TRUE ~ "Sub")) %>%
    mutate(deg_field =  case_when(
        str_detect(deg_code, "H6020") ~ "Education",
        str_detect(deg_code, "H6025") ~ "Humanities",
        str_detect(deg_code, "H6030") ~ "Arts",
        str_detect(deg_code, "H6035") ~ "Science",
        str_detect(deg_code, "H6039") ~ "Social Sciences",
        str_detect(deg_code, "H6059") ~ "Technical sciences",
        str_detect(deg_code, "H6075") ~ "Food/Biotech/Lab Tech",
        str_detect(deg_code, "H6080") ~ "Agriculture/Nature/Environment",
        str_detect(deg_code, "H6090") ~ "Health science",
        TRUE ~ "All")) %>%
    select(deg_code:deg_field, year = tid, sex = kon, degs_n = indhold)

We’ll start with two charts; first, the number of degrees earned each year, along with a 3-year moving average smoothing line. Then one showing the year-to-year percent change along with a line showing the average percent change. The 3-year average is computed using the rollmean function in the zoo package.

The percent changes are computed using my custom pctchange function loaded with the call source("~/Data/r/basic functions.R") when I load packages at the top of a script.

We can make one dataframe to do both charts, so I’ll just create it as it’s own object.

I had planned on doing a chart normalizing the degrees awarded to 100 starting in 2005. But it didn’t really show that much different than the total degrees chart. Though I decided against using the chart, I kept the process for creating the variable in the script for future use.

Show code for making the next df
# create percent change and norm to 100 varaibles
bacdegs_all <-
bacdegs_main %>%
    filter(deg_name == "Bachelors programs") %>%
    filter(sex == "Total") %>%
    select(year, degs_n) %>%
    # creates the 3-year rolling average
    mutate(degs_3yr = zoo::rollmean(degs_n, k = 3, fill = NA)) %>%
    # percent change variables
    mutate(pct_chg_degs = pctchange(degs_n)) %>%
    mutate(pct_chg_degs2 = ifelse(year > 2005,
        round(pct_chg_degs * 100, 3), pct_chg_degs)) %>%
    mutate(pct_chg_degs2 = ifelse(is.na(pct_chg_degs2), 0, pct_chg_degs2)) %>%
    # norming to 100 variables
    mutate(cum_change = cumsum(pct_chg_degs2)) %>%
    mutate(norm100 = cum_change + 100)

We have the dataframe, let’s make the charts. Notice for both charts I used some html styling in the subtitle to use them as legends. For the percent-change chart I’m using not only html styling in the caption but including a dynamic call to the dataframe to include the value of the mean percent change in the caption and the geom_hline. We’re using patchwork to join the plots into one image.

Show code for making the 1st 2 charts
# line plot with all degs & 3 year average
alldegs_plot <- 
bacdegs_all %>%
    ggplot(aes(x = year)) +
    geom_line(aes(y = degs_n), color = "#80B1D3", size = 1 ) +
    geom_line(aes(y = degs_3yr), color = "#FDB462", linetype = 2, size = 1) +
    scale_x_continuous(breaks = c(2005, 2008, 2010, 2012, 2014, 2016, 2018, 2020, 2022, 2024))+
    scale_y_continuous(limits = c(8000, 21000),
        labels = label_comma()) +
    theme_minimal() +
    labs(x = "", y = "",
        title = "The number of academic bachelor's degrees awarded increased by 77% from 2005-2024",
        subtitle = "<span style = 'color: #80B1D3;'> Blue line is total degrees awarded.</span>
        <span style = 'color: #FDB462;'> Orange dotted line is 3-year rolling average.</span> ",
        caption = "*Data from Danmarks Statistik table UDDAKT60 via danstat package*") +
    theme(
        panel.background = element_rect(fill = "grey95", colour = "grey95"),
        plot.title = element_markdown(size = 14, color = "grey35"),
        plot.subtitle = element_markdown(size = 11),
        plot.caption = element_markdown(size = 7),
        axis.text.x = element_text(size = 7, color = "grey50"),
        axis.text.y = element_text(size = 7, color = "grey50"),
        panel.grid.major = element_blank(), panel.grid.minor = element_blank())

# percent changes
pctchange_plot <-
bacdegs_all %>%
    ggplot(aes(x = year, y = pct_chg_degs)) +
    geom_line(color = "#80B1D3", size = 1) +
    geom_hline(aes(yintercept = mean(pct_chg_degs, na.rm = TRUE)),
        color="#FDB462") +
    scale_x_continuous(
        breaks = c(2005, 2008, 2010, 2012, 2014, 2016, 2018, 2020, 2022, 2024)) +
    scale_y_continuous(
        limits = c(-.2, .2), 
        breaks = c(-.2, -.15, -.10, -.05, 0, .05, .10, .15, .2),
        labels = label_percent()) +
    theme_minimal() +
    labs(x = "", y = "",
        title = "The percent change in number of academic bachelor's degrees awarded has fluctuated year-to-year 
        from 2005 to 2024",
        subtitle = glue::glue("<span style = 'color: #80B1D3;'> Blue line is annual % change in 
        degrees awarded starting in 2005.</span> 
        <span style = 'color: #FDB462;'> Average year-to-year change is 
        {round(mean(bacdegs_all$pct_chg_degs, na.rm = TRUE) *100, 2)}% </span> "),
        caption = "*Data from Danmarks Statistik table UDDAKT60 via danstat package*") +
    theme(
        panel.background = element_rect(fill = "grey95", colour = "grey95"),
        plot.title = element_markdown(size = 14, color = "grey35"),
        plot.subtitle = element_markdown(size = 11),
        plot.caption = element_markdown(size = 8),
        axis.text.x = element_text(size = 9, color = "grey50"),
        axis.text.y = element_text(size = 9, color = "grey50"),
        panel.grid.major = element_blank(), panel.grid.minor = element_blank())

alldegs_plot + pctchange_plot

line charts displaying degrees over time and percent change in degrees year to year

The number of Bachelor’s degrees awarded increased by 77% during the period covered here, albeit with some larger fluctuations in a few years. The 3-year average smoothing line doesn’t deviate too far from the general trend line. The percent-change chart shows the fluctuations year to year, though the changes are perhaps magnified a bit by the y-axis scale. With wider limits that line would be more flat.

For the next chart I’ll present degrees by sex in each major discipline. The code for that one is below. We’ll modify the main dataframe and pass right into the ggplot() calls, no need to create a separate object.

Keep an eye out for what I noticed while doing these charts the first time…look at the trend for “Technical Sciences. We’ll talk about that next.

Show code for making the 1st deg by sex chart
bacdegs_main %>%
    filter(!deg_name == "Bachelors programs") %>%
    filter(deg_group == "Main") %>%
    filter(!sex == "Total") %>%
    select(year, deg_field, sex, degs_n) %>%
    group_by(year, deg_field) %>%
    mutate(pct_sex_field = degs_n / sum(degs_n)) %>%
    ungroup() %>%
    ggplot(aes(x = year, y = pct_sex_field, color = sex)) +
    geom_point() +
    geom_smooth() +
    # for sex men (orange) & women (purple) colors
    scale_color_manual(values = c("#E66100", "#5D3A9B")) +
    scale_x_continuous(breaks =
            c(2005, 2008, 2010, 2012, 2014, 2016, 2018, 2020, 2022, 2024)) +
    scale_y_continuous(labels = label_percent()) +
    labs(x = "", y = "") +
    facet_wrap(~ deg_field, scales = "free_x") +
    theme(
        legend.position = "none",
        panel.background = element_rect(fill = "white", colour = "grey95"),
        panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
        plot.title = element_markdown(size = 14, color = "grey35"),
        plot.subtitle = element_markdown(size = 11),
        plot.caption = element_markdown(size = 8),
        axis.text.x = element_text(size = 9, color = "grey50"),
        axis.text.y = element_text(size = 9, color = "grey50"))

dot charts with smoothing lines displaying percent in degrees awarded by sex and academic discipline from 2005-2024

Because the degree percentages by sex don’t really change too much year-to-year the smoothing lines don’t really show much. Where there is fluctuation and thus a smoothing line that runs between more spread out points, it’s mostly in disciplines with lower numbers of degrees awarded, so prone highlighting the the more noticeable fluctuations.

While making them I noticed a weird trend in the “Technical Sciences” discipline. It starts at near zero degrees awarded, and the first year the discipline has more degrees, the gendered nature appears. In prompt 9 I noticed the gender divide in lots of disciplines for year 2023, so one of the things I had wanted to examine at some point is the trend in gender splits over time.

So let’s dig a bit deeper into the Technical Sciences discipline and look at total degrees awarded by sex.

Show code for making the 2nd deg by sex chart
bacdegs_main %>%
    filter(!deg_name == "Bachelors programs") %>%
    filter(deg_group == "Sub") %>%
    filter(!sex == "Total") %>%
    filter(deg_field == "Technical sciences") %>%
    select(year, deg_name, sex, degs_n) %>%
    ggplot(aes(x = year, y = degs_n, color = sex)) +
    geom_point() +
    geom_smooth() +
    labs(x = "", y = "") +
    scale_color_manual(values = c("#E66100", "#5D3A9B")) +
    scale_x_continuous(breaks =
            c(2005, 2008, 2010, 2012, 2014, 2016, 2018, 2020, 2022, 2024)) +
    facet_wrap(~ deg_name, scales = "free") +
    theme(
        legend.position = "none",
        panel.background = element_rect(fill = "white", colour = "grey95"),
        panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
        plot.title = element_markdown(size = 14, color = "grey35"),
        plot.subtitle = element_markdown(size = 11),
        plot.caption = element_markdown(size = 8),
        axis.text.x = element_text(size = 9, color = "grey50"),
        axis.text.y = element_text(size = 9, color = "grey50"))

dot charts with smoothing lines displaying percent in degrees awarded by sex and major in technical sciences from 2005-2024

First we see that all degrees were nearly at zero in 2005, so I’m guessing this discipline was newly classified in the Danish education system around then. I’m sure computer science and engineering degrees were awarded before 2005, but I’m not sure what the broader discipline category was before 2005.

Mostly what we notice is that the Electronics & IT major and the Engineering major are responsible for most of the gender split. Electronics & IT is by far the largest major and the gender split is quite stark. Women earn as many or more degrees in some fields, but not enough to make up for the split in Electronics & IT.

The smoothing lines show us the predicted smoother pass through the trends.

Now that I have a framework for the trends in degrees by sex, I can see doing a longer-form post and looking at majors in all disciplines.

created and posted April 23, 2025

Prompt #20 - Urbanization

Let’s look at educational attainment for people aged 15-69 by municipal group from 2008 to 2023. Given the prompt is “urbanization” within the theme of “time series”, my main questions here are:

  • To what extent the share of people living in the larger municipal areas grew between 2008 and 2023? (the years available on StatsBank)

  • Did the percentages of educational attainment change by municipal group over time?

The age range available is 15-69. I won’t disaggregate by age in this analysis. That would be interesting but I’m trying to keep this somewhat simple.

First, a quick primer on civic administrative divisions in Denmark, so we understand what the groups mean.

Municipalities (kommuner) are incorporated entities for the purposes of budgets, schools, and other services. There are often smaller towns and villages with their own names within a municipality.

For the purposes of this grouping, Danmarks Statistik has sorted municipalities into five groups. They do not correspond to the Regions, which are administrative units above the level of municipality. A municipality will often have separately named towns, villages or neighborhoods within it, for example Bellinge is a village that is part of the Odense kommune.

The five municipal groups in these charts are:

  • Capital municipalities (Hovedstadskommuner) is Copenhagen and the surrounding municipalities within roughly 45 minutes to an hour drive, as far south as Dragør just south of the airport, north almost to Helsingør, and west to Tastrup.

  • The Metropolitan municipalities (Storbykommuner) are the three next biggest cities, Århus, Aalborg, and Odense.

  • Provincial Municipalities (Provinsbykommuner) are smaller municipalities roughly more than an hour outside of Copenhagen and the metropolitan kommunes, for example Helsingør, Roskilde, and Viborg.

  • Commuter Municipalities (Oplandskommuner) are on the outskirts of the metropolitan and provincial municipalities, like Fredensborg, Holbæk, and Nyborg.

  • Rural municipalities (Landkommuner) are further away from the Commuter or Provincial municipalities.

You can find the groups and smaller municipalities within at this documentation page

The data is from Statistiks Danmark via the danstat package. We’ll be using table LABY19, Educational attainment by municipality groups.

The first step is of course to get and clean the data.

Show code for getting and cleaning data
library(tidyverse) # to do tidyverse things
library(tidylog) # to get a log of what's happening to the data
library(janitor) # tools for data cleaning
library(tidyverse) # to do tidyverse things
library(danstat) # package to get Danish statistics via api
library(waffle) # to make waffles
library(gt)
library(scales)
library(patchwork)
library(ggtext)


source("~/Data/r/basic functions.R")

# LABY19: Educational attainment (15-69 years) (number) by municipality groups,
# highest education completed and age
table_meta <- danstat::get_table_metadata(table_id = "laby19", variables_only = TRUE)
table_meta_dk <- danstat::get_table_metadata(table_id = "laby19", variables_only = TRUE, language = "da")

# create variable list using the ID value in the variable
variables_ed <- list(
    list(code = "komgrp", values = c(1, 2, 3, 4, 5)),
    list(code = "hfudd2", values = c("TOT", "H10", "H20", "H30", "H35",
        "H40", "H50", "H60", "H70", "H80", "H90")),
    list(code = "alder", values = "TOT"),
    list(code = "tid", values = "*"))

ed_urban1 <- get_data("laby19", variables_ed, language = "en") %>%
    as_tibble() %>%
    clean_names()

ed_urban_main <- ed_urban1 %>%
    mutate(komgrp = str_remove(komgrp, " municipalities")) %>%
    mutate(alder = ifelse(alder == "Age, total", "Total", alder)) %>%
    mutate(hfudd2 = case_when(
        hfudd2 == "H10 Primary education" ~ "Primary",
        hfudd2 == "H20 Upper secondary education" ~
            "HS-Academic",
        hfudd2 == "H30 Vocational Education and Training (VET)" ~
            "HS-Vocational",
        hfudd2 == "H35 Qualifying educational programs" ~
            "Qualifying educ paths",
        hfudd2 == "H40 Short cycle higher education" ~
            "Short-cycle coll",
        hfudd2 == "H50 Vocational bachelors educations" ~
            "Bach-Vocational",
        hfudd2 == "H60 Bachelors programs" ~
            "Bach-Academic",
        hfudd2 == "H70 Masters programs" ~ "Masters",
        hfudd2 == "H80 PhD programs" ~ "PhD",
        hfudd2 == "H90 Not stated" ~ "Not stated",
        TRUE ~ hfudd2)) %>%
        mutate(hfudd2 = factor(hfudd2,
        levels = c("Primary", "HS-Vocational", "HS-Academic", "Short-cycle coll",
            "Qualifying educ paths", "Bach-Vocational", "Bach-Academic",
            "Masters", "PhD", "Not stated", "Total"))) %>%
        mutate(komgrp = factor(komgrp,
        levels = c("Capital", "Metropolitan", "Commuter", "Provincial", "Rural"))) %>%
    select(muni_grp = komgrp, ed_highest = hfudd2, age = alder, year = tid, N = indhold)

Now that we have a general data frame to work with, let’s see if there’s been a change in the percentage of people aged 15-69 living in each of the groups from 2008 onward.

Show code for making chart 1
# stacked bar chart x axis is year, y is pct, fill is municipal group
ed_urban_main %>%
    filter(ed_highest == "Total") %>%
    select(muni_grp, year, N) %>%
    group_by(year) %>%
    mutate(muni_grp_pct = N /sum(N)) %>%
    ungroup() %>%
    select(-N) %>%
    arrange(year, muni_grp) %>%
    {. ->> tmp} %>%
    ggplot(aes(x = year, y = muni_grp_pct, fill = muni_grp)) +
    geom_bar(position = "stack", stat = "identity") +
    scale_fill_brewer(palette = "Set3") +
    geom_text(aes(label = scales::percent(round(muni_grp_pct, 3))),
        position = position_stack(vjust = 0.5),
        color = "grey20", size = 5) +
    scale_y_continuous(limits = c(0, 1),
        labels = label_percent()) +
    scale_x_continuous(breaks = c(2008, 2011, 2013, 2015, 2017, 2019, 2021, 2023))+
    coord_cartesian(expand = FALSE, clip = "off") +
    labs(x = "", y = "") +
    theme(
        plot.title = element_markdown(size = 16), plot.subtitle = element_markdown(size = 12),
        plot.caption = element_markdown(size = 8),
        legend.position = "bottom", legend.justification = "left",
        legend.spacing.x = unit(0, 'cm'),
        legend.key.width = unit(1, 'cm'), legend.margin=margin(-12, 0, 0, 0),
        legend.title = element_text(size = 8), legend.text = element_text(size = 8),
        panel.grid.major = element_blank(), panel.grid.minor = element_blank()) +
    guides(fill = guide_legend(label.position = "bottom",
        title = "Municipality group", title.position = "top"))
rm(tmp)

stacked bar chart showing percent of people ages 15-69 in denmark by municipal group

As we can see…share of people in the capital and bigger municipalties has been increasing, with provincial staying roughly the same, commuter declining slightly, and rural declining by 3 percentage points.

So there has been a trend towards more urbanization.

Slight aside… Using a stacked bar chart like this does bring up a data visualization conundrum that I want to highlight by showing the first almost finished version

stacked bar chart showing percent of people ages 15-69 in denmark by municipal group

Note how if you just look at the size of the bars within the bars and don’t pay attention to the percent labels, your eyes might lead you to believe that Metropolitan (yellow) and Provincial (orange) are shrinking, because the top line angles downward.

I solved that problem by adding fct_rev to the municipal group variable in the first aes(). I had ordered them as factors in the main dataset, and the ggplot honored that. But it didn’t look right. So reversing them kept the visual inference mostly the same as the data. Though you could sort of be led to believe that the provincial group (now yellow) is decreasing, not growing as it actually is.

The lesson here is be careful about your choice of visualization, and that it communicates exactly what you want it to, keeping in mind some people make inference by color, shape, and size, and some people go right to the numbers in the labels.

Next is the change in ed attainment by municipal group. Here we’ll facet out by municipal group, keeping year as the x axis. The image is “lightboxed” so click on it to enlarge. Hit esc to restore the screen.

Show code for making chart 2
# stacked bar chart x axis is year, y is pct, fill is ed attainment, facet my muni groups

ed_urban_main %>%
    filter(!ed_highest == "Total") %>%
    select(muni_grp, ed_highest, year, N) %>%
    arrange(year, muni_grp, ed_highest) %>%
    group_by(year, muni_grp) %>%
    mutate(muni_year_N = sum(N)) %>%
    mutate(muni_ed_pct = N /sum(N)) %>%
    ungroup() %>%
    select(-N) %>%
    {. ->> tmp} %>%
    ggplot(aes(x = year, y = muni_ed_pct, fill = fct_rev(ed_highest))) +
    geom_bar(position = "stack", stat = "identity") +
    scale_fill_manual(values = c(
        "#BC80BD", "#D9D9D9",   "#FCCDE5", "#B3DE69", "#FDB462",
        "#80B1D3", "#FB8072", "#BEBADA", "#FFFFB3", "#8DD3C7")) +
    geom_text(
        data = subset(tmp, year %in% c(2008, 2015, 2023) & muni_ed_pct > 0.01),
        aes(label = scales::percent(muni_ed_pct, accuracy = 1)),
        position = position_stack(vjust = .5),
        color = "grey20", size = 3) +
    scale_y_continuous(labels = label_percent()) +
    scale_x_continuous(breaks = c(2008, 2011, 2013, 2015, 2017, 2019, 2021, 2023))+
    coord_cartesian(expand = FALSE, clip = "off") +
    facet_wrap(~muni_grp, scales = "free") +
    labs(x = "", y = "",
        title = "Danes in all municipal groups are earning higher levels of education over time.
        The Capital & Metropolitan groups have higher percentages of people with Bachelor & Master's degrees.",
        subtitle = "Percentage of people ages 15-69 by municipal group in Denmark, 2008-2023.",
        caption = "*Data from Danmarks Statistik table LABY19 via danstat package*") +
    theme(
        plot.title = element_markdown(size = 14), plot.subtitle = element_markdown(size = 11),
        plot.caption = element_markdown(size = 8),
        axis.text.x = element_text(size = 7, color = "grey50"),
        axis.text.y = element_text(size = 7, color = "grey50"),
        strip.background.x = element_rect(fill = "grey90", color = "grey95"),
        legend.position = "bottom", legend.justification = "left",
        legend.spacing.x = unit(0, 'cm'),
        legend.key.width = unit(1, 'cm'), legend.margin=margin(-12, 0, 0, 0),
        legend.title = element_text(size = 8), legend.text = element_text(size = 8),
        panel.grid.major = element_blank(), panel.grid.minor = element_blank()) +
    guides(fill = guide_legend(label.position = "bottom", nrow = 1, reverse=T,
        title = "Highest level of education attained", title.position = "top"))
rm(tmp)

stacked bar chart showing percent of people ages 15-69 in denmark by educatonal attainment and municipal group

Here we see more people in all municipal groups attaining higher levels of education over time; the percentage of people stopping out after primary level is decreasing in all municipal groups. We also see a greater percentage of people with Bachelor’s and Master’s degrees in the Capital and Metropolitan regions.

Or in relation to today’s prompt - not only are more people gravitating to the larger urban areas or Copenhagen, Århus, Odense & Aalborg, those that do tend to have higher levels education.

On a general visualization choice note, I’m not sure if these stacked bars were the way to go. Waffles might have been the better choice, but I had just done waffles in prompt 15 so wanted something different here.

Since I had been falling behind in getting charts posted, I wanted to keep this analysis simple. Because age is available (in groups of 5 years) I could go back and see if there’s a difference in age composition that might explain changes to the educational attainment mix over time. Especially because we know from prompt 2 that Danes are earning higher levels of education over time.

Perhaps a deeper dive project after the chart challenge is over.

created and posted April 22, 2025

Prompt #23 - Log Scale

For this prompt I’m revisiting prompt 3, circular, to once again look at earnings. I’ll expand on that analysis to look across time, since the week’s theme is time series. I’ll also compare by sex and dig a bit deeper into education level to disaggregate by field or discipline where available.

Because the prompt is log, what I’ll do is run one chart with a log10 transformation and one without, to see if there’s a change to the slope.

Circling back to what we saw in prompt 9 and prompt 19, I’m especially interested in seeing if the gendered nature of degree types at the academic bachelor level, where men earn far more of the technical and science degrees, has an effect on earnings.

So as not to bury the lede, the answer to whether there is a gender pay gap is (sadly) yes. As we’ll see, it does vary by worker type (salaried or hourly) and by education level. There were too many layers as is so I did not further disaggregate by worker level (managers, line-staff, trainees, etc). And I couldn’t disaggrgate by age because that isn’t available with education level. However, we do see in the screenshot below from Table LIGELI3, that age is a factor in and of itself, though the effect varies by age group.

table from danmarks statstik statbank showing gender pay gap But for now let’s get back to the log scale prompt we’re here to work on.

First we get the data. As in prompt 3, the income field we’re getting is MDRSNIT, standardized monthly earnings. You can read more about the different income variables on pages 8 & 9 in this pdf document.

Show code for getting and cleaning data
library(tidyverse) # to do tidyverse things
library(tidylog) # to get a log of what's happening to the data
library(janitor) # tools for data cleaning
library(tidyverse) # to do tidyverse things
library(danstat) # package to get Danish statistics via api
library(waffle) # to make waffles
library(gt)
library(scales)
library(patchwork)
library(ggtext)

# some custom functions
source("~/Data/r/basic functions.R")

# lons11
table_meta <- danstat::get_table_metadata(table_id = "lons11", variables_only = TRUE)

# create variable list using the ID value in the variable
variables_ed <- list(
    list(code = "uddannelse", values = "*"),
    list(code = "sektor", values = c(1000, 1016, 1020, 1025, 1046)),
    list(code = "lønmål", values = "MDRSNIT"), # avg monthly
    list(code = "køn", values = "*"),
    list(code = "afloen", values = c("TIME", "FAST")),
    list(code = "tid", values = "*"))

# get the data from API
sal1 <- get_data("lons11", variables_ed, language = "en") %>%
    as_tibble() %>%
    clean_names()

# main df to work from clean up some fields, make filtering variables
sal_main <- sal1 %>%
    mutate(income = as.numeric(indhold)) %>%
    mutate(income = round(income, 2)) %>%
    mutate(kon = ifelse(kon == "Men and women, total", "Total", kon)) %>%
    mutate(afloen = ifelse(
        afloen == "Fixed salary-earners", "Salaried", "Hourly")) %>%
    mutate(sector = case_when(
        sektor == "Corporations and organizations" ~ "Private sector",
        sektor == "Government including social security funds" ~ "Gov't - National",
        sektor == "Municipal government" ~ "Gov't - Municipal",
        sektor == "Regional government" ~ "Gov't - Regional",
        TRUE ~ sektor)) %>%
    mutate(sector =
            factor(sector,
                levels = c("All sectors", "Gov't - Municipal", "Gov't - Regional",
                    "Gov't - National", "Private sector"))) %>%
    mutate(ed_level =
            case_when(
                str_detect(uddannelse, "H10") ~ "Primary",
                str_detect(uddannelse, "H20") ~ "HS-Academic",
                str_detect(uddannelse, "H30") ~ "HS-Vocational",
                str_detect(uddannelse, "H35") ~ "Qualifying Programs",
                str_detect(uddannelse, "H40") ~ "Short-cycle college",
                str_detect(uddannelse, "H50") ~ "Bachelor-Vocational",
                str_detect(uddannelse, "H60") ~ "Bachelor-Academic",
                str_detect(uddannelse, "H70") ~ "Masters",
                str_detect(uddannelse, "H80") ~ "PhD",
                str_detect(uddannelse, "H90") ~ "Not stated",
                uddannelse == "Total" ~ "Total")) %>%
    mutate(ed_level =
            factor(ed_level,
                levels = c("Primary", "HS-Academic", "HS-Vocational",
                    "Qualifying Programs", "Short-cycle college",
                    "Bachelor-Vocational", "Bachelor-Academic",
                    "Masters", "PhD", "Not stated", "Total"))) %>%
    ## clean up the degree names
    mutate(deg_code = str_extract(uddannelse, "^[^ ]+")) %>%
    mutate(deg_name = sub("^\\S+\\s+", '', uddannelse)) %>%
    mutate(deg_name = str_remove(deg_name, ", BACH")) %>%
    mutate(deg_name = str_remove(deg_name, " BACH")) %>%
    mutate(deg_name = str_remove(deg_name, ", SCE")) %>%
    mutate(deg_name = str_remove(deg_name, ", VBE")) %>%
    mutate(deg_name = str_remove(deg_name, ", MASTER")) %>%
    mutate(deg_name = str_replace(
        deg_name, "Educational, PhD", "Education")) %>%
    mutate(deg_name = str_remove(deg_name, ", PhD")) %>%
    mutate(deg_name = str_remove(deg_name, " PhD")) %>%
    mutate(deg_name = str_replace(deg_name, " \\s*\\([^\\)]+\\)", "")) %>%
    mutate(deg_name = str_replace(
        deg_name, "Upper secondary education", "Upper secondary")) %>%
    mutate(deg_name = str_replace(
        deg_name, "upper secondary education", "upper secondary")) %>%
    mutate(deg_name = str_replace(
        deg_name, "The technology area, ", "Technology-")) %>%
    mutate(deg_name = str_replace(
        deg_name, "in general", "general")) %>%
    mutate(deg_name = ifelse(uddannelse ==
        "H2010 Upper secondary education, General (stx, hf, student courses)",
        "Upper secondary, stx, hf", deg_name)) %>%
    mutate(deg_name = ifelse(uddannelse ==
        "H2020 Upper secondary education, General (hhx, htx)",
    "Upper secondary, hhx, htx", deg_name)) %>%
    mutate(deg_name = str_replace(
        deg_name, "Educational", "Education")) %>%
    mutate(deg_name = ifelse(
        deg_code %in% c("H5097","H6097","H7097","H8097"),
        "Not specified", deg_name)) %>%
    mutate(deg_name = case_when(
        deg_code == "H1010" ~ "up to 6th grade",
        deg_code == "H1020" ~ "7th-9th grade",
        deg_code == "H1030" ~ "10th grade",
        TRUE ~ deg_name)) %>%
    mutate(deg_name = str_replace(deg_name,
    "Food, biotechnology and laboratory technology",
    "Food/Biotech/Lab Tech")) %>%
    mutate(deg_name = str_replace(deg_name,
        "educations", "education")) %>%
    # create top-line degree level
    mutate(deg_level = case_when(
        deg_code %in%
            c("H10", "H20", "H30", "H35", "H40", "H50", "H60", "H70", "H80") ~ "Top level",
        deg_code == "Total" ~ "Total",
        TRUE ~ "Sub level")) %>%
    select(year = tid, income, afloen, sector, sex = kon, ed_level,
        deg_code, deg_level, deg_name)

Ok, now we do charts, first income for all education levels and sex.There will be a chart each for regular scale and log10, stitched together with patchwork. The log10 transformation is done in ggplot with the scale_y_log10 function.

Show code for making the 1st charts
# all income, facet by worker type.
plot_income_all <-
    sal_main %>%
    filter(sector == "All sectors") %>%
    filter(sex == "Total") %>%
    filter(ed_level == "Total") %>%
    select(year, income, afloen) %>%
    {. ->> tmp} %>%
    ggplot(aes(x = year)) +
    geom_line(aes(y = income), size = 2, color = "#1F78B4") +
    geom_text(
        data = subset(tmp, year %in% c(2015, 2019, 2023)),
        aes(y = income, label = scales::comma(round(income, 0))),
        color = "#1F78B4", vjust = 2) +
    scale_x_continuous(breaks = c(2015, 2017, 2019, 2021, 2023)) +
    scale_y_continuous(limits = c(20000, 60000),
        breaks = pretty_breaks(),
        labels = label_comma()) +
    labs(x = "", y = "",
        subtitle = "Monthly standardized income by worker type, 2015-2023;
        in Danish kroner (DKK)") +
    facet_wrap(~ afloen) +
    theme_minimal() +
    theme(
        plot.title = element_markdown(size = 14),
        plot.subtitle = element_markdown(size = 12),
        plot.caption = element_markdown(size = 9),
        axis.text.x = element_text(size = 9, color = "grey50"),
        axis.text.y = element_text(size = 9, color = "grey50"),
        strip.background.x = element_rect(fill = "grey90", color = "grey95"),
        strip.text = element_text(size = 11),
        panel.border = element_rect(color = "grey50", fill = NA, size = 1),
        panel.grid.major = element_blank(), panel.grid.minor = element_blank())

# log10 income
plot_log_income_all <-
    sal_main %>%
    filter(sector == "All sectors") %>%
    filter(sex == "Total") %>%
    filter(ed_level == "Total") %>%
    select(year, income, afloen) %>%
    ggplot(aes(x = year)) +
    geom_line(aes(y = income), size = 2, color = "#1F78B4", alpha = .6) +
    geom_text(
        data = subset(tmp, year %in% c(2015, 2019, 2023)),
        aes(y = income, label = scales::comma(round(income, 0))),
        color = "#1F78B4", vjust = 2) +
    scale_x_continuous(breaks = c(2015, 2017, 2019, 2021, 2023)) +
    labs(x = "", y = "",
        subtitle = "Monthly standardized income by worker type with log10 axis transformation, 2015-2023;
        in Danish kroner (DKK)",
        caption = "*Data from Danmarks Statistik table LØNS11 via danstat package*") +
    scale_y_log10(limits = c(20000, 60000),
        breaks = pretty_breaks(),
        labels = label_comma()) +
    facet_wrap(~ afloen) +
    theme_minimal() +
    theme(
        plot.title = element_markdown(size = 14),
        plot.subtitle = element_markdown(size = 12),
        plot.caption = element_markdown(size = 8),
        axis.text.x = element_text(size = 9, color = "grey50"),
        axis.text.y = element_text(size = 9, color = "grey50"),
        strip.background.x = element_rect(fill = "grey90", color = "grey95"),
        strip.text = element_text(size = 11),
        panel.border = element_rect(color = "grey50", fill = NA, size = 1),
        panel.grid.major = element_blank(), panel.grid.minor = element_blank())

plot_income_all / plot_log_income_all +
    plot_annotation(
        title = "Salaried workers have higher monthly earnings than salaried workers. Log10 axis transformation does not have much effect because increase in wages is steady.")

line graphs showing monthly income by salaried and hourly workers in denmark from 2015 to 2023. the bottom graphs have log10 transformations

There’s not much of an effect in changing the y axis to a log10 scale, mainly because the slopes aren’t that positively or negtively steep at any point. We can see the gaps in ticks on the y axis as the log transforms those spaces.

Now for some charts by sex. We’ll do log10 one more time, but spoiler alert, again it won’t matter much.

Show code for making the income charts by sex
# all income by sex, facet by worker type.
plot_income_sex_all <-
    sal_main %>%
    filter(sector == "All sectors") %>%
    filter(!sex == "Total") %>%
    filter(ed_level == "Total") %>%
    select(year, income, sex, afloen) %>%
    {. ->> tmp} %>%
    ggplot(aes(x = year)) +
    geom_line(
        data = subset(tmp, sex == "Men"),
        aes(y = income), size = 2, color = "#5D3A9B") +
    geom_text(
        data = subset(tmp, sex == "Men" & year == 2015),
        aes(y = income, label = "Men"), color = "#5D3A9B", vjust = -1) +
    geom_text(
        data = subset(tmp, sex == "Men" & year %in% c(2015, 2019, 2023)),
        aes(y = income, label = scales::comma(round(income, 0))),
        color = "#5D3A9B", vjust = 2) +
    geom_line(
        data = subset(tmp, sex == "Women"),
        aes(y = income), size = 2, color = "#E66100") +
    geom_text(
        data = subset(tmp, sex == "Women" & year == 2015),
        aes(y = income, label = "Women"), color = "#E66100", vjust = -1) +
    geom_text(
        data = subset(tmp, sex == "Women" & year %in% c(2015, 2019, 2023)),
        aes(y = income, label = scales::comma(round(income, 0))),
        color = "#E66100", vjust = 2) +
    scale_x_continuous(breaks = c(2015, 2017, 2019, 2021, 2023)) +
    scale_y_continuous(limits = c(20000, 60000),
        breaks = pretty_breaks(),
        labels = label_comma()) +
    labs(x = "", y = "",
        subtitle = "Monthly standardized income by sex and worker type, 2015-2023;
        in Danish kroner (DKK)") +
    facet_wrap(~ afloen) +
    theme_minimal() +
    theme(
        plot.title = element_markdown(size = 14),
        plot.subtitle = element_markdown(size = 12),
        plot.caption = element_markdown(size = 9),
        axis.text.x = element_text(size = 9, color = "grey50"),
        axis.text.y = element_text(size = 9, color = "grey50"),
        strip.background.x = element_rect(fill = "grey90", color = "grey95"),
        strip.text = element_text(size = 11),
        panel.border = element_rect(color = "grey50", fill = NA, size = 1),
        panel.grid.major = element_blank(), panel.grid.minor = element_blank())
rm(tmp)

# log10 income
plot_log_income_sex_all <-
    sal_main %>%
    filter(sector == "All sectors") %>%
    filter(!sex == "Total") %>%
    filter(ed_level == "Total") %>%
    select(year, income, sex, afloen) %>%
    {. ->> tmp} %>%
    ggplot(aes(x = year)) +
    geom_line(
        data = subset(tmp, sex == "Men"),
        aes(y = income), size = 2, color = "#5D3A9B", alpha = .6) +
    geom_text(
        data = subset(tmp, sex == "Men" & year == 2015),
        aes(y = income, label = "Men"), color = "#5D3A9B", vjust = -1) +
    geom_text(
        data = subset(tmp, sex == "Men" & year %in% c(2015, 2019, 2023)),
        aes(y = income, label = scales::comma(round(income, 0))),
        color = "#5D3A9B", vjust = 2) +
    geom_line(
        data = subset(tmp, sex == "Women"),
        aes(y = income), size = 2, color = "#E66100", alpha = .6) +
    geom_text(
        data = subset(tmp, sex == "Women" & year == 2015),
        aes(y = income, label = "Women"), color = "#E66100", vjust = -1) +
    geom_text(
        data = subset(tmp, sex == "Women" & year %in% c(2015, 2019, 2023)),
        aes(y = income, label = scales::comma(round(income, 0))),
        color = "#E66100", vjust = 2) +
    scale_x_continuous(breaks = c(2015, 2017, 2019, 2021, 2023)) +
    scale_y_log10(limits = c(20000, 60000),
        breaks = pretty_breaks(),
        labels = label_comma()) +
    labs(x = "", y = "",
        subtitle = "Monthly standardized income by sex & worker type with log10 axis transformation, 2015-2023;
        in Danish kroner (DKK)",
        caption = "*Data from Danmarks Statistik table LØNS11 via danstat package*") +
    facet_wrap(~ afloen) +
    theme_minimal() +
    theme(
        plot.title = element_markdown(size = 14),
        plot.subtitle = element_markdown(size = 12),
        plot.caption = element_markdown(size = 8),
        axis.text.x = element_text(size = 9, color = "grey50"),
        axis.text.y = element_text(size = 9, color = "grey50"),
        strip.background.x = element_rect(fill = "grey90", color = "grey95"),
        strip.text = element_text(size = 11),
        panel.border = element_rect(color = "grey50", fill = NA, size = 1),
        panel.grid.major = element_blank(), panel.grid.minor = element_blank())
rm(tmp)

plot_income_sex_all / plot_log_income_sex_all +
    plot_annotation(
        title = "Without controlling for education level or age, men earn higher monthly wages than women.")

line graphs showing monthly income by sex and by salaried and hourly workers in denmark from 2015 to 2023. the bottom graphs have log10 transformations

What we see here is that gender pay gap I mentioned at the start. It occurs regardless of employee type (hourly or salaried).

Disaggregating by education level will show us some interesting patterns. I won’t show the full code here, as the only real difference is using a facet_grid(afloen ~ ed_level, axes = "all", axis.labels = "all_x") call to put the worker types in the rows and education level in the columns. I had to play around with the income lables depending on year, tweaking the hjust or vjust to get them to line up nicely. I also kept the panel background grey as I found it easier on the eye to scan the lines over grey than over white.

Because the log10 transformation didn’t make a noticeable difference here any more than in the previous charts, I’m not including it here.

line graphs showing monthly income by sex, education level and by salaried and hourly workers in denmark from 2015 to 2023.

The first thing we see confirms the table I showed at the top of this post - there is a persistent gender pay gap in Denmark. The gaps are not as stark among hourly workers, and at some educational attainment levels there is almost no gap at all. For salaried workers the closest to parity comes at the PhD level. The gap is widest for salaried workers for those with vocational Bachelors. Perhaps a call-back to prompt 6 where we saw who was earning SOSU degrees? It tended to be mostly women, and the earnings on a SOSU degree aren’t very high.

It’s a bit surprising at first glance, because salaries for so many jobs are negotiated by the unions that many Danes belong to. But digging deeper into the gaps in degree fields might explain it. So would adding age and if/how many children a woman has. Maybe the reduction in total pay that happens during maternity leave cuts the average earnings if enough women at similar life points stop out for leave?

As this is already a very long post I won’t dig deeper into specific degree fields (humanities, sciences) within the degree levels (vocational or academic Bachelor, Masters, etc), but will get into that after the challenge is done.

created and posted April 25, 2025