Using R for efficient production of standardized content

Bogdan Micu

European Commission

Data Narrative

Example of data narrative

In December 04, the highest value was recorded on Anoat (338) and the lowest on Sullust (84); other planets with low values were Iego (99) and Alderaan (90).

Alderaan continued to increase after 3 months increasing on average by 8.2%; its values range between 78 (August 04) and 692 (February 01).

Anoat continued to decrease after 3 months decreasing on average by 5%, reaching the lowest value in seven months (since May); its values range between 114 (March 01) and 409 (August 04).

Iego continued to decrease after 3 months decreasing on average by 8.4%, reaching the lowest value in almost one year (since February); its values range between 96 (February 04) and 485 (July 01).

Sullust continued to decrease after 3 months stable on average; its values range between 84 (December 04) and 563 (March 01).

The same month half of all planets recorded a decrease in value, five planets recorded an increase, and three planets (Yavin, Naboo, and Ord) remained stable compared to November.

A month ago (November 04), the highest value was recorded on Anoat (351) and the lowest on Alderaan (82); another planet with a low value was Sullust (97).

A year ago (December 03), the highest value was recorded on Dantooine (322) and the lowest on Sullust (163); another planet with a high value was Yavin (318) whileanother planet with a low value was Alderaan (166).

Generation of data narrative

  • The structure of the message and the aspects to investigate are agreed in advance
    (e.g., extremes, change, patterns, outliers, relationships, category size)
  • data elements are generated
    challenge : select elements to highlight
    • how to select near-extreme values (presented)
    • whether to include cases with large change but small weight
  • data elements are verbalized (three planets, highest value, increase/decrease) with the proper noun number and verb agreement (one object is / several objects are)
  • text elements are connected by conjunctions or verbs indicating continuity or change, similarity or contrast

Generation of data narrative (continued)

Stakeholders’ agreement on

  • numeric criteria for assigning values to
    • quantitative categories (stable, extreme, close/far, strong/weak, majority)
    • change thresholds (moderate, large, very large increase/decrease)
    • verbalized percentages : half, two-thirds, one in six
  • level of detail
    e.g., when a category includes less than n elements, these are named, but more than n only the number and range is indicated
  • timeframe
    after several periods of non-significant change, the cumulative effect could be significant (if all change is in the same direction) or not (if direction of change is alternating)

Generation of data narrative : Example

The same month half of all planets recorded a decrease in value, five planets recorded an increase, and three planets (Yavin, Naboo, and Ord) remained stable compared to November.

  1. change groups (increasing / stable / decreasing) generated
  • objects sorted by absolute value
  • if more than 1 objects in group, last one prefixed by “and”
  • if more than 2 objects in group, separated by comma
  1. change groups sorted by number of objects
    the largest group is reported first
  2. group size (n) verbalized as fraction of total number of objects (N), or as word form (for n up to 5)
  3. change group descriptions combined

Generation of data narrative : Code for example

data %>%
    group_by(PLANET) %>%
    mutate(MoM=100*((value/dplyr::lag(value,
                                      order_by=date))-1),
           descr_MoM=case_when(abs(MoM)<.lim_stable ~
                                   "stable",
                               MoM<= -.lim_stable ~
                                   "decreasing",
                               MoM>= .lim_stable ~
                                   "increasing")) %>%
    filter(!is.na(descr_MoM)) %>%
    left_join(eval(.) %>%
                  group_by(date) %>%
                  summarize(N=n()),
              by=join_by(date)) %>%
    group_by(date,
             descr_MoM) %>%
    arrange(desc(abs(MoM))) %>%
    mutate(planet_s=paste0(ifelse(max(row_number())>1 &
                                      row_number()==max(row_number()),
                                  "and ",""),
                           PLANET)) %>%
    summarize(N=mean(N,
                     na.rm=TRUE),
              n=n(),
              planet_s=glue::glue_collapse(planet_s,
                                           sep=ifelse(n()>2,
                                                      ", ", " ")),
              .groups="drop") %>%
    group_by(date) %>%
    arrange(desc(n)) %>%
    mutate(descr.n=case_when(n==N ~
                                 "all planets",
                             between(N-n,1,2) ~
                                 "nearly all planets",
                             between(5*round(round(100*n/N)/5),
                                     76,90) ~
                                 "almost all planets",
                             between(5*round(round(100*n/N)/5),
                                     56,75) ~
                                 "more than half of all planets",
                             between(5*round(round(100*n/N)/5),
                                     45,55) ~
                                 "half of all planets",
                             between(5*round(round(100*n/N)/5),
                                     35,44) ~
                                 "almost half of all planets",
                             between(n,2,5) ~
                                 paste0(english::english(n),
                                        " planets"),
                             n==1 ~
                                 "a single planet",
                             TRUE ~ NA_character_),
           info=paste0(descr.n,
                       ifelse(n<=3,
                              paste0(" (",planet_s,")"),
                              ""),
                       ifelse(str_detect(descr_MoM,
                                         "stable"),
                              " remained stable",
                              paste0(" recorded ",
                                     ifelse(str_detect(descr_MoM,
                                                       "incr"),
                                            "an ","a "),
                                     str_replace(descr_MoM,
                                                 "ing","e"))),
                       ifelse(row_number()==min(row_number()),
                              " in value",""),
                       ifelse(row_number()==max(row_number()),
                              paste0(" compared to ",
                                     month.name[month(date +
                                                          months(-1))],
                                     ifelse(year(date)==year(date +
                                                                 months(-1)),
                                            "",
                                            year(date +
                                                     months(-1)))),
                              ""))) %>%
    summarize(info=glue::glue_collapse(info,
                                       sep=", ",
                                       last=", and ")) %>%
    mutate(info=glue::glue("The same month {info}.")) %>%
    filter(date==t.date) %>%
    pull(info) %>%
    as.character()

Selection of extremes and near-extremes

  • identify “natural groups” of data points, identified based on within-group and between-group distances
  • aim to reproduce the set of values that would be selected by the stakeholders

Extremes and Near-extremes on random numbers

Other Challenges

consolidated output

Self-contained html file

YAML header

---
format:
  html:
    embed-resources: true
---

embedded attachments

xfun::embed_file

different text in html & docx

code that can be rendered into both formats visible conditionally

::: {.content-visible when-format="html"}
::: {.content-visible when-format="docx"}

code that crashes the script when rendered into one of the two output formats executed conditionally

if (knitr::is_html_output())
if (knitr::pandoc_to("docx"))

highlight text in both html and word

  • html : css
  • docx : custom-reference-doc.docx, reference formatting template, where custom-style is defined (see Quarto documentation)

styles were applied simultaneously, w/out conflicts

::: {`r fmt.01.html`}
[`r txt.001`]{`r fmt.01.docx`}
[`r txt.002`]{`r fmt.01.docx`}
:::

adding data to pre-existing xlsx file containing charts and formatted cells

xxx = openxlsx::loadWorkbook(file.xlsx)
xxx %>%
  openxlsx::writeData(sheet="sheet",
                      x=data,
                      startRow=1,
                      startCol=1)
xxx %>%
  openxlsx::saveWorkbook(file.xlsx,
                         overwrite=TRUE)

extract non-formatted text from text w/ html tags

LongFmtText %>%
  rvest::minimal_html() %>%
  rvest::html_elements("body") %>%
  rvest::html_text()

Turpis, maecenas, nostra, duis nec, eleifend posuere sed fusce venenatis, curae euismod. Lacinia sed, feugiat id aptent, metus vel Alderaan - between 78 (August 04) and 692 (February 01). Fringilla aliquet, vestibulum diam Anoat - between 114 (March 01) and 409 (August 04). Nisl inceptos aliquam hendrerit, quis risus. Ligula sed curae aptent erat aliquam turpis Bespin - between 127 (January 04) and 445 (September 02). Ac urna blandit interdum cum erat torquent vel tincidunt mollis phasellus Corellia - between 102 (August 01) and 702 (January 04). Velit nam ridiculus at magna lobortis et vehicula. Ante torquent sed sociis, dolor nisl Coruscant - between 154 (December 01) and 432 (June 03). Amet amet id Dagobah - between 124 (December 04) and 425 (July 01). Sit, adipiscing turpis et netus magnis hendrerit nec lacinia. Amet pellentesque commodo viverra, ac ac phasellus, luctus in Dantooine - between 154 (January 04) and 403 (August 02). Tristique nisi eu mauris nam, vel in sem magnis ligula, ante, ultricies Endor - between 151 (November 04) and 477 (March 03). Netus ornare, quis nunc nisl. Fermentum sed nullam consectetur nunc adipiscing potenti mauris litora consequat iaculis Hoth - between 98 (September 01) and 519 (February 04). Luctus facilisis consequat libero, phasellus nam Iego - between 96 (February 04) and 485 (July 01). Ut a mauris etiam diam primis. Quis justo vel ut nec et nulla Kessel - between 122 (January 04) and 480 (April 02). Accumsan, varius, ac primis rhoncus nunc tortor vitae, aliquet Malastare - between 155 (September 04) and 567 (January 02). Ut eleifend nibh amet ullamcorper molestie. Velit efficitur porta porta et, sit lectus scelerisque mi ut a nulla Naboo - between 112 (October 01) and 469 (April 04). Nec potenti scelerisque eget netus quis eget leo sed, dis feugiat fringilla, ultricies Ord - between 101 (October 04) and 566 (March 02). Dapibus sit at neque ridiculus et vel augue. Id sed nec natoque mauris purus magna sit montes eu Sullust - between 84 (December 04) and 563 (March 01). Risus interdum suscipit class cras magna ut eu augue sociis Tatooine - between 126 (October 01) and 475 (April 04). Ante nibh turpis, nisi blandit lacus scelerisque in in. Non urna curabitur, eleifend aptent ac sit eu taciti dignissim parturient leo Yavin - between 159 (March 01) and 400 (September 02).

Source & Auxiliary files