Tuesday, 21 June 2022, 9:00 - 10:15am CDT: Opening remarks and Paula Moraga

Paula Moraga - R for geospatial data science and public health surveillance

Geospatial health data are essential to inform public health and policy. These data can be used to quantify disease burden, understand geographic and temporal patterns, identify risk factors and measure inequalities. In this talk, I will give an overview of R packages and statistical methods for geospatial data analysis and health surveillance. I will discuss data biases and availability issues, and how public health challenges require robust analytical tools and predictive models that can integrate complex data from different sources and at different geographic and temporal resolutions. I will present R packages for disease mapping, detection of clusters, and risk assessment of travel-related spread of disease, and health surveillance applications where R has been used to model health, environmental, demographic and climatic data to predict disease risk and identify targets for intervention. Finally, I will show how R can help with communication and dissemination which is essential to enable broad access to data and to develop and implement appropriate health policies and improve population health globally.

Paula Moraga is an assistant professor of statistics at King Abdullah University of Science and Technology (KAUST). Her research focuses on the development of statistical methods and computational tools for geospatial data analysis and health surveillance, and the impact of her work has directly informed strategic policy in reducing the burden of diseases such as malaria and cancer in several countries. Paula has worked on the development of several R packages for Bayesian risk modeling, detection of disease clusters, and risk assessment of travel-related spread of disease. She is the author of the book Geospatial Health Data: Modeling and Visualization with R-INLA and Shiny.

Session chair: Matt Shotwell


Tuesday, 21 June 2022, 2:45 - 4:00pm CDT: Amanda Cox (sponsored by Appsilon)

Amanda Cox - How tools shape thinking about graphics

I spent a long time making graphics at the New York Times. When I’d give talks in the world, one of the most popular questions was about tools–usually just some version of “what are they?” My go-to answer was that the tools didn’t really matter: you could draw your chart in chalk on the sidewalk, and, if it was good, it would still be good. But, I’ve spent the last few months hiring a new team. And I’ve started to suspect I was wrong. A love letter of sorts.

Amanda Cox is head of special data projects at USAFacts. Until 2022, she was the data editor of the New York Times. She joined its graphics department in 2005, making charts and maps for the paper and its website. In 2016, she was named the editor of The Upshot section, which offers an analytical approach to the day’s news. She is a leader in the field of data visualization. Before joining the Times, she worked at the Federal Reserve Board and earned a master’s degree in statistics from the University of Washington.

Session chair: Thomas Stewart

Wednesday, 22 June 2022, 9:00 - 10:15am CDT: afrimapr (sponsored by Oracle)

afrimapr - Perspectives of doing R stuff* in an emerging region

*stuff <- c("coding", "community", "capacity")

In 2020, the afrimapr project team set out to support analysts in Africa with open-source R approaches for mapping and visualising data. afrimapr aimed to develop open-source code components and open-access training materials and to support the development of a community of practice. The project was initially funded through Wellcome.

Over the past 30 months, we created several R packages, including {afrilearndata} with training datasets and {afrihealthsites} for health facility data, and we published a paper about the use of open health facility data in Africa. We developed and delivered a 4-hour online workshop with materials in English and French and developed related {learnr} tutorials. We prototyped interactive Shiny apps as demonstration tools of what is possible and started an online open textbook. Our learning materials have been used in various workshops, most notably at useR! 2021 and by Public Health England as part of training for the African Centre for Disease Control in 2021 and 2022. We run monthly community meetups online where community members or invited speakers do code-walkthrough sessions or share experiences related to mapping, data, and more. We’ve also partnered with the R for Data Science Online Learning Community, where we hosting the #chat-africa-maps channel.

This presentation will give a brief overview of the project to date. One of our Malawian community members will also showcase his mapping work using R and share how it is helping to shape policy in public health and providing a better understanding to help combat vector-borne diseases in Malawi. The most significant part of the presentation will, however, focus on the collective experiences of team members around growing an active and inclusive community of practice on the African continent.

Speakers: Anelda van der Walt (Talarify), Anne Treasure (Talarify), Andy South (Liverpool School of Tropical Medicine), Clinton Nkolokosa (Malawi Liverpool Wellcome Trust Clinical Research Programme), and Ghislain Nono Gueye (Louisiana Tech University)

Session chair: Peg Duthie

Presentation slides:

Wednesday, 22 June 2022, 2:45 - 4:00pm CDT: Julia Silge

Julia Silge - Applied machine learning with tidymodels

The tidymodels framework is a collection of R packages for modeling and machine learning using tidyverse principles. Whether you are just starting out today or have years of experience with ML, tidymodels offers a consistent, flexible framework for your work. In this talk, learn how to think about the steps of building a model from beginning to end, how to fluently use different modeling and feature engineering approaches, and how to avoid common pitfalls of modeling like overfitting and data leakage. Training a model is often not the final or truly useful goal of an ML project, so we will also discuss how to version and deploy reliable models trained in R, and approach MLOps tasks like monitoring a deployed model.

Julia Silge is a data scientist and software engineer at RStudio PBC, where she works on open source modeling tools. She holds a PhD in astrophysics and has worked as a data scientist in tech and the nonprofit sector, as well as a technical advisory committee member for the US Bureau of Labor Statistics. She is an author, an international keynote speaker, and a real-world practitioner focusing on data analysis and machine learning practice. Julia loves text analysis, making beautiful charts, and communicating about technical topics with diverse audiences.

Session chair: Andrea Vargas

Presentation slides:

Thursday, 23 June 2022, 9:00 - 10:15am CDT: Sebastian Meyer and R Core panel

Sebastian Meyer - Junior R-core experiences

At the useR! 2020 panel discussion, members of the R Core Team talked about the role of R-core and succession plans. In this talk, I will give an overview of selected changes in R 4.2.0, and also show how to contribute to base R development and share some personal experiences with that.

Sebastian Meyer is a statistician and research fellow at the Institute of Medical Informatics, Biometry and Epidemiology at Friedrich-Alexander-Universität Erlangen-Nürnberg, Germany. He holds a PhD in epidemiology and biostatistics from the University of Zurich and maintains the R package surveillance. He is an editor of the Journal of Statistical Software and the newest member of the R Core Team.

Session chair: Dirk Eddelbuettel

Thursday, 23 June 2022, 2:45 - 4:00pm CDT: Closing remarks and Mine Dogucu

Mine Dogucu - Teaching accessibly and teaching accessibility

The World Health Organization estimates that over one billion people–about 15% of the global population–experience a disability. As R educators, whether we teach in a classroom, at a meetup, or on social media, we need to ensure that our teaching is accessible to all learners. There is a close link between what gets taught and what gets practiced. Thus, not only should we be teaching accessibly but we should also be teaching about accessibility to learners. The inclusion of accessibility in the curricula can enable current and future R educators and developers to utilize accessibility recommendations in their teaching and R products. In this talk, I will share examples from statistics and data science classes where my collaborators and I started incorporating accessibility practices into our teaching. In addition, I will share a selection of R examples that I use to teach accessibility to my students.

Mine Dogucu is an assistant professor of teaching in the Department of Statistics at the University of California Irvine and an incoming lecturer (teaching) in the Department of Statistical Science at University College London. She is an educator with an interest in statistics and data science education and an applied statistician with experience in educational research. She works towards the goal of making statistics and data science physically and cognitively accessible. She enjoys teaching (with) R. She is the coauthor of the book Bayes Rules! An Introduction to Applied Bayesian Modeling and the accompanying R package bayesrules.mine.

Session chair: Yanina Bellini Saibene

Presentation slides:

Tuesday, 21 June 2022, 10:45am - 12:00pm CDT

Session 5, Big Data Management (sponsored by Oracle)

Session chair: Isabella Bicalho

Will Landau - Data version control for reproducible analysis pipelines

In computationally demanding data analysis pipelines, the targets R package maintains an up-to-date set of results while skipping tasks that do not need to be rerun. This process increases speed and enhances the reproducibility of the final end product. However, it also overwrites old output with new output, and past results disappear by default. To preserve historical output, two major enhancements have arrived in the targets ecosystem. The first enhancement is version-aware cloud storage. If you opt into Amazon-backed storage formats and supply an Amazon S3 bucket with versioning turned on, then the pipeline metadata automatically records the version ID of each target. That way, if the metadata file is part of the source code version control repository of the pipeline, the user can roll back to a previous code commit and automatically recover the old data, all without invalidating any targets or cueing the pipeline to rerun. The second enhancement to the ecosystem is gittargets, an alternative cloud-agnostic data version control system. The gittargets package captures version-controlled snapshots of the local data store, and each snapshot points to the underlying commit of the source code. That way, when the user rolls back the code to a previous branch or commit, gittargets recovers the data contemporaneous with that commit so that all targets remain up to date. With cloud versioning and gittargets, the targets package now combines the virtues of both Airflow-like and Make-like tools.

Ilias Moutsopoulos - `bulkAnalyseR`: An accessible, interactive pipeline for analysing and sharing bulk multi-modal sequencing data

Co-authors: Eleanor Williams and Irina Mohorianu

Bulk sequencing experiments (single- and multi-omics) are essential for exploring wide-ranging biological questions. To facilitate interactive exploratory tasks, coupled with the sharing of easily accessible information, we present bulkAnalyseR, a package integrating state-of-the-art approaches, using an expression matrix as the starting point (pre-processing functions are available as part of the package). Static summary images are replaced with interactive panels illustrating quality-checking, differential expression analysis (with noise detection) and biological interpretation (enrichment analyses and identification expression patterns, followed by inference and comparison of regulatory interactions). bulkAnalyseR can handle different modalities, facilitating robust integration and comparison of cis-, trans- and customised regulatory networks.

Oliver Reiter - Providing large trade datasets for research using Apache Arrow

Co-author: David Zenz

In this talk, we document a real-world application of Apache Arrow at an international economic research institute that draws on multiple large datasets (e.g., bilateral trade flow data from UN Comtrade or monthly trade statistics from EU Comext). Until recently, we relied on a workflow using Stata scripts to download, process and store the data. The overall process is lengthy, often requires manual intervention and uses a large amount of storage.

Using Apache Arrow and its R package arrow, we were able to streamline the processing of the data, drastically reduce the required amount of storage (by slightly more than 95%) and, as a side effect, reduce the effort needed to execute custom queries against this dataset.

Furthermore, we provide a performance comparison of five commonly used queries, comparing the “older” process with the “new” Apache Arrow implementation. Our results show that the timings of the new implementation are largely unaffected by the complexity of the query, whereas the timings of the old implementation increase dramatically. We also showcase the implementation of the ingestion pipeline that is used for keeping an up-to-date version of the monthly EU Comext at our institute.

Phuong Quan - `daiquiri`: Data quality reporting for temporal datasets

Co-authors: Martin Landray, Sarah Walker, Timothy Peto, and Benjamin Lacey

Large routinely-collected datasets are increasingly being used in research. Events occurring at the institutional level, such as software updates or new machinery or processes, can cause temporal artefacts that, if not identified and taken into account, can lead to biased results and incorrect conclusions. While checks for data quality issues should theoretically be conducted by the researcher at the initial data analysis stage, in practice it is unclear to what extent this is actually done, since it is rarely, if ever, reported in published papers. With the increasing drive towards greater transparency and reproducibility within the scientific community, this essential yet often-overlooked part of the analysis process will inevitably begin to come under greater scrutiny. Therefore, helping researchers to conduct it thoroughly and consistently will increase the quality of their studies as well as trust in the scientific process.

The daiquiri package addresses this need by generating data quality reports that enable quick visual review of temporal shifts in record-level data. Time series plots showing aggregated values are automatically created for each data field (column) depending on its contents (e.g., min/max/mean values for numeric data, no. of distinct values for categorical data), as well as overviews for missing values, non-conformant values, and duplicated rows. The resulting reports are shareable and can contribute to forming a transparent record of the entire analysis process.

Session 6, Clustering Methods

Session chair: Kapil Choudhary

Ishan Saran - Standardizing acute kidney injury across different populations

Co-authors: Shivam Saran and Francis Perry Wilson

Due to the complex etiology of its presentation, acute kidney injury (AKI) has been defined in a variety of ways. A standardized definition is necessary to make accurate comparisons across captured patient populations, to refine clinical understanding of the syndrome for better prognosis, and to lower administrative costs associated with AKI.

We streamline the standardization of AKI by presenting AKIFlagger, an open-source computational tool built in Python, R, and a standalone web application that implements a standardized AKI definition based on KDIGO guidelines while allowing for variational definitions of historical baseline. We applied AKIFlagger to a dataset of patients hospitalized with COVID-19 with three functional approaches to defining AKI: (1) a rolling-window definition, (2) a historical baseline definition, and (3) imputation based on demographic information.

In our dataset, we demonstrate that subtle changes in definition can have a large impact on which patient populations are captured: using a “historical baseline” definition of creatinine and allowing for imputed historical baseline creatinine values increases the size of captured patient populations by 20.7% and 57.1%, respectively, versus KDIGO standards.

Subtle differences in the definition of AKI can lead to drastic differences in which patient populations are captured by the definition. As a standardized tool, AKIFlagger can be used by researchers to ensure that definitions are uniform across studies.

Natalya Pya Arnqvist - `fdaMocca`: An R package for model-based clustering for functional data with covariates

Co-authors: Per Arnqvist and Sara Sjöstedt de Luna

Global concern about climate change and the impact of human activities on the environment and global heating has led to a surge in knowledge of climate variations over millennia. Varved lake sediment has the potential to play an important role in understanding past climate with its inherent annual time resolution and within-year seasonal patterns. This talk presents fdaMocca, an R package that provides routines for model-based functional cluster analysis for functional data with optional covariates. The idea is to cluster functional subjects (often called functional objects) into homogenous groups by using spline smoothers (for functional data) together with scalar covariates. The spline coefficients and the covariates are modeled as a multivariate Gaussian mixture model, where the number of mixtures corresponds to the number of clusters. The parameters of the model are estimated by maximizing the observed mixture likelihood via an EM algorithm (Arnqvist and Sjöstedt de Luna, 2019). The clustering method is used to analyze annual sediments from Lake Kassjön (Northern Sweden), which cover more than 6400 years and can be seen as historical records of weather and climate.

Arnqvist and Sjöstedt de Luna (2019):

Wenxi Zhang - k-means clustering usage in datasets with missing values

Co-author: Norman Matloff

k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest cluster centroid. However, the standard k-means algorithm fails to accommodate data with missing values. Our modified k-means algorithm takes missing values into account. When calculating the sum of squared errors of each data point to the centroid, we only consider the partial distance of entries with non-NA values. Visualization features are also included in the package. This innovation in the algorithm could be beneficial for large sparse datasets with missing values, especially for datasets of recommendation systems.

Christoph Kiefer - Subgroup discovery in structural equation models

Co-authors: Florian Lemmerich, Benedikt Langenberg, and Axel Mayer

Structural equation modeling is one of the most popular statistical frameworks in the social and behavioural sciences. Often, detection of groups with distinct sets of parameters in structural equation models (SEM) are of key importance for applied researchers–for example, when investigating differential item functioning for a mental ability test or examining children with exceptional educational trajectories. In this talk, we present a new approach, combining subgroup discovery–a well-established toolkit of supervised learning algorithms and techniques from the field of computer science–with SEM. We provide an introduction on what distinguishes subgroup discovery from common approaches and how subgroup discovery can be applied to detect subgroups with exceptional parameter constellations in SEM based on user-defined interestingness measures. Our approach is illustrated using both artifical and real-world data (from a large-scale assessment study). The illustrative examples were conducted in the R package subgroupsem, which is a viable implementation of our approach for applied researchers.

Session 7, Shiny Applications

Session chair: Thomas Rose

Andrew Patterson - What happens next? The day after deployment

Congratulations, you’ve made a lovely Shiny app, configured it, and deployed it in the cloud. Job done, right?

What happens next? How do you keep your Shiny server, environment, and data safe? This talk introduces you to possible fates that unassuming servers may meet at the hands of the malicious users, bots, nets and actors that are ever present on the modern internet, including:

  • vulnerabilities
  • attack vectors
  • data breach
  • vulnerable cloud configurations
  • configuration drift

But the aim of this talk is not just to scare unsuspecting data scientists! I’ll also introduce key methods and ideas to provide lines of defence for servers left alone in the wild, including:

  • maintenance planning and unattended-upgrades
  • firewalling and encryption
  • log-handling
  • configuration-as-code

Simon Gonzalez - Developing R Shiny apps to enhance speech ultrasound visualization and analysis

The area of speech visualisation and analysis is experiencing an important accelerated growth. This demands access to efficient technologies for the accurate description and analysis of articulatory speech patterns. In the area of tongue ultrasound studies, the visualization/analysis processes generally require a solid knowledge of programming languages, as well as a deep understanding of articulatory phenomena. This demands a variety of programs for efficient use of the data collected. In this presentation, we introduce a multimodal app for visualizing and analyzing tongue contours: UVA-Ultrasound Visualization and Analysis. This app combines the computational power of R and the interactivity of Shiny web apps to allow users to manipulate and explore tongue ultrasound data using cutting-edge methods. One of the greatest strengths of the app is that it has the capability of being modified to adapt to users’ needs. This has potential as an innovative tool for diverse academic and industry audiences.

Robert Bischoff - Using R Shiny and `Neo4j` to build the `CatMapper` prototype application

Co-author: Daniel Hruschka

CatMapper is a set of user-friendly web-based tools designed to help researchers overcome a common bottleneck in comparative research–integrating data across diverse datasets by complex categories (e.g., ethnicities, languages, religions, archaeological artifact types) that vary from dataset to dataset. We received startup funds to build these tools, but not enough to fund a software engineer. We undertook the at-times overwhelming task of building this application as social scientists using the R Shiny platform, which significantly lowered the technical barrier to achieving a functional web application. We relied on R’s extensive ecosystem to wrangle data; connect to, build, and query our Neo4j graph database; access Amazon S3; display network graphs; and display interactive maps. The progress enabled by R and its various packages has allowed us to demonstrate the feasibility of our project and enabled us to gain additional grant funding.

In this presentation, we demonstrate how we built this application using R and the graph database Neo4j, and how we currently manage it using Docker containers. We discuss some of the advantages and challenges we encountered throughout this process.

Session 8, R in Teaching

Session chair: Laure Cougnaud

Amelia McNamara - Teaching modeling in introductory statistics: A comparison of formula and tidyverse syntaxes

There is considerable debate about which R syntax to teach novices, but much of it is not based on data. This talk will report on an experiment run in a pair of introductory statistics labs, attempting to determine which of two R syntaxes was better for introductory teaching and learning: formula or tidyverse. One lab was conducted fully in the formula syntax, the other in tidyverse. Analysis of incidental data from YouTube and RStudio Cloud show interesting distinctions. The formula section appeared to watch a larger proportion of pre-lab YouTube videos, but spent less time computing on RStudio Cloud. Conversely, the tidyverse section watched a smaller proportion of the videos and spent more time on RStudio Cloud. Analysis of lab materials showed that tidyverse labs tended to be slightly longer (in terms of lines in the provided R Markdown materials, as well as minutes of the associated YouTube videos), and the tidyverse labs exposed students to more distinct R functions. However, both labs relied on a quite small vocabulary of consistent functions. Analysis of pre- and post-survey data show no differences between the two labs, so students appeared to have a positive experience regardless of section. This work provides additional evidence for instructors looking to choose between syntaxes for introductory statistics teaching.

Jonathon Love - `jamovi`: An R-based statistical spreadsheet for the masses

Co-authors: Ravi Seller and Damian Dropmann

The jamovi project aims to provide an R-based free and open source statistical spreadsheet for people coming from the SPSS or Excel traditions. Although to some, a spreadsheet represents a backwards step in flexibility compared to programming languages such as R, spreadsheets occupy an important place in empowering “middle tier” users–users perhaps without the time to master R, but with more complex questions than can typically be answered by data dashboards.

Additionally, jamovi serves to invite and ease the spreadsheet user into using R. R syntax is available for each analysis, and Rj Editor allows R code to be directly run within the spreadsheet. At the same time, jamovi helps R users publish analyses that can be driven with a user interface from within the spreadsheet, making them accessible to a much broader audience.

Perhaps most significantly, jamovi has been written from the ground up to be a native web application, allowing it to be run both as a self-contained desktop application or in the cloud.

This talk introduces jamovi, demonstrates its feature set, and walks you through the simple steps to provide a user interface for an analysis inside jamovi. We encourage you to take a look at jamovi before the talk.

Carsten Lange - A better way to teach histograms, using the TeachHist package

Histograms are essential concepts in statistics, yet students often have trouble grasping the underlying ideas and interpreting the diagrams. The talk introduces flexible and adaptive teaching strategies for statistics instructors using the TeachHist R package. The package extends R’s functionality to generate histograms for educational purposes. TeachHist enables instructors to help students to better visualize statistical concepts. Instructors can use real-world data, or TeachHist can generate random normal distributed data for a given mean and standard deviation.

Students often struggle to understand the relation between the original scale of a variable and the related normalized scale (z-values). TeachHist addresses this problem by generating histograms with two horizontal axes, one with the original scale and one with a z-scale. The package uses data visualization techniques to make statistical concepts more intuitive for students. Instructors can generate various histograms (count, relative frequency, and density) by writing very little code. This makes TeachHist suitable for demonstrating statistical concepts on the fly in the classroom (e.g., addressing a student’s question). TeachHist also includes functionality to teach confidence intervals and hypothesis testing in a visually appealing way.

Kristen B. Gorman - The untold story of `palmerpenguins`

Co-authors: Allison Horst and Alison Hill

The palmerpenguins R package provides a modern, approachable dataset containing body size measurements for three penguin species that nest on islands throughout the Palmer Archipelago, Western Antarctic Peninsula. Since palmerpenguins’s release on the Comprehensive R Archive Network (CRAN) in July 2020, the package has been downloaded over 340,000 times, was quickly adapted for use in other languages (including Python’s seaborn package and Google’s TensorFlow datasets), and has become a go-to option for data science and statistics educators worldwide.

In this talk, we share the untold story of the palmerpenguins package. From original data collection on rocky Antarctic shores to CRAN submission and beyond, we describe the penguins’ journey from polar research project to global teaching product. What started out as a simple methods paper for a dissertation project turned into a widely used data science product mainly because of initial efforts to make the data publicly available and easily accessible by others. The success of the palmerpenguins R package underscores the importance of proper data archiving for unknown future applications.

Session 9, Ecology and Environment

Session chair: Rita Giordano

Jonathan Callahan - Air quality recipes: Package APIs for R dabblers

Many people who work in government science agencies are software dabblers rather than zealous converts. With a “git ‘er done” attitude, they find example code on the interwebs and too often end up with a mix of Python, R, javascript, Excel, bash and whatever else they picked up in graduate school. Results vary.

Over the last decade, our work has focused on creating R packages for the air quality community. This community includes everyone from agency scientists at the national and state level to graduate and undergraduate students working in environmental health. Many in this community are willing to open up R/RStudio but don’t have the time or inclination to become “coders.” They use R because available packages make their lives easier.

This presentation will introduce a suite of packages and a style of coding that work together to make air quality analysis as easy as possible while still being rigorous. Features include compact, harmonized datasets; consistent naming conventions; thorough documentation; and a strong focus on using the pipe operator to create data analysis “recipes.” When packages are focused on a particular problem, many of the data munging steps can be encapsulated so that data analysis reads like a cake recipe:

county_daily_avg < - load | filterDate | filterCounty | collapse | dailyStatistic | getData

Designing packages for recipe-style analysis will greatly improve the lives of those for whom data analysis is a task rather than a vocation.

The presentation is available at: Air Quality Recipes

Lily Clements - Calculating CO2 equivalent (CO2e) emissions in R

Co-authors: David Stern and Danny Parsons

The last seven years have been the warmest on record due to increased levels of greenhouse gases. Additional effects include rising sea levels, extreme weather, and shrinking glaciers. As part of an initiative to be more environmentally responsible, the UK-based community interest company IDEMS International set itself the task to achieve carbon neutrality by estimating its current emissions levels and consequently offsetting those emissions by supporting projects that reduce CO2 levels.

Current calculators are generally transparent, but the authors have often made decisions on details which cannot be easily amended. Additionally, these calculators cannot be easily tailored for different environments (e.g., emissions can differ depending on the region). To estimate the carbon emissions at IDEMS International, we developed the carbonr R package. There is complexity in estimating emissions, but open discussions can take place on GitHub to capture different components and viewpoints and subsequently inform the R functions. This can add complexity into the functions to solve the needs of others.

We have additionally written a calculator in Shiny to ensure that the tools to estimate CO2e emissions are accessible beyond those experienced in R. The dashboard also contains suggestions to offset emissions.

In this talk, I will outline the R package and run through the Shiny dashboard. I will also report on institutional changes that have come from the calculator.

Josue M. Polanco-Martinez - `RolWinMulCor`: An R package for estimating rolling window multiple correlation in ecological time series

RolWinMulCor estimates the rolling window correlation for bi- and multivariate cases between regular time series, with particular emphasis on ecological data. It is based on the concept of rolling, running, or sliding window correlation, being useful for evaluating the evolution and stability of correlation over time. RolWin­MulCor contains six functions to estimate and to plot the correlation coefficients and their respective p-values. The first two, rolwincor_1win and rolwincor_heatmap, focus on the bivariate case. They estimate the correlation coefficients and the p-values for only one window-length (time-scale), considering all possible window-lengths or a band of window-lengths, respectively. The second two functions, rolwinmulcor_1win and rolwinmulcor_heatmap, are designed to analyze the multivariate case, following the bivariate case to visually display the results, but these two approaches are methodologically different (the multivariate case estimates the adjusted coefficients of determination instead of the correlation coefficients). The last two functions, plot_1win and plot_heatmap, are used to represent graphically the outputs of the four aforemen­tioned functions as simple plots or as heat maps. The functions contained in RolWinMulCor are highly flexible, containing several parameters for controlling the estimation of correlation and the features of the plot output. The RolWinMulCor package also provides examples with synthetic and real-life ecological time series for illustrating its use.

Peder Engelstad - `climatchR`: An R package for climate matching with high-throughput applications

Co-authors: Richard A. Erickson, Catherine S. Jarnevich, Helen R. Sofaer, and Wesley M. Daniel

Invasive species are those that spread and cause negative impacts when introduced beyond their native ranges, and anticipating invasions can guide regulatory decisions and biosurveillance strategies. Climatic similarity is an important predictor if and where a species may establish if accidentally or intentionally released by humans. Climate matching compares climatic conditions between a species’ current geographic range and a focal location to evaluate potential climatic suitability. Climate matching is a key component of horizon scanning, a rapid risk screening wherein lists of species are evaluated for their potential threat to an uninvaded region. Previous horizon scans have lacked the ability to rapidly assess the climate match of the numerous species that comprise the global pool of potential invaders, often limiting the total number of species considered. Propriety software and browser-based solutions implementing the widely used CLIMATCH algorithm have been developed but remain limited in their scope and accessibility. As a solution, we present climatchR, a novel R package implementing CLIMATCH. The package automates a reproduceable climate matching workflow from downloading data to summarizing species’ climate matches. climatchR is also designed to integrate high-throughput computing, providing users with the ability to rapidly generate climate match scores for thousands of species. Here we highlight a recent use case of the processing efficiency and utility of climatchR in a horizon scan involving over 10,000 species.

Session 10, Building the R Community 1

Session chair: Stephen Balogun

Guillaume Desachy - Successfully building a vibrant community of R users at AstraZeneca: Lessons learned!

Co-authors: Vera Hazelwood, Abhijit Das Gupta, and Parth Shah

In the past few years, there has been a true paradigm shift in the use of R in the pharmaceutical industry. Up until very recently, one had to choose if they wanted to use R or SAS. Nowadays, statisticians are trained in both languages.

With this in mind, at AstraZeneca we have built on the growing interest in R software, across every stage of drug development and company-wide. Since April 2021, we have launched several internal initiatives aiming at federating the community of R users within AstraZeneca. We started by stealing with pride a public initiative, TidyTuesdays, and making it our very own, calling it #azTidyTuesday. On a bi-weekly basis, we creatively promote publicly available datasets to the community of AstraZeneca R users. This is done by aligning the #azTidyTuesday editions with either an AZ value or an ongoing internal or external event (e.g., Pride Month, IPCC report release, COP26). We also put in place some R Subject Expert Mentors whom beginners can reach out to in case of questions. And in early 2022, we held the first AstraZeneca R Conference.

While building this community, we tried many things. Some worked really well from the very beginning; some required improvements and modifications. But all these initiatives bore fruit, as the number of members in the R users group more than tripled in just over 6 months, and the diverse community of R users is becoming more and more vibrant.

Ning Leng - The R Consortium R Submission Pilot 1 to FDA

Co-authors: Heng Wang, Yilong Zhang, Peikun Wu, Mike Stackhouse, Eli Miller, and Joe Rickert

On 22 November 2021, the R Consortium R Submissions Working Group successfully submitted an R-based test submission package through the FDA eCTD gateway. FDA staff were able to reproduce the numerical results.

This submission, an example package following eCTD specifications, included a proprietary R package, R scripts for analysis, R-based analysis data reviewer guide, and other required eCTD components.

To our knowledge, this is the first publicly available R-based or open-source-language-based FDA submission package. We hope that our materials and what we learned can serve as a good reference for future R-based regulatory submissions from different sponsors.

To bring an experimental clinical product to market, electronic submission of data, computer programs, and relevant documentation is required by health authority agencies from different countries. In the past, submissions have been mainly based on the SAS language. In recent years, the use of open-source languages, especially the R language, has become very popular in the pharmaceutical industry and research institutions. Although the health authorities accept submissions based on open-source programming languages, sponsors may be hesitant to conduct submissions using open-source languages due to a lack of working examples. Therefore, the R Consortium R Submissions Working Group aims to provide such examples as part of its focus on improving practices for R-based clinical trial regulatory submissions.

Njoki Lucy - Building an R-Ladies community during the COVID-19 pandemic

Co-presenters: Faith Musili, Margaret Wanjiru, and Shelmith Kariuki

R-Ladies Nairobi was started in May 2020, two months after the first COVID-19 case was announced in Kenya. Since then, the chapter has grown to 1000+ members.

How did we do it? During 2019 - 2020, four ladies (i.e., Faith, Njoki, Maggie and Shel) each had an intention of starting an R-Ladies Nairobi meetup chapter. One of us had recently been an R-Ladies curator and another had attended the useR! 2019 conference, where they met other amazing R-Ladies. After several discussions among ourselves, the chapter was launched.

In this talk, we’ll highlight a few things, including (i) the steps we took to officially launch the chapter, (ii) the activities we’ve had so far (including collaborations with other communities), (iii) how we source for speakers, (iv) our social media presence, (v) how we ensure the success of the meetups, (vi) what motivates us, (vii) our latest achievements, and lastly our future plans.


David Smith - Easy R tutorials with Dev Containers

If you’ve ever published a blog post or tutorial on R, or hosted a workshop using R, you know that the experience of the reader or participant is rarely as smooth as your content suggests. The main difficulty lies in configuration: the environment in which you ran the R code will be different than that of the reader, unless you provide detailed setup instructions and they follow them to the letter. In my experience, this is very unlikely.

Enter Dev Containers, an extension to Visual Studio Code that allows you to develop your R code within a container as if it were your local machine. With Dev Containers, you can set up your R version, packages, and companion software, and test your R code in a known-good environment. Dev Containers makes it easy to share that container with your readers/participants, so you can be sure that they will have the same experience running the code as you did developing it, with zero setup. In addition, if they have access to GitHub Codespaces, they can run your R code in the cloud with no additional configuration.

Tuesday, 21 June 2022, 1:00 - 2:15pm CDT

Session 11, Containerization and Metaprogramming

Session chair: Robin Gower

Konrad Krämer - Translate R to Cpp

The package ast2ast aims to translate an R function into a C++ function. An external pointer to the C++ function is returned to the user. The motivation to write the package: it is often cumbersome to use R functions in applications that have to call this function very often (> 100 calls) (e.g., ODE solving, Optimization). A possible solution would be to write the function in a faster programming language (e.g., C), but learning languages such as C is difficult and time-consuming. Therefore ast2ast is a decent alternative, as the function can be written in R. Moreover, ast2ast C++ objects can communicate with Rcpp, RcppArmadillo, and raw pointers.

Peter Solymos - Best practices for Shiny apps with Docker

Shiny has established itself as a trusted framework to quickly create proof-of-concept and business-critical applications alike. Along with the diversification of use cases, there is a growing need to understand the available hosting options and somehow find the best one. Today, there are more than 20 ways to host a Shiny app. Half of the hosting options involve the use of Docker container technology. All the general advantages of containerized applications apply to Shiny apps. Docker provides isolation to applications. Images are immutable: once built, they cannot be changed, and if the app is working, it will work the same in the future. Another important consideration is scaling. Shiny apps are single-threaded, but running multiple instances of the same image can serve many users at the same time.

Besides outlining the benefits of dockerized Shiny applications, I will also review best practices for handling dependencies, building images, security, caching, CICD pipelines, etc. I will close the talk by introducing the Hosting Data Apps website, where I publish reviews and tutorials to help R users and Shiny developers learn more about hosting Shiny apps with Docker.

Jamie Lentin - Meta-programming in R: The `gadget3` model framework

Co-authors: Bjarki Elvarsson and Will Butler

Metaprogramming is where code can treat other code as data, analysing and modifying code. These techniques are the core of the gadget3 package, a modeling framework designed for marine ecosystems. The metaprogramming techniques it uses allows you to produce parameterized models with multiple interacting species and fishing fleets. This can then be transformed into a TMB objective function without writing any C++, utilising its automatic differentiation abilities to speed up optimisation. An equivalent R function can then be generated to use the model elsewhere.

This talk will present some of the techniques and tools available to manipulate R code from within R itself, and what this enables you to do. We will demonstrate the basic usage of gadget3, what can be done with it, and the advantages metaprogramming gives us.

Alex Gold - Docker for data science

If you’re a practicing data scientist and R user, you’ve almost certainly heard of Docker–and maybe you’ve heard that you should be using it. But you might not have a great idea of what it really is or how it could be helpful to your day-to-day work.

In this talk, you’ll learn the concepts behind Docker, discover some Docker workflows that are particularly good for R-based work, and get an intro to the basic commands to manage Docker containers. By the end of this talk, you will be able to pinpoint whether and how Docker can be useful for you and your work, leaving you ready to get started!

Session 12, Inference Methods

Session chair: Janith Wanniarachchi

Max Welz - Generic machine learning inference on heterogeneous treatment effects using the package GenericML

Co-authors: Andreas Alfons, Mert Demirer, and Victor Chernozhukov

Recent developments have proposed the use of machine learning (ML) based methods for the estimation of treatment effects and potential heterogeneity therein. In particular, Chernozhukov, Demirer, Duflo and Fernández-Val (2020) propose a nearly assumption-free generic ML framework for estimation and uniformly valid inference on heterogeneous treatment effects in randomized experiments, which is also valid in high-dimensional settings. The GenericML package implements this framework while retaining a high degree of user flexibility. GenericML enables the specification of a wide variety of ML methods via the mlr3 ecosystem of Lang et al. (2019), supports nearly all types of randomized experiments, and allows for the customization of all components of the generic ML framework. GenericML follows a clear object-oriented design and takes advantage of parallel computing to reduce computing time. It provides rich methods for printing and plotting so that potential treatment effect heterogeneity along every supplied variable can be easily identified, both by means of inference and visualization. In addition, the package strictly adheres to a high standard of user-friendliness in its functionality and documentation with the goal of being easily usable for researchers in any domain. We will demonstrate GenericML in an example on the impact of a microcredit program in Morocco.

Chernozhukov, Demirer, Duflo and Fernández-Val (2020):

Lang et al. (2019):

Guillemette Marot - Variable selection with Multi-Layer Group-Lasso

Co-authors: Quentin Grimonprez, Samuel Blanck, and Alain Celisse

The MLGL (Multi-Layer Group-Lasso) R package implements a new procedure of variable selection in the context of redundancy between explanatory variables, which holds true with high0dimensional data. A sparsity assumption is made–that is, only a few variables are assumed to be relevant for predicting the response variable. In this context, the performance of classical Lasso-based approaches strongly deteriorates as the redundancy strengthens.

The proposed approach combines variables aggregation and selection in order to improve interpretability and performance. First, a hierarchical clustering procedure provides at each level a partition of the variables into groups. Then, the set of groups of variables from the different levels of the hierarchy is given as input to group-Lasso, with weights adapted to the structure of the hierarchy. At this step, group-Lasso outputs sets of candidate groups of variables for each value of regularization parameter.

The versatility offered by MLGL to choose groups at different levels of the hierarchy a priori induces a high computational complexity. MLGL, however, exploits the structure of the hierarchy and the weights used in group-Lasso to greatly reduce the final time cost. The final choice of the regularization parameter–and therefore the final choice of groups–is made by a multiple hierarchical testing procedure.

John Ferguson - Causal analysis in R: Using Bayesian network models to predict the impact of public health interventions on disease-prevalence in population health with population attributable fractions.

Co-author: Maurice O’Connell

This talk introduces causal analysis of population attributable fractions (PAF) in R using graphPAF. graphPAF is intended to facilitate analysis of large real-world epidemiological data structures linking risk factors (such as smoking or pollution) to disease. It focuses on estimation and display of different types of PAF and impact fractions which measure the disease burden attributable to risk factors and can subsequently be used to prioritise public health interventions that best prevent disease on a population level.

For certain analyses, graphPAF assumes that risk factors, confounders and disease are causally linked via an expert Bayesian network model. Users can specify their causal knowledge regarding causal pathways linking risk factors to disease, which is then incorporated into this network model and subsequent estimation. This network-based approach will in many cases answer questions causally and generate results that are less biased and more informative than previous regression-based approaches.

In particular, graphPAF can:

• estimate PAF and impact fractions for discrete risk factors • estimate and plot PAF for continuous risk factors • estimate Pathway Specific PAF (PS-PAF) • estimate joint PAF over several risk factors • estimate average PAF and sequential PAF • construct PAF fan-plots and PAF nomograms

This talk will appeal to statisticians, epidemiologists and data scientists who wish to answer questions causally in R with both simple and advanced causal modelling techniques applied at a population level.

Aymeric Stamm - `flipr`: FLexible Inference via Permutations in R

Co-authors: Alessia Pini and Simone Vantini

The goal of the flipr package is to provide a flexible framework for inference via permutation. The idea is to promote the permutation framework as a tool for inference on complex data. You supply your data, as complex as it might be, in lists where each entry stores a data point in a predetermined representation that you chose; flipr provides you with either point estimates or confidence regions or p-values of hypothesis tests. Permutation tests are especially appealing because (i) they only require exchangeability of the data and (ii) they are exact no matter how small your sample sizes are. You can also use the so-called non-parametric combination approach in this setting to combine several statistics to better target the alternative hypothesis you are testing against. Asymptotic consistency is guaranteed under mild conditions on the statistics you use.

The flipr package is intended as a low-level implementation of the permutation framework in the context of statistical inference. For now, it focuses on the two-sample problem. The mathematical object behind the scene is the so-called plausibility function (implemented as an R6Class in flipr), sometimes called p-value function, which is a curve that represents the variation of the p-value of an hypothesis test as the null parameters vary.

nevada, a child package, deals with the statistical analysis of populations of networks. Other child packages are in development.

Session 13, Interfaces with C, C++, Rust, and V

Session chair: Jason Cory Brunson

Charlie Gao - R's C interface: Perspectives from wrapping a C library

There has been a trend in recent years to integrate R with compiled code–mostly C++–with this combination presented as a relatively easy-to-implement “performance fix” for slow-running code. R’s native C interface, on the other hand, has received comparatively little attention, and it has a somewhat mythical reputation of being difficult-to-use and error-prone.

This talk introduces the C API as not only a viable but flexible interface for combining R with some of the most performant, best-in-class software libraries available, often written in C. Drawing on my own experience of creating nanonext, which wraps NNG (Nanomsg Next Gen), a high-performance socket library and concurrency framework (considered a successor to ZeroMQ).

Main themes to be explored include: (i) how to translate a C API into an idiomatic R API, (ii) harnessing the power of the external pointer, (iii) producing a fully portable and CRAN-ready package, and (iv) engaging with and contributing back to upstream.

The aim is to inspire others to try out the C API and ultimately enrich the R ecosystem with better-wrapped libraries.

Jonathan Berrisch - An introduction to Rcpp modules

Co-author: Florian Ziel

Rcpp modules are an R interface to C++ classes. Compared to functions, classes offer more flexibility. However, package maintainers often struggle to design small distinct functions, particularly for C++. The main problem is that functions usually do not share data, so functions must pass common data from one to another. The latter is incredibly tedious in C++.

Consequently, packages often end up with large C++ functions exposed to R via Rcpp. This isn’t good for several reasons:

  • Debugging becomes harder.
  • Your R function becomes a black box for users unfamiliar with C++.
  • R users can not simply execute parts of that function, modify data, and continue the computation.

Classes solve this issue by bundling data and functions (called methods) together. These C++ methods can be arbitrarily small and have direct access to the data. The methods can then be called from within R using Rcpp modules–that is, your high-level R function can create an instance of the exposed class and subsequently call its methods. This offers excellent transparency and flexibility to the R user, who can analyze (or even manipulate) the data between subsequent method calls.

We apply this design in online, the primary function of the profoc package. Profoc implements CRPS learning, an aggregation algorithm for probabilistic forecasting recently published in the Journal of Econometrics.

This talk addresses Rcpp beginners. A basic understanding of Rcpp is expected.

CRPS learning:

David B. Dahl - Writing R extensions in Rust

This talk complements *Writing R Extensions&, the official guide for writing R extensions, for those interested in developing R packages using Rust. It highlights idiosyncrasies of R and Rust that must be addressed by any integration and describes how to develop Rust-based packages that comply with the CRAN Repository Policy. This talk introduces cargo framework, a transparent Rust-based API that wraps commonly used parts of R’s API with minimal overhead and allows a programmer to easily add additional wrappers.

Edwin de Jonge - rvee: Recreational V programming for R

V or vlang is a simple, safe and fast programming language with the speed of C. It compiles to C or Js code and can be used for low- and high-level programming tasks.

R has an excellent track record in interfacing with other programming languages (e.g., C, Fortran, C++, Java, Python, and Rust among others).

The rvee R package provides the means to create R extension packages with the programming language V. It implements a V wrapper for the R API and can generate the necessary R package wrapper code from V code annotations, in the same spirit as RCpp does for the C++ language.

Session 14, Learning ggplot2

Session chair: Patrick Weiss

Jonathan Carroll - `ggeasy`: Easy access to `ggplot2` commands

Contributors to the package: Alicia Schep, Jonathan Sidi, Bob Rudis, Mohamed El Fodil Ihaddaden, and Thomas Neitmann

The ggplot2 R package offers an immensely flexible system for generating data visualisations, elegantly implemented with a sophisticated and well-thought-out system of commands and arguments, to cover almost any configuration a user may wish to generate. A frequent complaint in the community of regular users is that it is extremely easy to forget how to do common transformations, and what should be simple things, such as “rotate the x-axis labels.” To this end, ggeasy offers a suite of shortcuts to the theme() arguments of ggplot2, named so that auto-complete guides the user towards their goal, such as easy_rotate_x_labels(). This is paired with an option to present the “canonical” command that achieves the goal (i.e., the very easy-to-forget theme(axis.text.x = element_text(angle = 90, hjust = 0)). This talk will feature a demonstration of ggeasy, our motivation for producing it, and our intentions for future updates.

Presentation slides:

Nicola Rennie - Learning `ggplot2` with generative art

Generative art, the practice of creating art with code, is becoming ever more popular. Rtistry, the name often given to generative art when the language of choice is R, might not be the most obvious approach to learning how to use ggplot2. However, there’s a lot we can learn from it and take into our everyday work.

Many of the data visualizations that data scientists produce in practice tend to fall into one of the basic categories–scatter plots, line plots, or histograms–and we tend to spend quite a bit of time preparing our data beforehand, to get it into the “right” format. And so we never discover some of the intricacies beneath the functions we use.

Much of generative art relies on messier data, randomness, disorder, and unusual structures–things that, as data scientists, we often try to remove from our data before we visualize it. But when we think about our data, and therefore the visualization process, from a different perspective, we can learn new things about the tools we rely on every day. This talk will highlight a few of the more subtle differences in some aspects of ggplot2 discovered through experimenting with Rtistry. It will also highlight how we might exploit these subtleties to make more informative “data science” plots. You might even become convinced that being an Rtist could make you a better data scientist.

James Otto - `ggdensity`: Improved bivariate density visualization in R

Co-author: David Kahle

A popular strategy for visually summarizing bivariate data is plotting contours of an estimated density surface. Most commonly, the density is estimated with a kernel density estimator (KDE) and the plotted contours correspond to equally spaced intervals of the estimated density’s height. Notably, this is the case for geom_density_2d() and geom_density_2d_filled() from ggplot2. The proposed ggdensity package extends ggplot2, providing more interpretable visualizations of bivariate density estimates using highest density regions (HDRs). geom_hdr() and geom_hdr_lines() serve as drop-in replacements for the aforementioned ggplot2 functions, plotting density contours that are chosen to be inferentially relevant. By default, they plot the smallest regions containing 50%, 80%, 95%, and 99% of the estimated density (the HDRs). ggdensity also implements the estimation and plotting of HDRs resulting from estimators other than the standard KDE; densities can be estimated by histograms, frequency polygons, and fitting a parametric bivariate normal model. Also included are the functions geom_hdr_fun() and geom_hdr_fun_lines() for plotting HDRs of user-specified probability density functions. This allows for the plotting of a much larger class of HDR estimators than the four available for geom_hdr(). Users can specify and estimate arbitrary parametric models, providing the resulting pdf estimates to geom_hdr_fun() for contouring.

June Choe - Stepping into `ggplot2` internals with `ggtrace`

The declarative semantics of ggplot2, which allow users to build up a plot “layer by layer,” are a widely celebrated breakthrough for making code-based data visualizations more accessible and scalable. However, the inner workings of ggplot2 is overwhelmingly foreign even to experienced users, precisely because its internal object-oriented system is hidden from the user-facing functions. This presents a challenge for the future of the ggplot2 ecosystem, which depends heavily on community-driven development of extension packages: the fact that ggplot internals cannot be learned through the use of ggplot, by design, unintentionally renders this knowledge inaccessible to aspiring developers.

Luckily, a unique strength of the R community is that many useRs are self-taught and excel in learning through trial and error. Thus, a promising solution to this knowledge gap is to arm users with a tool that lets them interact with and explore the execution pipeline of a ggplot object. The ggtrace package offers this capability by allowing users to inject and log arbitrary expressions into any step of the pipeline through the low-level functions ggtrace() and with_ggtrace(). Additionally, ggtrace aims to expose ggplot internals in familiar functional programming terms through a family of high-level “workflow” functions in the form of ggtrace_{action}_{value}, which can inspect the value of local variables, capture a copy of a ggproto method’s runtime environment, and even hijack the return value of a method. We hope that this reimagining of ggplot internals through ggtrace can empower aspiring and experienced developers alike.

Session 15, Machine Learning

Session chair: Guillaume Desachy

Bernardo Lares and Igor Skokan - `Robyn`: Continuous and semi-automated marketing mix model from Meta Marketing Science

Robyn is an experimental, semi-automated, and open-sourced marketing mix modeling (MMM) package from Facebook Marketing Science. It uses various machine learning techniques (ridge regression with cross-validation, a multi-objective evolutionary algorithm for hyperparameter optimization, time-series decomposition for trend and season, gradient-based optimization for budget allocation, etc.) to define media channel efficiency and effectivity and explore adstock rates and saturation curves. It’s built for granular datasets with many independent variables and therefore especially suitable for digital and direct response advertisers with rich data sources.

Niklas Koenen - Interpreting deep neural networks with the R package `innsight`

Co-author: Marvin N. Wright

In the last decade, deep neural networks have established themselves in almost all areas of research, industry and public life. Several interpretability methods have been proposed, including so-called feature attribution methods. A major limitation so far is that most of these are Python-exclusive and not directly available in R, although the most popular deep learning libraries (such as Keras, Tensorflow or PyTorch) are becoming increasingly embedded in R.

We present the R package innsight, providing the most common feature attribution methods in a unified framework. Compared to implementations in Python (e.g., iNNvestigate or Captum), innsight stands out in two ways: First, it works independently of the deep learning library the model was trained with. Consequently, it is possible to interpret neural networks from any R package, including keras, torch, neuralnet, and even custom models. Despite its high flexibility, innsight benefits internally from the fast and efficient array calculations of the torch package, which is built directly on libtorch–PyTorch’s C++ backend–without a Python dependency. Secondly, innsight offers a variety of visualization methods for interpreting tabular data, time series and image data. With help of the plotly package, these can even be rendered interactively. In summary, we present a flexible, fast and user-friendly neural network interpretability package.

Marvin N. Wright - Testing conditional independence in supervised learning algorithms with the `cpi` package

Co-authors: Kristin Blesch and David S. Watson

We introduce a flexible R package for computing conditional predictive impact (CPI), a consistent and unbiased estimator of the association between one or several features and a given outcome, conditional on a reduced feature set. The method is highly modular and works in conjunction with any valid knockoff sampler (e.g., those in the knockoff package), supervised learning algorithm, and loss function. The package is built on top of the mlr3 ecosystem, which provides a unified interface for selecting models and risk estimators. We implement tools for frequentist and Bayesian inference, allowing users to evaluate the magnitude, significance, and precision of CPI estimates. We demonstrate the method using various algorithms, including linear regression, neural networks, random forests, and support vector machines. Empirical results show that the CPI compares favorably to alternative variable importance measures and other nonparametric tests of conditional independence on a diverse array of real and synthetic datasets. Simulations confirm that our inference procedures successfully control Type I error with competitive power in a range of settings. In summary, the cpi package provides a powerful model-agnostic and easy-to-use conditional independence test for supervised learning tasks.

Wednesday 22 June 2022, 1:00 - 2:15pm CDT

Session 19, Panel Discussion: Graphical User Interfaces for R

Robert Muenchen - Graphical User Interfaces for R

Graphical User Interfaces (GUIs) help non-programmers analyze data without R code, help students learn R code, and help programmers speed R code development. This panel, moderated by Robert Muenchen, will feature representatives of seven GUI development teams who will provide information on each interface and answer questions from participants:

• BlueSky Statistics - Aaron Rangel

• JASP - Eric-Jan Wagenmakers

• jamovi - Jonathon Love

• R AnalyticFlow - Ryota Suzuki

• R Commander - John Fox

• RKWard - Thomas Friedrichsmeier and Meik Michalke

• R-Instat - David Stern

Session 20, Data Visualization

Session chair: Jonathan Berrisch

Harriet Mason - Teaching computers to see patterns in scatterplots with scagnostics

Co-authors: Di Cook, Ursula Lai, and Stuart Lee

As the number of dimensions in a data set increases, the process of visualising its structure and variable dependencies becomes more tedious. Scagnostics (scatterplot diagnostics) are a set of numerical measures that describe the visual features of scatter plots. These features can be used to identify interesting and abnormal scatterplots within high-dimensional datasets and thus give a sense of priority to the variables we choose to visualise. A set of scagnostics is implemented in the new cassowaryr R package, which provides a user-friendly method to apply these diagnostics to data. The set of scagnostics available in cassowaryr includes measures previously defined in literature, as well as some new measures specific to this package. The scagnostics have been effective tools in both high-dimensional data visualisation and projection pursuit, which we illustrate respectively with examples from sports and astrophysics.

Paul Harrison - `langevitour`: A random or focussed tour of high-dimensional data at 60 FPS

langevitour is an HTML widget that tours 2D projections of a high-dimensional numerical dataset. The tour can proceed at random or be directed by the user in various ways. A particular application is examining the gene expression of thousands of individual cells from scRNA-seq data. The interactive display allows the user to gain a sense of the complex ways that cells are related to each other. Similar cells make continuous small motions together–a channel of visual information not possible in a static visualization.

Under the hood, langevitour uses essentially a physics simulation governed by Langevin dynamics to perform a random walk with momentum, sampling orthonormal projections of the data. A potential energy function can focus the random walk around “good” projections of data (projection pursuit), or around variables or groups of interest. User controls operate by adjusting the parameters of the physics simulation and the energy function.

langevitour can be used directly from R or included in a self-contained HTML document using R Markdown.

H. Sherry Zhang - A new tidy data structure to support exploration of multivariate spatio-temporal data

Co-authors: Dianne Cook, Patricia Menéndez, Ursula Laa, and Nicolas Langrené

Spatio-temporal data refer to measurements taken across space and time. In practice, spatio-temporal data can be decomposed into spatial and temporal components: at one time, we would select a spatial location and inspect the temporal trend; at another time, we might select one or multiple time value(s) and explore the spatial distribution. Ideally, we could make multiple maps and multiple time series to explore these together; however, doing all of these actions is complicated when data arrive fragmented in multiple objects. To make it easy to do all these tasks, ideally spatial and temporal variables are in a single data object that we can slice and dice in different ways to conduct different visualisations. In this talk, we suggest a new data structure, cubble, to organise spatio-temporal data so that different types of information can be easily accessed for exploratory data analysis. cubble is also capable of handling data with a hierarchical structure, matching data from multiple sources, constructing interactive graphics, and performing spatio-temporal transformation. Data from Australian climate weather stations, river level, and climate reanalysis (ERA5) will be used to demonstrate cubble.

Presentation slides:

Stephanie Lussier - `dataxray`: An interactive table interface for data summaries

Co-author: Augustin Calatroni

Every time a dataset is created, either for data management purposes or for statistical analyses, it is imperative that each variable be reviewed to detect potential errors. Not only should the evaluation involve summary statistics and graphical displays, but it should also present the results in a thorough and succinct manner.

Originally developed a couple of decades ago, the Hmisc::describe function has been a useful tool for data exploration prior to analysis. Hmisc::describe provides key information about input datasets, including variable attributes and summary statistics, using a concise print method to create a static report (HTML or PDF). It also provides the ability to interface SAS formatted datasets, which remain widely used in the clinical research industry, while the R language continues to grow in popularity.

For some time now, we have wanted to provide a wrapper for the aforementioned describe function to provide a modern and interactive interface to the Hmisc::describe output. Utilizing the power of the reactable package embedded with plotly interactive figures within a flexdashboard, concise summaries of every variable in a dataset can be generated with minimal user configuration. In order for other users to readily deploy such a powerful summary table, we wrapped our work into the dataxray package.

Session 21, Parallel Computing

Session chair: Ilias Moutsopoulos

Mark Hornick - Exploiting data parallelism for R scalability

Scaling solutions for large-volume data can often be challenging. While some solutions require complex algorithm modifications to achieve parallelism and scalability, others can take advantage of more immediate data parallelism. The concept of data parallelism is often called out as addressing those “embarrassingly parallel” solutions–“embarrassing” because they’re so easy. A prime example is scoring data with a machine learning model. However, even with its conceptual simplicity, achieving a robust implementation can make production deployment more complex. Having ready-made and well-integrated infrastructure to support data parallelism can greatly reduce development overhead while improving the likelihood of project success.

In this session, you’ll learn about the data parallelism (and task parallelism) provided with Oracle Database through Oracle Machine Learning for R (OML4R). Define an R function, store it–and any related R objects–in the database, and have the database environment spawn and manage multiple R engines to enable scalability and performance. We’ll also demonstrate OML4R embedded R execution, which supports this important capability.

Orcun Oltulu - Parallelization of variable selection in nonparametric regression

Co-author: Fulya Gokalp Yavuz

The nonparametric approach in modeling maintains its popularity with increasing momentum for statistical and machine learning methods. The increasing size of the data and the number of variables make it necessary to develop variable selection methods to work more effectively and fast. However, due to the nature of nonparametric methods, when the variable selection step is also added, the calculations get cumbersome. This study works on accelerating the variable selection algorithm in nonparametric (kernel) regression with parallelization. The algorithm combines two steps in nonparametric regression: bandwidth selection for the Nadaraya-Watson estimator and variable selection. Also, it consists of independent sequential calculations that iterate over each observation point, creating a high time cost as the datasets’ dimensions become more prominent. We apply the parallelization technique on the independent sequential analyses to reduce the elapsed time for the process while keeping the same accuracy level. We construct a simulation design to compare results for different dimensions of the artificial data and the different number of cores used in the parallelization. In the simulation, while the calculation results show a significant gain in the computation time of the parallelization methods used on R programming, it also gives the exact accuracy measurements.

Pyry Kantanen - Experiences from CSC high-performance computing environments by handling synthetic national identification numbers

Co-author: Leo Lahti

National identification numbers (NINs) and other identification code systems form an often overlooked but important bedrock for managing and governing populations. The hetu and sweidnumbr R packages provide tools for validating and extracting information from Finnish and Swedish identification numbers.

As a form of function creep, the operational scope of NINs in both countries has spread from their original use in population registers to encompass most public sector data systems, and even many private sector CRM systems. NINs enable the linking data from multiple different registers, forming a basis for register-based official statistics production and register-based knowledge production in academia, among other uses. Even modest register-based studies can contain significantly more observations than traditional social scientific studies. Larger datasets prompt the need for high-performance computing services and code parallelization.

In our talk, we will discuss on how random NIN generation and checking functions found in the hetu package can be used for testing and learning purposes in the Puhti supercomputer environment, hosted by the Finnish CSC-IT Center for Science. The CSC remote environment differs from local R environments in that code is run via batch jobs, and efficient use of cores must be monitored by running scaling tests. Lessons learned from experimenting with and improving the hetu R package functions can be applied to other packages as well, contributing to joint efforts in building a shared FIN-CLARIAH research and methods infrastructure for the social sciences and humanities.

The sweidnumbr package is with this preprint:

Henrik Bengtsson - Futureverse: Profile parallel code

In this presentation, I share recent enhancements that allow developers and end-users to profile R code running in parallel via the future framework. With these new, frequently requested features, we can study how and where our computational resources are used. With the help of visualization (e.g., ggplot2 and Shiny), we can identify bottlenecks in our code and parallel setup. For example, if we find that some parallel workers are more idle than expected, we can tweak settings to improve the overall CPU utilization and thereby increase the total throughput and decrease the turnaround time (latency). These new benchmarking tools work out of the box on existing code and packages that build on the future package, including future.apply, furrr, and doFuture.

The future framework, available on CRAN since 2016, has been used by hundreds of R packages and is among the top 1% of most downloaded packages. It is designed to unify and leverage common parallelization frameworks in R and to make new and existing R code faster with minimal efforts of the developer. The futureverse allows you, the developer, to stay with your favorite programming style, and end-users are free to choose the parallel backend to use (e.g., on a local machine, across multiple machines, in the cloud, or on a high-performance computing (HPC) cluster).

Session 22, Programming and Graphics Frameworks

Session chair: David B. Dahl

Sebastian Krantz - `collapse`: Advanced and fast statistical computing and data manipulation in R

collapse is a C/C++ based package that offers new possibilities for advanced statistical programming in R, in particular for complex problems involving grouped/weighted computations, time-series and panel data, and programs utilizing multiple different R objects and data structures. For these tasks it provides a large set of statistical functions that are fully vectorised along multiple dimensions (columns, groups, weights, time indexing and sweeping-out), together with low-level building blocks such as grouping objects, math by reference, and utilities for memory efficient programming. It is class-agnostic, supporting all of R’s basic data structures (vectors, factors, matrices, data frames, lists) and further popular ones ((grouped) tibble, data.table, sf, xts, pseries/pdata.frame). The package also provides a set of highly efficient data manipulation functions that can substitute (partly) for base R and tidyverse functions while taking full advantage of the fast statistical functions and computational backend of the package.

By proving new statistical possibilities and efficient algorithms in a way that accommodates a very broad range of R objects and programming styles, collapse enables the design of flexible and highly efficient programs for complex statistical computing problems, and adds a unique programming experience and sense of performance to the R ecosystem.

Presentation recording:

Paul Murrell - Enriching the vocabulary of R graphics

At the heart of the R graphics system lies a graphics engine. This defines a graphics vocabulary for R–a set of possible graphics operations, like drawing a line, colouring in a polygon, or setting a clipping region. Graphics packages like ggplot2 allow users to describe a plot in terms of high-level concepts like geoms, scales, and aesthetics, but that high-level description has to be reduced to a set of graphics operations that the graphics engine can understand.

Unfortunately, the R graphics engine has a limited vocabulary. There are things that it cannot say, like “draw the outline of this text.”

In R 4.1.0 the vocabulary of the graphics engine was expanded to include gradient fills, pattern fills, clipping paths, and masks. This talk will describe recent work that expands the graphics engine vocabulary even further to include stroking and filling paths, isolated groups, compositing operators, and affine transformations.

Hadley Wickham - R7: A new OOP system for R

R7 is a new OOP system for R that attempts to take the best parts of S3 and S4 and create a true successor to both. In this talk, I’ll introduce you to the main features of R7, showing you how you can create classes, generics, and methods. R7 has been designed to work seamlessly with S3 and S4, so you’ll also learn how to use R7 in conjunction with your existing OOP code.

R7 is being developed by the RConsortium Working Group on Object Oriented Programming, which includes representatives from R-Core, Bioconductor, RStudio, and the wider R community. While it’s currently prototyped as a GitHub package, the goal is to include it in a future release of R itself.

R7 is still under active development, so we are looking for people to try it out, give us feedback, and help us develop it.

Michael J. Mahoney - `unifir`: A unifying API for working with Unity in R

Co-authors: Colin M. Beier and Aidan C. Ackerman

This talk introduces unifir, a new package for using R to control the Unity video game engine for producing interactive 3D environments. unifir lets users write idiomatic R code to create immersive virtual environments in the Unity engine, translating R code into C# scripts and system commands to create terrain objects, place and manipulate 3D models, and create “player characters” that can move across the virtual space. The use of R allows these environments to be created more quickly and reproducibly than previous manual approaches, and allows users to produce these environments with minimal knowledge of the underlying rendering engine or the skills it requires.

This talk walks through the unifir package from two perspectives: users leveraging the package to produce “scenes,” and developers interested in the underlying mechanics. Users can take advantage of a number of scaffolding functions and permissively licensed 3D models from within standard R scripts, which can then be used to quickly produce reproducible scenes within the Unity engine. Developers may be interested in the use of the R6 class system, which enables easy extension of the base set of functionality provided by unifir directly, as well as the metaprogramming involved in translating instructions from R code to C# scripts and Unity configuration files. Overall this approach provides a framework for producing such visualizations and is already being used by the terrainr package to produce large-scale landscape visualizations.


Session 23, Publishing and Reproducibility

Session chair: Jane Ho

Laure Cougnaud - Clinical data review reporting tool

In clinical trials, the frequent review of safety data collected on patients is a key process. clinDataReview, an interactive reporting tool, has been developed to help medical monitors in the exploration of standard clinical data (CDISC SDTM format) collected during the trial. It displays overviews of patients for each domain of interest (e.g., enrollment, demography, adverse events, laboratory abnormalities), linked to patient-specific views (via the patientProfilesVis package). Tables of descriptive summary statistics in interactive and CSR-ready format (via the inTextSummaryTable package), interactive visualizations (treemap, sunburst, spaghetti plot, time profiles, boxplot) and listings (comparison of multiple data deliveries) are available.

The report consists of a set of self-contained HTML pages created from R Markdown “template” chapters. These chapters contain standard visualizations, tables and listings. Via a set of configuration files (in YAML format), both R and non-R users can easily tailor the chapters to specific datasets and analyses of interest for the particular study and disease. The combination of the study-specific configuration files with a fixed version of the R package(s) containing the template chapters and functionalities (stored in a version control system) ensures the full reproducibility and traceability of each analysis.

The creation of the documentation for the input parameters (and description) of each chapter (in Rd format) at package creation is automated with JSON Schema, as is validation of the configuration/input parameters for a chapter. The tool will be demonstrated on a public clinical dataset.

Christophe Dervieux - A tour of `knitr` engines: `knitr` not only knits R

Co-author: Yihui Xie

In this talk, we’ll focus on available knitr engines. The different knitr engines are at the heart of the R package, as they allow the evaluation of code chunk content and output result in a certain way. Although the R engine is the most common, there is a set of other engines that can be used to extend the usage of R Markdown documents and do more with less code. It can be used to run external software from a chunk, help build markdown content, or even pass content to other R packages. Let’s take a tour, looking at both the newest addition and existing ones.


Yihui Xie - Creating a blog (or website) with blogdown that will not be down

From 2018 to 2020, some users suffered a lot from creating and maintaining websites with blogdown, because there were too many possible factors that could affect the functionality of a site or even break it, such as the ever-changing Hugo site generator and its themes. Most of these issues were addressed in blogdown in late 2020. In this talk, I will share tips on how to create and maintain a stable website with blogdown. In particular, I will introduce the troubleshooting function blogdown::check_site(), and demonstrate a few stable Hugo themes (hugo-xmin, hugo-apero, and hugo-prose). Hopefully your blogdown site will never go down again. But if it does, you will know how to troubleshoot it.

Meike Steinhilber - Reliable scientific software development in R using the sprtt package as an example

Co-authors: Martin Schnuerch and Anna-Lena Schubert

The implementation and further development of statistical test procedures is an important contribution to scientific research. However, critical software development steps are often omitted or implemented only half-heartedly in the scientific context. This is particularly problematic since relevant decisions in research are made based on such statistical implementations. Using the sprtt package as an example, we will show what reliable software development with R can look like within scientific contexts. The sprtt package was developed for scientific purposes and allows the application of frequentist sequential tests (e.g., sequential t-tests). The package is understood as a toolbox and will be continuously developed. This goal can only be achieved in the long term by using sustainable software practices.

Session 24, Web Frameworks

Session chair: Sanjay Kumar

Matthias Mueller - Serving insights to stakeholders using R and the Slack API

Slack has become omnipresent in many companies, as teams use it to foster collaboration and teamwork. For data scientists, this presents a unique opportunity to integrate and embed data analytics into conversations that are already happening: specifically, we can proactively serve insights to key stakeholders without needing them to actively seek out information. In this talk, Matthias will present how his team built a custom Slackbot with R that automatically serves insights to the greater organization.

Casper Hart - `detourr`: Interactive and performant tour visualisations for the web

The tour provides a useful vehicle for exploring high-dimensional datasets. It works by combining a sequence of projections (the tour path) into an animation—the display method. Current display implementations in R are limited in their interactivity and portability and have poor performance and jerky animations even for small datasets.

We take a detour into web technologies, such as Three.js, WebGL, TensorFlow.js, and WebAssembly, that support smooth and performant tour visualisations. The detourr R package implements a set of display tools in Typescript that allows for rich interactions (including orbit controls, scrubbing, and brushing) and smooth animations for large datasets. It provides a declarative R interface using htmlwidgets and supports linked views using crosstalk and R Shiny. The resulting animations are portable and accessible across a wide range of browsers and devices. It is designed to be extensible, allowing for the addition of custom display methods either from scratch or using existing display methods as a base.

John Coene - `Ambiorix`: A web framework inspired by express.js

Ambiorix is a web framework for R with a syntax inspired by express.js. It allows the building of any web service with a single syntax/DSL, including single-page and multipage web applications, as well as RESTful APIs. It is fully featured, and extensible (via middlewares). There are already a number of extensions for logging, security, etc.

Agustin Calatroni - Interactive dashboards without Shiny

Co-author: Stephanie Lussier

RStudio’s flexdashboard package is a powerful tool for creating interactive dashboards in R using R Markdown. A variety of layouts can be quickly generated, including multiple pages, storyboards, and commentaries, as well as embedded tabs and drop-down menus. Additionally, with minimal programming effort, the dashboards can be customized via prepackaged themes or custom CSS. Dashboards can be further extended for user interactivity with tables and visualizations by judicious use of HTML widgets to create a standalone HTML file with no special client or server requirement. In this talk, we will present a workflow utilizing flexdashboard and leveraging the abilities of other individual packages, such as trelliscopejs, plotly, DT, reactable, leaflet, and crosstalk, to create highly interactive clinical trial reports for data monitoring and/or statistical analyses results. By avoiding the use of Shiny, these reports can be conveniently emailed, deployed on an internal company webpage, or added to GitHub pages for widespread accessibility.


Thursday, 23 June 2022, 10:45am - 12:00pm CDT

Session 27, Data Crunching with R

Session chair: Edgar Manukyan

Patrice Godard - Managing and leveraging knowledge catalogs with `TKCat`

Research organizations generate, manage, and use more and more knowledge resources, which can be highly heterogenous in their origin, their scope, and their structure. Making this knowledge compliant with F.A.I.R. (Findable, Accessible, Interoperable, Reusable) principles is critical for the generation of new insights leveraging it. The aim of the TKCat (Tailored Knowledge Catalog) R package is to facilitate the management of such resources, which are frequently used alone or in combination in research environments. In TKCat, knowledge resources are manipulated as modeled database (MDB) objects. These objects provide access to the data tables, along with a general description of the resource and a detailed data model (generated with ReDaMoR) that documents the tables, their fields, and their relationships. These MDBs are then gathered in catalogs that can be easily explored and shared. TKCat provides tools to easily subset, filter and combine MDBs and create new catalogs suited for specific needs. Currently, there are 3 different implementations of MDBs that are supported by TKCat: in R memory (memoMDB), in files (fileMDB) and in ClickHouse (chMDB).


Miguel Alvarez - Database list against the matrix: Use of `taxlist` and `vegtable` for the assessment of vegetation-plot data

Vegetation-plot information, as any biodiversity record, cannot be efficiently stored in a single table template and usually needs the use of relational models for data storage. Considering this fact, and a missing object class to handle vegetation plots in R, I developed vegtable. This R package is inspired by Turboveg 2 and includes functions for importing and exporting data as well as common data manipulation.

In this talk, I introduce some capabilities and functions of vegtable, including summaries, subsets, conversion of cover values, and export in cross-table formats. I also discuss the common practice of handling vegetation data in matrices against more efficient work with database lists.

An important dependency of vegtable is the package taxlist, which focuses on taxonomic lists, usually of plant species recorded in plots. The package taxlist is able to contain taxonomic ranks, parent-child relationships among taxa and used synonyms, allowing the harmonization of nomenclature for data collected from different data sources. While taxlist can be applied to groups of organisms other than plants, some recent works have proved its use for syntaxonomic classifications. The possibility of including spatial information, electronic libraries and lists of specimen vouchers will be also highlighted.


Robin Gower - Linked data frames

Linked-data uses the Resource Description Framework (RDF) to identify resources with Uniform Resource Identifiers (URIs) and describe them with a set of statements, each specifying the value of a given property for the resource. These statements connect together to form a knowledge graph spanning the web.

The linked-data-frames package makes this data more amenable for idiomatic use in R by using the vctrs package to encapsulate resource descriptions.

We believe this is a novel use of vctrs to tabulate graphs. Learn about our practical experiences and the problems we encountered.

The package also helps users to download linked-data from the web, weaving together a variety of W3C standards and other linked-data vocabularies for working with statistical data cubes. The work was funded by the Integrated Data Programme, a cross-government initiative in the UK bringing together data from across the UK government and devolved administrations. This work to publish linked statistical open data in interoperable formats may also be of interest to R users.

Presentation slides:

Hannes Mühleisen - `DuckDB`: An in-process analytical DBMS

Using databases to wrangle and retrieve data from R and Python can be challenging. Traditional systems like SQLite or MySQL are not built for analytical workloads, and moving data into the analysis environment can suffer from low bandwidth.

DuckDB is a new in-process database management system that runs directly in-process, greatly streamlining setup and data transfer. DuckDB uses a column-vectorized query processing architecture to run analytical SQL queries very quickly indeed.

DuckDB is deeply integrated with R. It can, for example, directly run queries on data that lives in R data frames without a dedicated data importing step. I will describe the rationale behind building DuckDB as well as give some usage examples for statistical programming.

Session 28, Package Development

Session chair: Njoki Lucy

Zuguang Gu - On the heaviness of package dependencies

In the last decade, R has rapidly become a major programming language for developing software for data analysis. As the number of R packages increases, dependency among packages becomes complicated. On CRAN and Bioconductor, there were 187661 direct dependency relations among 22083 packages as of 28 October 2021.

If a package (P) depends on a large number of parent packages, it brings with it the following consequences:

(1) A lot of additional indirect dependency packages will need to be installed with P, which creates the risk of upstream package installation failures breaking the installation of P. (2) It will take a long time to load P, and namespaces loaded into the R session after loading P will be huge. (3) P will be “heavy” and bring heavy dependencies to downstream packages that depend on P.

We developed an R package named pkgndep and proposed a new measure called “dependency heaviness,” which measures the number of unique dependencies a parent brings to its child. The pkgndep package provides an intuitive way for visualizing dependency heaviness, and it helps to find out which parent contributes the most heaviness of the dependency. We also performed a global dependency analysis on all packages on CRAN/Bioconductor, demonstrating which packages’ dependencies are mostly affected by parents and which packages affect their child and downstream dependencies more. The global analysis has been integrated in the pkgndep package as a comprehensive web-based database.

Presentation slides:

Lorenz Walthert - Better commits with `pre-commit`

Nearly everyone who works with code uses the version control system git. However, making good commits (content, scope, message) is something one has to learn. The precommit R package provides an interface to the language-agnostic framework pre-commit, which allows users to run checks on the code to commit so-called git hooks. These help you to sort out trivial problems like code formatting, as well as linting, ensuring up-to-date derivatives (e.g., Rd, and more. This increases the quality of commits–and ultimately the whole code base.

By attending this talk, you will learn how to set up pre-commit for your projects locally, as well as enforcing the hooks on continuous integration via the recently added support for R in The author will also demo various configuration settings that pre-commit supports and show you how to create pre-commit hooks idiosyncratic to your projects.

Daan Seynaeve - Authenticating R package distribution

Co-author: Tobia De Koninck

We present an approach for authenticating CRAN-like R package distribution via OAuth 2.0.

Securing package installation is necessary when R packages are intended for a limited audience due to confidential or proprietary package content. Current approaches largely rely on restricting access on the network level or using alternative means of package distribution.

We show how a web repository with an interface similar to CRAN can be secured using the OAuth 2.0 Device Code Flow to create an authentication flow that is intuitive and user-friendly. We present a reference implementation of such a repository server and matching R client. We additionally show how this integrates with earlier work on an open-source solution for R package management.

Juliane Manitz - Learnings and reflection on the implementation of risk-based package assessment

Co-authors: Andy Nicholls, Joe Rickert, Marley Gotti, Doug Kelkhoff, Yilong Zhang, Paulo Bargo, and Keaven Anderson

This presentation is on the implementation of risk-based approaches to assess R package accuracy within a validated infrastructure. The discussion reflects thoughts from the R Validation Hub Working Group, a cross-industry initiative funded by the R Consortium. Our mission is to enable the use of R by the biopharmaceutical industry in a regulatory setting, where the output may be used in submissions to regulatory agencies. In early 2020, the R Validation Hub published a white paper which addresses concerns raised by statisticians, statistical programmers, informatics teams, executive leadership, quality assurance teams and others within the pharmaceutical industry about the use of R and selected R packages as a primary tool for statistical analysis for regulatory submission work. Meanwhile, the R Consortium successfully submitted a fully R-based test package to the FDA, and various companies have implemented the concept of risk-based R package validation into their standard processes. We present our learnings from those applied case studies, highlighting which aspects were easy to implement into practice and where difficulties occurred. We also review how new developments in the riskmetric R package and Shiny app can help.

Session 29, Expanding Tidyverse

Session chair: Carsten Lange

Patrick Weiss - Tidy Finance with R

Co-authors: Christoph Scheuch, Stefan Voigt, and Patrick Weiss

Financial economics is a vibrant area of research and a central part of all business activities. Despite a vast number of empirical studies of financial phenomena, the field suffers from the lack of public code for key concepts of financial economics. This lack of transparent code not only leads to numerous replication efforts (and their failures), but also constitutes a waste of resources on problems that have already been solved by countless others in secrecy. Our book Tidy Finance with R (in development) aims to lift the curtain on reproducible finance by providing a fully transparent code base for many common financial applications. We hope to inspire others to share their code publicly and take part in our journey towards more reproducible research in the future.

The book comprises 5 parts: (i) an introduction to empirical finance using tidyverse and tidyquant, (ii) accessing and managing financial data using SQLite, (iii) asset pricing using tidy coding principles, (iv) modelling and machine learning using the tidymodels framework, and (v) portfolio optimization in a tidy manner.

We write this book for three audiences: students, who want to acquire the basic tools required to conduct financial research, ranging from undergrad to graduate level; instructors, who look for materials to teach in empirical finance courses; and data analysts or statisticians, who work on issues pertaining to financial data and need practical tools to do so.

Presentation slides:

Dax Kellie - Exploring biodiversity databases is tidier than ever

Co-authors: Jenna Wraith, Shandiya Balasubramaniam, and Martin Westgate

Over 100 million records of more than 150,000 Australian species are held within the Atlas of Living Australia (ALA), a data infrastructure that aggregates Australia’s biodiversity data across citizen science programs, museums, herbaria and government agencies. These data are used by a wide range of people, including researchers, public servants, industry workers, and citizen scientists, for monitoring and conservation. Downloading data from the ALA requires users to construct and send a coherent data query to the ALA’s programming interface, which will search and return data that matches their data query. However, building a query for data one intends to download, rather than data one has already downloaded, hasn’t always been easy for R users. This task can quickly become clunky and unintuitive when trying to match R syntax to an existing programming interface.

The ALA’s latest R package, galah, uses an innovative solution to query building that allows users to build data queries in a similar way to wrangling their data using dplyr. We demonstrate how galah makes use of tidy evaluation and pipes to make filtering and downloading biodiversity data easier, and discuss its implications for public and scientific communities.

Presentation slides:

Jason Cory Brunson - Toward a tidy package ecosystem for topological data analysis

Co-authors: Raoul Wadhwa and Matthew Piekenbrock

Topological data analysis (TDA) applies techniques from combinatorial and algebraic topology to characterize the structure of high-dimensional data. Now in its third decade, TDA comprises many standard workflows in practical domains as well as ongoing theoretical and experimental work. Despite these, software to conduct TDA in R, while extensive, is largely ad hoc, independent, and inaccessible to non-specialists.

The tidyverse collection exemplifies a methodical, coordinated, and general-purpose approach to R package development. It relies on a shared set of structural and syntactic conventions to improve legibility, learnability, and extensibility, but also relies on rigorous grammars and opinionated features to promote practice standards.

Our goal is to support the production and refinement of a similarly principled R package collection for TDA, and to couple it where appropriate to existing tidyverse infrastructure. We view the challenge as two-fold: to catalog sound TDA applications into general workflows, and to develop functional units (packages) that realize them. The process is recursive, as implementations inform theory and vice versa.

This is work in progress, and the packages tdaunif, ripserr, TDAstats, simplextree, Mapper, and ggtda are under active development. In this presentation, we will review some common TDA analysis pipelines, illustrate them using our packages, and survey short-term needs and long-term goals. Feedback, interest, and contributions will be encouraged.

Bryan Shalloway - Five ways to do "tidy" pairwise operations

The dplyr _at/, _if/, and _all variants of mutate() & summarise() and (in the last couple years) dplyr::across() have made iterating across operations on many columns easier to do in the tidyverse. These functions, though, still don’t facilitate operations across complex combinations of columns (e.g., operations across pairs of a specified set of columns). Pairwise operations are an example of a pattern that is easy to conceptualize but takes effort to code, as there is little guidance on best practices.

In this talk I’ll walk through examples of five different tidyverse-friendly packages/approaches that can be used for doing pairwise operations. Some of these are for pairwise “mutating” operations, others for “summarising,” and some for either:

  • A mutating pairwise operation generally returns the same number of rows as was input. For example, calculate ratios of all possible pairs of inputted columns.
  • A summarising operation generally returns a single row (for each group). For example, compare the distributions between columns by calculating the K-S statistic between all pairs of columns.

I will then walk through examples using each of widyr, corrr, recipes, pwiser, and dplyover for doing pairwise mutating/summarising operations. I’ll summarize the functionality enabled by each package across dimensions of “handles mutating?” “handles summarising?” and “handles arbitrary operations?” I will spend the most time on pwiser and especially dplyover, which offer the most flexibility for doing pairwise operations and facilitate a dplyr::across() style syntax.

Session 30, Experimental Design

Session chair: Nasrin Attar

Kristen Hunter - Power Under Multiplicity Project (PUMP): Estimating power, minimum detectable effect size, and sample size when adjusting for multiple outcomes

Co-authors: Luke Miratrix, Kristin Porter, and Zarni Htet

For randomized controlled trials (RCTs) with a single intervention being measured on multiple outcomes, researchers often apply a multiple testing procedure (such as Bonferroni or Benjamini-Hochberg) to adjust p-values. Such an adjustment reduces the likelihood of spurious findings, but also changes the statistical power, sometimes substantially, which reduces the probability of detecting effects when they do exist. However, this consideration is frequently ignored in typical power analyses, as existing tools do not easily accommodate the use of multiple testing procedures.

We introduce the PUMP R package as a tool for analysts to estimate statistical power, minimum detectable effect size, and sample size requirements for multi-level RCTs with multiple outcomes. Multiple outcomes are accounted for in two ways. First, power estimates from PUMP properly account for the adjustment in p-values from applying a multiple testing procedure. Second, as researchers change their focus from one outcome to multiple outcomes, different definitions of statistical power emerge. PUMP allows researchers to consider a variety of definitions of power, as some may be more appropriate for the goals of their study. The package estimates power for frequentist multi-level mixed effects models, and supports a variety of commonly used RCT designs and models and multiple testing procedures. In addition to the main functionality of estimating power, minimum detectable effect size, and sample size requirements, the package allows the user to easily explore sensitivity of these quantities to changes in underlying assumptions.

Lukas Baumann - `baskexact`: Planning a basket trial based on power priors

Co-authors: Johannes Krisam and Meinhard Kieser

In a basket trial a new treatment is tested in several subgroups. Basket trials are mostly used in uncontrolled phase II studies, where tumor response is the primary outcome and the subgroups comprise patients with different primary tumor locations but a common biomarker. Many of the recently proposed designs for the analysis of basket trials utilize Bayesian tools to partly share information between baskets depending on similarity to increase the power. baskexact implements a family of basket trials based on power priors and using empirical Bayes methodology. In these designs, baskets are at first analyzed individually using a beta-binomial model. The amount of information that is shared between the subgroups is determined by weights which are, for example, derived from a similarity measure of the individual posterior distributions.

With baskexact, exact calculation of the operating characteristics of these designs, such as the type 1 error rate, power and expected sample size, is possible. baskexact makes use of the formal S4 class system and is built in a way that makes it easily extendable such that different functions to calculate the weights or to conduct the interim analyses can be added.

Carlos de la Calle-Arroyo - `optedr`: An optimal experimental design package

Co-authors: Jesús López-Fidalgo and Licesio J. Rodríguez-Aragón

Often in optimal experimental design research the efforts concentrate on generating optimal designs for particular problems or families of problems, or on algorithms and general procedures to find such designs. However, a design can be optimal for a certain criterion but still be inadequate. Experimenters could have particular needs or constraints, preferences for certain experimental points, statistical needs, etc. In those cases, the experimenter can either use the optimal design as a benchmark, or augment or modify the design to transform it more to their preference.

optedr allows the calculation of optimal designs for non-linear models with an independent variable, for different criteria. The package has been implemented after considering an applied approach, with a simple interface to generate such designs.

Aside from generating optimal designs, the package allows the comparison of user-generated designs with the optimum, to use them as a benchmark. It also implements a methodology to D-augment designs in an informative way, controlling the efficiency.

Lastly, as the package works with approximate designs, a rounding algorithm has been implemented to transform the approximate optimal and augmented designs to exact designs, ready for experimenter use.

Francesca Graziano - Using the `design2phase` library to estimate power and efficiency of a two-phase design with survival outcome

Co-author: Paola Rebora

The availability of large epidemiological cohorts and stored biological specimens allows us to reuse these data to answer new research questions. Two-phase sampling is a general approach for sub-sampling that significantly reduces time and cost. However, if the aim is to estimate the association of a novel biomarker with outcome in a subset of patients, the choice of the most performing sampling design remains rare in practice, partly due to the lack of convenient and flexible tools.

The design2phase library, implemented in R software, is a tool that provides a simulation-based investigation of sub-sampling performances with the aim of estimating the association between a new marker and a time-to-event outcome in a two-phase study. The library is created to estimate power and efficiency of a wide variety of sampling designs simultaneously (e.g., simple random sampling, case control, probability proportional to size, nested case control, and countermarching), applying a two-phase Cox model weighted by the inverse of the empirical inclusion probability.

This user-friendly tool could help researchers plan a second phase study by providing simple commands to perform stratified sampling and to visualize power curves. The architecture of the package and its applicability are illustrated by using data from childhood acute lymphoblastic leukemia to evaluate the role of different genetic polymorphisms on treatment failure due to relapse.

Session 31, Forecasting & Nowcasting

Session chair: Max Welz

Kapil Choudhary - VMD-based time delay neural network hybrid model for agricultural price forecasting

Co-authors: Girish Kumar Jha, Ronit Jaiswal, and Rajeev Ranjan Kumar

Agricultural price forecasting is one of the challenging areas of time-series forecasting due to its strong dependence on biological processes. To enhance the accuracy of agricultural price forecasting, we propose a variational mode decomposition (VMD)-based hybrid model, VMD-TDNN, that combines VMD and a time delay neural network (TDNN). The proposed hybrid model is based on the “divide and conquer” concept. The VMD is used as a preprocessing technique to decompose a complex agricultural price series into a set of intrinsic mode functions (IMFs) with different center frequencies. Due to its adaptiveness, sound mathematical theory VMD overcomes the limitation of the mode mixing problem of the empirical mode decomposition (EMD) method. Further, a TDNN with a single hidden layer is constructed to forecast each IMFs individually. Finally, the prediction results of all IMFs, are aggregated to formulate an ensemble output for the agricultural price series. The proposed model’s prediction ability is evaluated using level and direction forecasting evaluation criteria. The empirical results, using monthly international maize, palm oil, and soybean oil price series, demonstrate that the proposed decomposition-based ensemble model (VMD-TDNN) can significantly improve the prediction accuracy of agricultural price series.

Ronit Jaiswal - Hybrid time series forecasting model based on STL decomposition and ELM

Co-authors: Girish Kumar Jha, Kapil Choudhary, and Rajeev Ranjan KumarI

In this study, we integrated a decomposition technique viz. seasonal trend decomposition procedure based on loess (STL) with an efficient neural network-based forecasting technique (i.e., extreme learning machine [ELM]) and developed an ensemble hybrid model called STL-ELM for a nonstationary, nonlinear and seasonal agricultural price series. First, the STL technique is used to decompose the original price series into the seasonal, trend and remainder components. Then, an ELM with a single hidden layer is constructed to forecast these components individually. Finally, the prediction results of all components are aggregated to formulate an ensemble output for the agricultural price series. The hybrid model captures the temporal patterns of a complex time series effectively through analysis of the simple decomposed components. The study further compared the price forecasting ability of the developed STL-ELM model with time delay neural network (TDNN), ELM and SARIMA models using monthly price series of potato for two major markets of India. The empirical results clearly demonstrated the superiority of the developed hybrid model over the other models in terms of two forecasting evaluation criteria. Moreover, the accuracy of the forecasts obtained by all the models is also evaluated using the Diebold-Mariano test, which shows that the STL-ELM-based model has a clear advantage over the other three models.

Sam Abbott - Evaluating semi-parametric nowcasts of COVID-19 hospital admissions in Germany

Co-author: Sebastian Funk

COVID-19 hospitalisations in Germany are released by date of positive test rather than by date of admission. This has some advantages when they are used as a tool for surveillance, as these data are closer to the date of infection and so easier to link to underlying transmission dynamics and public health interventions. Unfortunately, however, when released in this way the latest data are right-censored, meaning that final hospitalisations for a given day are initially underreported. This issue is often found in datasets used for the surveillance of infectious diseases and can lead to delayed or biased decision making. Fortunately, when data from a series of days is available we can estimate the level of censoring and provide estimates for the truncated hospitalisations adjusted for truncation with appropriate uncertainty. This is usually known as a nowcast.

In this talk, we evaluate a series of novel semi-parametric nowcasting model formulations in real-time and provide an example workflow to allow others to do similarly. This project is part of a wider collaborative assessment of nowcasting methods. All models are implemented using the epinowcast R package. The nowcasting and evaluation pipeline is implemented using the targets R package. All input data, interim data, and output data are available.

Nikos Bosse - Evaluating forecasts with `scoringutils` R

Co-authors: Sam Abbott, Sebastian Funk, and Hugo Gruson

Forecasts play an important role in a variety of fields. Their role in informing public policy has attracted increased attention from the general public with the emergence of the COVID-19 pandemic. Much theoretical work has been done on the development of proper scoring rules and other scoring metrics that can help evaluate these forecasts. However, there is a vast choice of scoring rules available for different types of data, and there has been less of a focus on facilitating their use by those without expertise in forecast evaluation. In this talk, we introduce scoringutils, an R package that, given a set of forecasts and truth data, automatically chooses, applies and visualises a set of appropriate scores. It gives the user access to a wide range of scoring metrics for various types of forecasts, as well as a variety of ways to visualise the evaluation. We give an overview of the evaluation process and the metrics implemented in scoringutils and show an example evaluation of forecasts for COVID-19 cases and deaths submitted to the European Forecast Hub between May and September 2021.

Session 32, Building the R Community 2

Session chair: Amelia McNamara

Nadja Bodner - `ConNEcT`: An R package to build contingency measure-based networks on binary time series

Co-author: Eva Ceulemans

Dynamic networks are valuable tools to depict and investigate the concurrent and temporal interdependencies of various variables (e.g., during a dyadic interaction) across time. Although several software packages for computing and drawing dynamic networks have been developed, software that allows investigating the pairwise associations between a set of binary intensive longitudinal variables is still missing. To fill this gap, we introduce an R package that yields contingency measure-based networks. ConNEcT implements different contingency measures: proportion of agreement, corrected and classic Jaccard index, phi correlation coefficient, Cohen’s Kappa, odds ratio, and log odds ratio. Moreover, users can easily add alternative measures, if needed. Importantly, ConNEcT also allows conducting non-parametric significance tests on the obtained contingency values that correct for the inherent serial dependence in the time series, through a permutation approach or model-based simulation. In this talk, we provide an overview of all available ConNEcT features and showcase their usage.

Neale Batra - The Epi R Handbook: Getting R in the hands of frontline public health responders

Co-author: Alex Spina

The intense recent spotlight on public health has highlighted a vast global workforce of frontline outbreak responders hindered by substandard and proprietary analytical tools. These practitioners often desire to use R, but lack discipline-specific training resources. In May 2021, the nonprofit Applied Epi launched the free Epidemiologist R Handbook, a bookdown with 50 chapters of example R code targeted to address the daily tasks of applied epidemiologists and public health responders. This handbook, now the foundational R resource for public health, has been used by 130,000 people in 203 countries/territories, is being translated into 10 languages, and has been adopted at field levels by Doctors without Borders, the World Health Organization, and countless local health agencies. In this talk, we describe the grassroots effort that led 150 practitioners to create the Handbook and its translations. We continue by detailing Applied Epi’s ongoing global, multilingual training campaign of interactive tutorials, live R courses, and R package development tailored to support practitioners first learning to code and those transitioning from other languages. Applied Epi aims to accelerate the adoption of R across all of epidemiology and public health–not just academia and mathematical modeling. This discussion will explore how R can evolve to better serve and center ground-level practitioners–whether in epidemiology or other fields.

Kieran Martin - R journey: Switching to R in the pharmaceutical industry

In the pharmaceutical industry we have been using SAS as our core tool for data science for a very long time. Shifting towards an open source language like R doesn’t just involve updating our code, it means thinking differently, working differently, and approaching our problems in a different way.

In this talk I will discuss some of the hurdles we have needed to climb in our journey at Roche towards R, how we did so, and what we still have yet to do.

Nicholas Tierney - Reflections one year into working as a research software engineer

Despite the obvious impact of software in research, we are still working out how to adequately acknowledge research software in academia. How do we provide rewarding career paths for those who want to write research software as academic output? The relatively new field of research software engineering can help address this. A research software engineer combines professional software expertise with an understanding of research. I have been working as a research software engineer in academia for the past year. In this talk, I will explain how this role fits into academia, what I do as a research software engineer, and summarise what I’ve learnt, and how I see, and hope, a career in research software engineering develops over time.


Thursday, 23 June 2022, 1:00 - 2:15pm CDT

Session 33, R GUIs

Session chair: Susanne Dandl

Ross Dierkhising - Transitioning from commercial software to the R-based GUI BlueSky Statistics in a large academic medical center

Making the change from one software to another can be a difficult process, especially at scale in large institutions, considering the heterogeneity of the user base, including users who are not themselves statisticians. This talk will discuss how the Mayo Clinic transitioned from the commercial software JMP to the R-based software BlueSky Statistics in both academic and research areas. It focuses on the issues being solved, the plan put in place to solve them, the implementation, the challenges, thoughts for the future, and lessons learned.

Sanjay Kumar - The new architecture for BlueSky Statistics R GUI

R is powerful, flexible, and extensible, but it can be intimidating without a programming background and poses a steep learning curve. This results in the continuing popularity of proprietary menu-based tools such as SPSS and Minitab. BlueSky Statistics is a free and easy-to-use graphical user interface that unleashes the power of R to non-programmers. BlueSky has been adopted by hundreds of universities across 40 countries. Programmers can also use BlueSky to speed up their work and ramp up their R learning since BlueSky displays underlying R code for every GUI-based analysis. Version 10 of BlueSky, released in early 2022, is built upon an entirely new architecture with a modern user interface. This architecture works to enable any R programmer to add their code and dialog boxes. It also added support for R Markdown and LaTeX to the already publication-quality output tables. In addition, it adopted a form of Markdown as BlueSky’s native file format, allowing easy compatibility with users of other interfaces, such as RStudio. The new architecture also adds GUI-level reproducibility to the previous approach, which depended on R code. This talk will describe these and other new features in this release.

Steven A. Miller - Multilevel modeling in the BlueSky Statistics environment: Growth curves, daily diaries, and experience sampling

Multilevel models (MLMs) have been frequently utilized to handle non-independent observations (e.g., students nested within classrooms, time points nested within individuals, etc.); they are characterized by fixed effect parameters as well as random effects that model variability due to nesting units. While such models were originally estimated in specialized software, they have recently been implemented in SAS, SPSS, and R. Some of the difficulty in conceptualizing these models has led to the development of much literature to help naïve users of such models. BlueSky Statistics offers a user-friendly GUI-based analytic facility for multilevel modeling and it generates the necessary R code for you. In this presentation, features relating to random effects will be discussed extensively, including modeling random intercepts, random slopes, uncorrelated random intercept and slope, and correlated random intercept and slope. BlueSky also provides features for model diagnosis and post-hoc analysis of categorical predictors and interactions. Demonstrations will be provided from experience sampling, daily diary, and growth-curve data. Comparisons to other BlueSky procedures (e.g., repeated-measures ANOVA) will be provided, along with demonstration of commonly used data restructuring. Discussion of future BlueSky features that might also be used for non-independent observations (e.g., latent variable growth curves) will also be provided. Specialized options that are not commonly available in other pieces of commercial software will be highlighted (e.g., semi-partial-R-squared effect size, estimates of effect size).

Christophe Genolini - R++: An easy-to-use graphical R interface for medical doctors

Co-author: Timothy Bell

R++ is a user-friendly R graphical interface for medical doctors. It greatly simplifies the use of R.

R is a comprehensive yet powerful language/environment. However, for non-statisticians, using R can be difficult. In particular, some specialists have a job that has nothing to do with statistics, and they occasionally use statistics as a tool. In this case, they have to relearn R each time they want to use it.

With R++, the learning curve is extremely short–around 1 hour to master everything. To achieve such simplicity, we have limited R++ to a single sector of activity: medicine. Then we worked along two lines: (1) limiting the number of tools at hand: R offers a very large number of tools. Medical doctors only use a tiny fraction of it. In R++, we integrated only the tools used by the vast majority of the medical world, and we did not include the others. The result is a very minimal clean interface. (2) in collaboration with HCI experts, we conducted more than a hundred “user meetings,” during which medical doctors told us about their difficulties and their needs. We designed a first interface that would meet these needs. This interface was submitted to user criticism. We then made a second interface, taking the feedback into account. And so on, until convergence.

The result is a streamlined software that is very easy to learn and responds exactly to the needs of the users.

Session 34, Regression Models

Session chair: Agustin Calatroni

Tobias Schoch - `robsurvey`: Robust survey statistics estimation

Co-author: Beat Hulliger

The robsurvey package provides robust estimation methods for data from complex sample surveys. The package implements the following methods: (1) basic outlier-robust location estimators of the population mean and total using weight reduction, trimming, winsorization, and M-estimation (robust Horvitz-Thompson and Hajek estimators); (2) robust survey regression M- and GM-estimators of the type Mallows and Schweppe; (3) robust model-assisted estimators of the population mean and total.

A key design pattern of the package is that the methods are available in two flavors: bare-bone functions and survey methods. Bare-bone functions are stripped-down versions of the survey methods in terms of functionality. They may serve package developers as building blocks. The survey methods are much more capable and depend–for variance estimation–on the R package survey.

The talk is organized into three parts: (1) Overview of the robust methods in robsurvey, including a comparison with other R packages (survey, robustbase, robeth, and MASS), Stata (robstat and rreg), SAS (robustreg), NAG and GNU Scientific Library. (2) Design patterns and possible extensions of the package. (3) Use cases and applications of the package.

Hannah Frick - `censored`: A tidymodels package for survival analysis

Co-authors: Emil Hvitfeldt and Max Kuhn

Survival analysis is an important field in modeling, and there are many R packages available which implement various models, from “classic” parametric models to boosted trees. While they cover a great variety of model types, they also come with considerable amounts of heterogeneity in syntax. The tidymodels framework is a collection of R packages for modeling and machine learning using tidyverse principles. It provides a consistent interface to a variety of modeling functions, along with tools for resampling, assessing performance, and hyperparameter tuning.

The censored package now extends the model coverage of tidymodels’s parsnip package for survival analysis. It offers a tidymodels interface to parameteric survival models, (regularized) proportional hazards models and various tree-based models such as decision trees, boosted trees, and random forests for survival analysis. Additionally, it offers predictions of time to event, linear predictor, survival probability, or hazard in a tibble format consistent across different models. The censored and parsnip packages can be used on their own or in conjuction with other tidymodels packages.

Achim Zeileis - `distributions3`: From basic probability to probabilistic regression

Co-authors: Moritz Lang and Alex Hayes

The distributions3 package provides a beginner-friendly and lightweight interface to probability distributions. It allows to create distribution objects in the S3 paradigm that are essentially data frames of parameters, for which standard methods are available: e.g., evaluation of the probability density, cumulative distribution, and quantile functions, as well as random samples. It has been designed such that it can be employed in introductory statistics and probability courses. By not only providing objects for a single distribution but also for vectors of distributions, users can transition seamlessly to a representation of probabilistic forecasts from regression models such as GLM (generalized linear models), GAMLSS (generalized additive models for location, scale, and shape), etc. We show how the package can be used both in teaching and in applied statistical modeling, for interpreting fitted models, visualizing their goodness of fit (e.g., via the topmodels package), and assessing their performance (e.g., via the scoringRules package).


Pierre Masselot - The R package `cirls`: Constrained estimation in generalized linear models

Co-author: Antonio Gasparrini

The cirls R package provides functions to fit generalized linear models with coefficients subject to linear constraints. The estimation is based on iterative calls to a quadratic programming algorithm to provide fits respecting constraints. The main routine in the package,, is meant to be called through the usual glm function provided in the stats package. This allows taking advantage of the whole glm machinery to produce outputs, summaries, and extract coefficients. The package provides additional methods to produce corrected (co)variance matrices and confidence intervals for the coefficients, consistently with the estimation framework. We illustrate how to use the package to fit shape-constrained splines, obtaining nonlinear functions with monotonicity and convexity constraints.

Session 35, Spatial Statistics

Session chair: Jana Dlouhá

Bryan A. Fuentes - `rassta`: Raster-based spatial stratification algorithms

Co-authors: Minerva J. Dorantes and John R. Tipton

Spatial stratification of landscapes allows for the development of efficient sampling surveys, the inclusion of domain knowledge in data-driven modeling frameworks, and the production of information relating the spatial variability of response phenomena to that of landscape processes. This work presents the rassta package as a collection of algorithms dedicated to the spatial stratification of landscapes, the calculation of landscape correspondence metrics across geographic space, and the application of these metrics for spatial sampling and modeling of environmental phenomena. The theoretical background of rassta is presented through references to several studies which have benefited from landscape stratification routines. The functionality of rassta is presented through code examples which are complemented with the geographic visualization of their outputs.

Nikolas Kuschnig - Bayesian spatial econometrics

Bayesian approaches to spatial econometric models are relatively uncommon in applied work, but play an important role in the development of new methods. This is partly due to a lack of easily accessible, flexible software for the Bayesian estimation of spatial models. Established probabilistic software struggles with computational specifics of these models, while classical implementations cannot harness the flexibility of Bayesian modelling. In this talk, I present bsreg, an object-oriented R package that bridges this gap. The package enables quick and easy estimation of spatial econometric models and is readily extensible. Using the package, I demonstrate the merits of the Bayesian approach by means of a well-known dataset on cigarette demand. Bayesian and frequentist point estimates coincide, but posterior inference affords better insights on uncertainty. I find that in previous works with distance-based connectivities the average spillover effects were overestimated considerably, highlighting the need for tried and tested software.

Aritz Adin - `bigDM`: An R package to fit scalable Bayesian spatial and spatio-temporal disease mapping models for high-dimensional data

Co-authors: Erick Orozco-Acosta and María Dolores Ugarte

The use of spatial and spatio-temporal count data models are crucial in areas such as cancer epidemiology, since they permit investigators to reliably obtain incidence or mortality risk estimates of cancer in small areas, avoiding the huge variability of classical risk estimation measures such as standardized mortality ratios or crude rates. However, the scalability of these models (i.e., their use when the number of space-time domains increases significantly) has not yet been studied in depth.

The bigDM R package implements several spatial and spatio-temporal scalable disease mapping models for high-dimensional count data using the INLA technique for approximate Bayesian inference in latent Gaussian models. The main algorithms are based on the “divide and conquer” methodology so that local spatio-temporal models can be fitted simultaneously. The adaptation of this idea to the context of disease mapping is very appropriate in practice, since spatio-temporal conditional autoregressive (CAR) models induce local smoothness in both spatial and temporal dimensions by means of neighbouring areas and time points.

This package allows the user to adapt the modelling scheme to their own processing architecture by performing both parallel and/or distributed computation strategies to speed up computations by using the future package. The new version of the package will also include functions to fit scalable multivariate spatial models to jointly analyse several diseases, and scalable ecological regression models that take into account the confounding issues between potential risk factors and the model random effects.

Crystal Wai - `spatialEpisim`: An R Shiny app for tracking COVID-19 in low- and middle-income (LMIC) countries

Co-authors: Ashok Krishnamurthy, Gursimran Dhaliwal, Jake Doody, Timothy Pulfer, and Ryan Darby

It is essential to understand what future epidemic trends will be, as well as the effectiveness and potential impact of public health intervention measures. Our goal is to provide insights to support informed, data-driven decision making. We present spatialEpisim, an R Shiny app that integrates mathematical modeling and open-source tools for tracking the spatial spread of COVID-19 in low- and middle-income (LMIC) countries.

We present spatial compartmental models of epidemiology (e.g., SEIR, SEIRD, SVEIRD) to capture the transmission dynamics of the spread of COVID-19. Our interactive app can be used to output and visualize how COVID-19 spreads across a large geographical area. The rate of spread of the disease is influenced by changing the model parameters and human mobility patterns.

First, we run the spatial simulations under the worst-case scenario, in which there are no major public health interventions. Next, we account for mitigation efforts, including strict mask wearing and social distancing mandates, and widespread vaccine rollout to priority groups.

As a test case, numbers of newly infected and death cases in Nigeria are estimated and presented. Projections for disease prevalence with and without mitigation efforts are presented via time-series graphs for the epidemic compartments.

We seek primarily to clarify mathematical ideas, rather than to offer definitive medical answers. Our analyses may shed light more broadly on how COVID-19 spreads in a large geographical area with places where no empirical data is recorded or observed.

Session 36, Synthetic Data and Text Analysis

Session chair: Charlie Gao

Johannes Gussenbauer - Synthetic data generation with `simPop`: New features XGBoost and advanced calibration

Co-authors: Alexander Kowarik, Siro Friedmann, and Matthias Templ

Synthetic data generation methods are used to transform the original data into privacy-compliant synthetic copies (twin data) that can be used for training data, open-access data, internal datasets to speed up analyses, remote execution and much more. The CRAN package simPop allows the simulation of simple up to very complex datasets conducted with complex sampling designs, missing values, realistic cluster structures (like person in households) and mixed-scaled variables in an computational efficient manner. With simPop synthetic data can be simulated in the same size as the input data or in any size, and in the case of finite populations even the entire population.

We show (1) a new and powerful synthetic data generation method in combination with (2) an improved calibration method to adjust the synthetic data to known population margins. Both are now fully integrated in simPop.

  1. The proposed XGBoost-based method shows strong performance especially with synthetic categorical variables and outperforms other tested methods. The tuning of the parameters–an important step in the application of XGBoost–can be estimated using modified k-fold cross-validation.
  2. After data generation, adjusting the synthetic data to known population margins is recommended. For this purpose, we implemented a simulated annealing algorithm capable of using multiple different population margins at once. In addition, the algorithm is efficiently implemented, making it feasible if the adjusted populations contain 100 million or more observations.

Emil Hvitfeldt - Improvements in text preprocessing using `textrecipes`

Text constitutes an ever-growing part of the data available to us today. However, it is a non-trivial task to transform text, represented as long strings of characters, into numbers that we can use in our statistical and machine learning models. textrecipes has been around for a couple of years to aid the practitioner in transforming text data into a format that is suitable for machine learning models. This talk gives a brief overview of the basic functionality of the package and a look at exciting recent additions.

Janith Wanniarachchi - `scatteR`: Generating instance space based on scagnostics

Co-author: Thiyanga Talagala

Modern data synthesizers consist of model-based methods where the focus is primarily on tuning the parameters of the model and not on specifying the structure of the data itself. Scagnostics is an exploratory graphical method, capable of encapsulating the structure of bivariate data through graph-theoretic measures. An inverse scagnostic measure would therefore provide an entry point to generate datasets based on the characteristics of instance space rather than a model-based simulation approach. scatteR is a novel data generation method with controllable characteristics based on scagnostic measurements. We have used a generalized simulated annealing optimizer iteratively to discover the optimal arrangement of data points in each iteration, which minimizes the distance between the current and target measurements. Generally, as a pedagogical tool, scatteR can be used to generate datasets to teach statistical methods and as a data synthesizer to synthesize existing datasets. Based on the results of this study, scatteR is capable of generating 50 data points under 30 seconds with 0.05 root mean squared error on average.


Olivier Delmarcelle - `sentopics`: An R package for joint sentiment and topic analysis of textual data

This paper presents the R package sentopics through a framework that joins topic modelling and sentiment analysis. The package offers the tools to estimate simple topic models (Latent Dirichlet Allocation) or extensions including sentiment (Joint Sentiment/Topic model). It is then possible to enrich the models estimated using sentopics with external measures of sentiment to create topical-sentiment series. The package also includes numerous off-the-shelf visualizations, aiming to ease the burden of analyzing topic model outputs.

Session 37, Unique Applications and Methods

Session chair: Emilio López Cano

Vathy M. Kamulete - Call me when it hurts

Statistical tests for performance monitoring shift can be susceptible to false alarms: they are sensitive to minor (negligible) differences. We introduce a robust framework to detect adverse shifts based on outlier scores: D-SOS for short. D-SOS holds that the new (test) sample is not substantively worse than the reference (training) sample, and not that the two are equal. The idea is to sound the alarm only when it truly hurts. Our approach is uniquely tailored to serve as a robust metric for model monitoring and data validation.

Ashutosh Dalal - `NBBDesigns` and `rsdNE`: R packages for the generation of designs and analysis of data incorporating neighbour effects

Co-authors: Seema Jaggi, Eldho Varghese, Arpan Bhowmik, Cini Varghese, and Anindita Datta

Neighbour effects from adjacent units are very common during experimentation, especially in agricultural field experiments, when the units are arranged linearly without any gaps. The effects due to these neighbouring units contribute to variability in experimental results and lead to a substantial loss in efficiency if not considered while designing and analyzing. Hence, in order to avoid bias in comparing the effects of treatments in this situation, it is essential to ensure that no treatment is unduly disadvantaged by its neighbour. There are designs developed in the literature to address the problem of neighbour effects, but getting access to these designs for any experimenter is difficult, owing to the theoretical framework for constructing designs suitable for the experimental situation and then analyzing the data generated from such experimentation.

In this talk, we will highlight two packages developed in R viz, NBBDesigns and rsdNE, for the generation of single- factor and multifactor designs suitable for experimental situations when neighbour effect is suspected. The purpose of these packages is to make these designs freely available and suitable for various situations while providing easy accessibility to experimenters and researchers.

Alex Zhu - R on Raspberry Pi: The `RaspberryPiR` package for collecting and analysing streaming sensor data

Co-authors: Pierre Lafaye de Micheaux, Pablo Mozharovskyi, and Fabien Navarro

Raspberry Pi is a powerful, popular, low-cost minicomputer with the ability to collect physical environmental data from sensors and circuits, such as temperature, luminosity, gas concentration, images and infrared radiation level. Our new R package RaspberryPiR can store sensor data using sensor controlling modules on the Pi (GPIO pins) into shared memory. The data analysis can then be done in a streaming manner, using various streaming statistical and machine learning algorithms.

In its current implementation, our package is compatible with the following sensors: DHT11 Temperature and Humidity Sensor, Photo Resistor, MQ2 Gas Sensor and Raspberry Pi Camera Module V2, which already allows for numerous streaming applications. We review and suggest implementation of a set of existing statistics tools for windowed data streams, such as Control Charts and Tukey Region. These can help visualizing data streams collected using our package.

To summarize, our package simplifies the process of collecting data streams from surroundings using a Raspberry Pi. This permits scientists, statisticians, data scientists and practitioners to be in control of their environmental research and data project without the need of understanding complexity of data storage and electric circuits on the Raspberry Pi.

Matthew Pocernich - Creating unbiased TV commercial exposure estimates

To measure viewership for TV commercials, information is gathered from millions of TVs, using internal software that shares viewing information through the internet. TV viewing data may be joined with household demographic information such as income, age, education and presence of children to describe the reached population. Since connected TVs are not distributed randomly, unadjusted reach estimates are biased. To get unbiased national estimates of commercial viewing, a standard practice in the ad tech industry is to use weighting to adjust for biases.

This talk discusses the way R is used to evaluate initial biases viewing data, select relevant attributes, apply a raking algorithm to estimate device weights and evaluate the robustness of the resulting weights. Raking algorithms are used when information is only available for the marginal distributions. On a daily basis, tens of millions of TVs need to be weighted and dozens of attributes considered, from two to hundreds of levels. Consequently, the efficiency of the methodology is important. The implementation of this methodology uses Oracle Machine Learning for R (OML4R), which provides such efficiencies.