Posters & Elevator Pitches

Awards

At the closing session of useR! 2022, five posters were designated as the best of the conference. Each poster was scored by at least two members of the useR! 2022 program committee. The scoring criteria included quality of abstract, outline, introduction, or summary; organization and clarity; quality of content; quality of graphics and visualizations; and impact and added value. The scores were averaged and adjusted for reviewer bias. Winners will receive a free book from the CRC Press catalog, which includes 64 titles in the R Series; we thank the publisher for their support: https://www.routledge.com/Chapman--HallCRC-The-R-Series/book-series/CRCTHERSER

The winners are:

  • Abdul Aziz Nurussadad and Akbar Rizki, for “Twitter Bot using rvest, rtweet and a GitHub Action” : https://twitter.com/panganBot/status/1540235020224573440/photo/1

  • Maciej Nasinski, for “Conscious R packages maintenance”

  • Cara Thompson, for “Level up your labels: Tips and tricks for annotating plots” : https://cararthompson.com/talks/user2022

  • Peter Fortunato, for “How R helps me evaluate the safety performance of a metropolitan highway network”

  • Konrad Oberwimmer, for “Automation in (mass) production of charts with svgtools

Post-conference availability

Posters can be viewed on the conference platform until 23 July 2022. There are no plans for permanently archiving them. An option for individual presenters is to upload their poster to a repository that provides DOIs, such as OSF or ScienceOpen Posters.

https://osf.io

https://www.scienceopen.com/

Session structure (posted before the conference)

English:

The poster session will be held on 22 June 2022, from 10:45am to 12:30pm CDT. The posters have been organized by topic into groups of up to five presenters (see below). Each group will have a virtual “lounge” that will be open throughout the conference, with a tab for asynchronous/written discussion and a Zoom-like “Live Forum” for audio-video chat.

Each group has been assigned to “Round A” or “Round B.” Presenters in Round A will be in their lounges from 10:45am to 11:30am CDT to deliver their elevator pitches and converse with attendees live. Presenters in Round B will be in their lounges from 11:45am to 12:30pm CDT.

Español:

La sesión de posters se llevará a cabo el 22 de junio de 2022, de 10:45 am a 12:30 pm CDT. Los posters se han organizado por tema en grupos de tres a cinco (ver más abajo). Cada grupo tendrá un “salón” virtual que estará abierto durante toda la conferencia, con una pestaña para debates asincrónicos/escritos y un “Foro en vivo” similar a Zoom para chat de audio y video.

Cada grupo ha sido asignado a “Ronda A” o “Ronda B”. Las personas de la Ronda A estarán en sus salas desde las 10:45 am hasta las 11:30 am CDT para dar sus charlas rápidas y conversar con participantes. Las personas de la Ronda B estarán en sus salones de 11:45 am a 12:30 pm CDT.

Français:

La présentation par affiches est le 22 juin 2022, de 10h45 à 12h30 HAC. Les affiches sont organisées par thème en groupes de trois à cinq (voir ci-dessous). Chaque groupe aura un “lounge” virtuel qui sera ouvert pendant toute la conférence, pour la discussion asynchrone/écrite et un “Forum en direct” de type Zoom pour le chat audio-vidéo.

Chaque groupe a été affecté au “Round A” ou au “Round B”. Les personnes du tour A seront dans leurs salons de 10 h 45 à 11 h 30 HAC pour présenter leurs sujets de discussion et converser en direct. Les personnes du tour B seront dans leurs salons de 11h45 à 12h30 HAC.

Round A, 22 June 2022, 10:45 - 11:30am CDT

Posters on Bayesian Methods in R

Issei Tsunoda - Using R in radiology for signal detection theory

In radiology, physicians find nodules in radiographs taken by MRI, CT, etc. In order to quantify physicians’ recognition ability, we use the so-called Free-response Receiver Operating Characteristic (FROC) analysis with R packages such as rstan to fit a model to data. I made a package, BayesianFROC, in which I implemented Bayesian models for FROC analysis and also developed a graphical user interface (GUI) with Shiny. BayesianFROC is available at https://CRAN.R-project.org/package=BayesianFROC

Teck Kiang Tan - The forthcoming way of hypothesis testing: Informative hypothesis

Null hypothesis significance testing (NHST) has been and still is the dominant way of carrying out hypothesis testing. The basic hypothesis test idea of MHST is to test whether the null hypothesis of no effect can be rejected based on the observed data. This is carried out by comparing the p-value to a pre-specified significance level. However, the use of a pre-specified significance level of, usually, .05 becomes the main criticism as it may not be a sensible formulation. More importantly, under the framework of NHST, multiple tests becomes a painfully tedious process that requires at least a two-step procedure. Together with the struggle in NHST of not able in accepting the null hypothesis, the informative hypothesis that incorporates the Bayes factor becomes an option that allows for the direct evaluation of a set of predetermined hypotheses. This approach allows for inequality constraints to specify and be able to accept or reject the null hypothesis using the Bayesian approach. Applied researchers who often set expectations about the order and the direction of the parameters in their statistical model will find that informative hypotheses directly meet their hypothesis requirements. This poster covers the way to carry out informative hypotheses using the package bain. Using data from the National University of Singapore warehouse ALSET Data Lake, three informative hypotheses are demonstrated to show their applicability to answer research concerns by specifying informative hypotheses. These three examples concentrate on three statistical models–namely, ANOVA, regression model, and structural equation modeling.

Marta Sánchez Sánchez - A Bayesian model on the number of infected by a disease based on the Gompertz curve by using R: The COVID-19 case

The COVID-19 pandemic has highlighted the need for finding mathematical models to forecast the evolution of a contagious disease. In this work, we consider the epidemiological model where the number of new infected cases by a disease follows a non-homogeneous Poisson process with an intensity function based on the Gompertz curve. Our main aim is to provide to the scientific community a robust way to tackle the probabilistic model that describe the number of new infected cases in this kind of contagious disease in a specific region from a Bayesian perspective, particularly in case of SARS-CoV-2. To this effect, it is an implementation of the well-known Bayesian analysis of Poisson Process which defines the likelihood function when n events are recorded in the interval (0, T) based on the intensity function and the mean value function. We introduce a probabilistic tool where the results are related directly with posterior distributions of the Gompertz curve parameters, and as a by-product we are able to forecast the number of new cases in near future time intervals. Summarizing, we will provide a free tool to make forecasts about the evolution of a pandemic, as the COVID-19 one, in a certain population.

Posters on Bioinformatics I

Thilini Mahanama - Risk assessment of drug-induced liver injuries based on in vitro assays with mechanistic knowledge

The rapid development of in vitro assays for toxicity assessment has provided a tremendous opportunity to improve toxicity risk assessment by utilizing assays more relevant to human biology while reducing the reliance on animals. However, important challenges still exist for the effective use of in vitro assays in risk assessment practices. Although machine learning methods have shown significant power in predicting toxicity endpoints, directly utilizing the huge number of in vitro assays currently available is neither practical nor effective. On the other hand, the adverse outcome pathway (AOP) framework has shown great promise in encoding expert knowledge on relevant biological pathways pertaining to toxicity. We discuss our work on using AOPs to filter the large number of in vitro assays to construct parsimonious and high-performing predictive models for toxicity, using drug-induced liver injury as an example. Another challenge for developing predictive models using in vitro assay data is the difficulty of corroborating the result with human data due to the scarcity of suitable datasets. We partially address this problem by taking advantage of real-world data. A novel statistical method is outlined for analyzing spontaneous adverse event reporting databases for drug safety. By connecting real-world data for adverse events in routine medical care with machine learning models based on in vitro assays, we demonstrate a new avenue to further strengthen the power of machine learning in toxicity studies.

Max Beesley - Tracing amniotic fluid stem cells using spatial transcriptomics

I use computationally-intensive transcriptomic techniques to characterise the autologous multipotent stem cells (AFSCs) present in human amniotic fluid. AFSCs can be expanded and differentiated during gestation, making them an ideal candidate for fetal and neonatal autologous regenerative medicine. However, there is a lack of consensus on the origin and identity of these cells which hinders their full clinical potential. I used bulk RNA-sequencing (RNAseq) to characterise these cells with the aim of determining their origin. Gene set enrichment analysis provided potential candidate tissues within the fetus. This was then narrowed down further to a particular cell lineage using novel analyses involving the overlap of the bulk RNAseq data with established reference single-cell RNAseq datasets. To confirm the precise anatomical origin of the AFSCs we applied spatial transcriptomics to human fetal samples. I re-applied the overlap protocol and was able to identify the precise location from which the AFSCs originate. This was able to confirm our working hypothesis about how the cells translocated to the amniotic fluid. Determination of the origin of these stem cells will improve their clinical applications and accelerate further research. I will explain my computational pipeline and discuss the mixture of R-based techniques I used to fully harness this data and facilitate completing our aims.

Myriam Maumy - `SelectBoost`: A general algorithm to enhance the performance of variable selection methods in correlated datasets

Variable selection has become one of the major challenges in statistics. This is due to both the growth of big data or technological innovations, that make it possible to measure large amounts of data in a single observation. As a consequence, problems in which the number P of variables is larger than the number N of observations have become common. Although many methods have been proposed in the literature their performance in terms of recall and precision are limited in a context where the number of variables by far exceeds the number of observations or in a high correlated setting.

The SelectBoost package implements a new general algorithm (https://doi.org/10.1093/bioinformatics/btaa855) that improves the precision of any existing variable selection method. This algorithm is based on highly intensive simulations and takes into account the correlation structure of the data. Our algorithm can either produce a confidence index for variable selection or it can be used in an experimental design planning perspective.

David Shilane - Analyzing panel data with `tvtools`

Panel data presents an efficient method for storing longitudinal information in studies that can update the records at any time. Panel data is structured so that a subject has multiple rows linked by a unique identifier. Each row records an interval of time. Other measurements for the subject are considered constant for the duration of the interval. Often used in medical studies, panel data allows us to track changes in each patient’s profile. However, the structure of panel data must be incorporated into analyses. Many common applications require consideration of the time period and variable number of records per subject. Furthermore, with many records per subject, panel data structures are necessarily large relative to the sample size. With these concerns in mind, the authors developed the tvtools package for R to analyze panel data.

The tvtools package offers methods for summarization, quality checks, and analyses. The amount of missing records can be tabulated over time. The user can identify records with gaps, overlaps, or events of unusual duration. For analyses, one can extract cross-sectional data structures, determine the length of observation by subject, measure the times to events, and calculate the utilization of medications. Events can be counted by total records or as distinct events spanning multiple intervals. Crude event rates can be calculated overall or in eras of time. Grouped computations are easily incorporated into the methods. The presentation will detail the applications and benefits of the tvtools package.

Posters on Bioinformatics II

Layla Bouzoubaa - `pmcFetchR`: PMC full-text retrieval for text mining

pmcFetchR is a novel R package that allows users to retrieve full-text articles from NCBI’s PMC OA dataset on AWS. This package includes the function fetch_pmcid for retrieving PMCIDs given a vector or string of PMIDs, and the function fetch_fulltext, which takes the output of fetch_pmcid or a given vector of PMCIDs to retrieve full-text articles.

In this pitch/poster, I will be introducing the package, including how it utilizes several NCBI APIs, as well as the motivation for it. I will also be walking attendees through a use-case in which the package is useful for NLP tasks as the fetch_fulltext function returns the articles requested in a tidy and tokenized dataframe. I also would like to engage attendees for any feedback to improve the package or for features that would be helpful.

Bastian Pfeifer - Efficient genome-wide signal reconstruction from ranked genes with `TopKSignal`

The ranking of items is widely used to rate their relative quality or relevance across multiple assessments. Beyond classical rank aggregation, it is of special interest to estimate the, usually unobservable, latent signals that inform a consensus ranking. Under the only assumption of independent assessments, we have developed and implemented an indirect inference approach via linear or quadratic convex optimization. The final estimates of the signals and their standard errors can be obtained from classical bootstrap or from the computationally more efficient Poisson bootstrap.

This novel methodology can be used for a variety of bioinformatics tasks, where rank observations are the only available input or preferred to metric input. The latter applies to gene expression analysis. We retrieved sequencing-based kidney cancer profiles from The Cancer Genome Atlas (TCGA) in order to infer the genome-wide consensus signals from surviving and non-surviving patient groups. For technical reasons, the patient-specific sequencing counts are not observed on a unique metric scale. Therefore, we transformed them to ordinary scale. Each patient can be imagined as an independent ranker of the set of genes. For each group (survival and non-survival) of patients we could thus form an input rank matrix for signal estimation. The resulting group-specific consensus signal estimates of gene expression reflect the genome-wide gene importance orderings indicative of the type of survival status. Our methodology is implemented within the R package TopKSignal freely available from GitHub (https://github.com/pievos101/TopKSignal).

Daniela Corbetta - Procrustes analysis for high-dimensional data

In neuroscience and spatial transcriptomics, the analysis of between-subject variability is quite attractive, but it cannot be performed on raw data since the anatomical and functional structure of the brains differs between subjects. The alignment of the images is indeed a preliminary and unavoidable step. Most of the best-performing alignment algorithms are based on the Procrustes theory, a statistical shape analysis that aligns matrices in a common reference space using similarity transformations. The perturbation model rephrases the Procrustes method as a statistical model that defines matrices as a random perturbation of a common reference matrix plus an error term. However, its solution is not unique, lacking interpretability since the aligned images lose their anatomical structure. To overcome this problem, Andreella and Finos (https://arXiv:2008.04631v4) proposed the ProMises model, which extends the perturbation model in a Bayesian context assuming a von Mises-Fisher distribution as prior distribution for the rotation parameter. They also introduced the Efficient ProMises model, which reduces the computational load of the ProMises model in the case of high-dimensional data without loss of information, and it is also suitable for matrices with different dimensions. These models allow users to incorporate information regarding the orientation of the rotation parameter by a proper specification of its prior distribution hyperparameters. We present our package, alignProMises, which contains two main functions, ProMisesModel and EfficientProMisesSubj, that implement these two models. We show an application of our package on spatial transcriptomics data.

Alan Aw - Flexible tests of exchangeability with genomics applications

In scientific studies involving analyses of multivariate data, two questions often arise for the researcher. First, is the sample exchangeable, meaning that the joint distribution of the sample is invariant to the ordering of the units? Second, are the features independent of one another, or can the features be grouped so that the groups are mutually independent? We propose a non-parametric approach that addresses these two questions. Our approach is based on permutations, and is fast and flexible. It controls the Type I error across realistic scenarios, and handles data of arbitrary dimensions by leveraging large-sample asymptotics. In the exchangeability detection setting, comparison against unsupervised tests of stratification based on random matrix theory shows that our approach compares favorably in various scenarios of interest. We demonstrate how our method can support bread-and-butter analyses in population genetics, including (1) finding evidence of population structure; and (2) finding optimal LD block splits. We also consider other application domains, applying our approach to post-clustering single-cell chromatin accessibility data and World Values Survey data, where we show how users can partition features into independent groups, which helps generate new scientific hypotheses about the features.

Posters on Biostatistics Methods

Erika Rasnick - Using R to democratize geospatial data

Many geospatial datasets are free and publicly available, but often require programming and spatial data expertise before they are in an analysis-ready format. Here we describe how we use R to curate geospatial data and make it truly accessible to both R users and non-R users. This process involves first accessing and transforming data from its raw format. This often includes breaking up large datasets into digestible chunks. Next, we develop an R package specifically for working with the curated dataset. It handles spatial overlays and any calculations required for assessing the data over user-specified space and time. Finally, we use that data-specific R package and other R tools to develop containerized software that streamlines the process for non-R users while maintaining data privacy and reproducibility. Through this framework, we democratize geospatial data by transforming it into products that are accessible to both R and non-R users.

Jonathan Gross - Exploring social determinants of health and health outcomes in neighborhoods using R Shiny

Exploratory analysis of geographic health indicators and outcomes is frequently performed. For example, examining correlations between social determinants of health (SDOH) and the homicide rate may provide insights into risk factors, protective factors and potential solutions. Examples of SDOH include median household income, percent of students absent from high or middle school and the lead paint violation rate. A Shiny app was created to explore correlations at the neighborhood-level in Baltimore City, Maryland. Neighborhood Health Profile (NHP) data for Baltimore City’s 55 Community Statistical Areas was used. The NHP 2017 dataset contains 102 continuous variables on SDOH and health outcomes. The Shiny app allows users to select an explanatory variable and outcome variable for analysis of correlations and related statistics. In addition, the app contains maps for explanatory and outcome variables, and an overlay map combining both, using the leaflet package. Lastly, basic machine learning components were added for k-means clustering and principal component analysis, using factoextra. This Shiny app can be used to explore any geographic dataset containing lots of continuous variables. The rgeoda package will also be briefly discussed.

Jane Ho and Xingyu (Fred) Feng - Evaluating inter-laboratory method performance in Ontario's COVID-19 wastewater surveillance initiative

Wastewater surveillance based on measurement of SARS-CoV-2 biochemical signals is a promising tool to complement conventional epidemiological metrics for tracking COVID-19 disease prevalence in a community. However, sample processing methods are far from standardized, contributing to substantial variability in the analytical results between laboratories supporting a common surveillance network. To support the Ontario Ministry of the Environment, Conservation and Parks with the delivery of its wastewater surveillance initiative, the Ontario Clean Water Agency initiated an ongoing inter-laboratory program to facilitate method comparisons and QA/QC evaluations. Split samples of authentic wastewater are prepared and distributed to laboratories for analysis. R was used as a comprehensive platform for data cleaning and processing, statistical evaluation and visualizations.

Due to the rapidly evolving state-of-the-science, flexibility to perform exploratory data analysis and test hypotheses on inter-laboratory method comparison data sets was critical. Through manipulation of data frames afforded by the use of common R packages, datasets were efficiently explored. Hierarchies of data structures were investigated to gain insights about sources of variability within and between methods. Statistical summaries and graphical plots were generated to facilitate intuitive interpretations of the data. As methods matured, specialized plots were incorporated as additional visualizations to track intra- and inter-method variability over multiple rounds of inter-laboratory method comparisons (e.g., Youden and Mandel h & k plots). This program’s findings are a critical component in increasing end-user confidence in the validity of results emanating from different laboratory methods to support a common wastewater surveillance network.

Rahmasari Nur Azizah - Testing monotonic trends on the dose-response relationship of nanomaterial toxicity using the R package `NMTox`

As nanomaterials are increasingly used in various fields and products, interest in nanomaterial toxicity is growing as well. NanoinformaTIX is a H2020 project that aims to build a user-friendly platform for risk management of engineered nanomaterials. One of the goals of this project is to develop a method for in vitro-in vivo extrapolation (IVIVE) of nanomaterial toxicity. As an initial step, the R package NMTox was developed as a tool for conducting a preliminary analysis on the dose-response nanomaterial toxicity relationship.

The NMTox R package includes several methods such as a likelihood ratio test, Williams, Marcus, M and modified M test, that can be used to test monotonic trends on the order-restricted dose-response nanomaterial toxicity data. Methods to adjust for multiplicity are provided, and this package also includes several functions for data exploration.

We illustrate the analysis on the data of nanomaterial toxicity studies with cell viability as the endpoint of interest. The trend testing was performed on 14 nanomaterials, which were divided into 82 subsets of data according to the cell type and the methods used in the experiment, the study provider, the exposure time and the concentration unit. Using a likelihood ratio test, a significant monotonic trend was found on 30 nanomaterials.

Posters on Business and Operations

Roberto Delgado Castro - Pivot tables in R for financial analysis: A real success case of automating a public trust supervision tool in Costa Rica

In response to a direct command from the Contraloría General de la República (CGR), the Dirección General de Desarrollo Social y Asignaciones Familiares (DESAF), part of the Ministry of Labor and Social Security in Costa Rica, developed the CAMEL (Capital, Assets, Management, Assessment and Liquidity) model. This model, one-of-a kind in Costa Rica and adapted from the one applied by SUGEF (Superintendencia General de Entidades Financieras) to the national financial system, executes financial supervision of three public trusts hose consolidated patrimonies amount to US$70 million. The main objective consists of displaying conclusions and timely recommendations to contribute to their sustainability.

For each trust, the model´s data range and calculations have been developed in R using pivot tables. The correspondent analysis of each metric was developed through conditional algorithms. The mentioned consolidated work in R was done in order to automate the whole supervision process. Its inputs are the official balances of specific accounts from financial statements for each trust.

Atilla Wohllebe - App-based in-store navigation in retail: Antecedents of usage and influence on app usage--an R-based application of SEM

Both the smartphone in general and mobile apps in particular are playing a decisive role in shaping the customer-facing digitization of retail. By leveraging augmented reality technology, mobile apps can be used to help consumers find products in a store through in-store navigation (ISN). This poster uses a survey of 1,500 consumers and a structural equation model (SEM) to show the antecedents of intention to use an ISN and how an ISN can increase usage of a retailer’s mobile app. To build the SEM, the authors use R, utilizing the packages lavaan and psych, among others.

Carl Ganz - Staffing a call center: Queues in action with R

Some research has been done using queuing systems to model call center management (Koole 2001), but there are not many exhaustively documented case studies. In our poster we walk through our work using queuing systems to staff our call center including forecasting time heterogeneous arrival and service rates based on Twilio data, model validation, and simulating different staffing schedules with the Simmer package.

Posters on Computing Frameworks

Sayantani Karmakar - An R-package for generating incomplete row-column designs

Row-column designs are widely recommended for experimental situations when there are two well-identified factors that are cross-classified representing known sources of variability. However, these designs are not readily available when the number of treatments is more than the levels of row and column blocking factors. Here, an algorithmic approach for constructing a new series of row-column designs with incomplete rows and columns, by amalgamating two incomplete block designs has been proposed. A wide range of incomplete block designs, viz., balanced incomplete block designs/ partially balanced incomplete block designs/ t-designs, are available in the literature, which can be selected as input designs to construct the proposed series of designs. To avoid the complexity involved in the construction algorithm, the R package iRoCoDe has been developed for the generation of the proposed designs. A catalogue of designs has been prepared using iRoCoDe for ≤ 20 treatments.

Drew Schmidt - Introducing `fmlr`: A novel high-performance matrix framework for R

Many statistical algorithms are dominated by matrix computations. For the research statistician, there are many benefits to implementing their novel methods in software that can be consumed by high-performance computing (HPC) resources, like campus clusters or national supercomputers. This not only involves a new category of users, but possibly otherwise unavailable computing grants. However, most statisticians do not have the combination of background and desire to scale their codes out in this way.

Here, we introduce the fmlr package, a novel HPC framework for matrix computations with R. It provides numerous linear algebra and statistical methods, for data stored on a CPU or GPU, or distributed across an MPI cluster. Each backend is managed by a common interface, so code written for a laptop can easily scale to cloud resources or large HPC systems. Unlike other high-level matrix frameworks, fmlr places an emphasis on minimizing memory consumption. For example, we can zero-copy inherited CPU data from R and modify it without making additional copies. We also fully support 32-bit floating point data, in addition to the standard 64-bit, with some additional support for 16-bit float in the case of GPU. We will introduce this framework, discuss some of its more unique capabilities, and demonstrate its value with some example performance benchmarks.

James Duncan - `simChef`: An intuitive framework for reliable simulation studies in R

We introduce simChef, an R package that simplifies the design, coding, computation, evaluation, visualization, and documentation of simulation experiments so that authors can focus on their scientific questions. simChef emphasizes modularity, developer productivity, computational efficiency, automated comprehensive documentation, and real-world data as a central component of simulation experiments. We will illustrate simChef’s main features, including: a) its tidyverse-inspired grammar of data-driven simulation experiments, which eases simulation design across a wide range of data-generating processes, methods, and parameters; b) its flexible utilities for distributed computation and checkpointing across simulation scenarios; and c) its automated documentation of results using R Markdown.

https://yu-group.github.io/simChef/

Charlie Gao - `nanonext` and `mirai`: A messaging and concurrency framework for R

nanonext is a lightweight zero-dependency R binding for NNG (Nanomsg Next Gen), a C socket library providing high-performance scalability protocols, implementing a cross-platform standard for messaging and communications. Considered a successor to ZeroMQ, protocols encompass common use patterns such as RPC, pub/sub, service discovery, pipeline etc. nanonext serves as a concurrency framework for building distributed applications, utilising ‘Aio’ objects which automatically resolve upon completion of asynchronous operations. nanonext provides the interface for code and processes to communicate with each other–receive data generated in Python, perform analysis in R, and send results to a C++ program–all on the same computer or on networks spanning the globe. nanonext further provides (asynchronous) http(s) and websocket clients built using NNG and MbedTLS.

mirai (meaning “future” in Japanese) is a package written using the nanonext framework and implementing asynchronous execution of arbitrary R code in (optionally persistent) background processes.

Posters on Data Mining, Machine Learning, Deep Learning, and AI

Mohieddin Jafari - Nominal data mining with `NIMAA` R package

Nominal data is data that has been “labeled” and grouped into a number of distinct unordered groups depending on the labels assigned. Due to the impracticality of performing extensive numerical procedures on this type of data, analysis is typically trivial. On the other hand, graphs and networks are composed of collections of nodes and edges, which can be thought of as independent nominal variables. We give the R program NIMAA, which combines graph theory and data mining approaches to provide a nominal data mining pipeline for further information exploration. NIMAA includes functions for constructing weighted and unweighted bipartite graphs, analyzing the similarity of nominal variable labels, clustering labels or categories to super-labels, validating clustering results, predicting bipartite edges via missing weight imputation, and providing a variety of visualization tools based on nominal variable labels in a dataset. Additionally, I will represent how nominal data mining was applied to a biological dataset that comprised a significant number of nominal variables in this talk.

Kenneth Geers - R & Peace: Data mining Russia's heaviest novel

This useR! poster offers a textual analysis, in R, of Leo Tolstoy’s novel War and Peace. The author introduces three subjects: R, the novel, and data mining. The text, an English translation of War and Peace, is downloaded for free from Project Gutenberg. The text is prepared for analysis, including the tokenization of words, sentences, paragraphs, and chapters. Stop words are removed from the token sets, which yields large and rich datasets for analysis. Topic models are created via clustering and bigram analysis. Sentiment analysis (using lexicons like Bing and NRC) is used to analyze the novel’s topics, characters, and chapters, as well as the relationships between them. Finally, the author explains how this R code can be modified and repurposed to analyze any other type of text.

Susanne Dandl - `counterfactuals`: An R package for counterfactual explanation methods

Counterfactual explanation methods are a valuable technique for explaining single predictions of black box models. They generate counterfactual data points that show how feature values of individuals need to be changed to obtain a desired prediction. Knowledge about them increases trust in a deployed machine learning model, for example, by justifying or helping to detect biases of individual predictions. Despite the increasing amount of proposed methods in research, the current software landscape is rather sparse; interfaces and requirements of existing implementations vary widely.

The counterfactuals package provides a modular and unified R6-based interface for counterfactual explanation methods. It embeds three existing counterfactual explanation methods with some optional methodological extensions to generalize these methods to different scenarios and make them more comparable. It also provides additional functionality to evaluate and visualize the created counterfactuals. Multiple components for generating counterfactuals can be exchanged, allowing methods to be easily extended and tailored to specific needs. Due to the object-oriented concept of the package, users can also easily add their own counterfactual explanation methods.

Bryan Shalloway - Handling uncertainty in predictions: Approaches to building prediction intervals within a `tidymodels` framework

In many settings your predictive model must output a range rather than just a point estimate. Three common approaches for outputting prediction intervals are to use…

  1. a parametric method where the prediction intervals are solved for analytically
  2. a simulation or conformal inference based approach
  3. a method that outputs quantiles

In this elevator pitch, I will briefly walk through examples of how you can do each from within the tidymodels ecosystem. (See http://bryanshalloway.com for more detailed written examples.)

Posters on Data Visualization

Rohan Reddy Tummala - Visualizing dichotomous data correlations using 2-sample corrgrams

Corrgrams enable visualization of multivariate correlation matrices for a data set by using heat maps. Comparing multivariate correlations between two groups of interest using traditional corrgrams necessitates two separate one-sample corrgrams in order to display the correlation matrix of each group. Here, we introduce two-sample correlation matrices and corrgrams as an efficient solution for visualizing multivariate correlations in data sets stratified into dichotomous groups. We also introduce the R package corrarray, which streamlines the generation of novel two-sample correlation matrices. The lower and upper triangular correlation matrices of the first and second sample, respectively, are displayed on opposite sides of the principal diagonal in a single correlation matrix. When a data set’s grouping variable has more than two levels, this package provides functionality to visualize a multi-sample array comprising individual correlation matrices for k levels of the grouping variable. Visualizing correlation matrices of a dichotomous data set using a single two-sample corrgram eliminates the redundancy and inefficient space utilization that would have otherwise resulted from resorting to traditional corrgrams.

Christie D. Williams and Lee Noel - W.E.B. Du Bois's visualizations of 21st-century big data in R

W.E.B. Du Bois’s visualization work captures civil rights in the 1900s by visualizing Black America. “The problem of the twentieth century is the problem of the color line,” according to Du Bois (1901). Many R users visualize Du Bois’s unique style by using small datasets that produce two specific types of Du Bois’s visual creation (black and white bar diagram, and circle chart). However, many of those attempts do not capture the 21st-century big data voting population data in the United States. Our research question: Will today’s big data fit Du Bois’s data design? Our presentation will look at 2020 census data through the eyes of the Du Bois catalog using R. To utilize Du Bois’s visual catalog of 14 diverse types of visualization, we will demonstrate our modification of the original design work by Du Bois, by adding more statistical computing power than utilized by Du Bois and add-ons to ggplot2. We will employ R to illustrate that Du Bois’s work is more relevant nowadays than ever before, where the call for visualization of Black America is still as important as it was in the 1900s.

Chun Fung Kwok - `animate`: A web-based graphics device for animated visualisation in R

Animated visualisation is a powerful tool for capturing complex space-time dynamics. Yet animating visualisations in real time and beyond XY plots, as in the cases of agent-based models and sports analytics, remains challenging with the current R graphics system. Here, I present animate, an R package that implements a new web-based graphics device to enable flexible real-time visualisation in R.

Utilising the R-to-JavaScript transpilation provided by the sketch package, the device allows users to take full advantage of the d3 JavaScript library using the R base plot syntax. The base plot syntax is extended and adapted to support animation, including both frame-by-frame option and motion tweening options. There is little new to learn other than the differences between static and animated plots, and users can get productive quickly. The device integrates well with Shiny and R Markdown documents, making it easy to create responsive and shareable applications online or as a standalone HTML document. We will go through the API of the package and showcase the many new exciting possibilities the package brings to the table.

Cara Thompson - Level up your labels: Tips and tricks for annotating plots

Poster Award winner

Polished annotations can make all the difference between a good plot that contains all the necessary information, and a great plot that engages readers with a clear story. Whether we’re using annotations to highlight different groups, to tell stories about an outlier data point, to add detail about key values or to explain how a predictive model works, applying a few simple tricks allows them to shine as integral parts of our data visualisations. These include:

  • how to use colors and fonts to draw attention to key elements in annotations
  • how to format text on the fly to change fonts, colors and text sizes within the same annotation using ggtext
  • the different alignment options for text and for arrows, and how to assign them programmatically depending on where the annotation sits compared to its data point
  • how to add multiple arrows with different curvature values by passing a tibble containing their parameters into annotate()

Covering both the design and the coding aspect of these tips, this poster presentation aims to equip you to level up the effectiveness of your annotations and the code that underpins them.

https://cararthompson.com/talks/user2022

Posters on Dissemination of Information

John Paul Helveston - `xaringanBuilder`: A better way to build (`xaringan`) slides

Thanks to packages like xaringan and xaringanExtra, R is increasingly being used to create high-quality, interactive presentation slides, which are typically rendered in html format. While highly effective for presentations, html files are less convenient for sharing with others, and converting them to popular formats (such as a pdf) can be cumbersome. The xaringanBuilder package was created to simplify this process and make it easier to build slides into multiple formats. Currently, the package can render xaringan slides to html, pdf, png, gif, mp4, and pptx formats as well as png images optimally sized for sharing on social media. Slides that contain panelsets or other html widgets are supported, and a new slide can be built for each increment on slides that contain incremental animations. Finally, xaringanBuilder is not limited to only xaringan slides. For example, other html slides, such as ioslides presentations, can be converted from html to other formats, and any pdf can be converted to png, gif, mp4, and pptx formats. More information about the package can be found at https://jhelvy.github.io/xaringanBuilder/.

Gergely Daroczi - Internationalization of R packages

Localization and translation of base R and the R manuals have been possible for a long time, even with community support at http://translation.r-project.org, although the related activity has been hectic from time to time. Recently, interest around internationalization has increased again; see, e.g., the “Translating R to Your Language” tutorial at last year’s useR! conference.

Although the base R support for GNU gettext is extremely powerful and enough for the most common translation tasks, it lacks support for some of the advanced GNU gettext features–such as providing comments for the translator, which is useful when the translation is being done by people other than the programmer. Rx Studio has faced that problem when outsourcing the translation of its internal R package to 6 languages, and created helper functions to find and extract terms to be translated from R packages (either in source code or in translations) with metadata to be shared with the translators, and related tooling for uploading po files to translator services.

This poster will introduce these helpers in the means of a freshly open-sourced R package to help other R package developers to get help with supporting non-English speaking users.

Lara Spieker - Publishing your R book with CRC

In this very practical poster/elevator pitch, editors from Chapman and Hall/CRC discuss why you should consider publishing an R or data science book and why you should work with CRC. The poster will go over the publishing process and provide best practices for shaping your ideas and submitting a book proposal; the editors will discuss their bestsellers and popular series as well as emerging topics and trends.

Posters on Environment and Ecology

David Schruth - The primate` R package enables functional access to All the World's Primates via SQL tables

Ecological investigation is invariably plagued by the inherent difficulty of collecting and analysing all possibly relevant causal input variables. While many datasets have been amassed to answer ancient ecological mysteries, and numerous tools have been developed to enable merging of such disparate sources of possible influence–these problems remain somewhat unresolved due to inaccessibility to tools and data. For example, current understanding on when primates originated is uncertain and the extent to which possible ecological influences (e.g., leaping, daylight, canopy height, or diet) most influenced the their emergence, is likewise unclear. A database that has recently become available online derives from the “All the World’s Primates” project. This database, however, is developed in SQL and currently requires the use of a web interface with only limited functionality. Here I showcase an additional tool, the primates R package which enables easy-to-use SQL-style functional access to these data. To highlight its utility, I review several primate origin puzzles, review some possible solutions, and demonstrate how the package functionality helps bring together disparate datasets compiled by thousands of independent specialized primatologists to help begin to answer these age-old questions with big data, all via a familiar R interface.

Victor Korir - Cloud-computing and collaborative code for ecological regionalization: The case of *Prosopis juliflora* invasion on the Marigat Plains

Mesquite (Prosopis juliflora, Fabaceae) is among the most noxious invasive plant species across the tropics worldwide. This shrub is native to Tropical South America and was introduced to Kenya during the 1970s with the purpose of afforestation and recovery of degraded land. Since the 1990s this plant has been recognized as invasive on the Marigat Plains, Kenya, because of its fast spread and its negative impact. It encroaches on pastures and arable land, and blocks access to watering points at Lake Baringo. Landscape and environmental variables have been considered as important predictors for the invasion degree by Prosopis. Nevertheless, the geological complexity of the Marigat Plains makes any attempt to produce comprehensive correlation models very difficult. We therefore considered the NDVI (normalized differencial vegetation index) calculated from LANDSAT imagery as a proxy for soil properties influencing the vegetation phenology and total biomass. We targeted an ecological regionalization by using CART (classification and regression trees), a semi-supervized classification method. We classified the study area into 7 different cover units in functions of NDVI statistics (e.g., maximum and variability), landscape metrics and spatial heterogeneity. The resulting classification was further used as a factor in a model in order to enhance the prediction capacity of invasion models. On the context of this work, we implemented a series of tools for collaborative data assessment: cloud-computing using the package rgee, an interface between Google Earth Engine and R, Sciebo as a data cloud, and GitLab for a Git repository of code and documentation.

Kapil Choudhary - An improved ensemble empirical mode decomposition based hybrid model to forecasting agricultural commodity prices

Agricultural price forecasting is one of the challenging areas of time series analysis due to the inherently noisy, nonstationary, and nonlinear characteristics of time series data. In this study, an ensemble empirical mode decomposition (EEMD)-based neural network model is proposed for agricultural price forecasting. For this purpose, the original price series was first decomposed into several independent intrinsic mode functions (IMFs) and one residue component. Monthly price data of soybean oil from the international market was decomposed into eight independent intrinsic modes and one residue with different frequencies, indicating some interesting features of price volatility. Then a time-delay neural network (TDNN) with a single hidden layer was constructed to forecast these IMFs and residual components individually. Finally, the prediction results of all IMFs, including residual, were aggregated to formulate an ensemble output for the original price series. Empirical results demonstrated that the proposed EEMD-TDNN model outperforms the TDNN model in terms of root mean square error and directional prediction statistics, mainly due to nonlinear and nonstationary characteristics of series.

Virginia Andrea García Alonso - Retrieving and visualizing satellite sea water temperature data for marine analyses: a case study using the `rerddap` R package

Environmental variables such as sea water temperature and salinity are key determinants of many biological process in marine ecosystems. Temperature variability is especially important in high latitude environments where species are subject to marked seasonal variations which influence their life cycles and their development. Since obtaining in situ data in marine and oceanic areas entails different logistic challenges and may provide inadequate spatio-temporal resolutions, employing satellite data emerges as a powerful tool to boost marine analyses. In this poster we describe a workflow including the steps necessary to retrieve satellite data from the ERDDAP server employing the rerddap package, reshape such data into a “tidy” format with the dplyr package, and visualize temperature patterns using the ggplot2 package, among others. The study area is located in the southern border of the Southwest Atlantic Ocean, a region displaying both a marked seasonality and a longitudinal gradient in water temperature across all seasons, thus configuring an appropriate area to be employed as an example. Materials employed for the poster will be openly shared as a GitHub repository.

Round B, 22 June 2022, 11:45am - 12:30pm CDT

Posters on Multivariate Analysis and Applications

Francisco J. Benítez Ríos - `rmoo`: An R package for multi-objective optimization

A non-dominated sorting-based multi-objective optimization package, built upon the GA package, the rmoo R package provides a complete and flexible framework for optimization of multiple supplied objectives. Researchers will have at your disposal a wide range of configuration options for this purpose, as well as real numbers, permutations and binaries representation. The R language is widely used by statisticians and researchers in related areas, providing a large number of tools for them as well as a powerful toolbox for plotting graphs. rmoo has been built with these offered advantages and is easy to use.

Aleix Alcacer - Discovering archetypal football teams and stats with biarchetype analysis

In 1994, Adele Cutler and Leo Breiman introduced archetypal analysis, an unsupervised learning method similar to cluster analysis. Rather than typical observations (cluster centers), it looks for extreme points in the dataset, called archetypes. We propose a new statistical methodology called biarchetype analysis (biAA), which, like all archetypal analysis techniques, seeks to extract extreme cases from a dataset. However, unlike the previous methods, biarchetype analysis allows for the extraction of extreme cases from both observations and variables.

In developing our methodology, a detailed definition of biarchetype analysis was first presented, as well as a numerical method for solving it. BiAA was also implemented in the R programming language, and a package was created to make it easier to use. Finally, biAA was used to solve a sports analysis problem, allowing us to discover hidden patterns in the data. It was applied to a data set containing metrics of football teams.

David Degras - Generalized tensor canonical correlation analysis

Canonical correlation analysis (CCA) is a celebrated statistical technique for finding linear combinations of variables that are maximally correlated between two datasets. By reducing the dimension of data, CCA facilitates understanding relationships between groups of variables. Individual scores on canonical variables can also provide useful features to machine learning algorithms (e.g., for classification, clustering, or regression). Since the 1970s various extensions of CCA have been developed to handle multiple datasets (MCCA), high-dimensional data, nonlinearity patterns, and more. In recent years, the wide availability of multiple data sources has renewed interest in multiblock analysis methods–including MCCA–for data integration and fusion. Related to this, new applications in biomedical research, computer vision, and remote sensing have prompted efforts to extend MCCA to tensor data (e.g., 2D/3D images and video sequences). We will present our ongoing research on MCCA for tensor data, including the new R package tensorMCCA. Focusing on computations, we will discuss challenges in initializing optimization algorithms, assessing the quality of solutions, determining higher-order canonical components, and processing large datasets. We will demonstrate tensorMCCA with an application to the multimodal integration of brain imaging data.

Laura Vicente-Gonzalez - PERMANOVA: Multivariate analysis of variance based on distances and permutations

Due to recent advances in data collection, it is every time more frequent to have data matrices with a high number of variables, even higher than the number of individuals. When the aim of the study is to establish the significance of the differences among several groups–arising, for example, from the treatments of a designed experiment–multivariate rather than univariate separate analysis should be used in order to control the Type I risk. The most popular method for multivariate comparisons is Multivariate Analysis of Variance (MANOVA). Normally, MANOVA is used together with a pictorial representation of the group centroids (Canonical Analysis) in order to help with the interpretation when the hypothesis of no group differences is rejected. The problem with MANOVA is that it has very restrictive conditions for its correct application–namely, data has to be multivariate normal and the structure of variation and covariation must be the same across groups. Moreover, the number of variables has to be much smaller than the number of individuals. In many applications none of the previous conditions holds, and it is necessary to use non-parametric methods. PERMANOVA (MANOVA trough permutations) could be used as an alternative to MANOVA when the application conditions do not hold.

A package for PERMANOVA is presented. A pictorial representation based on principal coordinates of the group means to explore deviations from the null hypothesis is proposed. The main theoretical results will be applied to different sets of data.

Posters on Novel Statistical Methods

Frédéric Bertrand - `bootPLS`: Bootstrap hyperparameter selection for PLS models and extensions

Methods based on partial least squares (PLS) regression, which has recently gained much attention in the analysis of high-dimensional genomic datasets, have been developed since the early 2000s for performing variable selection. Most of these techniques rely on tuning parameters that are often determined by cross-validation (CV) based methods, which raises essential stability issues. To overcome this issue, we introduced recently non-parametric bootstrap-based techniques to determine the numbers of components for regular or sparse ((s)PLS) and sparse GPLS regression ((s)GPLS) and a new dynamic bootstrap-based method for significant predictor selection, suitable for both PLS regression and its incorporation into generalized linear models (GPLS). These techniques relies on establishing bootstrap confidence intervals, which allows testing of the significance of predictors at preset type I risk α, and avoids CV. The bootPLS package provides these implementations of non-parametric stable bootstrap-based techniques to determine the numbers of components for partial least squares linear or generalized linear regression models as well as sparse partial least squares linear or generalized linear regression models.

Benjamin Schwendinger - Holistic generalized linear models

Selecting a sensible model from the set of all reasonable models is an essential but typically time-consuming process in the data analytic process. To simplify this process, Bertsimas & King 2015 and Bertsimas & Li 2020 introduce the holistic linear model (HLM). The HLM is a constrained linear regression model where the constraints aim to automate the model selection process by utilizing quadratic mixed-integer optimization. The integer constraints are used to place cardinality constraints on the linear regression model. Placing a cardinality constraint on the total number of variables allowed in the final model leads to the classical best subset selection problem (Miller 2002): minimize_{beta} 1/2 ||y-X*beta||_2^2 subject to ||beta||_0 =< k

Adding cardinality constraints on user-defined groups of variables can be used to limit the pairwise multicollinearity or select the best (non-linear) transformation. Additionally, the HLM allows posing constraints on the global multicollinearity and linear constraints on the parameters.

This work introduces holiglm, an R package for formulating and fitting holistic generalized linear models (HGLMs). To our knowledge, we are the first to suggest using conic optimization to extend the results presented for linear regression by Bertsimas et al. to the class of generalized linear models. The holiglm package provides a flexible infrastructure for automatically translating constrained generalized linear models into conic optimization problems. The optimization problems are solved by utilizing the R optimization infrastructure package ROI (Theußl, Schwendinger & Hornik 2020). Using ROI makes it possible for the user to choose from a wide range of commercial and open-source optimization solvers. Additionally, a high-level interface is provided, which can be used as a drop-in replacement for the stats::glm() function. Using conic optimization instead of iteratively reweighted least squares (IRLS) has the advantage that no starting values are needed, the results are more reliable (proven optimality) and the solvers are designed to handle constraints. These advantages come at the cost of a longer runtime. However, as shown by Schwendinger, Grün & Hornik 2021, for some GLMs the speed of the conic formulation is similar to the IRLS implementation.

Darshana Jayakumari - A new goodness-of-fit diagnostic for count data based on half-normal plots

Goodness-of-fit diagnostics guide the researchers to identify the best available model that represents the data. There are many graphical and quantitative methods available in the literature for model selection. In this project we aim to aggregate a distance-based framework to an existing model selection method and to bring a quantitative basis to a qualitative approach applied specifically to count data. This technique is helpful to assess the suitability of the assumed model and mean-variance relationship. The proposed framework is applied to half-normal plots with an added simulated envelope. The framework includes penalisation functions based on the envelope width and the distance of the residual points from the boundary of the envelope. The effectiveness of the penalisation functions were tested using a simulation study. The simulation study was done for specific conditions of mild and strong overdispersion using different sample sizes. The preliminary results showed that the distance framework allows for distinguishing between the single parameter Poisson model and models with an added dispersion parameter. The framework was tested on real life datasets demonstrating under dispersion, overdispersion and zero inflation, and the framework works well in distinguishing the well fitted model and goes in accordance with the graphical model selection method by half normal plots.

Carlos Pasquier - Using relative weight analysis with residualization to detect relevant nonlinear interaction effects in ordinary and logistic regression

Relative weights analysis is a classic tool for detecting whether one variable or interaction in a model is relevant. In this work, we present the construction of relative weights for non-linear interactions using restricted cubic splines. Using this idea, we provide a method to identify the most representative set of variables and interactions of a multivariate model using relative weights analysis. We tested this procedure using two simulated examples, giving representative results and demonstrating the usefulness of the method.

Posters on R Education

OpenSalud LAB - Coding to save lives

We designed an educational program in data science entirely in Spanish, open and free of charge for public servants of health in Latin America, composed of almost 6 months of training and created in association with various R communities and collaborations of different teachers. We hope that the people who take the bootcamp will be able to incorporate their programming skills in solving complex problems within their institutions, make the best evidence-based decisions and ultimately improve the quality of care for patients and their families. In addition, we want to encourage the use of R and democratize access to advanced knowledge as a common way to address healthcare management processes, through continuous training of public servants.

Why in Spanish? Because we want to improve access to advanced knowledge in programming and data science in Spanish-speaking communities, since most of the quality content is in English.

Why free? Because we do not want people to be unable to acquire this type of knowledge due to lack of money or economic barriers, depriving citizens of the advantages of using this technology.

Why R? Because its use is closer to health, due to the scientific research where it is frequently used, it is simpler to introduce in the usual processes of hospitals and because of the large community of R globally.

Tyler George - Utilizing open source resources to teach introductory data science

There are a plethora of open education resources for introductory data science courses. Utilizing these resources in a classroom presents a variety of challenges for instructors including difficulties adapting the course materials to their particular program or campus, setting up computing infrastructures or cloud services, and learning new software. This poster will cover a successful introductory data science course redesign using primarily the open source resource Data Science in a Box by Mine Çetinkaya-Rundel, in a One-Course-at-A-Time semester calendar at a small liberal arts college. Four major areas will be covered in this poster related to this Introduction to Data Science course implementation:

  • The course design including the schedule, classroom setup, and daily course flow.
  • The technology and content necessary to learn to teach the course effectively.
  • The use and effectiveness of the chosen teaching pedagogies, primarily collaborative learning, supported by open source activities that required student groups to collaboratively use RStudio and Git to complete.
  • The setting up of a minimal infrastructure to utilize RStudio Server on a campus computer.

Lastly, the difficulties and room for improvement in all four areas will be discussed.

Posters on R in Production (sponsored by Appsilon)

Maciej Nasinski - Conscious R packages maintenance

Poster Award winner

This poster is about supplementary utils for CRAN maintainers and R package developers in the pacs package, and the wide range of tools to achieve a healthy R environment and make the developer’s life easier. Each function was inspired by everyday challenges faced by experienced R developers in production-quality agile projects. The tools are designed to be universal.

Konrad Oberwimmer - Automation in (mass) production of charts with `svgtools`

Poster Award winner

When designing charts, statisticians sometimes have to abide by detailed corporate design rules. Conventional R packages may prove to be unhandy in such cases, either because of a lack of formatting options or because applying all rules results in long code. The SVG format provides the possibility of separating design and statistical concerns. Because of its vector-based nature and its XML file format, an SVG template of a chart can be modified to reflect correct statistical values (e.g., percentages in a bar chart). The R package svgtools does this by translating statistical values to coordinates and replacing the latter in a pre-existing SVG file. Resulting charts may be saved to disk or rendered directly (e.g., by R Markdown).

On this poster, we will show applications of this approach for official statistics of an Austrian government agency. Besides presenting the general workflow we give hints on how to use svgtools to mass produce hundreds of charts from a small set of templates in an agreeable amount of computation time.

Marcin Dubel - The `data.validator` package as a safety net in R projects

Data is the backbone of every data science project. If it contains flaws, even the best code will produce misleading output. What is worse is that we cannot mitigate the risk with unit tests. Working with R provides all the flexibilities in data manipulation, and it is even more crucial to make sure that the input is as expected. As the first step of a typical workflow is loading the data, the second step should assure that analysis can proceed with the trustworthy input. It can be achieved with data.validator assertion rules.

Yet validating the data in R is the beginning of the process. As often the data source is beyond our control there is a need to communicate the violations to non-R-programmers in a user-friendly way. It can be achieved with data.validator build-in or custom validation reports. Production solutions based on R, like Shiny applications, can have two cases when validation might be super useful. If the workflow requires loading / selecting the data, users should be informed early about the problems. If the app is connected to the database source, it is crucial to automatically check the quality of the data. We cannot let users base their decisions on the wrong input.

Cody L. Marquart - Automating package management using Gitlab CI/CD

With roughly 19,000 packages, R has a thriving community of scientists and researchers contributing to CRAN. Although these package developers may be experts in their respective fields, it does not mean they have the experience, or time, to maintain a library of open source code. Open source code maintenance can be time-comsuming–especially in R, where CRAN enforces a strict set of requirements to ensure consistent behavior across all platforms. These rules require package maintainers to provide well-documented code that is tested on numerous operating systems using previous, current, and future versions of R. This is a daunting task and one that can become overwhelming very quickly. Discussed here is a process, modeled from a larger one used by Epistemic Analytics at the University of Wisconsin-Madison, that combines a set of existing tools (e.g., devtools, covr, rhub, pkgdown, drat) with the power of Gitlab Continuous Integration and Deployment (CI/CD) in order to automate the majority of these tasks. With a little initial configuration to set up a pipeline, a package developer, using standard git workflows, can focus on the code, while the documentation, testing, and distribution is nearly fully automated, as seen in our example package hosted on GitLab. In the future, this work could be streamlined, requiring a developer to implement only a single file (utilizing configuration templates in our existing projects) to get the full benefit of this automation.

Posters on R in the Wild

Jean-François Rey - The `vmr` package to manage virtual machines for/with R

vmr allows users to manage, provision and use a virtual machine preconfigured for R. It develops, makes tests and builds a package in a clean environment. It offers different OS providers choices to improve the quality, the productivity, the reproductibility and the sharing of R productions. Here we present a pipeline over GitLab CI/CD to create VMs, a Vagrant cloud repository and how vrm uses Vagrant, and the posibilities offered by the vmr package to manipulate a VM using R code.

Abdul Aziz Nurussadad and Akbar Rizki - Twitter Bot using `rvest`, `rtweet` and a GitHub Action

Poster Award winner

panganBot posts the daily price of several food items in Indonesia, harvested from hargapangan.id using rvest. Diagrams are made using ggplot2 and sent to Twitter using rtweet and GitHub Actions. panganBot uses the same flow for panganBOTpublish and berasBOTpublish: it scrapes a table from hargapangan.id, makes it a data frame, and then makes a line graphic using ggplot2 and publishes it using rtweet. There are several tweaks here and there, to make sure the tweet can be read as humanly possible (i.e., Indonesians using point {.} as thousand separator).

https://twitter.com/panganBot/status/1540235020224573440/photo/1

Duy Nghia Pham - Non-trivial balance of centrifuge rotors

Loading tubes in opposite buckets has been used universally yet intuitively to balance centrifuge rotors. Most rotors support tube distributions with rotational symmetry of order not only 2 but also other prime divisors of the total bucket number. This potential allows rotors to be balanced by the nontrivial placement of tubes, which offers users greater flexibility and more safety in centrifuge operation. Based on linear combinations and random sampling, centrifugeR finds the number of tubes that can be loaded in centrifuge rotors in a single operation and shows various ways to balance them.

Loren Lee - Translation trouble in *Le Roman de Silence*: Using Shiny and bookdown for digital editing of medieval texts

Since its first edition produced by Lewis Thorpe in 1972, Le Roman de Silence–a thirteenth-century Picard verse narrative–has been the fuel for much debate concerning its mysterious author/narrator, Heldris of Cornwall, and his message regarding gender. Some label the text a misogynist defense of the status quo (Gaunt 1990). Others revel in the gendered possibilities the text allows its readers to imagine (Barr 2020). Despite many interpretations, the common thread running through this contentious scholarship is a shared frustration with the inability of modern language translations to convey the text’s multilayered meanings. However, it is not the fault of editors and translators but rather the limitations of traditional print technology that prevent us from properly engaging with this unique manuscript. Unlike a traditional print edition, a flexible, interactive digital edition could reflect the mutability of Heldris’s language, thereby better articulating the ways Silence itself translates gender (Campbell 2019). I have developed this new digital edition and propose new methods toward the editing of medieval manuscripts such that the polysemy of the original is not lost on a modern audience. I will present a new translation of verses 2439-2688 of Silence in a bookdown book, and harness the interactive power of Shiny to allow the modern reader to fully engage with the text. By structuring my translation as tidy data, capturing all translation possibilities, and allowing the reader to interact with those possibilities via Shiny, the true depth of the text is unlocked as never before.

Posters on Shiny Web Applications

Mark Gallivan - Creating an image classifier R Shiny app with a multi-round disagreement workflow

The performance of any machine learning model depends on its underlying trained data. In the absence of labeled datasets, creating accurate, consistent, and domain-expert informed datasets can be cumbersome and expensive. Within the ophthalmology domain, images are considered an objective source of truth; however manual labeling is required to obtain data quality assessments, anatomical measurements, and disease outcomes. The outputs of these classifications can power artificial intelligence (AI) models and drive research, insights, and business value. Through this poster and elevator speech, I will showcase the power of the Shiny package to allow domain experts to easily classify images and adjudicate disagreements between users through a multi-round process streamlined in the app. Shiny’s reactive programming paradigm is well-suited to modify the disagreement multi-round process based on user input. Strong attention will be shown on the benefits of the shinyjs package to provide a more friendly user experience and the AWS command line interface to serve as the glue of the app’s data flow. A key feature of the app is its ability to determine who labeled what, when, and why. A traceable audit trail is important from both a regulatory perspective and to evaluate inter-rater concordance as well as data quality. At the conclusion of the poster and speech, R users will understand how Shiny can create expert generated datasets much faster and strive closer towards data-centric AI.

Shazia Ruybal-Pesántez - `covidClassifyR`: Streamlining data analysis pipelines to enhance the monitoring of COVID-19 in Papua New Guinea

Due to the COVID-19 pandemic, the capacity of already under-resourced health systems in many low- and middle-income countries (LMICs) has been significantly strained. Measuring antibodies to SARS-CoV-2 in the laboratory has been instrumental as a public health tool because it enables a “snapshot” of the extent of community transmission and population-level immunity to better understand how many people in a given area have been infected and where transmission is occurring. We developed covidClassifyR, a user-friendly Shiny web application that streamlines data analysis pipelines to enable researchers on-the-ground to analyze antibody data generated in the lab and use built-in algorithms to robustly classify unknown patient samples as positive (i.e., recently exposed to COVID-19) or negative (not exposed). These results can then be leveraged to enhance and tailor public health strategies for monitoring COVID-19. Importantly, the app aims to make the downstream data processing, quality control and interpretation of the raw antibody data accessible to all researchers without the need for a specialist background in statistical methods or programming. It also allows users to perform quality control (QC) on their data, download an automated QC report, and visualize their data directly on the app through interactive plots. In this talk I will present a case study of the use of covidClassifyR in Papua New Guinea, and demonstrate how Shiny web applications can be leveraged to bridge the gap between data generation and data analysis for lab researchers on-the-ground and importantly to support disease surveillance efforts in LMICs.

Jacopo Baldacci - Development of an application for interactive exploration and quantification of diagnostic patterns of myopathy in muscular tissue histology images

Q&A will be handled by Ilaria Ceppa

Muscle biopsy is a routine diagnostic procedure for investigating the causes of muscle diseases. Routine histochemistry, typically performed on frozen tissue, commonly includes various stains, which allow the assessment of muscle fiber morphology and the identification of many pathological and oftentimes diagnostic patterns. Unfortunately, the lack of uniformity in the interpretation of these patterns by clinicians is an important issue in the management of neuromuscular disorders. In this study, we developed a Shiny app (integrated with python API, called via httr package, and javascript code, using the package shinyjs) that allows the interactive exploration of two important diagnostic patterns: “increased fiber size variation” and “increase in the number of internal nuclei.” These patterns are typically visible, at light microscope, in muscle biopsy of patients affected by myopathy. Our software gets as input all the images–acquired with a scanner–that compose the whole scanning of the muscle section. A segmentation algorithm is performed on each image in order to recognize the edges and separate all the fibers from each other. The software can therefore calculate the area of each segmented fiber and plot the distribution of the areas’ size. Moreover, we developed an algorithm for detecting internal nuclei in each fiber. Our software allows to calculate the percentage of fibers with internal nuclei and to count the number of internal nuclei each fiber contains. These quantifiable pieces of information could be useful for the clinician to collect objective and quantified data in order to detect important signs of myopathic damage.

Tobia De Koninck - The state of ShinyProxy--2022

With the ever-increasing amount of data to analyze, the need for interactive web applications keeps growing. A broad set of frameworks exists to build such applications–for instance, Shiny, Dash, H20 Wave, Streamlit, etc. ShinyProxy offers a 100% open source enterprise solution to run and manage such applications. Because of its scalability, ShinyProxy is able to stand any number of users, while still being simple enough to deploy for small teams. Integration with existing systems is an important feature of ShinyProxy. Connect with existing authentication systems (e.g., LDAP, Active Directory, OpenID Connect …), metric platforms (Prometheus, Influx DB, Postgres), or logging tools (Loki, Elastic), or even adapt the look and feel to your own style. Do your users need more freedom to dive into the data? No problem, ShinyProxy can host IDEs such as RStudio or notebook servers (e.g. Jupyter or Zeppelin notebooks). Give your users the resources they need by automatically allocating CPUs, RAM and even GPUs.

In this pitch we present some common use cases and integrations of ShinyProxy and give an overview of the upcoming features. Stay tuned for an improved user experience!

Posters on Social Science Analysis

Andrew Vancil - Community health: Using R to coordinate, analyze and share community survey data

As more and more community survey data is collected in various efforts, the heterogeneity of that data can make harmonizing and learning from it challenging. Thankfully, R can help by directly interfacing with external software programs to unlock powerful and seamless analysis protocols. We have established a workflow to use R to access a REDCap (Research Electronic Data Capture) survey via the REDCap API, prepare and analyze the data, geocode addresses to neighborhoods, and compile a shareable report.

Cincinnati Children’s Hospital Medical Center is actively investigating household food availability in targeted neighborhoods. The primary data sources in this research are surveys taken by community members (presented in both English and Spanish). The surveys are constructed and hosted by REDCap and accessed in R via the REDCap API. Once the data is collected in R, it is passed to a geocoding program entitled DeGAUSS. DeGAUSS relies on Docker, which is called through an R Markdown chunk accessing the bash terminal within R. Once the geocoded data is pulled back into R, data analysis and visualization is conducted in an R Markdown file for easy dissemination to key stakeholders. The striking, interactive visualizations and maps that we generate in R are vital for identifying issues and spurring intervention.

By tapping into R’s ability to access outside software programs, powerful data collection and analysis can be conducted and shared. These results are used by important decision makers whose actions can have significant impacts in disadvantaged neighborhoods in the Cincinnati area.

Jeanne Raquel de Andrade Franco - Mental disorders in the Latin American population

The most common mental disorders are depression and anxiety. Depression is characterized by the absence of positive expressions. Anxiety is a mental disorder that includes phobias, panic, and post-traumatic stress, among others. Mental disorders in the Latin American population constitute a growing health problem. Data collected in 2005 showed that Brazil, Colombia and Peru had the highest rates of people with depression. For generalized anxiety disorder, Brazil led with the highest percentage, followed by Colombia and Chile. The objective of this study was to evaluate depression, anxiety, and eating disorders in Latin American countries. The data was taken from the Our World in Data website and was selected between the years 1990 and 2017. The R packages used for the analyses were dplyr, psych, RVAideMemoire, car and statix. For producing the graphs the packages used were ggplot2, ggdark, gridExtra and grid. The statistical analysis used was Kruskal-Wallis, and post-hoc tests were performed to analyze the differences between the groups, using Dunn’s test with p-value adjustment and Bonferroni’s method. The Kruskal-Wallis analyses showed that differences exist between Latin American countries for all three mental disorder variables; similarly, Dunn’s test showed differences in pairwise comparison between some countries. The graphs showed that Chile, Brazil, and Argentina had the highest percentages of depression, anxiety, and eating disorders.

Bastián González-Bustamante - Time-dependent data encoding and time-varying exposure survival models to study presidential crises

This presentation demonstrates the application and the structure conversion from proportional hazards data to a time-dependent, non-proportional hazards dataset in order to analyse specific events such as crises. Specifically, we analyse the effect of low presidential approval rates on ministerial turnover in Brazil and Chile. Approval is measured with quarterly estimates using a dyad-ratio algorithm and merged into the time-dependent structure to evaluate individual ministerial terminations. This implies encoding the dataset with cases that have multiple observations according to defined time intervals–in this case, quarters of the year. The empirical strategy combines time-varying exposure Cox regressions with observational data and non-proportional hazards due to the dataset structure. Then, we employ propensity score and matching techniques to estimate precisely the effect of low approval on ministerial survival and perform moderation analyses with different ministers’ profiles associated with presidential strategies to cope with turbulent times.

Peter Fortunato - How R helps me evaluate the safety performance of a metropolitan highway network

Poster Award winner

The commonly used statistic to evaluate roadway safety performance, crash rates, is not statistically robust and can mislead transportation policymakers when deciding which elements of a roadway network need to be improved. While crash rates have the advantage of being easily understood by the public, they have the potential to mistakenly prioritize low-volume, low-collision sites, wasting millions of taxpayer dollars in the process. The Ohio Kentucky Indiana (OKI) Regional Council of Governments– the metropolitan planning agency (MPO) for the Cincinnati, Ohio, region–has undertaken the effort to implement methods presented in the Highway Safety Manual to derive a more helpful statistic, potentially for crash reduction (PCR). Our goal at OKI is to screen the 8-county region in order to determine which segments and intersections have the highest potential for a reduction in crashes. This will then help inform our prioritization process when allocating federal and state capital to local municipalities that need assistance for funding transportation improvement projects. Statistical methods featured in this analysis are the negative binomial regression model and empirical Bayes method.

Edwin de Jonge - Hodge-decomposing Dutch internal migration

Migration, the flow of people changing residence, is an example of a weighted directed graph and therefore an example of network data. Analysis of network data can be challenging, and this is especially true for weighted directed network data. We present the package hodgedecompose which helps in extracting gradient and cyclic flows in weighted directed networks. To demonstrate its functionality we analyze the open data set for Dutch internal migration, which consists of the migration flows between 350 Dutch regions. The package decomposes a weighted directed graph into an undirected backbone, a gradient graph and a cyclic remainder.

Posters on Social Science Analysis during the COVID-19 Pandemic

Kelsey Edmond - E-Quality: Leveraging Twitter to understand the digital divide

The digital divide refers to the social stratification due to an unequal ability to access, adapt, and create knowledge via the use of information and communication technologies (ICTs). The demand for widespread digital access increased suddenly and dramatically at the onset of the COVID-19 pandemic. This study explores how digital divide discourse developed amid the crisis by employing a large-scale text analysis of verified tweets.

Leveraging the academictwitteR package in R, tweets were systematically collected and underwent descriptive statistics, sentiment analyses, time series analyses, regression discontinuity design, and latent Dirichlet allocation method of topic modeling. The outcome of this research aims to provide input for the development of adequate policies targeted at more egalitarian digital use, finally aiming to decrease digital and subsequently social inequalities.

https://twitter.com/pshhkels/status/1539668775506653184

Lauren Norris- Baltimore City Health Department (BCHD) COVID-19 vaccination clinic data visualization using a R Shiny app

COVID-19 vaccines are a key tool in ending the COVID-19 global pandemic. To promote the health and safety of Baltimore City, Maryland, residents, it is essential to reach and vaccinate as many eligible Baltimoreans as possible. Through the Baltimore City Health Department’s (BCHD) VALUE (Vaccine Acceptance & Access Lives in Unity, Engagement & Education) Communities Initiative, BCHD is working to increase COVID-19 vaccine access, acceptance, and uptake via educational outreach and pop-up vaccination clinics focused on some of Baltimore’s most vulnerable and/or hesitant populations. A Shiny app was created and deployed in October 2021 to easily visualize BCHD vaccination clinic data trends over time and to track progress towards pre-specified vaccine targets. Clinic-level (not individual-level) data is stored and extracted from REDCap and contains 19 categorical and 48 continuous variables, including number of doses administered by dose number (first, second, third/booster), vaccine manufacturer, clinic location (zip code, address), etc. The Shiny app allows users to choose the clinic population of interest (e.g., older adults) for which they would like to view data and reactively displays numerous figures including: number of clinics over time, cumulative vaccines administered over time, and proportion of the population vaccinated through BCHD efforts/clinics by neighborhood. The Shiny app is for BCHD internal-use only and regularly presented to BCHD senior leadership to inform and improve upon vaccine strategy.

Merriah Croston - What's in a jab? The spread of COVID-19 vaccine misinformation versus fact-checks on Twitter

COVID-19 has infected hundreds of millions of people, caused millions of deaths, and resulted in extensive socio-economic damage. Although disease underlies these outcomes, misinformation has played an incalculable role. Indeed, there is some consensus that the spread of misinformation on social media during the COVID-19 pandemic is unprecedented. In response, several approaches have been taken to prevent the spread of misinformation, including fact-checking.

This is the first study to examine the spread of fact-check URLs versus the URLs they debunk using social network analysis of Twitter data. We were motivated to examine vaccine-related misinformation as a case study due to the threat of vaccine hesitancy and vaccine refusal to global health. Moreover, vaccine-related conspiracy theories are particularly prevalent during large-scale infectious disease outbreaks. We compared the retweet network that formed around vaccine-related misinformation URLs to that of corresponding fact-check URLs. Using social network analysis, we visualized these networks and answered the following questions: How are these networks structured and are there distinct communities? Which network members are most influential and supportive? What URLs are most popular according to the scale and rate of spread? Finally, what message and network member characteristics are associated with retweeting? To achieve this, we retrieved and examined a corpus of tweets that were posted June 1 - November 30, 2021, using a set of R packages, including academictwitteR, statnet, igraph, and ergm.count. Code and tweet IDs can be employed in additional case studies and will be available at https://github.com/mcroston/misinformation_covid19.git.

Posters on Specialized Methods and Graphics

Isra Ahmad - How much coffee do you really drink?

My partner and I have had ongoing debates on the household expenditures on Philz Coffee. I obtained Philz Coffee app data via the California Consumer Privacy Act and analyzed data over a 3-year time period. I used RStudio to clean, analyze and visualize the data. This data revealed overall dollars expended on Philz Coffee, the locations of expenditures, the visits by location, and breakdown of expenditures by location. This analysis resulted in confirmation of the hypothesis that there is in fact an exorbitant amount of coffee expenditures.

Thomas Rose - GlobaLID: Accessing and visualising lead isotope data for the reproducible reconstruction of raw material provenance

Lead (Pb) isotope geochemistry is an approved key method in archaeological sciences to reconstruct the resource provenance of metals and trade networks of past civilisations. Successful application and interpretation of Pb isotope signatures of metal artefacts rely crucially on the published ore data, which are partly only available from pre- or re-digitalised publications. Most Pb isotope reference data collections were compiled by individual working groups, usually focussing on their projects and regions of interest. Despite the importance of the reference data there is currently no common approach on how and with which metadata lead isotope reference values should be reported. Consequently no public database exists yet to access them, rendering reproducibility of the results difficult and time-consuming.

GlobaLID aims to overcome this situation by establishing an open database as a central repository for such lead isotope reference data and by developing a Shiny app that allows accessing the database and carrying out the most common tasks for the reconstruction of the resource provenance up to the creation of publication-quality plots. This database is accompanied by a Shiny app that implements the reproducible visualisation of the reference data in maps and publication-quality plots of various styles as the main task of the lead isotope method. The presentation will briefly present the outline and aim of the GlobaLID project. It will then focus on the implementation of reproducible but customisable plots in the already available prototype and our plans for the full version.

James Foadi - A `cry` for help: Statistical crystallography with R

Statistical techniques have been applied to the wide field of crystallography ever since this discipline started to make use of x-ray diffraction. Novel statistical applications in crystallography have regularly appeared in research journals up to the present, but their grouping into statistical crystallography has attracted little interest outside the traditional disciplines using crystallographic techniques as a tool to explore the properties of matter. This lack of diffusion has greatly limited the expansion of this important mathematical field, especially considering the tremendous recent advances in statistics and statistical computing. Part of the reason can be attributed to the specialist jargon and techniques of crystallography, mostly relating to space symmetry and data format.

In this talk a new R package, cry, is introduced and described. The goal of the package is to bridge the technical divide between crystallography and statistics by providing functions to read the most widespread crystallographic data formats and to cope with symmetry and other standard crystallographic operations. In this poster, a few examples are featured to attract interest and familiarise the reader with this engaging, fundamental and rewarding research area. The main goal is ultimately to attract the interest and expertise of professional statisticians to the field of crystallography.

Rita Giordano - Data visualisation with `cry`

cry is a package developed to perform statistics on crystallographic data and to improve the quality of charts for scientific papers and presentations. The package presents different functions to create charts, based on output from different crystallographic software. Output from the most popular crystallographic programs is appropriately reshaped as input for various R plotting functions. The net outcome is clear charts that improve the understanding of most scientific results. This makes it possible to reach a wider audience, not necessarily within the strict field of interest. As a consequence, the dissemination of knowledge is facilitated. I will show how such applications with cry are helpful and why they are needed for better scientific publications or reports. I will also highlight the importance of improved communication via charts and the appropriate display of scientific results.

Posters on Statistical Modelling

Miguel A. Sorrel - "Where am I?": Efficient diagnostic classification in educational settings using `cdcatR`

In education, we differentiate between summative evaluation (i.e., pass/fail, rank-order grading) and formative evaluation (i.e., detection of strengths and weaknesses, feedback). Formative assessment involves measuring on a rather frequent basis and providing feedback as immediately as possible so that teachers and students can implement remedial solution–that is, finding out where each student is to be able to intervene in the most adapted way possible. However, in applied settings time is always in short supply. One promising way to address this issue involves the use of computerized adaptive tests based on cognitive diagnostic models (CD-CAT). On the one hand, cognitive diagnostic models are a family of statistical models that do not require large samples to be estimated and allow the classification of examinees into attribute profiles. For example, the attribute profile {100} would indicate that the student masters the first attribute but not attributes 2 and 3. On the other hand, the computerized adaptive implementation of these models allows each person to be administered the item that best fits the pattern of hits and fails that the person has as the test progresses. For each student, the assessment ends when there is high confidence regarding their classification. This poster illustrates how to evaluate assessments of this type using the R package cdcatR by didactically describing its features in the context of a simulated diagnostic application. The main goal is to disseminate this methodology, which is expected to facilitate more frequent and optimized measurement in real educational environments.

Pablo Nájera - `cdmTools`: An R package to facilitate diagnosis feedback with cognitive diagnosis modeling

Cognitive diagnosis modeling (CDM) is a family of statistical models that have received increasing attention, especially in the educational field, due to their ability to provide fine-grained information about the students’ mastery status of skills, domains, or competences. Thus, rather than providing a single score in a continuous scale (e.g., 710 in Math), CDM identifies whether examinees master (or not) a series of more specific attributes (e.g., mastery of subtraction, non-mastery of multiplication), thus facilitating remedial instruction. Some R packages (e.g., GDINA, CDM, cdcatR) are already available for the analysis of CDM data. There are, however, some important features concerning CDM that are missing in these existing packages. To address this, we have developed the {cdmTools} R package, which contains: (1) useful functions for simulation studies, including the detection and generation of identified random Q-matrices; (2) two procedures (parallel analysis and model-fit comparison) for assessing the dimensionality of CDM data; (3) the discrete factor loading empirical Q-matrix estimation method; and (4) the Hull method for empirical Q-matrix validation. By filling these important gaps, cdmTools is a necessary complement to other R packages for conducting comprehensive and valid CDM analyses. In this poster and elevator pitch, the main functions of the cdmTools package, as well as synergies with other CDM packages, are illustrated using an educational dataset.

Alexandre Brouste - Fast and efficient estimation procedure for counting stochastic processes

In regular statistical experiments, the sequence of Le Cam’s one-step estimators presents certain advantages over the sequence of maximum likelihood estimator (MLE) and over other sequences of estimators (method of moments, quantile matching method, etc.) in terms of computational cost and asymptotic variance. It is much less computationally expensive than the MLE, while it has the same rate and the same asymptotic variance. Since there is no full numerical optimization (but only one computation of the Newton step or the Fisher scoring step), the procedure is faster and appropriate for very large datasets. On the other hand, it is asymptotically optimal in terms of asymptotic variance, which is generally not the case for other sequences. We propose to extend this procedure, initially applied for i.i.d. random variables, to the estimation of the parameters in the (deterministic) intensity function of an inhomogeneous counting Poisson process or in the (stochastic) intensity function of an Hawkes process. Monte Carlo simulations are carried out for several examples in order to exhibit the performance of Le Cam’s one-step estimation procedure in terms of efficiency and computational cost.

Emily Tupaj and Jerzy Wieczorek - `CIPerm`: An R package for computationally efficient confidence intervals from permutation tests

We carry out computationally-efficient construction of confidence intervals from permutation tests for simple differences in means. When using a permutation test to evaluate H_0: mu_A - mu_B = 0, the naive approach to constructing a CI for the (mu_A - mu_B) parameter would require carrying out many new permutation tests at different values of (mu_A - mu_B). Instead, our package constructs a CI cheaply using a single set of permutations, making such CIs feasible for much larger datasets.