mpadge blog

Debugging C++ code in R with a single command

30 Aug 22

debugging in R with a single command

The art of debugging C++ code in R has been covered in many other places, notably including this post by Davis Vaughan, and this helpful introduction by Toby Hocking, which provides a little more detail on how to get a debugger started. The detail on starting a debugger is nevertheless brief, and only added as an aside to the main point of the post. The aim of this post is to provide a detailed reference on how to start a source code debugger in R. This blog post will not describe details of common debugging environments, even if only because Davis Vaughan has already done such a great job of that. As he describes there, the two most common debugging environments used on Linux systems are “gdb” (the GNU Project Debugger) and “lldb”. This whole post presumes code to be debugged is in an R package, and that all commands that follow are executed from within the root directory of that package. Debugging code from other packages requires modifying the following procedure to ensure that debug symbols are inserted within the source code of those other packages. But then it’ll generally be easier to do that from within the root directory of the package you want debugged anyway, so we’ll just presume that from here on anyway. ## How to start a source-code debugger in R Starting an R session causes a computer console to enter a dedicated computational environment where R commands can be typed, and will be appropriately interpreted and executed. Similarly, starting a source-code debugger generally results in entering a dedicated debugging environment where debugging commands can be entered in order to debug source code which has been pre-loaded into that environment. Starting a debugger from within an R environment generally consists of the two steps of: 1. Re-compiling source code to include debugging symbols; and then 2. Starting an R session in “debug” mode. Although people experienced with debugging might see these steps as trivial, they can easily prevent insurmountable challenges to anybody who has never used a source code debugger before. This post will describe a simple setup for “automating away” these two steps, and reducing them to a single command. ### Re-compiling source code to include debugging symbols While there are potentially a number of ways source code in an R package (or elsewhere) can be re-compiled with debugging symbols, perhaps the easiest is to insert the symbols within a Makevars file. These files are used to control compilation of source code. The following line in a Makevars file will insert debugging symbols when the code is re-compiled:

PKG_CPPFLAGS = -UDEBUG -g

The code can then be compiled with an R command like pkgbuild::compile_dll(), and will then include the debug symbols needed in the debugging environment. I have a function in my personal R package of general purpose functions, called debug() which automatically creates a Makevars file to insert these symbols (or modifies an existing file to add the symbols on to any existing compilation symbols). The following section describes how this function is used to implement a one-line command to start debugging. ### Starting an R session in debug mode As described above, R is effectively a self-contained computational environment within some other environment (such as a terminal environment, or RStudio). A debugger is also a self-contained computational environment. Unsurprisingly, this means that a debugging environment can not be started from within R, but must be started from the “host” environment from where you usually start R, such as a terminal, or a shell environment in RStudio. Debuggers also need to be started by evaluating some specified R expression, generally specified as an external R script. This means debugging some function, f(), requires creating a simple file, say “script.R”, which calls that function. The script must include any other lines necessary for R to know how to load the function, such as library() calls, or the full function definition. For example, if I wanted to debug a function within my mpmisc package - which would be silly, because it contains no source code, but the principle applies regardless - then I would create a “script.R” file with the following lines: {r mpmisc-demo, eval = FALSE} library (mpmisc) check <- increment_dev_version () # or whatever function I want debugged. The debugger can then be started from that location by running: {bash dbg-demo-call, eval = FALSE} R -d gdb -e 'source("script.R")' This command calls R, and must be run within a shell environment, not from within R! The -d flag tells R to run in debug mode, and requires specifying which debugger to use, such as “gdb” as in that example, or “lldb”, or any other available debugger. The -e flag specifies a command for R to evaluate while debugging. ### Putting it together in a single command My single-command solution is implemented via a shell alias, for which is use “debugr”, which just calls a shell script: {bash debugr, eval = FALSE} alias debugr="bash ///debug.bash" The shell script is in my dotfiles repo, and contains these lines: {bash debug.bash, eval = FALSE} #!/usr/bin/bash echo "---------------------------------------------------" echo " Debug an R script with gdb or lldb" echo "---------------------------------------------------" read -p "Enter name of script (empty = default 'script.R'): " SCRIPT Rscript -e "mpmisc::debug (); pkgbuild::compile_dll()" DEBUGGER=gdb if [ "$SCRIPT" == "" ]; then R -d $DEBUGGER -e "source('script.R')" else R -d $DEBUGGER -e "source('$SCRIPT')" fi That script calls mpmisc::debug() to create or modify Makevars to include debug symbols, and then re-compiles the source object by calling pkgbuild::compile_dll(). It also includes an interactive prompt to specify the script to be used for debugging, with a default of “script.R”. I can then debug any package by creating a debug script, and then simply calling debugr to drop me straight into a debugging environment. As said at the outset, this post is only intended to describe how to get that far. See the links given at the stop for what to do once you’re there.

Timeout on parallel jobs in R

14 Jan 22

Timeout on parallel jobs in R

Python’s multiprocessing and threading libraries both have a timeout parameter for re-joining threads after they’ve finished. This provides an easy way to launch multi-threaded jobs while ensuing that no single thread runs for longer than a specified timeout. This is very useful in implementing a standard “timeout on a function call” operation, as detailed in this Stack Overflow question of that title which offers a bewildering variety of approaches to that problem. Among the easiest of those is the recommendation to rely on the multiprocessing libraries’s join() operation which accepts a timeout parameter, as described in the library’s documentation. There is also an equivalent parameter for python’s other main parallelisation library, threading. A nice example of the usefulness of this timeout parameter in action is given in the fitter package by @cokelaer for fitting probability distributions to observed data. The main function fits a wide range of different distributions, and can even automagically select the best distribution according to specified criteria. This is done through fitting different distributions in parallel on different threads, generally greatly speeding up calculations. Distributional fitting is, however, often an iterative procedure, meaning the duration required to generate a fit within some specified tolerance can not be known in advance. Parallel threads by default must wait for all to terminate before individual results can be joined. To ensure distributional fits are generated within a reasonable duration, fitter has a _timed_run function to: > spawn a thread and run the given function … and return the given default > value if the timeout is exceeded. The bit of that function which controls the timeout consists of the following lines (with code for exception handling removed here): {python, eval = FALSE} def _timed_run (self, func, args=()): class InterruptableThread(threading.Thread): def __init__(self): threading.Thread.__init__(self) self.result = default def run(self): self.result = func(args) it = InterruptableThread() it.start() it.join(self.timeout) return it.result That represents a succinct way to run a multi-threaded job in which each thread obeys a specified timeout parameter. This post describes two approaches to implementing equivalent functionality in R. ## Timeout in R’s ‘parallel’ package R’s {parallel} package offers one way to implement a timeout parameter, via the mccollect() function, which is (almost) equivalent to Python’s .join() operator. This can be illustrated with this arbitrarily slow function: {r slow-fn, eval = FALSE} fn <- function (x = 10L) { vapply (seq (x), function (i) { Sys.sleep (0.2) runif (1) }, numeric (1)) } Calculating this in parallel is straightforward with the mcparallel() and mccollect() functions. This code generates 10 random inputs to fn() which will take random durations up to 20 * 0.2 = 4 seconds each. {r parallel-src, eval = FALSE} set.seed (1) n <- sample (1:20, size = 10, replace = TRUE) library (parallel) jobs <- lapply (n, function (i) mcparallel (fn (i))) system.time ( res <- mccollect (jobs) )
{r parallel-duration, echo = FALSE} c (user = 0.006, system = 0.000, elapsed = 3.615) That took much less than the expected duration of, {r dur-exp-src, eval = FALSE} sum (n) / 5
{r dur-exp, echo = FALSE} set.seed (1) n <- sample (1:20, size = 10, replace = TRUE) sum (n) / 5 The mccollect() function has a timeout parameter “to check for job results”. Specifying that in the above function then gives the following, noting that the parameter wait also has to be passed with its non-default value of FALSE to activate timeout. {r parallel-timeout-src, eval = FALSE} jobs <- lapply (n, function (i) mcparallel (fn (i))) system.time ( res <- mccollect (jobs, wait = FALSE, timeout = 2) )
{r parallel-timeout-duration, echo = FALSE} c (user = 0.000, system = 0.000, elapsed = 0.003) That seems much too quick! What does the result look like? {r parallel-timeout-res-src, eval = FALSE} res
{r parallel-timeout-res, echo = FALSE} list (`24053` = 0.6096623) It seems that mccollect() has only returned one result. The reason can be seen by tracing the implementation of the timeout parameter from the mccollect() function through to the selectChildren() function into the C function, select_children(), and finally to the lines which implement the waiting procedure. These lines show that the function returns as soon as it collects a value from any of the “child” processes (via the R_ext/R_SelectEx() function which is implemented here). So setting timeout in mccollect() will then return results as soon as the first result has been been generated. That of course means that the remaining jobs continue to be processed, and can be returned by subsequent calls to mccollect(). Two consecutive calls will then naturally return the first two results to be processed. To check this, we need to note that the jobs list contains process ID (pid) values, one of which is detached by the first call to mccollect(), and so has to be removed from the jobs list. {r timeout2A-src, eval = FALSE} jobs <- lapply (n, function (i) mcparallel (fn (i))) pids <- vapply (jobs, function (i) i$pid, integer (1)) system.time ( res1 <- mccollect (jobs, wait = FALSE, timeout = 2) )
{r timeout2A, echo = FALSE} c (user = 0.000, system = 0.000, elapsed = 0.007)
{r timeout2B-src, eval = FALSE} jobs <- jobs [which (!pids %in% names (res1))] system.time ( res2 <- mccollect (jobs, wait = FALSE, timeout = 2) ) {r timeout2B, echo = FALSE} c (user = 0.000, system = 0.000, elapsed = 0.003) The two returned values are then, {r timeout2-results-src, eval = FALSE} res1; res2
{r timeout2-results, echo = FALSE} list (`26140` = 0.05318079, `26146` = 0.7513229) So R has a timeout parameter on parallel jobs, but it doesn’t work like the equivalent Python parameters, and arguably doesn’t work how one might expect. That code exploration is nevertheless sufficient to understand how a pythonic version could be implemented: {r pytimeout-src, eval = FALSE} par_timeout <- function (f, n, timeout) { jobs <- lapply (n, function (i) mcparallel (f (i))) Sys.sleep (timeout) mccollect (jobs, wait = FALSE) } par_timeout (fn, n, 2)
{r pytimeout, echo = FALSE} list (`26913` = 0.008293313, `26908` = c (0.2473093, 0.9442306), `26907` = 0.8032608, `26906` = c (0.1900972, 0.8134690, 0.2745623, 0.3148808, 0.3954601, 0.7415558, 0.9394560), `26905` = c (0.7566425, 0.2494607, 0.4848817, 0.3469343)) And we get five out of the expected 10 results returning within our specified timeout of 2 seconds. We can estimate from the generated values of n which ones should have returned, given that fn takes 0.2s per unit of the input, x, repeating the initial code used to generate those values. {r should_work_src-rpt, eval = FALSE} set.seed (1) n <- sample (1:20, size = 10, replace = TRUE) timeout <- 2 # in seconds data.frame (n = n, should_work = n / 5 <= 2)
{r should_work-rpt, echo = FALSE} set.seed (1) n <- sample (1:20, size = 10, replace = TRUE) timeout <- 2 # in seconds data.frame (n = n, should_work = n / 5 <= 2) And we might have expected 6 values to have returned, of which we actually got only 5, but perhaps the value of n = 10 extended just beyond the timeout? We’ll nevertheless compare this result with an alternative approach below. But first, there are some notable drawbacks to the approach illustrated here: 1. The documentation for the mcparallel() and mccollect() functions state at the very first line, “These functions are based on forking and so are not available on Windows.” While that might not concern those who develop packages on other systems, it will greatly reduce the use of any code implementing parallel timeouts in this way. 2. There are many “wrapper” packages around R’s core {parallel} functionality, notably including the “futureverse” family of packages, the primary aim of which is to make parallelisation in R simpler, through enabling any calls to be simply wrapped in parallelisation functions like future(). These packages offer no direct way of controlling the timeout parameter of mccollect(), or any equivalent functionality. The next section explores a different approach that is operating-system independent. ## Timeout via ‘callr’ The callr package by Gábor Csárdi and Winston Chang is designed for ‘calling R from R’ – that is, for, > performing computation in a separate R process, without affecting the current > R process The package offers two main modes of calling processes: as blocking, foreground processes via callr::r(), or as non-blocking, background processes via callr::r_bg(). The foreground r() function has an explicit timeout parameter, which returns a system_command_timeout_error if the specified timeout (in seconds) is exceeded. The following code calls the fn() function from above to demonstrate this functionality, wrapping the main call in tryCatch() to process the timeout errors: {r long-fn-fg} timeout_fn <- function (x = 1L, timeout = 2) { tryCatch ( callr::r (fn, args = list (x = x), timeout = timeout), error = function (e) NA ) } Passing a value of x larger than around 5 should then timeout at 1 second, as this code demonstrates: {r slow-fn1-timing-src, eval = FALSE} system.time ( x <- timeout_fn (x = 10, timeout = 1) )
{r slow-fn1-timing, echo = FALSE} c (user = 0.152, system = 0.035, elapsed = 0.959) The returned value is then: {r slow-fn1-x-src, eval = FALSE} x
{r slow-fn1-x, echo = FALSE} NA That function timed out as expected. Compare what happens when the timeout is extended well beyond that limit: {r slow-fn2-src, eval = FALSE} timeout_fn (x = 5, timeout = 10)
{r slow-fn2-out, echo = FALSE} runif (5) The timeout parameter of callr::r() can thus be used to directly implement a timeout parameter. The following sub-section demonstrates how to extend this to parallel jobs. ## Parallel timeout via ‘callr’ To illustrate a different approach than the previous mcparallel() function, the following code uses the mclapply function of the parallel package, which unfortunately also does not work on Windows, but suffices to demonstrate the principles. {r parallel, eval = FALSE} set.seed (1) n <- sample (1:20, size = 10, replace = TRUE) nc <- parallel::detectCores () - 1L system.time ( res <- parallel::mclapply (mc.cores = nc, n, function (i) timeout_fn (x = i, timeout = 2)) )
{r sys.time, echo = FALSE} c (user = 1.754, system = 0.544, elapsed = 3.008)
{r parallel-out, eval = FALSE} print (res)
{r parallel-out-for-real, echo = FALSE} res <- as.list (rep (NA, 10L)) res [[1]] <- c (0.20134728, 0.09508085, 0.75240848, 0.30041337) res [[2]] <- c (0.5837042, 0.6133771, 0.3121486, 0.2943205, 0.4455983, 0.5102744, 0.8867751) res [[3]] <- 0.9381157 res [[4]] <- c (0.9201705, 0.9656466) res [[9]] <- 0.7515151 print (res) And that returned 5 out of the 10 jobs, as for the previous example using mccollect(). (The actual values differ due to random number generators being seeded differently in the two lots of jobs.) This approach, of using callr to control function timeout parameters, enables parallel jobs to be implemented on all operating systems through replacing the mclapply() or mcparallel() functions with, for example, equivalent functions from the {snow} package. These {snow} functions (such as the parApply family of functions) also do not implement a timeout parameter, and so this {callr} approach offers one practical way to do so via those packages. ### Timeout parameters and ‘future’ packages Processes triggered by the {callr} package do not generally play nicely with the core {future} package, which was likely one motivation for Henrik Bengtsson to develop the {future.callr} package which explicitly uses {callr} to run each process. The processes are nevertheless triggered as callr::r_bg() processes which do not have a timeout parameter. While it is possible to directly implement a timeout parameter of r_bg processes by monitoring until timeout and then using the kill method, the future.callr package does not directly expose the r_bg processes necessary to enable this. There is therefore currently no safe way to implement a timeout parameter along the lines demonstrated here within any futureverse packages.

GitHub notifications from the terminal

27 Oct 21

GitHub notifications from the terminal

I work almost entirely from the terminal, and regret the few remaining tasks which still require me to venture elsewhere, such as a web browser. Until recently, one of my main reasons for constantly switching to my browser was to check my GitHub notifications. This post describes how I view my notifications within the terminal, including an option to mark them as “read” on GitHub. The internal functionality is encoded in R, although the functions are mere http::GET calls which could easily be translated into any other language. The code described and linked to here uses GitHub’s REST (version 3) API, because notifications are not yet (at the time of writing) able to be accessed via the more recent GraphQL (verion 4) API. The GitHub Command-Line-Interface (cli) relies exclusively on the GraphQL API, and so also can’t (yet) be used to access notifications. Once notifications are accessible via GraphQL queries, the cli will be able to be used directly to do everything described here and much more. Until that time, the following provides one way to access GitHub notifications from the terminal. ## The script Like almost everything I do, I associate this with an alias, in this case gn for, of course, GitHub Notifications. The alias calls the following very simple bash script: {bash, eval = FALSE} #!/usr/bin/bash if [ "$1" == "" ]; then Rscript -e "mpmisc::gh_notifications ()" elif [ "$1" == "done" ]; then Rscript -e "mpmisc::mark_gh_notifications_as_read()" elif [[ "$1" =~ ^[0-9]+$ ]]; then Rscript -e "mpmisc::open_gh_notification ($1)" else echo "gn only accepts 'done' or a single number" exit 1 fi That shows the three options currently implemented: 1. gn to view notifications; 2. gn done to mark all notifications as read; and 3. gn to open the nominated notification in GitHub All of these options call functions from an R package I use to hold my miscellaneous functions, mpmisc. ## The gh_notifications functions All of these functions are contained within a single file of that package, itself containing less than 200 lines of code. The main gh_notifications() function is a simple GET call to the API endpoint. The request requires authentication with a GitHub API token, and returns notifications for the user associated with the token. The request returns a wealth of JSON data described in the API docs (under “Response”), from which I extract a few essential details including: 1. Title of the notification; 2. Repository, in org/repo format; 3. Issue Number (where present; not for notifications from such things as commit messages); 4. Notification URL; 5. Time at which notification updated/issued; and 6. Time at which notification or issue was last read. These notifications are then cached for immediate recall by other functions. Finally the notifications are printed to screen with a separate function which formats output using ANSII escape codes. The result then looks something like this: → org1/repo2 #3: title one
→ org4/repo5 #6: title two Typing gn 1 will then open the first notification in my default web browser. The notifications for the open_gh_notification() function are loaded from the cached version, so opening is effectively instantaeous. Finally, the REST API offers one additional function to mark all notifications as read by issuing a PUT command to the same API endpoint. The mark_gh_notifications_as_read() function does exactly that, and is aliased in the above shell script to gn done.

The allcontributors package

10 Mar 21

The allcontributors package

The allcontributors package is an alternative implementation in R of the original all-contributors to acknowledge all contributors in your ‘README’ (or elsewhere). The original is intended to help acknowledge all contributions including those beyond the contents of an actual repository, such as community or other or less-tangible organisational contributions. This version only acknowledges tangible contributions to a repository, but automates that task to a single function call, in the hope that such simplicity will spur greater usage. In short: This package can’t do everything the original does, but it makes what it does much easier. ## Why then? The original all-contributors is primarily a bot which responds to commit messages such as add @user for , where is one of the recognized types. As said above, the relative advantage of that original system lies primarily in the diversity of contribution types able to be acknowledged, with each type for a given user appearing as a corresponding emoji below their github avatar as listed on the README. In comparison, this R package: 1. Works automatically, by calling add_contributors() at any time to add or update contributor acknowledgements. 2. Works locally without any bot integration 3. Can add contributors to any file, not just the main README 4. Offers a variety of formats for listing contributors: (i) divided into sections by types of contributions, or as a single section (ii) presented as full grids (like the original), numbered lists of github user names only, or single text strings of comma-separated names. ## Usage The primary function of the package, add_contributors(), adds a table of all contributors by default to the main README.md file (and README.Rmd if that exists). Tables or lists can be added to other files by specifying the files argument of that function. The appearance of the contributors table is determined by several parameters in that function, including: 1. type For the type of contributions to include (code, contributors who open issues, contributors who discuss issues). 2. num_sections For whether to present contributors in 1, 2, or 3 distinct sections, dependent upon which types of contributions are to be acknowledged. 3. format Determining whether contributors are presented in a grid with associated avatars of each contributor, as in the original, an enumerated list of github user names only, or a single text string of comma-separated names. Contribution data are obtained by querying the github API, for which a local key should be set as an environmental variable containing the name "GITHUB" (either via Sys.setenv(), or as an equivalent entry in a file ~/.Renviron). If the main README file(s) contains a markdown section entitled "Contributors", the add_contributors(), function will add a table of contributors there, otherwise it will be appended to the end of the document(s). If you wish your contributors table to be somewhere other than at the end of the README file(s), start by adding an empty "## Contributors section to the file(s) and the function will insert the table at that point. Any time you wish to update your contributor list, simply re-run the add_contributors() function. There’s even an open_issue parameter that will automatically open or update a github issue on your repository so that contributors will be pinged about them being added to your list of contributors. The data used to construct the contributions table can also be extracted without writing to the README file(s) with the function get_contributors(), {r get_contributors} library (allcontributors) get_contributors(org = "ropensci", repo = "allcontributors") ## Updating Contributor Acknowledgements “Contributors” sections of files will be automatically updated to reflect any new contributions by simply calling add_contributors(), If your contributors have not changed then your lists of acknowledgements will not be changed. The add_contributors(), function has an additional parameter which may be set to force_update = TRUE to force lists to be updated regardless of whether contributions have changed. This can be used to change the formats of acknowledgements at any time. If anything goes wrong, the easiest way to replace a contributions section is to simply delete the old ones from all files, and call add_contributors(), again. ## More Information The package has a single vignette which visually demonstrates the various formats in which an “allcontributors” section can be presented.

The troubles with getting help files in R

29 Sep 20

The troubles with getting help files in R

Databases of help files in R

R has a very well structured system for documenting and accessing help for packages. In most systems, attempts to access help files will result in a dedicated window opening up with nicely formatted help content for a requested topic. This blog entry addresses the issue of how to extract the underlying text of those files, for example in order to do any kind of text mining-type analyses. The content of the help files can be extracted for a given package via the tools::Rd_db() function. That function works like this: {r rd_db} x <- tools::Rd_db (package = "tools") class (x) length (x) class (x [[1]]) Say I want to extract the help file shown on the html page for Rd_db linked to immediately above. Then I just have to find the entry in the Rd_db data. As a first try, I simply examine the names of the files in the database, and try to match (via grep) the one called something like rd_db: {r rd_db2} grep ("rd_db", names (x), ignore.case = TRUE) The database contains no file called Rd_db or the like. If you click again on the above link to the html entry you’ll notice that the page itself is called, Rdutils. Where does that name come from? Help files for R packages are contained within a /man directory of the package source. When a package is installed, all files within that directory (which end with a suffix of .Rd) are compiled into a binary database object which can then be read by the Rd_db() function. So the databases of help files for any given package contain one entry for each file in the original /man directory of the package source, with the names of those original files transferred over to the names of the corresponding entries in the Rd_db file. These databases in installed packages are no longer contained within directories named /man, rather they are compiled within a directory called /help. The contents of this directory can be readily examined with code like the following: {r} loc <- file.path (R.home (), "library", "tools", "help") list.files (loc) And the two tools.rdb and toools.rdx files represent the binary database of help files for the tools package. An alternative way to access the databases contained within that directory is via the lazyLoad() function. (And clicking on that website indicates another inconsistency that the function is called lazyLoad, yet the page is named lazyload, for reasons which should become clear as you read on.) {r} package <- "tools" loc <- file.path (R.home (), "library", package, "help", package) e <- new.env () chk <- lazyLoad (loc, envir = e) head (names (x)) head (ls (envir = e)) Those last two commands reveal that the entries in the object returned from Rd_db() are the original and full file names within the /man directory of the package source, while the corresponding names when lazyLoaded have the suffix, .Rd, removed. The following line nevertheless confirms that the two methods yield identical results: {r rd-vs-lazyload} all (ls (envir = e) == gsub ("\\.Rd$", "", names (tools::Rd_db (package = "tools")))) ### An alternative approach An alternative approach to extract some of the information contained in the Rd_db object uses a trick that the utils::help function can be called without specifying a topic. Additionally specifying help_type = "text" will then retrieve a few components of the database in text form. {r} package <- "tools" h <- utils::help (package = eval (substitute (package)), help_type = "text") class (h) At that point, attempting to print the object h will simply open the help file the usual way, rather than giving you the textual content. Noting the output of the following, {r} str (h) leads to the obvious next step of examining {r hinfo-fakey, eval = FALSE, echo = TRUE} h$info {r hinfo, eval = TRUE, echo = FALSE} hcut <- lapply (h$info, function (i) i [1:10]) hcut And the second component of the h$info object has the names and descriptions of each entry in the help database. ## Getting help content for a particular function If we want to analyse the textual content of help files, then we obviously need a way to extract that content for any given function. Armed with the basics described above, let’s say we want to extract the content of the help file for the tools::Rd_db() function. If you click on that link, you’ll notice that the page which describes only the function Rd_db() is actually called, Rdutils. So how could we automatically extract the content of the help file for Rd_db(), or indeed any particular function, when the help files describing our desired function may have entirely arbitrary names? The full entry for Rdutils looks like this: {r} x [["Rdutils.Rd"]] And you’ll notice at the top that Rd_db is given as an alias. The structure of these files is described in a section of the “Writing R Extensions” manual, which explains that these files contain a “name”, a “title”, and optional “alias” entries. Comparing the above text to the formatted html help page for Rd_db() reveals what those three fields are: 1. The “name” field defines the name of a single help topic, which may or may not be the name of the original /man directory file in the package source (more on this below). 2. The “title” field specifies an arbitrary description which will appear at the top of the help file. 3. The “alias” fields specify topics which will be linked to the given help file. So to locate the help entry for a nominated function, we need to find match that function with an alias entry for some help file which we do not necessarily know the name of. As long as we know the package in which we are searching, we can then simply extract all alias entries for every single help file. The example in the help file for the Rd_db() function use a non-exported function called .Rd_get_metadata() (non-exported meaning that function can only be called via the triple-colon method as tools:::.Rd_get_metadata(), and also meaning that there will be no help entry for this function). This function can be used to extract all “alias” fields for every help topic: {r get-aliases} aliases <- lapply (x, function (i) tools:::.Rd_get_metadata (i, "alias")) Code like the following can then be used to find the file which describes the Rd_db function. {r find-aliases} myfn <- "Rd_db" aliases [which (vapply (aliases, function (i) myfn %in% i, logical (1)))] And that gives us the name of the help file describing our desired function. ## Getting help content for a particular function (#2) An alternative approach to finding the names of help files associated with a specified function is to use the help.search() function which returns the following kinds of results: {r hs-fakey, echo = TRUE, eval = FALSE} hs <- help.search (pattern = "Rd_db", package = "tools") str (hs) hs$matches {r hs, echo = FALSE, eval = TRUE} hs <- help.search (pattern = "Rd_db", package = "tools") hs$lib.loc [1] <- "/////" str (hs) {r hs-matches} hs$matches We can see there that the final, "Entry" column includes Rd_db, and specifies that it is an "alias". The name of the associated file is also given there as “Rdutils”. ## Names of help topics; names of help files I indicated above that the “name” field of an “Rd” file > defines the name of a single help topic, which may or may not be the name of the original /man directory file in the package source An example of a help topic which differs from the name of the underlying file arises courtesy of the "formatC" function from the “base” package. {r} hs <- help.search (pattern = "formatC", package = "base") hs$matches [hs$matches$Topic == "formatC", ] So “formatC” is the official name of one of the help topics within the “base” package, and therefore should also be the name of the entry within its help database. And yet look what happens when the help database is accessed via lazyLoad: {r} package <- "base" loc <- file.path (R.home (), "library", package, "help", package) e <- new.env () chk <- lazyLoad (loc, envir = e) fns <- ls (envir = e) fns [grep ("formatC", fns, ignore.case = TRUE)] And the entry in the database is called formatc (lower-case “c”), yet the content of that entry declares a “name” of formatC (upper-case “C”). So the “Name” entry in the object returned by the help.search() function is not actually the name of the /man file in the source package, rather it is the “name” entry specified in the actual “Rd” file. The contents of the corresponding file can nevertheless be extracted via Rd_db like this: {r} x <- tools::Rd_db (package = "base") aliases <- lapply (x, function (i) tools:::.Rd_get_metadata (i, "alias")) myfn <- "formatC" i <- which (vapply (aliases, function (i) myfn %in% i, logical (1))) # number of entry in db names (x) [i] tools:::.Rd_get_metadata (x [[i]], "name") This indicates that the help.search() function ought not be used to extract the contents of help files, but that the Rd_db() function can be used as illustrated above, with no need to worry about the names of the underlying files. A final function might look something like the following: {r} help_text <- function (fn_name, package) { x <- tools::Rd_db (package = package) aliases <- lapply (x, function (i) tools:::.Rd_get_metadata (i, "alias")) i <- which (vapply (aliases, function (i) fn_name %in% i, logical (1))) return (x [[i]]) } ### Conclusion Hopefully this has been helpful to anyone wanting to extract the actual contents of R help files. While the objects extracted by the methods described above can generally be treated as character objects, and any form of parsing applied, they are also objects with a defined class of Rd. There are several methods already available to parse such objects, in particular those described in the help file for the parse_Rd() function which claims at the outset that, > This function parses ‘Rd’ files according to the specification given in https://developer.r-project.org/parseRd.pdf The document referred to there goes in to extensive detail about methods for parsing these objects.

Using RcppParallel to aggregate to a vector

07 Nov 19

Using RcppParallel to aggregate to a vector

This article was recently published in the Rcpp Gallery, and demonstrates using the RcppParallel package to aggregate to an output vector. It extends directly from previous demonstrations of single-valued aggregation, through providing necessary details to enable aggregation to a vector, or by extension, to any arbitrary form. ### The General Problem Many tasks require aggregation to a vector result, and many such tasks can be made more efficient by performing such aggregation in parallel. The general problem is that the vector in which results are to be aggregated has to be shared among the parallel threads. This is a parallelReduce task - we need to split the singular task into effectively independent, parallel tasks, perform our aggregation operation on each of those tasks, yielding as many instances of our aggregate result vector as there are parallel tasks, and then finally join all of those resultant vectors from the parallel tasks into our desired singular result vector. The general structure of the code demonstrated here extends from the previous Gallery article on parallel vector sums, through extending to summation to a vector result, along with the passing of additional variables to the parallel worker. The following code demonstrates aggregation to a vector result that holds the row sums of a matrix, noting at the output that is not intended to represent efficient code, rather it is written to explicitly emphasise the principles of using RcppParallel to aggregate over a vector result. ### The parallelReduce Worker The following code defines our parallel worker, in which the input is presumed for demonstration purposes to be a matrix stored as a single vector, and so has of total length nrow * ncol. The demonstration includes a few notable features: 1. The main input simply provides an integer index into the rows of the matrix, with the parallel job splitting the task among elements of that index. This explicit specification of an index vector is not necessary, but serves here to clarify what the worker is actually doing. An alternative would be for input to be the_matrix, and subsequently call the parallel worker only over [0 ... nrow] of that vector which has a total length of nrow * ncol. 2. We are passing two additional variables specifying nrow and ncol. Although one of these could be inferred at run time, we pass them simply to demonstrate how this is done. Note in particular the form in the second constructor, called for each Split job, which accepts as input the variables as defined by the main constructor, and so all variable definitions are of the form, nrow(oneJob.nrow). The initial constructor also has input variables explicitly defined with _in suffices, to clarify exactly how such variable passing works. 3. No initial values for the output are passed to the constructors. Rather, output must be resized to the desired size by each of those constructors, and so each repeats the line output.resize(nrow, 0.0), which also initialises the values. (This is more readily done using a std::vector than an Rcpp vector, with final conversion to an Rcpp vector result achieved through a simple Rcpp::wrap call.) {r, engine='Rcpp', eval = FALSE} #include // [[Rcpp::depends(RcppParallel)]] #include using namespace Rcpp; using namespace RcppParallel; struct OneJob : public Worker { RVector input; const NumericVector the_matrix; const size_t nrow; const size_t ncol; std::vector output; // Constructor 1: The main constructor OneJob ( const IntegerVector input_in, const NumericVector the_matrix_in, const size_t nrow_in, const size_t ncol_in) : input(input_in), the_matrix(the_matrix_in), nrow(nrow_in), ncol(ncol_in), output() { output.resize(nrow, 0.0); } // Constructor 2: Called for each split job OneJob ( const OneJob &oneJob, Split) : input(oneJob.input), the_matrix(oneJob.the_matrix), nrow(oneJob.nrow), ncol(oneJob.ncol), output() { output.resize(nrow, 0.0); } // Parallel function operator void operator() (std::size_t begin, std::size_t end) { for (size_t i = begin; i < end; i++) { // Very inefficient yet explicit way to calculate row sums: for (size_t j = 0; j < ncol; j++) { // static_cast becuase (i,j,nrow) are size_t, aka unsigned long, // but Rcpp vectors require `R_xlen_t`, aka long. output[i] += the_matrix[static_cast(i + j * nrow)]; } } } // end parallel function operator void join (const OneJob &rhs) { for (size_t i = 0; i < nrow; i++) { output[i] += rhs.output[i]; } } }; The worker can then be called via parallelReduce with the following code, in which static_casts are necessary because .size() applied to Rcpp objects returns an R_xlen_t or long value, but we need to pass unsigned long or size_t values to the worker to use as indices into standard C++ vectors. The output of oneJob is a std::vector, which is converted to an Rcpp::NumericVector through a simple call to Rcpp::wrap. {r, engine='Rcpp', eval = FALSE} // [[Rcpp::export]] NumericVector vector_aggregator (IntegerVector index, NumericVector x) { const size_t nrow = static_cast (index.size ()); const size_t ncol = static_cast (x.size ()) / nrow; OneJob oneJob (index, x, nrow, ncol); parallelReduce (0, nrow, oneJob); return wrap (oneJob.output); } {r, engine='Rcpp', echo = FALSE} #include // [[Rcpp::depends(RcppParallel)]] #include using namespace Rcpp; using namespace RcppParallel; struct OneJob : public Worker { RVector input; const NumericVector the_matrix; const size_t nrow; const size_t ncol; std::vector output; // Constructor 1: The main constructor OneJob ( const IntegerVector input_in, const NumericVector the_matrix_in, const size_t nrow_in, const size_t ncol_in) : input(input_in), the_matrix(the_matrix_in), nrow(nrow_in), ncol(ncol_in), output() { output.resize(nrow, 0.0); } // Constructor 2: Called for each split job OneJob ( const OneJob &oneJob, Split) : input(oneJob.input), the_matrix(oneJob.the_matrix), nrow(oneJob.nrow), ncol(oneJob.ncol), output() { output.resize(nrow, 0.0); } // Parallel function operator void operator() (std::size_t begin, std::size_t end) { for (size_t i = begin; i < end; i++) { // Very inefficient yet explicit way to calculate row sums: for (size_t j = 0; j < ncol; j++) { // static_cast becuase (i,j,nrow) are size_t, aka unsigned long, // but Rcpp vectors require `R_xlen_t`, aka long. output[i] += the_matrix[static_cast(i + j * nrow)]; } } } // end parallel function operator void join (const OneJob &rhs) { for (size_t i = 0; i < nrow; i++) { output[i] += rhs.output[i]; } } }; // [[Rcpp::export]] NumericVector vector_aggregator (IntegerVector index, NumericVector x) { const size_t nrow = static_cast (index.size ()); const size_t ncol = static_cast (x.size ()) / nrow; OneJob oneJob (index, x, nrow, ncol); parallelReduce (0, nrow, oneJob); return wrap (oneJob.output); } ### Demonstration Finally, the following code demonstrates that this parallel worker correctly returns the row sums of the input matrix. {r} # allocate a vector nrow <- 1e5 ncol <- 10 x <- runif (nrow * ncol) # input matrix res <- vector_aggregator (seq(nrow), x) # confirm that this equals rowsums of the matrix: xmat <- matrix(x, ncol = ncol) identical(res, rowSums(xmat)) You can learn more about using RcppParallel at https://rcppcore.github.com/RcppParallel.

Github 2FA, git push, and password entry

25 Oct 19

Github 2FA, git push, and password entry

Activating github two-factor authentication (2FA) offers an indubitable security boost, with one notable side effect: https authentication requires entering a Personal Access Token instead of password, as very clearly explained in the official github documentation , which states: > The command line prompt won’t specify that you should enter your personal access token when it asks for your password. So everything looks like it stays the same, except now I have to enter a random 32-character long Personal Access Token (PAT), instead of my former, sensibly memorable, and readily typeable password. But I liked things the old way! This blog entry describes the process I went through to effectively restore the previous behaviour of the git prompt prior to me switching on 2FA on github, enabling me to type a password for git push, instead of the un-typeable PAT. ## Why enter a password each time? Many – maybe most? – people are likely content with SSH authentication, which avoids any of these issues, and simply allows your git push commands to be identified through connecting your local ssh agent with github to do the authentication. git push then just works. My problem with this is twofold: 1. i like typing in both my github name, and my password, especially because i have long learnt to appreciate the brief cognitive disconnect this gives me, one which not infrequently leads to me realising that, no, i really do not want to push that commit. The necessity of me manually entering my name and password for each push provides an extra level of security against me inadvertently pushing breaking or otherwise silly commits. I like that. 2. The immediacy of SSH pushes disturbs me somewhat. Yes, my local machine is absolutely authenticated, but this means that anybody who happens to get their maws on my machine can push whatever they want anytime. Although this is wildly unlikely to ever happen, the mere notion that it could nevertheless disturbs me. I like having to type my name and password. It is impossible for me to type my name and PAT. For a brief moment after having switched on 2FA on github, i feared that i was going to have to constantly copy-paste my PAT for every commit. I didn’t wanna do that, so i did the following … but first a brief digression into my SSL habits. ## OpenSSL encryption I use OpenSSL a lot. I encrypt any and all sensitive information, and use a host of local scripts and bash aliases to do so. I wasn’t going to leave my github PAT just lying around on my machine, so it naturally gets encrypted too, simply by storing it as a single line in a text file, and typing: {bash, eval = FALSE} openssl des3 -salt -md sha256 -pbkdf2 -in gitpat.txt -out gitpat That command prompts me to enter and repeat a password. See the OpenSSL manual for what all those flags mean; or just believe me that they ensure that it’s really encrypted. Delete gitpat.txt — and don’t forget any extra files like .gitpat.txt.un~ on linux, or whatever traces might be left lying around on other operating systems — and your PAT is secure. Decrypting pretty much just reverses the above: {bash, eval = FALSE} openssl des3 -salt -md sha256 -pbkdf2 -d -in gitpat -out gitpat.txt Then i’ve got my token in gitpat.txt, which i can … copy-and-paste each time i need to git push? No way! And so … on to my solution. ## github 2FA via https with password entry, and not an untypeable token My solution involved two main tricks: 1. Replacing my pushes in the form git push origin master, where origin can be identified via git remote -v as something like https://github.com/mpadge/, and which necessitates entering "mpadge" and my PAT, with git push https://mpadge:@github.com/mpadge/, where the PAT is passed directly to github, circumventing the need to enter it manually, so that the push is directly sent and accepted; and 2. Writing a script requiring my (github or other) password, and using that to automatically decrypt my PAT, convert it to an environmental variable, and using that to convert git push into the form above with my PAT embedded. The second of those steps looks, in the form of a bash script, like this: {bash, eval = FALSE} read -s -p "Enter Password: " PASS echo "" openssl des3 -salt -md sha256 -pbkdf2 -d -in ////gitpat -out gitpat.txt -pass pass:$PASS PASS="" PAT=$( I then have a variable, "PAT", containing my PAT, with no other traces of its value, or of my password, left on my machine. Note that the password required is whatever was entered for the initial encryption of gitpat.txt to gitpat. The first step then inserts this PAT, and my github user name, into a git push command via the following bash code, presuming here that my github user name is stored in a variable named UNAME: {bash, eval = FALSE} REMOTE=$(git remote -v | head -n 1) # REMOTE="origin https://github.com// (fetch)" (or similar) # function to cut string by delimiter cut () { local s=$REMOTE$1 while [[ $s ]]; do array+=( "${s%%"$1"*}" ); s=${s#*"$1"}; done; } # cut terminal bit "(fetch)" from remote, returning first part as array[0]: array=(); cut " " REMOTE="${array[0]}" # cut remainder around "github.com", returning 2nd part as "//" array=(); cut "github.com" # convert REMOTE given above to # REMOTE="https://:@github.com//" (or similar) printf -v REMOTE "https://%s:%s@github.com%s" "$UNAME" "$PAT" "${array[1]}" echo $REMOTE That script gives our desired output: {r bash1, echo = FALSE} f <- file ("junk.bash") txt <- c ('#!/bin/bash', 'UNAME="mpadge"', 'PAT=""', 'REMOTE=$(git remote -v | head -n 1)', 'cut () {', ' local s=$REMOTE$1', ' while [[ $s ]]; do', ' array+=( "${s%%"$1"*}" );', ' s=${s#*"$1"};', ' done;', '}', 'array=();', 'cut " "', 'REMOTE="${array[0]}"', 'array=();', 'cut "github.com"', 'printf -v REMOTE "https://%s:%s@github.com%s" "$UNAME" "$PAT" "${array[1]}"', 'echo $REMOTE $BRANCH') writeLines (txt, f) close (f) #system ("bash junk.bash", intern = TRUE) "https://mpadge:@github.com//" invisible (file.remove ("junk.bash")) ## Final script My solution then just involved combining those two tricks within a single script, designed to almost but not quite reflect the old git push prompt and behaviour i was trying to emulate, and including an additional option to call the script with an extra parameter specifying the branch to push to, or otherwise defaulting to the current branch: {bash, eval = FALSE} #!/bin/bash read -p "User name for 'https://github.com': " UNAME read -s -p "Password (NOT PAT) for 'https://$UNAME@github.com' " PASS echo "" openssl des3 -salt -md sha256 -pbkdf2 -d -in ////gitpat -out gitpat.txt -pass pass:$PASS PASS="" PAT=$(/" array=(); cut "github.com" # convert REMOTE given above to # REMOTE="https://:@github.com//" (or similar) printf -v REMOTE "https://%s:%s@github.com%s" "$UNAME" "$PAT" "${array[1]}" git push $REMOTE $BRANCH # clear variables: PAT="" REMOTE="" I then only needed to set an alias to that script in ~/.bash_aliases, along the lines of {bash, eval = FALSE} alias gitpush="bash ////gitpatscript.bash" and then replace my former git push with gitpush, to enable me to once again type in my password like i always liked to do.



What are matrices in R?
31 Jul 19




What are matrices in R?
“R is a shockingly dreadful language for an exceptionally useful data
analysis environment” (

Tim Smith & Kevin Ushey ). One of the strangest manifestations
of claims like these is that,

“Everything in R is a vector” . The simple question that then arises
is, What is a matrix? One commonly cited current repository of things R
is Hadley Wickham’s book,

Advanced R , which has a section on

Data Structures  which simply states that a matrix is the two
dimensional equivalent of a vector, and that,

“Adding a dim attribute to an atomic vector allows it to
behave like a multi-dimensional array.”  The chapter linked to above
goes on to say that, “Vectors are not the only 1-dimensional data
structure. You can have matrices with a single row or single column, or
arrays with a single dimension. They may print similarly, but will
behave differently. The differences aren’t too important.” This blog
entry will attempt to illustrate the kind of circumstances under which
differences between vectors and matrices actually become quite important
indeed. ## An initial illustration Vectors do differ from matrices, as
the following code clearly illustrates:
{r, echo = FALSE} library (magrittr)
{r rowsum1-fakey, eval = FALSE} n <- 1e6 x <- runif (n) y <- runif (n) xy <- cbind (x, y) # a matrix rbenchmark::benchmark (                        res <- x + y,                        res <- rowSums (xy),                        replications = 100,                        order = NULL) [, 1:4]
{r rowsum1, echo = FALSE} n <- 1e6 x <- runif (n) y <- runif (n) xy <- cbind (x, y) # a matrix knitr::kable (rbenchmark::benchmark (                                      res <- x + y,                                      res <- rowSums (xy),                                      replications = 100,                                      order = NULL) [, 1:4])
Adding the two rows of a matrix takes 3-4 times longer than adding two
otherwise equivalent vectors. And okay, that’s very likely something to
do with the rowSums function rather than the matrix itself,
but why should these two behave so differently? At that point, I must
freely admit to being not sufficiently clever to have uncovered the
actual reason in the

underlying C source code.  The answer must lie somewhere in there,
so any pointers would be greatly appreciated. Short of that, the
following is a phenomenological explanation, derived through attempting
to reconstruct in C code what rowSums is actually doing.
Direct vector addition must work something like the following C code,
written here in a form able to be directly parsed in R via the

inline package .
{r add1} library (inline) add <- cfunction(c(a = "numeric", b = "numeric"), "                  int n = LENGTH (a);                  SEXP result = PROTECT (Rf_allocVector (REALSXP, n));                  double *ra, *rb, *rout;                  ra = REAL (a);                  rb = REAL (b);                  rout = REAL (result);                  for (int i = 0; i < n; i++)                      rout [i] = ra [i] + rb [i];                  UNPROTECT (1);                  return result;                  ")
That’s a simple C function to add two vectors and return the result,
with most of the code providing the necessary scaffolding for an R
function. The following benchmark compares that with the previous two
equivalent functions.
{r benchmark1-fakey, eval = FALSE} rbenchmark::benchmark (                        res <- x + y,                        res <- add (x, y),                        res <- rowSums (xy),                        replications = 100,                        order = NULL) [, 1:4]
{r benchmark1, echo = FALSE} knitr::kable (rbenchmark::benchmark (                                      res <- x + y,                                      res <- add (x, y),                                      res <- rowSums (xy),                                      replications = 100,                                      order = NULL) [, 1:4])
So our add function is broadly equivalent to R’s underlying
code for vector addition, and correspondingly, considerably more
efficient than rowSums applied to an equivalent matrix.
This naturally fosters the question of whether the inefficiency arises
in rowSums itself, or whether it is somehow something
inherent to R’s internal representation of matrices and/or matrix
operations? The following code provides an initial answer to that
quesiton.
{r benchmark2-fakey, eval = FALSE} rbenchmark::benchmark (                        res <- x + y,                        res <- add (x, y),                        res <- rowSums (xy),                        res <- xy [, 1] + xy [, 2],                        replications = 100,                        order = NULL) [, 1:4]
{r benchmark2, echo = FALSE} knitr::kable (rbenchmark::benchmark (                                      res <- x + y,                                      res <- add (x, y),                                      res <- rowSums (xy),                                      res <- xy [, 1] + xy [, 2],                                      replications = 100,                                      order = NULL) [, 1:4])
And direct addition of two columns of a matrix, through indexing into
those columns, is roughly as inefficient as
rowSums itself, while direct addition of the equivalent
vectors remains 3-4 times more efficient. ### How are matrices stored?
So the reason for the relative inefficiency of rowSums is
likely to extend directly from the column selection operation,
xy[, i]. The reference manual for the C-level details of
data storage and sub-selection in R is the online compendium,

R Internals , yet even this has remarkably little to say in regard
to how matrices are actually stored or manipulated. The key is a single
incidental statement that,

“Matrices are stored as vectors” . The storage can then be
understood through reading the details of vector storage, and then
simply figuring out how the indexing of a matrix-as-vector is
implemented. This can be easily discerned from direct conversion within
R: {r matrix-as-vector} as.vector (cbind (1:5, 6:10)) The
columns of a matrix are directly concatenated within the vector object.
This enables us to then re-write the above C code for vector addition to
instead accept a matrix object, noting that the indices i
and n + i respectively refer to the first and second
columns of the matrix.
{r} matadd <- cfunction(c(a = "numeric"), "                  int n = floor (LENGTH (a) / 2.0);                  SEXP result = PROTECT (Rf_allocVector (REALSXP, n));                  double *ra, *rout;                  ra = REAL (a);                  rout = REAL (result);                  for (int i = 0; i < n; i++)                      rout [i] = ra [i] + ra [n + i];                  UNPROTECT (1);                  return result;                  ")
Benchmarking that against the previous versions, and including an
additional comparison of direct matrix addition, gives the following
results.
{r benchmark3-fakey, eval = FALSE} rbenchmark::benchmark (                        res <- x + y,                        res <- add (x, y),                        res <- rowSums (xy),                        res <- xy [, 1] + xy [, 2],                        res <- matadd (xy),                        res <- xy + xy,                        replications = 100,                        order = NULL) [, 1:4]
{r benchmark3, echo = FALSE} knitr::kable (rbenchmark::benchmark (                                      res <- x + y,                                      res <- add (x, y),                                      res <- rowSums (xy),                                      res <- xy [, 1] + xy [, 2],                                      res <- matadd (xy),                                      res <- xy + xy,                                      replications = 100,                                      order = NULL) [, 1:4])
That benchmark demonstrates that operations on matrix columns are only
as efficient as equivalent operations on vectors when the matrices are
treated as singular vector objects. Direct addition of entire matrices
(xy + xy) is also as efficient as vector addition, taking
here around twice as long because twice as many values are being added.
Inefficiencies arise in handling matrices only when extracting
individual rows or columns – the xy[, i] operations,
presumably because these operations involve creating an additional copy
of the entire row or column. ## Conclusion What the above code was
intended to demonstrate was that matrices should only be considered to
be like vectors in the sense of operations on the
entire objects. Sub-setting or sub-selecting of matrices involves
creating additional copies of the sub-set/sub-selected portions, and is
comparably less efficient than equivalent vector operations. In
particular, efficient C or C++ operations on matrices should index
directly into the underlying vector object, rather than sub-setting
particular rows or columns of the matrices. The assertion that
everything in R is a vector hereby deepens: Even matrices in R are
vectors, and should in many circumstances be treated as such.



Calling external files from C in R
04 Jul 19




Calling external files from C in R
I recently encountered a problem while bundling an old C library into
a new R package. The library itself depends on, and includes, an
external “dictionary” in plain text format used to construct a large
lookup table. The creators of this library of course assume that this
dictionary file will always reside in the same directory as the compiled
object, and so can always be directly linked. The src
directory of R packages is, however, only permitted to contain source
code, which text files definitively are not. This blog entry is
about where to put such files, and how to link them within the
source code. The answer turns out to be very simple, yet was
nevertheless one which occupied a couple of days of my time, hence this
documentation for the sake of posterity. As with many “external” files
within R packages, the recommended locations is within the
inst directory, or some sub-directory thereof. Any files
within this directory will be copied “recursively to the installation
directory” (from Writing R Extensions). Such files can nevertheless
not be called directly from any src code, because
there is no way for a compiled source object to find them – relative
paths can not be used, because they will be implemented relative to the
directory from which the compiled object is called. Tests, for example,
will call the compiled object from the ./tests directory,
while direct use within the package directory will call from
.. For general usage, the directory from which the object
is called could be anywhere, and external files can not be linked. In
other words, it is not possible to directly link a compiled object in a
R package with other package-local files, because the only “local” in R
is the currently working directory. It is thus necessary to step back
“out” from the source into the R environment to obtain the path to the
external file – in my case, to the dictionary. This information needs
somehow to be fed to the source code whenever and wherever the package
is used: precisely the kind of job for which the .onLoad()
function is intended. An additional problem in my particular case was
that the source code relied very extensively on defining the dictionary
file through a simple C macro:
#define MY_DICTIONARY "dictionary.txt"
Literally dozens of functions then call that simple macro to read
from the dictionary. Rewriting all of them to accept a dynamic parameter
defining the location would have been way too much work, and so I
urgently needed a simpler solution. The easiest turned out to be to use
environmental variables, which are universally accessible by any
programming language. I just needed to define and write the
environmental variable of the package dictionary in the
.onLoad() function as,
{r, eval = FALSE} Sys.setenv ("DICT_DIR" = system.file (package = "my_package", "subdir", "my_dict.txt"))
Accessing this within the source code was then as simple as defining an
equivalent function in C to read that variable:
char * getDictPath()
{
    char *ret = getenv("DICT_DIR");
    return ret;
}
and then replacing the hard-coded macro with a functional
equivalent:
#define MY_DICTIONARY getDictPath()
The entire bundled source then remained intact, with the
getDictPath() function returning the appropriate path as
defined within R itself, and accessible through the
system.file() function, and leaving the C code able to
simply call the macro MY_DICTIONARY to access the local
copy of that file. Credit and gratitude to Iñaki Ucar and Martin Morgan
for suggestions on the r-package-devel
mailing list.



Caching via background R processes
06 Jun 19




Caching via background R processes
The title of this blog entry should be fairly self-evident for those
who might incline to read it, yet is motivated by the simple fact that
there currently appear to be no online sources that clearly describe the
relatively straightforward process of using background processes in
R to cache objects. (Check out search engine results
for

“caching background R processes” : most of the top entries are for
Android, and even opting for other search engines

does little to help uncover any useful information .) Caching is
implemented because it saves time, generally by saving the results of
one function call for subsequent reuse. Background processes are also
commonly implemented as time-saving measures, through delegating
long-running tasks to “somewhere else”, allowing you to keep focussing
on whatever (un)important things you were doing in the meantime.
Straightforward caching of the results of single function calls is often
achieved through

“memoisation” , implemented in several R packages
including

R.cache ,

memoise ,

memo ,

simpleCache , and

simpleRCache , not to mention the extremely useful cache-management
package,

hoardr . None of these packages offer the ability to perform the
caching via a background process, and thus the initial call to a
function to-be-cached will have to wait until that function finishes
before returning a value. This blog entry describes how to implement
caching via background processes. Using a background process to cache an
object naturally requires a measure of anticipation that the object to
be cached is likely to be useful sometime in the future, as opposed to
necessarily needed right now. This is nevertheless a relatively common
situation is complex, multi-stage analyses, where the results of one
stage generally proceed in a predictable manner to subsequent stages.
The typical inputs and outputs of those subsequent stages are the things
that can be anticipated, and the results pre-calculated via background
processes, and then cached for subsequent and immediate recall.
So having briefly described “standard” caching (“foreground” caching, if
you like), it’s time to describe background processes in
R. ## Background processes in R Background processes
are, among other things, the key to the much-used

future package . This package seems at first like a barely
intelligible miracle of mysterious implementation. What are these
“futures”? The host of highly informative vignettes provide a wealth of
information on how the users of this package can implement their own
“futures”, yet little information on how the futures themselves are
implemented. (This is not a criticism; it reflects a reasonably
self-justifying design choice, because the average user of this package
will be generally satisfied with knowing how to use the package, and
won’t necessarily want or need to know how the magic is
performed.) In short: a “future” is just a background process that dumps
its results somewhere ready for later recall. What is a background
process? Simply another R session running as a separate

process . It’s easy to implement in base R. We first need a simple
R script, as for example generated by the following
code:
{r my_code, eval = TRUE} my_code <- c ("x <- rnorm (1e6)",                            "y <- x ^ 2",                            "y [x < 0] <- -y [x < 0]",                            "saveRDS (sd (y), file = 'myresult.Rds')") writeLines (my_code, con = "myfile.R")
That script can be executed as a background process by simply calling

Rscript  via a

system  or

system2  call, where the latter two allow wait = FALSE
to send the process to the background. (The more recent implementation
of system calls via the

sys package  and its simple exec_background() function
also deserve a mention here.) In base R terms, a script can be called
from an interactive session via
{r background, eval = TRUE} system2 (command = "Rscript", args = "myfile.R", wait = FALSE) list.files (pattern = "^my")
The script has been executed as a background process, and the result
dumped to the file, “myresult.Rds”. This can then simply be read to
retrieve the cached result generated by that background process:
{r} readRDS ("myresult.Rds") And that value was calculated
in, and cached from, a background process. Simple. ### Complications
Where was the above value stored? In the working directory of that
R session, of course. This is often neither a
practicable nor sensible approach, for example whenever any control over
storage locations is desired. These cached values are generally going to
be temporary in nature, and the tempdir() of the current
R session offers an alternative location, and is in
fact the only location acceptable for CRAN packages to write to during
package tests. Other common options include a sub-directory of
~/.Rcache, as used for example in the

R.cache  package. I’ll only consider tempdir() from
here on, but doing so will also reveal why the more enduring location of
~/.Rcache is often preferred. Another complication arises
in calling

Rscript , by virtue of the claims in

“Writing R Extensions”  – the official CRAN guide to
R packages – that one should, > … not invoke R by
plain R, Rscript or (on Windows) Rterm in your examples, tests,
vignettes, makefiles or other scripts. As pointed out in several places
earlier in this manual, use something like “$(R_HOME)/bin/Rscript” or
“$(R_HOME)/bin$(R_ARCH_BIN)/Rterm” That comment is not very helpful
because the alluded “several places” are in different contexts, and are
also only examples rather than actual guidelines. The problem is those
suggestions will usually, but not always work, depending on
Operating System idiosyncrasies. So calling

Rscript  directly is less straightforward than it might seem. A
further problem arises in that both system and
system2 will generally return values of 0 when
everything works okay. “Works” then means that the process has been
successfully started. But where is that process in relation to the
current R session? And likely most importantly, has
that process finished or is it still operating? While it is possible to
use further system calls to determine the

process identifier (PID) , that process itself is fraught and
perilous. There are further complications which arise through directly
calling background R processes via
Rscript, but those should suffice to argue for the fabulous
alternative available thanks to Gábor Csárdi and … ## The processx
package The

processx  package states simply that it provides, > “Tools to run
system processes in the background” This package is designed to run
any available system process, including ones that potentially
have nothing to do with R let alone a current
R session. Using

processx  to run background R process thus requires
calling Rscript, with the associated problems described
above. Fortunately for us, Gábor foresaw this need and created the
“companion” package,

callr  to simply > “Call R from R”

callr  relies directly on

processx , but provides the far simpler function,

r_bg  to > “Evaluate an expression in another R session, in the
background” So

r_bg  provides the perfect tool for our needs. This function
directly evaluates R code, without needing to render it to text as we
did above in order to write it to an external script file. An

r_bg  version of the above would look like this:
{r my_fn} f <- function () {     x <- rnorm (1e6)     y <- x ^ 2     y [x < 0] <- -y [x < 0]     saveRDS (sd (y), file = "myresult.Rds") } callr::r_bg (f)
We immediately see that

r_bg  returns a handle to the process itself, along with the single
piece of critical diagnostic information: Whether the process is still
running or not:
{r r_bg} px <- callr::r_bg (f) px Sys.sleep (1) px
Multiple processes can be generated and queried this way. The package is
designed around, and returns,

R6  class objects, enabling function calls on the objects, notably
including the following:
{r r_bg2} px <- callr::r_bg (f) px while (px$is_alive())     px$wait () px
The px$is_alive() and px$wait() functions are
all that is needed to wait until a background process is finished. In
the context of using background processes to cache objects, these lines
enable the primary R session to simply wait until
caching is finished before retrieving the object. ## processx, callr,
and caching There is only one remaining issue with the above code: Where
is “myresult.Rds” in the following code?
{r my_fn2} f <- function () {     x <- rnorm (1e6)     y <- x ^ 2     y [x < 0] <- -y [x < 0]     saveRDS (sd (y), file = file.path (tempdir (), "myresult.Rds")) } px <- callr::r_bg (f)
It’s in tempdir(), but not the
tempdir() of the current process. Where is his other
tempdir()? It’s temporary of course, so has been dutifully
cleaned up, thereby removing our desired result. What is needed is a way
to store the result in the tempdir()of the current – active
– R session. This tempdir() is merely
specified as a character string, which we can pass directly to our
function:
{r myfn3} f <- function (temp_dir) {     x <- rnorm (1e6)     y <- x ^ 2     y [x < 0] <- -y [x < 0]     saveRDS (sd (y), file = file.path (temp_dir, "mynewresult.Rds")) }
We then only need to note that the second parameter of

r_bg  is args, which is, > “Arguments to pass to the
function. Must be a list.” That is then all we need, so let it run …
{r r_bg3} px <- callr::r_bg (f, list (tempdir ())) while (px$is_alive())     px$wait () list.files (tempdir (), pattern = "^my")
And there is our new result, along with all we need to understand how to
cache objects via background R processes. ## Summary 1.
Define a function to generate the object to be cached, and include a
tempdir() parameter if that is to be used as the cache
location. 2. Use callr::r_bg() to call that function in the
background and deliver the result to the desired location. 3. Examine
the handle of the process returned by r_bg() to determine
whether it has finished or not. 4. … use the cached result.



C++ templates and Rcpp
07 May 19




C++ templates and Rcpp
C++ templates are really useful. Templates allow you to code a
function able to accept arguments of different types that can’t
necessarily be known until compile time. The R language is, however,
written in C, and knows nothing of templates. Rcpp opens up to the R
language the extensions offered by C++ over C, yet integrating templates
within Rcpp code is not straightfoward. This blog entry will hopefully
clarify the steps needed to use C++ templates in an Rcpp context. As
often in programming, employing templates in Rcpp is about finding the
most efficient level of abstraction. Templates are one of the coolest
ways to “abstract” C++ code – generally meaning abstracting away from
specific variable types (or classes, structures, whatever …) to generic
templated forms that accept multiple, or indeed any possible, types.
Templates in

rust  just work - types are directly inferred, and any potential
conflicts will be caught at compile time.

rust  is the gold standard in which template abstraction is as
pain-free as possible. C++ templates are, in contrast, somewhat more
painful, as a minimal generic template must be explicitly specified.
This is often as simple as replacing some function definition, say:
int my_function (int my_integer_input)
{
    int result = my_integer_input;
    // do something with `result`
    return result;
}
with a templated version:
template 
T my_function_t (T my_generic_input)
{
    T result = my_generic_input;
    // do something with `result`
    return result;
}
As it stands, my_function_t will accept inputs of any
arbitrary kinds. (There are also ways to permit templated code to only
accept objects of some pre-defined classes.) An Rcpp version of the
first function might look like this:
// [[Rcpp::export]]
int my_rcpp_function (int my_integer_input)
{
    int result = my_integer_input;
    // do something with `result`
    return result;
}
The problem arises when you try to do something like this:
// [[Rcpp::export]]
template 
int rcpp_template (T input)
{
    int result = Rcpp::as  (input);
    // do something with `result`
    return result;
}
and that takes you here:
RcppExports.cpp:46:36: error: use of undeclared identifier 'T'
    Rcpp::traits::input_parameter< T >::type input(inputSEXP);
                                                              ^
RcppExports.cpp:46:41: error: no type named 'type' in the global namespace
    Rcpp::traits::input_parameter< T >::type input(inputSEXP);
                                                                    ~~^
2 errors generated.
This provides highly informative error messages that clearly indicate
that the cause is the ability to infer appropriate
inputSEXP types (itself a necessity of R
being written in C and so knowing nothing about inferred types or
templates, as stated above). What we can nevertheless do here is replace
our undefined type, T, with an equivalently undefined and
generic SEXP (and let’s define our function while we’re at
it to square the input; and also, if you’re wondering what all this
SEXP stuff is, you could take a wee digression over to

Miles McBain’s brief but illuminating ramblings on the topic )
// [[Rcpp::export]]
int rcpp_template (SEXP input)
{
    int result = Rcpp::as  (input);
    return result * result;
}
This can be then called from R, and will return an
integer output. (The Rcpp::as  () is a wrapper
for static_cast  (), which simply truncates
decimals, so rcpp_template(1.9) will give 1.) What about
generic return values? The next obvious step would be to try this:
// [[Rcpp::export]]
SEXP rcpp_template2 (SEXP input)
{
    return input * input;
}
This would obviously be rather dangerous if it actually worked, but
we don’t need to worry because it fails with this:
error: invalid operands to binary expression ('SEXP' (aka 'SEXPREC *') and 'SEXP')
SEXP result = input * input;
                        ~~~~~ ^ ~~~~~
1 error generated.
The arrow points to the “operand”, indicating that this has defaulted
to an attempt to implement bit-wise multiplication of two generic
pointer objects (an SEXP object is nothing but a pointer to
the underlying structure it points to, an SEXPREC object).
So that all leaves us now knowing that the most we can do is to send
generic inputs from R to C++ functions as
SEXP parameters, and then coerce them with the magic of
Rcpp::as(). This is of course also potentially
dangerous:
rcpp_template (2.9) # = 4; okay
rcpp_template ("2.9")
# Error in rcpp_template(input) :
#   Not compatible with requested type: [type=character; target=integer].

 ## A better level of abstraction Remembering that

R’s C interface  only knows about SEXP
(“S-EXPression”), and that all SEXP objects are mere
pointers to C arrays, suggests something like the following code—which
does not work:
#include 
template 
T mysquare (T &x)
{
    for (size_t i = 0; i < x.size (); i++)
        x (i) = x (i) * x (i);
    return x;
}
// [[Rcpp::export]]
SEXP rcpp_mysquare (SEXP &x)
{
    return mysquare (x);
}
That code fails to compile because of “incomplete definition of type
‘SEXPREC’” (where a SEXPREC is a structure pointed to by an
SEXP)—in other words, R has no way of inferring the type of
data pointed to by the SEXP. The trick to getting this to
compile, and thereby to using C++ templates via Rcpp, is to have an
additional “type-selector” function that recognises and typecasts the
input type as one of the

six possible R types . We’re only interested in a couple of those
here, representing the integer and real or floating-point types, which
are respectively INTSXP and REALSXP. Recalling
that there is no distinction between a single integer or numeric
(floating-point) value and equivalent vectors of these, we can
distinguish these two cases through casting via Rcpp::as to
Rcpp equivalents of either integer or numeric vectors with
the following additional code, representing our “type selector”
function:
SEXP mysquare (SEXP &x)
{
    switch (TYPEOF (x))
    {
        case INTSXP: {
                         Rcpp::IntegerVector iv = Rcpp::as  (x);
                         return mysquare (iv);
                     }
        case REALSXP: {
                         Rcpp::NumericVector nv = Rcpp::as  (x);
                         return mysquare (nv);
                     }
        default: { Rcpp::stop ("incompatible type");    }
    }
    return x; // this should never happen
}
This function takes a generic (SEXP) input and returns a
generic output, yet deploys actual calls to the templated version of
mysquare with specified (Rcpp) types, ensuring
that the above templated function will always be able to infer the input
type. The default Rcpp::stop ensures that
types other than our desired two are not processed further, preventing
for example attempts to calculate the square of "a".
Inserting this “type-selector” code in the above code permits a generic
SEXP-in / SEXP-out function (our
rcpp_mysquare in the above code) to be deployed to specific
types, and then simply passed to a generic C++ template function.
Presuming this C++ code to be in a file src.cpp, the whole
thing then works like this:
{r writeSrcCpp, echo = FALSE} src_code <- ' #include  template  T mysquare (T &x) {     for (size_t i = 0; i < x.size (); i++)         x (i) = x (i) * x (i);     return x; } SEXP mysquare (SEXP &x) {     switch (TYPEOF (x))     {         case INTSXP: {                          Rcpp::IntegerVector iv = Rcpp::as  (x);                          return mysquare (iv);                      }         case REALSXP: {                          Rcpp::NumericVector nv = Rcpp::as  (x);                          return mysquare (nv);                      }         default: { Rcpp::stop ("error");    }     }     return x; // this never happens } // [[Rcpp::export]] SEXP rcpp_mysquare (SEXP &x) {     return mysquare (x); } ' writeLines (src_code, "src.cpp")
{r sourceCpp} Rcpp::sourceCpp ("src.cpp") # source the file, placing the Rcpp::export-ed function in workspace x <- 1:5 x <- rcpp_mysquare (x) x class (x) storage.mode (x) <- "numeric" x <- rcpp_mysquare (x) x class (x)
An integer vector gives integer return values, and a numeric
(floating-point) vector gives numeric return values. There you have it:
templating through the magic of SEXP. Gratitude extended to
Dirk Eddelbeuttel and David Cooley for advice and helpful pointers. ##
The final code Just to make it clear, here’s the above code all placed
in a single place:
#include 
template 
T mysquare (T &x)
{
    for (size_t i = 0; i < x.size (); i++)
        x (i) = x (i) * x (i);
    return x;
}
SEXP mysquare (SEXP &x)
{
    switch (TYPEOF (x))
    {
        case INTSXP: {
                         Rcpp::IntegerVector iv = Rcpp::as  (x);
                         return mysquare (iv);
                     }
        case REALSXP: {
                         Rcpp::NumericVector nv = Rcpp::as  (x);
                         return mysquare (nv);
                     }
        default: { Rcpp::stop ("error");    }
    }
    return x; // this never happens
}
// [[Rcpp::export]]
SEXP rcpp_mysquare (SEXP &x)
{
    return mysquare (x);
}

Update (31 July 2019)
Since writing that, I found

this very clear and more extensive explanation  in an

Rcpp Gallery post. .




how i made this site
06 May 19




how i made this site (from scratch)
New blog, new website, so here we go. I’ll start by describing how i
built the website. From scratch. The site is built with

zurb foundation  , because i had read that it did everything that

hugo  could, but that final products were more lightweight and
flexible. Plus i had no idea about it, and learning something new is
always often sometimes worthwhile. I was also
frustrated that standard

hugo  advice seemed to be, ‘’oh, just pick a template and off you
go,’’ yet there is surprisingly little advice on how to modify any given
template, let alone how to start from scratch. It turned out that

foundation  at least made starting from scratch fairly easy, and so
this entry is about that process. Note that i consider myself a
technically-oriented, back-end programmer more focussed on getting stuff
in and processing it than on getting stuff out. So when i say ‘’starting
from scratch,’’ i mean that most sincerely. ## visual style This is
largely html-related ramblings, so if you’re interested in
the code stuff, you might like to skip straight ahead to the next section.

zurb  provides a template (see

here  for details) which deposits a basic infrastructure on your
local playground, along with the required

foundation  libraries. The basic system is fairly well
 documented , so
there’s little point going into that here. The top of this site is a
standard 
top bar , and most of the rest is built from standard

callout  containers or plain cells. This and all blog pages, for
example, are full-width 
xy-grid  containers with simple headers of

    

The entire site lives within the local src/ directory,
with the remainder being stuff used by

foundation  to build the site. This src directory
really is impressively lightweight. The primary components of

foundation  are ‘’pages’’ and ‘’partials,’’ with the latter
identical to most other systems for building websites. Crudely
interpreted, ‘’pages’’ hold the actual content, while ‘’partials’’
define the styles, generally as html header and footer
components inserted before and after the content of a page.

foundation  integrates directly with arbitrarily-structured
yaml files, which made auto-generation of my main web page
particularly easy. The files themselves live in the
src/data directory, with the blog entries, for example,
read straight from a src/data/blog.yaml file that looks
like this:
-
    title: how i made this site
    description: < blah blah blah >
    created: 06 May 19
    modified: 06 May 19
    link: blog/blog001.html
- 
    title: C++ templates and Rcpp
    description: < blah blah blah >
    created: 07 May 19
    modified: 07 May 19
    link: blog/blog002.html
More on how that gets automatically generated below; for now, just
pretend it’s a static file. This has two entries, each of which has a
variety of components (such as title,
description, and link). The ‘’blog’’ section
on the main page is generated directly from these yaml
meta-data, using the {{#each blog}} command to automatically loop over
each of the above entries in the data/blog.yml file, using
the same double-curly-bracket syntax from zurb’s

panini  to insert variables into the html code:
{{#each blog}}
    {{> blog_header}}
        
            
                {{ title }}
            
            
                {{ description }}
            
        
    {{> blog_footer}}
{{/each blog}}

The whole site is set up with a grid 12 squares across, so these are
full-width containers with grid-padding-x, which by default
reads values from the global
/src/assets/scss/_settings.scss file. Yep, it’s an

scss  file, which is both great and … not so great. It means that
almost all variables used to generate your site - this site - can be
modified through directly modifying the values in
src/assets/scss/_settings.scss. The not so great is that
these are global variables which are translated during
compilation into css variables which generally won’t share
the same names. So if you want to change these values locally rather
than globally, you can’t ‘just do it’, you are forced to revert to
standard css (to define class structures) or
html (to explicitly define elements). This blog page, for
example, is defined by a simple entry in
src/assets/scss/app.scss – the sole location needed to
define all local classes - as:
.blogClass{
    margin-top: 0px;
    margin-left: 50px;
    margin-right: 50px;
}
The margin- elements are bog-standard css,
and absent these custom definitions all inherit the global properties
specified in src/assets/scss/_settings.scss (definining
standard properties of foundation’s xy-grid):
$grid-margin-gutters: (
    small: 20px,
    medium: 30px
);
$grid-padding-gutters: $grid-margin-gutters;
$grid-container-padding: $grid-padding-gutters;
Examples of html modifications to the global default
scss variables are the background colours for each
component of the code and blog sections. Remember that everything on the
main page is a ‘callout’, meaning that they all inherit the global
variables defined in src/assets/scss/_settings.scss. I
defined the global background as
$callout-backgroud: transparent;
so the background image would appear underneath everything by
default. This required local changes to render the components
semi-transparent white, which was achieved with a simple two-line
src/partials/blog_header.html of:

The code above with {{> blog_header }} simply inserts that header in
its rightful place. That is the very short version of how i got this
site to look the way it does. It’s simple, but it was fairly easy, and
most important to me was that i didn’t have to borrow somebody else’s
arbitrary and way-more-difficult-to-modify-than-i-thought template for
whatever other site/blog-generating system i may otherwise have chosen.
## the content The steps roughly described above yielded a static site
largely as you see here. The only remaining step was automating the
procedure of updating the site. Perhaps the easiest approach would be to
do this manually, but as most of the content is contained within
yaml files, this is a procedure ripe for automation. As the
end product of most of my coding efforts is packaged in
R-form, i opted to automate this procedure within
R, although the same principles apply to any other
language. What this section effectively describes is how easy

foundation  made the task of effectively recreating

Yihui Xie  ’s fabulous  blogdown  package.
Subjective judgement here, but the blogdown package was first released
to cran in August 2017, and a lot has changed in that short time. As
often happens, the enormity of the task Yihui achieved with that package
can now be recreated in

foundation  form much easier, and with much less code. In the case
of this site, it effectively amounts to connecting some kind of
blog_render() function to a simple update of a
yaml text file, with a few more tricks for other included
elements, notably graphics. With the help of partials, the entire
html formatting of a blog page is as simple as a header
with these few lines:
{{> header}}

  
    
      {{#markdown}}

and a footer simply closing each section with:
    {{/markdown}}
    

  


(plus just a couple of extra lines to add the navigation bar at the
side – shown here
in a navbar() function, if you’re interested). In between
is ‘’standard’’ markdown (at least in a form I’ve yet to encounter any
particular idosynracies with ..), which

foundation  interprets seamlessly. Converting an
Rmarkdown (.Rmd) document to a blog entry
is thus in essence as simple as rendering (via
rmarkdown::render()) it to some kind of standard markdown,
renaming that to .html, and inserting the 5 lines of header
and four lines of footer shown above. The following function forms the
basis of a blog_render() function:
{r blog_render, eval = FALSE} blog_render <- function (fname) {     rmarkdown::render(paste0 (fname, ".Rmd"), rmarkdown::md_document(variant='gfm'))     file.rename (paste0 (fname, ".md"), paste0 (fname, ".html"))     conn <- file (paste0 (fname, ".html"))     md <- readLines (conn)     header <- c (<... defined above ...>)     footer <- c (<... defined above ...>)     md <- c (header, md, footer)     writeLines (md, conn)     close (conn) }
Simply calling blog_render ("this_page") will then render
and transform this_page.Rmd into
this_page.html formatted for this website. The full
function used to generate these pages has a couple of other
sub-functions, mostly to move images to locations accessible by

foundation , and to replace a few character fields not otherwise
interpretable in either standard html or

foundation  terms. Examples of the latter are the {{ breadcrumbs }}
used by 
foundation’s panini interpreter, which are replaced by corresponding
html encodings; or the Rmarkdown code
chunk delimiter, ```{r}, from which the curly brackets must be removed
to replace it with, ``` r. ### images While it is possible to specify an
image directory in the yaml front-matter of an
.Rmddocument, it was just as easy, and more explict, to add
another function to my blog_render() function to move them
to the appropriate place in the 
foundation  directory, which is assets/img, and any
arbitrary sub-directories thereof. The following lines achieve this
{r fig-move, eval = FALSE} path <- file.path (paste0 (fname, "_files"), "figure-gfm") flist <- list.files (path, full.names = TRUE) newpath <- file.path ("..", "..", "assets", "img", fname) if (!dir.exists (newpath))     dir.create (newpath, recursive = TRUE) file.rename (flist, file.path (newpath, list.files (path))) unlink (paste0 (fname, "_files"), recursive = TRUE)
along with a simple replacement in the main file of the former with the
latter path. A final parameter called center_images, when
TRUE, inserts simple 
 and
 lines before and after the standard
markdown image insertion command
(![](//)). ## meta-data,
yaml data, and the front page The blog_render() function
then worked, but I still needed to automatically update the front page
to link directly to the latest entry. Another fairly straightforward
yaml-processing task, this time stripping the
yaml headers from all blog entries. This became the second,
and only other, main function, update_main(). This function
essentially just strips the yaml header data out of each
.Rmd blog entry, and re-formats it slightly as the
data/blog.yml file. This in turn relies on one main
function, get_one_blog_dat() which, for example, converts
the metadata for this entry of:
---
title: how i made this site
description: 
date: 06/05/2019
link: blog/blog001.html
---
into the only slightly-modified version in data/blog.yml
of
-
    title: how i made this site
    description: 
    created: 06 May 19
    modified: 06 May 19
    link: blog/blog001.html
The “created” date is read from the original date field
of the .Rmd metadata, while the “modified” date is the
actual date of file modification. These two dates enable blog entries to
be sorted by dates of either creation or modification with a simple
binary parameter. # conclusion That’s it. It took me a little while to
construct this site, but most of the time was me learning how

zurb foundation  works. Most of the mechanics of site construction
and updating are nevertheless done via the R code,
which is really very short and efficient. If you’re interested, the two
files that do the work are here,
for rendering a blog entry and here,
for updating the main page. The render_blog() function
calls the main updating function anyway, so all I ever need to do is to
call one simple function to render any new blog entry and update the
website. The site itself is housed on the master branch of
mpadge.github.io,
while the generating code behind the site is on the source
branch. Deployment is controlled with a very simple bash
script, called by a single makefile
command, which builds the foundation site, copies everything across
from the source to master branches, adds the
changes to git, and creates a commit to update the site.
That’s it. Advantages of having done this my own way: - no borrowed
templates! - no blogdown - full control over everything