Published

January 13, 2026

Introduction

Code
set.seed(123)

source("Baseline/utils.R")

##############
## Packages ##
##############

library(plyr) # Used for mapping values
suppressPackageStartupMessages(library(tidyverse)) # ggplot2, dplyr, and magrittr
library(readxl) # Read in Excel files
library(lubridate) # Handle dates
library(datefixR) # Standardise dates
library(patchwork) # Arrange ggplots

# Generate tables
suppressPackageStartupMessages(library(table1))
library(knitr)
library(pander)

# Generate flowchart of cohort derivation
library(DiagrammeR)
library(DiagrammeRsvg)

# paths to PREdiCCt data
if (file.exists("/docker")) { # If running in docker
  data.path <- "data/final/20221004/"
  redcap.path <- "data/final/20231030/"
  prefix <- "data/end-of-follow-up/"
  outdir <- "data/processed/"
} else { # Run on OS directly
  data.path <- "/Volumes/igmm/cvallejo-predicct/predicct/final/20221004/"
  redcap.path <- "/Volumes/igmm/cvallejo-predicct/predicct/final/20231030/"
  prefix <- "/Volumes/igmm/cvallejo-predicct/predicct/end-of-follow-up/"
  outdir <- "/Volumes/igmm/cvallejo-predicct/predicct/processed/"
}

demo <- readRDS(paste0(outdir, "demo-IBD.RDS"))

labs <- as.data.frame(read_xlsx(paste0(
  data.path,
  "Baseline2022/bloodtestresults.xlsx"
)))

labs <- labs %>%
  distinct(ParticipantNo, .keep_all = TRUE)


labs <- labs[, c(
  "ParticipantNo",
  "CReactiveProtein",
  "Haemoglobin",
  "HaemoglobinUnit",
  "WCC",
  "WCCUnit",
  "Platelets",
  "PlateletsUnit",
  "Albumin",
  "AlbuminUnit"
)]
demo <- merge(demo, labs, by = "ParticipantNo", all.x = TRUE, all.y = FALSE)

Participants’ IBD care teams reported biochemistry results via REDCap. Whilst data from many types of blood tests were collected, only the most relevant tests are presented here. Where possible, reference levels have been denoted on plots via vertical lines.

Faecal calprotectin

As a reminder, only subjects with a faecal calprotectin are included in this analysis, and therefore NA values are not reported for faecal calprotectin. Faecal calprotectin has been grouped into FC<50, 50 ≤ FC ≤ 250, and FC > 250.

Code
demo %>%
  drop_na(cat) %>%
  ggplot(aes(x = cat, color = cat, fill = cat)) +
  geom_bar() +
  labs(x = "FC category", y = "Frequency") +
  theme_minimal() +
  theme(legend.position = "none") +
  scale_fill_manual(values = c("#408585", "#F59954", "#E4384E")) +
  scale_color_manual(values = c("#244644", "#A35805", "#9A1027"))

C-reactive protein

Code
perc <- (1 - demo %>%
  drop_na(cat) %>%
  select(CReactiveProtein) %>%
  is.na() %>%
  sum() /
  demo %>%
    drop_na(cat) %>%
    nrow()) * 100

Alongside FC, CRP is the most requested laboratory test for monitoring IBD disease activity. Unlike FC which acts as a proxy for gastrointestinal inflammation, CRP provides an indication of inflammation across the body. The reference level for CRP is considered to be < 5 mg/L.

CRP was generally well-completed with 76.4% of subjects having an associated CRP. As can be seen in Figure 1, there were some extreme, albeit physiologically plausible, values reported.

Code
demo %>%
  drop_na(cat) %>%
  ggplot(aes(x = CReactiveProtein)) +
  geom_histogram(bins = 100, fill = "#EB9486", color = "#BF7163") +
  theme_minimal() +
  xlab("CRP (mg/L)") +
  ylab("Frequency")
Figure 1: Distribution of CRP at recruitment for the FC cohort.

As should be expected, CRP was found to be associated with FC.

Code
pander(summary(aov(CReactiveProtein ~ cat, data = demo)))
Table 1: ANOVA between CRP and FC groups.
Analysis of Variance Model
  Df Sum Sq Mean Sq F value Pr(>F)
cat 2 1949 974.4 7.98 0.0003556
Residuals 1635 199628 122.1 NA NA
Code
p <- demo %>%
  drop_na(cat) %>%
  ggplot(aes(x=cat, y = CReactiveProtein)) +
  geom_boxplot(staplewidth  = 1,
               fill = "#C0D8E0",
               color = "#42797B",
               outlier.shape = NA) +
  ylim(0, 12) +
  theme_minimal() +
  xlab("Faecal calprotectin category (\U03BCg/g)") +
  ylab("C-reactive protein (mg/L)")
cairo_pdf("plots/CRP-FC-boxplot.pdf", width = 8, height = 6)
p
Warning: Removed 601 rows containing non-finite outside the scale range
(`stat_boxplot()`).
Code
Warning: Removed 601 rows containing non-finite outside the scale range
(`stat_boxplot()`).
Figure 2: Boxplot of CRP by FC category.

Haemoglobin

Code
perc <- (1 - demo %>%
  drop_na(cat) %>%
  select(Haemoglobin) %>%
  is.na() %>%
  sum() /
  demo %>%
    drop_na(cat) %>%
    nrow()) * 100

Haemoglobin is a protein which facilitates the transport of oxygen in red blood cells. Haemoglobin has been reported for 83.07% of subjects in the FC cohort. Haemoglobin differs between sex and accordingly has sex-specific reference levels (130-170 g/L for men and 120-150g/L for women).

Whilst a normal distribution for haemoglobin is observed overall (Figure 3), there appears to be some observations substantially lower than one would expect for this distribution. Investigating these observations suggests that these observations are likely to be in units of g/dL instead of the common g/L units (Table 2).

Code
demo %>%
  drop_na(cat) %>%
  ggplot(aes(x = Haemoglobin)) +
  geom_histogram(bins = 100, fill = "#76E5FC", color = "#2EB6CD") +
  theme_minimal() +
  xlab("Haemoglobin") +
  ylab("Frequency")
Figure 3: Distribution of haemoglobin before processing.
Code
knitr::kable(table(subset(demo, Haemoglobin < 50)$HaemoglobinUnit),
  col.names = c("Units", "Frequency")
)
Table 2: Reported unit measurements for Haemoglobin reported less than 50.
Units Frequency
g/dL 6
g/l 1
g/L 5
mmol/mol 1

Having assumed any Haemoglobin < 50 has been reported in g/dL, any Haemoglobin observations <50 have been multiplied by 10 to be in g/L units. This results in the distribution seen in Figure 4.

Code
demo <- demo %>%
  mutate(Haemoglobin = if_else(Haemoglobin < 50,
    Haemoglobin * 10,
    Haemoglobin
  )) %>%
  select(-HaemoglobinUnit)

p1 <- demo %>%
  drop_na(cat) %>%
  filter(Sex == "Female") %>%
  ggplot(aes(x = Haemoglobin)) +
  geom_histogram(bins = 100, fill = "#8789C0", color = "#6A6B9C") +
  theme_minimal() +
  xlab("Haemoglobin (g/L)") +
  ylab("Frequency") +
  xlim(0, 190) +
  geom_vline(xintercept = c(120, 150), color = "#93032E") +
  ggtitle("Female haemoglobin")



p2 <- demo %>%
  drop_na(cat) %>%
  filter(Sex == "Male") %>%
  ggplot(aes(x = Haemoglobin, fill = Sex, color = Sex)) +
  geom_histogram(bins = 100, fill = "#225560", color = "#044651") +
  theme_minimal() +
  xlab("Haemoglobin (g/L)") +
  ylab("Frequency") +
  xlim(0, 190) +
  geom_vline(xintercept = c(130, 170), color = "#93032E") +
  ggtitle("Male haemoglobin")

(p1 / p2)
Figure 4: Distribution of haemoglobin after processing.

Haemoglobin was not found to be significantly associated with FC at a 5% significance level.

Code
pander(summary(aov(Haemoglobin ~ cat, data = demo)))
Table 3: ANOVA between haemoglobin and FC groups.
Analysis of Variance Model
  Df Sum Sq Mean Sq F value Pr(>F)
cat 2 940.6 470.3 2.539 0.07921
Residuals 1778 329292 185.2 NA NA

White cell count

Code
perc <- (1 - demo %>%
  drop_na(cat) %>%
  select(WCC) %>%
  is.na() %>%
  sum() /
  demo %>%
    drop_na(cat) %>%
    nrow()) * 100

White cell count (WCC) is a measurement of all white blood cells (neutrophils, lymphocytes, monocytes, basophils and eosinophils) which help to fight infections. 83.02% of subjects have a WCC recorded.

Whilst some extreme WCC values are observed, most observations fall within the reference range (4-11 (×10^9/L)) and a conventional bell curve is seen overall.

Code
demo %>%
  drop_na(cat) %>%
  ggplot(aes(x = WCC)) +
  geom_histogram(bins = 100, fill = "#7FD8BE", color = "#2D977E") +
  theme_minimal() +
  xlab("White cell count (×10^9/L") +
  ylab("Frequency") +
  geom_vline(xintercept = c(4, 11), color = "#FF686B", alpha = 0.8)
Figure 5: Distrbution of WCC for the FC cohort. Vertical lines indicate the reference range.

WCC is reported to be associated with FC.

Code
pander(summary(aov(WCC ~ cat, data = demo)))
Table 4: ANOVA between WCC and FC groups.
Analysis of Variance Model
  Df Sum Sq Mean Sq F value Pr(>F)
cat 2 95.27 47.63 8.615 0.0001891
Residuals 1777 9826 5.529 NA NA

Platelets

Code
perc <- (1 - demo %>%
  drop_na(cat) %>%
  select(Platelets) %>%
  is.na() %>%
  sum() /
  demo %>%
    drop_na(cat) %>%
    nrow()) * 100

The majority of subjects (83.02%) have a platelets test result recorded.

Whilst some extreme values are observed, most fall within the reference range.

Code
demo %>%
  drop_na(cat) %>%
  ggplot(aes(x = Platelets)) +
  geom_histogram(bins = 100, fill = "#84E6F8", color = "#018897") +
  theme_minimal() +
  xlab("Platelets (×10^9/L") +
  ylab("Frequency") + 
  geom_vline(xintercept = c(150, 450), color = "#7b1907", alpha = 0.8)
Figure 6: Distrbution of Platelets for the FC cohort. Vertical lines indicate the reference range.

Platelets is reported to be significantly associated with FC.

Code
pander(summary(aov(Platelets ~ cat, data = demo)))
Table 5: ANOVA between platelets and FC groups.
Analysis of Variance Model
  Df Sum Sq Mean Sq F value Pr(>F)
cat 2 96868 48434 9.635 6.891e-05
Residuals 1777 8933162 5027 NA NA

Albumin

Code
perc <- (1 - demo %>%
  drop_na(cat) %>%
  select(Albumin) %>%
  is.na() %>%
  sum() /
  demo %>%
    drop_na(cat) %>%
    nrow()) * 100

Albumin is globular protein made in the liver which carries hormones, vitamins, and enzymes. Albumin has a reference range of 35-50 (g/L). 79.1% of subjects in the FC cohort have a recorded Albumin result.

Code
demo %>%
  drop_na(cat) %>%
  ggplot(aes(x = Albumin)) +
  geom_histogram(bins = 100, fill = "#21D19F", color = "#26A37D") +
  theme_minimal() +
  xlab("Albumin") +
  ylab("Frequency") +
  geom_vline(xintercept = c(35, 50), color = "#FF686B")
Figure 7: Distribution of Albumin before processing.

As can be seen in Figure 7, a very extreme value has been reported. This has been assumed to have been off by a factor of 10. Adjusting this observation results in the distribution seen in Figure 8.

Code
demo <- demo %>%
  mutate(Albumin = if_else(Albumin > 100, Albumin / 10, Albumin)) %>%
  select(-AlbuminUnit)
demo %>%
  drop_na(cat) %>%
  ggplot(aes(x = Albumin)) +
  geom_histogram(bins = 40, fill = "#21D19F", color = "#26A37D") +
  theme_minimal() +
  xlab("Albumin (g/L)") +
  ylab("Frequency") +
  geom_vline(xintercept = c(35, 50), color = "#FF686B")
Figure 8: Distribution of Albumin after processing.

There is strong evidence for an association between Albumin and FC.

Code
pander(summary(aov(Albumin ~ cat, data = demo)))
Table 6: ANOVA between Albumin and FC groups.
Analysis of Variance Model
  Df Sum Sq Mean Sq F value Pr(>F)
cat 2 1069 534.5 23.85 6.098e-11
Residuals 1693 37943 22.41 NA NA

Missingness

Code
demo %>%
  select(
    cat,
    CReactiveProtein,
    Haemoglobin,
    WCC,
    Platelets,
    Albumin,
    SiteName
  ) %>%
missing_plot2(title = "Biochemistry missingness")

Code
saveRDS(demo, paste0(outdir, "demo-biochem.RDS"))

Reproduction and reproducibility

Session info

R version 4.4.0 (2024-04-24)

Platform: aarch64-unknown-linux-gnu

locale: LC_CTYPE=en_US.UTF-8, LC_NUMERIC=C, LC_TIME=en_US.UTF-8, LC_COLLATE=en_US.UTF-8, LC_MONETARY=en_US.UTF-8, LC_MESSAGES=en_US.UTF-8, LC_PAPER=en_US.UTF-8, LC_NAME=C, LC_ADDRESS=C, LC_TELEPHONE=C, LC_MEASUREMENT=en_US.UTF-8 and LC_IDENTIFICATION=C

attached base packages: stats, graphics, grDevices, utils, datasets, methods and base

other attached packages: DiagrammeRsvg(v.0.1), DiagrammeR(v.1.0.11), pander(v.0.6.5), knitr(v.1.47), table1(v.1.4.3), patchwork(v.1.2.0), datefixR(v.1.6.1), readxl(v.1.4.3), lubridate(v.1.9.3), forcats(v.1.0.0), stringr(v.1.5.1), dplyr(v.1.1.4), purrr(v.1.0.2), readr(v.2.1.5), tidyr(v.1.3.1), tibble(v.3.2.1), ggplot2(v.3.5.1), tidyverse(v.2.0.0) and plyr(v.1.8.9)

loaded via a namespace (and not attached): shape(v.1.4.6.1), gtable(v.0.3.5), xfun(v.0.44), htmlwidgets(v.1.6.4), visNetwork(v.2.1.2), lattice(v.0.22-6), tzdb(v.0.4.0), vctrs(v.0.6.5), tools(v.4.4.0), generics(v.0.1.3), curl(v.5.2.1), fansi(v.1.0.6), pan(v.1.9), jomo(v.2.7-6), pkgconfig(v.2.0.3), Matrix(v.1.7-0), RColorBrewer(v.1.1-3), lifecycle(v.1.0.4), compiler(v.4.4.0), farver(v.2.1.2), munsell(v.0.5.1), codetools(v.0.2-20), htmltools(v.0.5.8.1), yaml(v.2.3.8), finalfit(v.1.0.7), glmnet(v.4.1-8), Formula(v.1.2-5), mice(v.3.16.0), nloptr(v.2.0.3), pillar(v.1.9.0), MASS(v.7.3-60.2), iterators(v.1.0.14), rpart(v.4.1.23), boot(v.1.3-30), mitml(v.0.4-5), foreach(v.1.5.2), nlme(v.3.1-164), tidyselect(v.1.2.1), digest(v.0.6.35), stringi(v.1.8.4), splines(v.4.4.0), labeling(v.0.4.3), fastmap(v.1.2.0), grid(v.4.4.0), colorspace(v.2.1-0), cli(v.3.6.2), magrittr(v.2.0.3), survival(v.3.5-8), utf8(v.1.2.4), broom(v.1.0.6), withr(v.3.0.0), scales(v.1.3.0), backports(v.1.5.0), timechange(v.0.3.0), rmarkdown(v.2.27), nnet(v.7.3-19), lme4(v.1.1-35.3), cellranger(v.1.1.0), hms(v.1.1.3), evaluate(v.0.23), V8(v.4.4.2), rlang(v.1.1.3), Rcpp(v.1.0.12), glue(v.1.7.0), minqa(v.1.2.7), jsonlite(v.1.8.8) and R6(v.2.5.1)

Licensed by CC BY unless otherwise stated.