从Dataset-JSON v1.1 User's Guide中，提到两个和R相关的值得注意的细节：SAS与R的四舍五入差异和R中浮点数的特性，这可能是用R进行数据处理无法回避的两个细节，本篇文章是对这两篇文章的解读和学习，文章地址在文末。

R与SAS在四舍五入函数的差别及解决办法

在SAS里使用round()处理0.5的四舍五入后值是1，处理-0.5四舍五入后值是-1。

data _null_;
    a = 0.5;
    b = round(0.5);
    put "四舍五入0.5后值为: " b;
    c = -0.5;
    d = round(c);
    put  "四舍五入-0.5后值为: " d;
run;
 
 四舍五入0.5后值为: 1
 四舍五入-0.5后值为: -1

但是在R里，使用原生函数round()却出现了不一样的结果：

> round(0.5)
[1] 0
> round(1.5)
[1] 2

这是因为SAS的round()采取了远离零点的舍入规则，0.5会四舍五入为1；而R中base::round()采取的是向偶数取整的规则，因此round(0.5)得到0，而round(1.5)得到2。

处理办法

原生R中无法得到和SAS四舍五入一样的结果，因此在R中可以用janitor::round_half_up()或者tidytlg::roundSAS()两个包中的函数达到目的。

> janitor::round_half_up(0.5)
[1] 1
> tidytlg::roundSAS(0.5)
[1] 1

此外，作者提到这篇文章https://www.lexjansen.com/phuse-us/2020/ct/CT05.pdf中介绍了另一个方法，自己写round函数。虽然大多数时候这个函数都凑效，但在有些取值精度上还是会出现些许错误，比如案例中的2436.845保留2为小数，就变成了2436.84而不是2436.85。

# a function that rounds half up
# exact copy from: https://www.lexjansen.com/phuse-us/2020/ct/CT05.pdf
ut_round <- function(x, n = 0) {
  # x is the value to be rounded
  # n is the precision of the rounding
  scale <- 10^n
  y <- trunc(x * scale + sign(x) * 0.5) / scale
  # Return the rounded number
  return(y)
}
# round half up for 2436.845, with 2 decimal places
ut_round(2436.845, 2)

[1] 2436.84

这是一个取值精度的问题，在函数里加了一个R语言的内置常量（系统所能表示的最小正浮点数），就可以缓解浮点精度问题。

# revised rounds half up
ut_round1 <- function(x, n = 0) {
  # x is the value to be rounded
  # n is the precision of the rounding
  scale <- 10^n
  y <- trunc(x * scale + sign(x) * 0.5 + sqrt(.Machine$double.eps)) / scale
  # Return the rounded number
  return(y)
}
# round half up for 2436.845, with 2 decimal places
ut_round1(2436.845, 2)

[1] 2436.85

可以使用的函数

这个问题在开源论坛StackOverflow被广泛讨论之后，以下几个函数得到了精度修复：

janitor::round_half_up() version >= 2.1.0
tidytlg::roundSAS()
scrutiny::round_up_from()/round_up() version >= 0.2.5

需要理解的容差

虽然上述的四舍五入函数已经得到了修复，但它们和Base R里的round()还是有精度上的差距，比如减去最小正浮点数，前两个函数依旧将1.5四舍五入为2，而base R的round()值为1，这是因为R语言中存在默认容差导致的，1.5末尾的浮点数，让1.5 - sqrt(.Machine$double.eps)和1.5视为相等。

> a <- 1.5 - sqrt(.Machine$double.eps)
> tidytlg::roundSAS(a)
[1] 2
> janitor::round_half_up(a, digits = 0)
[1] 2
> round(a)
[1] 1

也有文章提到可以用cards::round5()，测试了一下和上面两个包效果一样。

结论

不同编程语言在一些细节处理上会有差异，重要的点在于了解这些差异，并且确保在统计分析计划中需要明确记录精确四舍五入的方法，无论使用哪种编程语言都遵循这个标准。

R中的浮点数值

浮点数是表示整数之间的数值，比如0.5、3.1415926等，它的储存并非像整数那样，显示数字的近似值，如果用format()函数增加显示位数，浮点数的精度就变了：

> format(3.14, digits = 22)
[1] "3.140000000000000124345"

# 一样的浮点数，精确到小数点许多位之后就出现了差异
> (0.1 + 0.2) %>% format(digits = 22)
[1] "0.3000000000000000444089"
> 0.3 %>% format(digits = 22)
[1] "0.2999999999999999888978"

造成的影响

用浮点数来进行比较的时候，应避免精确比较运算符如：==和>=，而这种比较在ADaM数据集衍生一些分组变量时其实是经常用到的，如果没有采取措施直接比较浮点数，可能会造成错误结果。

潜在解决方案

用signif()函数将浮点数四舍五入到指定有效位数再比较两个浮点数，举个栗子：

# 直接比较两个浮点数，因为小数位数，两者不相等
> (0.1 + 0.2) %>% format(digits = 22)
[1] "0.3000000000000000444089"
> 0.3 %>% format(digits = 22)
[1] "0.2999999999999999888978"
> (0.1+0.2)*1.1 == 0.3 * 1.1
[1] FALSE

# 用signif函数四舍五入到指定有效位数15位，再比较
> signif((0.1+0.2)*1.1, 15) == signif(0.3 * 1.1, 15)
[1] TRUE

> signif((0.1+0.2)*1.1, 20) == signif(0.3 * 1.1, 20)
[1] FALSE

可以注意到，signif()保留到第15位，两者就是相等的，但是保留到20位就不相等了。具体到多少位，可以根据实际情况与统计协商，记录在SAP再执行。

参考文章

https://wiki.cdisc.org/display/PUB/Precision+and+Rounding

https://pharmaverse.github.io/blog/posts/2023-07-24_rounding/rounding.html

https://psiaims.github.io/CAMIS/Comp/r-sas_rounding.html

https://pharmaverse.github.io/blog/posts/2023-10-30_floating_point/floating_point.html