I inerited a piece of R code written by a former SAS programmer. It did not work as intended.
I came across some code at work where the author wanted to condition an action based on the presence of missing data. Many of the people I work with also started out as and still work as SAS programmers, so we have a habit of trying to do things in R like:
if (x == "") {do something}
rather than is.na(x)
. I found the bug because the code wasn’t working properly. It was supposed to do something when data was missing. I had missing data but nothing was happening as a consequence.
Let’s see the difference between the the two.
Find the missing data with x == ""
x == ""
[1] FALSE FALSE TRUE NA FALSE
y == ""
[1] FALSE FALSE NA FALSE
What about is.na()
?
So, x == ""
is true if it is literally true. In SAS programming, we can use ""
as shorthand for missing. But with R, if a vector has a value NA
, the logical test returns NA
. The code I was working with failed because the missing data was NA
so the logical test in the code did not return TRUE
.
The function is.na()
is the best way to find missing values in both character and numerical vectors with one small caveat; I have seen character data where missing values were recorded as ""
. If you have data like that, is.na()
won’t find it.
For attribution, please cite this work as
Bleier (2021, Feb. 28). Random Digits Blog: Missing does not equal blank: your SAS jedi mind tricks won't work here.. Retrieved from https://randomdigits.io/posts/2021-02-28-missing-does-not-equal/
BibTeX citation
@misc{bleier2021missing, author = {Bleier, Jonathan}, title = {Random Digits Blog: Missing does not equal blank: your SAS jedi mind tricks won't work here.}, url = {https://randomdigits.io/posts/2021-02-28-missing-does-not-equal/}, year = {2021} }