Table of Contents
URL: https://www.progressiverobot.com/replace-in-r/
Introduction
In data analysis, you may need to address missing values, negative values, or non-accurate values that are present in the dataset. These problems can be addressed by replacing the values with 0, NA, or the mean.
In this article, you will explore how to use the replace() and is.na() functions in R.
Replacing the Values in a Vector with replace()
This section will show how to replace a value in a vector.
The replace() function in R syntax includes the vector, index vector, and the replacement values:
replace(target, index, replacement)
First, create a vector:
df <- c('apple', 'orange', 'grape', 'banana')
df
This will create a vector with apple, orange, grape, and banana:
[secondary_label Output]
"apple" "orange" "grape" "banana"
Now, let's replace the second item in the list:
dy <- replace(df, <^>2<^>, <^>'blueberry'<^>)
dy
This will replace orange with blueberry:
[secondary_label Output]
"apple" "blueberry" "grape" "banana"
Now, we'll replace the fourth item in the list:
dx <- replace(dy, <^>4<^>, <^>'cranberry'<^>)
dx
This will replace banana with cranberry:
[secondary_label Output]
"apple" "blueberry" "grape" "cranberry"
Replacing NA Values with 0 in R
Consider a scenario where you have a data frame containing measurements:
[label air_quality]
Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
5 NA NA 14.3 56 5 5
6 28 NA 14.9 66 5 6
7 23 299 8.6 65 5 7
8 19 99 13.8 59 5 8
9 8 19 20.1 61 5 9
10 NA 194 8.6 69 5 10
11 7 NA 6.9 74 5 11
12 16 256 9.7 69 5 12
Here is the data in CSV format:
[label air_quality.csv]
Ozone,Solar.R,Wind,Temp,Month,Day
41,190,7.4,67,5,1
36,118,8.0,72,5,2
12,149,12.6,74,5,3
18,313,11.5,62,5,4
NA,NA,14.3,56,5,5
28,NA,14.9,66,5,6
23,299,8.6,65,5,7
19,99,13.8,59,5,8
8,19,20.1,61,5,9
NA,194,8.6,69,5,10
7,NA,6.9,74,5,11
16,256,9.7,69,5,12
This contains the string NA for "Not Available" for situations where the data is missing.
<!–
–>
You can replace the NA values with 0.
First, define the data frame:
df <- read.csv('air_quality.csv')
Use is.na() to check if a value is NA. Then, replace the NA values with 0:
df[is.na(df)] <- 0
df
The data frame is now:
[secondary_label Output]
Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
5 0 0 14.3 56 5 5
6 28 0 14.9 66 5 6
7 23 299 8.6 65 5 7
8 19 99 13.8 59 5 8
9 8 19 20.1 61 5 9
10 0 194 8.6 69 5 10
11 7 0 6.9 74 5 11
12 16 256 9.7 69 5 12
All occurrences of NA in the data frame have been replaced.
<!–
–>
Replacing NA Values with the Mean of the Values in R
In the data analysis process, accuracy is improved in many cases by replacing NA values with a mean value. The mean() function calculates the mean value.
To overcome this situation, the NA values are replaced by the mean of the rest of the values. This method has proven vital in producing good accuracy without any data loss.
Consider the following input data set with NA values:
[label air_quality]
Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
5 NA NA 14.3 56 5 5
6 28 NA 14.9 66 5 6
7 23 299 8.6 65 5 7
8 19 99 13.8 59 5 8
9 8 19 20.1 61 5 9
10 NA 194 8.6 69 5 10
11 7 NA 6.9 74 5 11
12 16 256 9.7 69 5 12
<!–
–>
df <- read.csv('air_quality.csv')
Use is.na() and mean() to replace NA:
df$Ozone[is.na(df$Ozone)] <- mean(df$Ozone, na.rm = TRUE)
First, this code finds all the occurrences of NA in the Ozone column. Next, it calculates the mean of all the values in the Ozone column – excluding the NA values with the na.rm argument. Then each instance of NA is replaced with the calculated mean.
Then round() the values to whole numbers:
df$Ozone <- round(df$Ozone, digits = 0)
The data frame is now:
[secondary_label Output]
Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
5 21 NA 14.3 56 5 5
6 28 NA 14.9 66 5 6
7 23 299 8.6 65 5 7
8 19 99 13.8 59 5 8
9 8 19 20.1 61 5 9
10 21 194 8.6 69 5 10
11 7 NA 6.9 74 5 11
12 16 256 9.7 69 5 12
The NA values in the Ozone column are now replaced by the rounded mean of the values in the Ozone column (21).
<!–
–>
Replacing the Negative Values with 0 or NA in R
In the data analysis process, sometimes you will want to replace the negative values in the data frame with 0 or NA. This is necessary to avoid the negative tendency of the results. The negative values present in a dataset will mislead the analysis and produce false accuracy.
Consider the following input data set with negative values:
[label negative_values.csv]
count entry1 entry2 entry3
1 1 345 -234 345
2 2 65 654 867
3 3 23 345 3456
4 4 87 876 9
5 5 2345 34 867
6 6 876 98 76
7 7 35 -456 123
8 8 87 98 345
9 9 -765 67 765
10 10 4567 -87 234
Here is the data in CSV format:
count,entry1,entry2,entry3
1,345,-234,345
2,65,654,867
3,23,345,3456
4,87,867,9
5,2345,34,867
6,876,98,76
7,35,-456,123
8,87,98,345
9,-765,67,765
10,4567,-87,234
<!–
–>
Read the CSV file:
df <- read.csv('negative_values.csv')
Replacing the Negative Values with 0
Use replace() to change the negative values in the entry2 column to 0:
data_zero <- df
data_zero$entry2 <- replace(df$entry2, df$entry2 < 0, 0)
data_zero
The data frame is now:
[secondary_label Output]
count entry1 entry2 entry3
1 1 345 0 345
2 2 65 654 867
3 3 23 345 3456
4 4 87 867 9
5 5 2345 34 867
6 6 876 98 76
7 7 35 0 123
8 8 87 98 345
9 9 -765 67 765
10 10 4567 0 234
The negative values in the entry2 column have been replaced with 0.
Replacing the Negative Values with NA
Use replace() to change the negative values in the entry2 column to NA:
data_na <- df
data_na$entry2 <- replace(df$entry2, df$entry2 < 0, NA)
data_na
The data frame is now:
[secondary_label Output]
count entry1 entry2 entry3
1 1 345 NA 345
2 2 65 654 867
3 3 23 345 3456
4 4 87 867 9
5 5 2345 34 867
6 6 876 98 76
7 7 35 NA 123
8 8 87 98 345
9 9 -765 67 765
10 10 4567 NA 234
The negative values in the entry2 column have been replaced with NA.
Conclusion
Replacing values in a data frame is a convenient option available in R for data analysis. Using replace() in R, you can switch NA, 0, and negative values when appropriate to clear up large datasets for analysis.
Continue your learning with How To Use sub() and gsub() in R.