URL: https://www.progressiverobot.com/replace-in-r/

Introduction

In data analysis, you may need to address missing values, negative values, or non-accurate values that are present in the dataset. These problems can be addressed by replacing the values with 0, NA, or the mean.

In this article, you will explore how to use the replace() and is.na() functions in R.

Prerequisites

values illustration for: Prerequisites

To complete this tutorial, you will need:

Replacing the Values in a Vector with replace()

This section will show how to replace a value in a vector.

The replace() function in R syntax includes the vector, index vector, and the replacement values:

				
					replace(target, index, replacement)
				
			

First, create a vector:

				
					df <- c('apple', 'orange', 'grape', 'banana')
df
				
			

This will create a vector with apple, orange, grape, and banana:

				
					[secondary_label Output]
"apple"  "orange"  "grape"  "banana"
				
			

Now, let's replace the second item in the list:

				
					dy <- replace(df, <^>2<^>, <^>'blueberry'<^>)
dy
				
			

This will replace orange with blueberry:

				
					[secondary_label Output]
"apple"  "blueberry"  "grape"  "banana"
				
			

Now, we'll replace the fourth item in the list:

				
					dx <- replace(dy, <^>4<^>, <^>'cranberry'<^>)
dx
				
			

This will replace banana with cranberry:

				
					[secondary_label Output]
"apple"  "blueberry"  "grape"  "cranberry"
				
			

Replacing NA Values with 0 in R

Consider a scenario where you have a data frame containing measurements:

				
					[label air_quality]
    Ozone  Solar.R  Wind  Temp  Month  Day
1      41      190   7.4    67      5    1
2      36      118   8.0    72      5    2
3      12      149  12.6    74      5    3
4      18      313  11.5    62      5    4
5      NA       NA  14.3    56      5    5
6      28       NA  14.9    66      5    6
7      23      299   8.6    65      5    7
8      19       99  13.8    59      5    8
9       8       19  20.1    61      5    9
10     NA      194   8.6    69      5   10
11      7       NA   6.9    74      5   11
12     16      256   9.7    69      5   12
				
			

Here is the data in CSV format:

				
					[label air_quality.csv]
Ozone,Solar.R,Wind,Temp,Month,Day
41,190,7.4,67,5,1
36,118,8.0,72,5,2
12,149,12.6,74,5,3
18,313,11.5,62,5,4
NA,NA,14.3,56,5,5
28,NA,14.9,66,5,6
23,299,8.6,65,5,7
19,99,13.8,59,5,8
8,19,20.1,61,5,9
NA,194,8.6,69,5,10
7,NA,6.9,74,5,11
16,256,9.7,69,5,12
				
			

This contains the string NA for "Not Available" for situations where the data is missing.

<!–

–>

You can replace the NA values with 0.

First, define the data frame:

				
					df &lt;- read.csv('air_quality.csv')
				
			

Use is.na() to check if a value is NA. Then, replace the NA values with 0:

				
					df[is.na(df)] &lt;- 0
df
				
			

The data frame is now:

				
					[secondary_label Output]
    Ozone  Solar.R  Wind  Temp  Month  Day
1      41      190   7.4    67      5    1
2      36      118   8.0    72      5    2
3      12      149  12.6    74      5    3
4      18      313  11.5    62      5    4
5       0        0  14.3    56      5    5
6      28        0  14.9    66      5    6
7      23      299   8.6    65      5    7
8      19       99  13.8    59      5    8
9       8       19  20.1    61      5    9
10      0      194   8.6    69      5   10
11      7        0   6.9    74      5   11
12     16      256   9.7    69      5   12
				
			

All occurrences of NA in the data frame have been replaced.

<!–

–>

Replacing NA Values with the Mean of the Values in R

In the data analysis process, accuracy is improved in many cases by replacing NA values with a mean value. The mean() function calculates the mean value.

To overcome this situation, the NA values are replaced by the mean of the rest of the values. This method has proven vital in producing good accuracy without any data loss.

Consider the following input data set with NA values:

				
					[label air_quality]
    Ozone  Solar.R  Wind  Temp  Month  Day
1      41      190   7.4    67      5    1
2      36      118   8.0    72      5    2
3      12      149  12.6    74      5    3
4      18      313  11.5    62      5    4
5      NA       NA  14.3    56      5    5
6      28       NA  14.9    66      5    6
7      23      299   8.6    65      5    7
8      19       99  13.8    59      5    8
9       8       19  20.1    61      5    9
10     NA      194   8.6    69      5   10
11      7       NA   6.9    74      5   11
12     16      256   9.7    69      5   12
				
			

<!–

–>

				
					df &lt;- read.csv('air_quality.csv')
				
			

Use is.na() and mean() to replace NA:

				
					df$Ozone[is.na(df$Ozone)] &lt;- mean(df$Ozone, na.rm = TRUE)
				
			

First, this code finds all the occurrences of NA in the Ozone column. Next, it calculates the mean of all the values in the Ozone column – excluding the NA values with the na.rm argument. Then each instance of NA is replaced with the calculated mean.

Then round() the values to whole numbers:

				
					df$Ozone &lt;- round(df$Ozone, digits = 0)
				
			

The data frame is now:

				
					[secondary_label Output]
    Ozone  Solar.R  Wind  Temp  Month  Day
1      41      190   7.4    67      5    1
2      36      118   8.0    72      5    2
3      12      149  12.6    74      5    3
4      18      313  11.5    62      5    4
5      21       NA  14.3    56      5    5
6      28       NA  14.9    66      5    6
7      23      299   8.6    65      5    7
8      19       99  13.8    59      5    8
9       8       19  20.1    61      5    9
10     21      194   8.6    69      5   10
11      7       NA   6.9    74      5   11
12     16      256   9.7    69      5   12
				
			

The NA values in the Ozone column are now replaced by the rounded mean of the values in the Ozone column (21).

<!–

–>

Replacing the Negative Values with 0 or NA in R

In the data analysis process, sometimes you will want to replace the negative values in the data frame with 0 or NA. This is necessary to avoid the negative tendency of the results. The negative values present in a dataset will mislead the analysis and produce false accuracy.

Consider the following input data set with negative values:

				
					[label negative_values.csv]
    count  entry1  entry2  entry3
 1      1     345    -234     345
 2      2      65     654     867
 3      3      23     345    3456
 4      4      87     876       9
 5      5    2345      34     867
 6      6     876      98      76
 7      7      35    -456     123
 8      8      87      98     345
 9      9    -765      67     765
10     10    4567     -87     234
				
			

Here is the data in CSV format:

				
					count,entry1,entry2,entry3
1,345,-234,345
2,65,654,867
3,23,345,3456
4,87,867,9
5,2345,34,867
6,876,98,76
7,35,-456,123
8,87,98,345
9,-765,67,765
10,4567,-87,234
				
			

<!–

–>

Read the CSV file:

				
					df &lt;- read.csv('negative_values.csv')
				
			

Replacing the Negative Values with 0

Use replace() to change the negative values in the entry2 column to 0:

				
					data_zero &lt;- df
data_zero$entry2 &lt;- replace(df$entry2, df$entry2 &lt; 0, 0) 
data_zero
				
			

The data frame is now:

				
					[secondary_label Output]
   count entry1 entry2 entry3
1      1    345      0    345
2      2     65    654    867
3      3     23    345   3456
4      4     87    867      9
5      5   2345     34    867
6      6    876     98     76
7      7     35      0    123
8      8     87     98    345
9      9   -765     67    765
10    10   4567      0    234
				
			

The negative values in the entry2 column have been replaced with 0.

Replacing the Negative Values with NA

Use replace() to change the negative values in the entry2 column to NA:

				
					data_na &lt;- df
data_na$entry2 &lt;- replace(df$entry2, df$entry2 &lt; 0, NA)
data_na
				
			

The data frame is now:

				
					[secondary_label Output]
   count entry1 entry2 entry3
1      1    345     NA    345
2      2     65    654    867
3      3     23    345   3456
4      4     87    867      9
5      5   2345     34    867
6      6    876     98     76
7      7     35     NA    123
8      8     87     98    345
9      9   -765     67    765
10    10   4567     NA    234
				
			

The negative values in the entry2 column have been replaced with NA.

Conclusion

Replacing values in a data frame is a convenient option available in R for data analysis. Using replace() in R, you can switch NA, 0, and negative values when appropriate to clear up large datasets for analysis.

Continue your learning with How To Use sub() and gsub() in R.