Data forms the backbone of any data analytics you do. Regarding data, there are many things to go wrong – be it construction, arrangement, formatting, spelling, duplication, extra space, and so on. In order to perform data analytics properly we need various data cleaning techniques so that our data is ready for analysis. It is commonly said that,
“Data scientists spend 80% of their time cleaning and manipulating data and only 20% of their time actually analyzing it.”
Thus, it is necessary to be accustomed to the process of data cleansing techniques and all means of clearing data, which are related to data cleansing methods. This post provides a very basic introduction to data cleansing techniques in Excel.
This post covers the following data cleaning steps in Excel along with data cleansing examples:
- Get Rid of Extra Spaces
- Select and Treat All Blank Cells
- Convert Numbers Stored as Text into Numbers
- Remove Duplicates
- Highlight Errors
- Change Text to Lower/Upper/Proper Case
- Spell Check
- Delete all Formatting
Data cleansing or data cleaning is the process of identifying and removing (or correcting) inaccurate records from a dataset, table, or database and refers to recognizing unfinished, unreliable, inaccurate or non-relevant parts of the data and then restoring, remodeling, or removing the dirty or crude data.
Data cleaning techniques may be performed as batch processing through scripting or interactively with data cleansing tools.
Are Data Cleaning Techniques Essential?
Data cleaning techniques are not only an essential part of the data science process – it’s also the most time-consuming part. As the New York Times reported in a 2014 article called “For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights”,
“Data scientists… spend 50 percent to 80 percent of their time, before this useful nugget can be discovered, rooted in this more mundane labor of collecting and preparing uncontrolled digital data.
Unfortunately, data cleaning techniques are not commonly reported in the media nor are they taught in most intro data science courses because it is not as important for training neural networks or identifying images, but those Cleaning data plays a very important role in getting things done.
Without the data cleaning techniques, the neural networks and image identification modules will not be as efficient as we want them to be.
Data Cleansing Examples and Data Cleaning Methods in Excel
In this post, I will show you various ways to clean data in Excel with data cleansing examples & data cleansing techniques.
1. Data Cleaning Techniques-Get Rid of Extra Spaces
Here I have the text Welcome Toshivatechnical written in four different ways.The first one is a regular way with only one space between words, in the second case I have more than one space between words, in the third case I have some leading spaces and some spaces between words and in the fourth case I have After the spaces have, you can see that there are a couple of spaces after the last word.
Now, this could typically be the case if you get this data from a colleague or you get it from a text file or imported from a database. So to clean this data and get rid of these extra spaces you can use the function trim.
Trim function takes one single argument which could either be the text which you type manually or it could be the cell reference, in this case, I will take the cell reference A1 and what this function does is it would remove all the leading spaces and trailing spaces and extra spaces between words except one single space that is allowed.
So if I drag this down you would see that it has corrected all these texts. It has removed the extra space here between welcome into it has removed the leading spaces and trailing spaces.
2. Data Cleaning Techniques-Select and Treat All Blank Cells
If you only need to use the text, you can convert it into values using special paste. Here are the names of the students and their three subjects. You can see that there are some gaps in this dataset which may be due to the student not appearing in the exam.
Now you may not want to leave this data set with blanks, you may want to type not appear in all these cells which are blank. So to do that you can either go and select each cell manually and type not appear. But if you have a huge data set that because this could be very tiresome. So to do it at one go,
Select the entire data set,
Go to find and select and select this option Go to Special this opens the go-to special dialog box. You can also use the keyboard shortcut F5 and when you do this it opens the go-to dialog box here you have special button, click on it and it again opens it equal to special dialogue box
Click blanks and click okay, this would select all the blank cells in your data set at the same time
So now you have these cells in gray and the first cell is in white color because it is the active cell so to start typing it does not start appearing in all these cells, do not start typing and enter ctrl + and the like Hee you hit ctrl + enter it enters all cells.
3. Data Cleaning Techniques-Convert Numbers Stored as Text into Numbers
Here I have this number entered in three different ways,
In the first case, it is a number as you can see that it aligns to the right of the cell number is always aligned to the right while the text aligns to the left of a cell and in the other two cases, you can see these Are text formats because they align to the left. Now these three have to be converted back into numbers.
The first one is already a number but there are two ways to do it to convert these two into one number. The first one I will go into the formatting box and I will write the normal one and when I enter the second one will be converted back to a number because in this case it was only in text format but the third case is a bit more difficult because it using a major apostrophe Is recorded and many people do so, many people enter the number to begin with apostrophy so that it is a Changes to the text and it could cause some problems.
A very foolproof method is type in any of the blank cells go to the cell and copy this now select these cells go to paste -> paste special and this opens the paste special dialog box. Here you have operation category within this select multiply and click okay so what this does is this multiplies this number with one and any number multiplied by one is unchanged but this also takes care of the apostrophe so now all these three numbers get converted back into a number format.
4. Data Cleaning Techniques-Remove Duplicates
Here I have a data set of students and their marks in three subjects and there are duplicates in this data, so you can see that there is a duplicate for Bill and a duplicate for Phil now if you want to remove these duplicate values There are two ways it is using conditional formatting first
- Highlight cell rules -> duplicate values and as soon as you select this it gives you the option to highlight duplicates and the formatting. I will keep the formatting as a red fill with dark red text and when I hit OK you can see that this has been highlighted and all those numbers and names that appear more than once it highlighted in red.
I can manually see that Phil repeats twice and Bill repeats twice so I can select this data and manually delete this.
The other way to remove duplicate and clear data is to select the entire data set that goes into the data and here I have the option to remove duplicate, I click on this option and it opens the delete duplicate dialog box Make sure that if you have a data header that is in this case, it has this option. If it is not selected then it is also included as a part of your data. Not known, this should not happen when you have selected this option. These names are the names of the columns. So I can see a student column in the Chemistry column is the Mathematics column. The Physics column, then I want all those rows or all the data. I can select OK to remove the sets that are duplicates, but in this case it will not delete a number that repeats again but rather the entire row is an exact duplicate. Be it
For example, Jack and these three marks have to meet exactly the numbers and name here Jack and these three marks and if that is the case then this row gets deleted similar is the case with Jill and these numbers so now when I click OK it says one duplicate values found and removed unique values remain the reason being that in case of Jill you can see that the marks do not match so this entire row is not an exact match and hence it remains but since in case of Bill it was an exact match then the row was removed.
5. Data Cleaning Techniques-Highlight Errors
Here I have a dataset for five companies. I have their income numbers for three years and net income numbers for three years and using these numbers I have calculated the net income margin which is the net income by revenue.
Now you can see that there are errors in data for Company X and Company Z, the reason being that there is no revenue number for these companies in and hence I get a division error because I try to divide their net income by nothing. Now this is a small data set and you can visually spot these errors but if you have a huge data set these errors could be difficult to spot so to do that you can use two methods first is using conditional formatting,
So select this entire dataset go-to home -> New Rule within the Conditional Formatting and New Formatting Rules dialog box. Select only the cells that have this drop-down from Select Errors and when you select Errors you will get the option to format the cells. Which has an error, in this case, let me select red and I click OK and as soon as I do this all the errors occur which are highlighted in red
We control Z that the other way to do this would be to select cells that have errors and you can do this using the Go-to-Special dialog box so that press F5 it opens the Go-to dialog box . You have a special button, click on it, it opens the Go-to-Special dialog box.
Here select formulas and within formulas as soon as you selected all these four options get available, deselect the first three options and only keep the errors option selected and now click OK. When you do this all those cells which have error in it get selected now you can manually either delete all these cells or type something like not available and hit control enter so that it gets entered in all the cells which have error in.
6. Data Cleaning Techniques-Change Text to Lower/Upper/Proper Case
Here I have names written in different ways that you can see or it can be all caps, it can be all lowercase and in some cases, it is a mix-and-match of uppercase lowercase so that it all You can use one of these three threads to make it consistent,
LOWER() formula takes one argument, it could be either the text that you type in or you can use a cell reference in this case if I’ll use the cell reference A1 and when I hit ctrl enter this gives me mary jane the name but all the alphabets have been converted into lowercase and when I drag this down this is the case for all these names all these names now look consistent in lowercase you may want these all in the uppercase so in that case you can use the formula UPPER() and you can see this these are all in uppercase now as I drag this down the most used way is proper case because it would keep the first alphabet of your name as in capital and the rest all would be in the lowercase may show you I will select PROPER() and I hit ctrl enter and you can see M of Mary and J of Jane is in caps and rest all the alphabets are in the lowercase and now I drag it down so these are three formulas that can very quickly make your text consistent this could be the case when you are sharing a worksheet or you get it from a text file where a lot of people enter it in different ways these formulas can quickly make these consistent.
7. Data Cleaning Techniques-Spell Check
If you have a very large data set and you only want to extract part of it as clean data, while Microsoft PowerPoint and Microsoft Word have a feature where it will outline whether any errors are grammatical errors or spelling errors in Microsoft Excel Although it does not have the feature that you can still do a spell check and fix these errors.
So to do that select the data and press F7 and when you do that it runs the spellcheck for you and it is the same thing that you see in Microsoft Word or PowerPoint it will show you the text that it thinks is a spelling error and it will show you the suggestions as well so you can change these and once it is done it will show you that spellcheck is complete and you are good to go.
8. Data Cleaning Techniques-Delete all Formatting