8 ways to clean data using data cleaning techniques

data cleaning techniques

Data forms the backbone of any data analytics you do. Regarding data, there are many things to go wrong – be it construction, arrangement, formatting, spelling, duplication, extra space, and so on. In order to perform data analytics properly we need various data cleaning techniques so that our data is ready for analysis. It is commonly said that,

 

“Data scientists spend 80% of their time cleaning and manipulating data and only 20% of their time actually analyzing it.”

 

Thus, it is necessary to be accustomed to the process of data cleansing techniques and all means of clearing data, which are related to data cleansing methods. This post provides a very basic introduction to data cleansing techniques in Excel.

 

This post covers the following data cleaning steps in Excel along with data cleansing examples:

  1. Get Rid of Extra Spaces
  2. Select and Treat All Blank Cells
  3. Convert Numbers Stored as Text into Numbers
  4. Remove Duplicates
  5. Highlight Errors
  6. Change Text to Lower/Upper/Proper Case
  7. Spell Check
  8. Delete all Formatting

Digital Marketing क्या है?

What is Data Cleaning?

Data cleansing or data cleaning is the process of identifying and removing (or correcting) inaccurate records from a dataset, table, or database and refers to recognizing unfinished, unreliable, inaccurate or non-relevant parts of the data and then restoring, remodeling, or removing the dirty or crude data.

Data cleaning techniques may be performed as batch processing through scripting or interactively with data cleansing tools.

 

Are Data Cleaning Techniques Essential?

 

Data cleaning techniques are not only an essential part of the data science process – it’s also the most time-consuming part. As the New York Times reported in a 2014 article called “For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights”,

 

Data scientists… spend 50 percent to 80 percent of their time, before this useful nugget can be discovered, rooted in this more mundane labor of collecting and preparing uncontrolled digital data.

 

Unfortunately, data cleaning techniques are not commonly reported in the media nor are they taught in most intro data science courses because it is not as important for training neural networks or identifying images, but those Cleaning data plays a very important role in getting things done.

Without the data cleaning techniques, the neural networks and image identification modules will not be as efficient as we want them to be.

 

With the rise of big data, methods of data cleansing have become more important than ever. Every industry – banking, healthcare, retail, hospitality, education – is now navigating a large ocean of data.
And as the data pool is getting bigger, the changes of things are also getting bigger. When you cannot view the entire dataset in a spreadsheet on your computer, it becomes difficult to find each fault. In fact, this may be true for several reasons.

Data Cleansing Examples and Data Cleaning Methods in Excel

In this post, I will show you various ways to clean data in Excel with data cleansing examples & data cleansing techniques.

1. Data Cleaning Techniques-Get Rid of Extra Spaces

Here I have the text Welcome Toshivatechnical written in four different ways.The first one is a regular way with only one space between words, in the second case I have more than one space between words, in the third case I have some leading spaces and some spaces between words and in the fourth case I have After the spaces have, you can see that there are a couple of spaces after the last word.

Now, this could typically be the case if you get this data from a colleague or you get it from a text file or imported from a database.  So to clean this data and get rid of these extra spaces you can use the function trim.

Trim function takes one  single argument which could either be the text which you type manually or it could be the cell reference, in this case, I will take the cell reference A1 and what this function does is it would  remove all the leading spaces and trailing spaces and extra spaces between words except one single space that is allowed.

So if I drag this down you would see that it has corrected all these texts. It has removed the extra space here between welcome into it has removed the leading spaces and trailing spaces.

2. Data Cleaning Techniques-Select and Treat All Blank Cells

If you only need to use the text, you can convert it into values using special paste. Here are the names of the students and their three subjects. You can see that there are some gaps in this dataset which may be due to the student not appearing in the exam.

Now you may not want to leave this data set with blanks, you may want to type not appear in all these cells which are blank. So to do that you can either go and select each cell manually and type not appear. But if you have a  huge data set that because this could be very tiresome. So to do it at one go,

 

Select the entire data set,

 

Go to find and select and select this option Go to Special this opens the go-to special dialog box. You can also use the keyboard shortcut F5 and when you do this it opens the go-to dialog box here you have special button, click on it and it again opens it equal to special dialogue box

data-cleaning-techniques

Click  blanks  and click okay, this would select all the blank cells in your data set at the same time

data-cleaning-techniques

So now you have these cells in gray and the first cell is in white color because it is the active cell so to start typing it does not start appearing in all these cells, do not start typing and enter ctrl + and the like Hee you hit ctrl + enter it enters all cells.

data-cleaning-techniques

3. Data Cleaning Techniques-Convert Numbers Stored as Text into Numbers

 

Here I have this number  entered in  three different ways,

 

In the first case, it is a number as you can see that it aligns to the right of the cell number is always aligned to the right while the text aligns to the left of a cell and in the other two cases, you can see these Are text formats because they align to the left. Now these three have to be converted back into numbers.

 

The first one is already a number but there are two ways to do it to convert these two into one number. The first one I will go into the formatting box and I will write the normal one and when I enter the second one will be converted back to a number because in this case it was only in text format but the third case is a bit more difficult because it using a major apostrophe Is recorded and many people do so, many people enter the number to begin with apostrophy so that it is a Changes to the text and it could cause some problems.

 

A very foolproof method is type in any of the blank cells go to the cell and copy this now select these cells go to paste -> paste special and this opens the paste special dialog box. Here you have operation category within this select multiply and click okay so what this does is this multiplies this number with one and any number multiplied by one is unchanged but this also takes care of the apostrophe so now all these three numbers get converted back into a number format.

4. Data Cleaning Techniques-Remove Duplicates

Here I have a data set of students and their marks in three subjects and there are duplicates in this data, so you can see that there is a duplicate for Bill and a duplicate for Phil now if you want to remove these duplicate values There are two ways it is using conditional formatting first

 

  • Highlight cell rules -> duplicate values and as soon as you select this it gives you the option to highlight duplicates and the formatting. I will keep the formatting as a red fill with dark red text and when I hit OK you can see that this has been highlighted and all those numbers and names that appear more than once it highlighted in red.

I can manually see that Phil repeats twice and Bill repeats twice so I can select this data and manually delete this.

data-cleaning-techniques

The other way to remove duplicate and clear data is to select the entire data set that goes into the data and here I have the option to remove duplicate, I click on this option and it opens the delete duplicate dialog box Make sure that if you have a data header that is in this case, it has this option. If it is not selected then it is also included as a part of your data. Not known, this should not happen when you have selected this option. These names are the names of the columns. So I can see a student column in the Chemistry column is the Mathematics column. The Physics column, then I want all those rows or all the data. I can select OK to remove the sets that are duplicates, but in this case it will not delete a number that repeats again but rather the entire row is an exact duplicate. Be it

 

For example, Jack and these  three marks have to meet exactly the  numbers and name here Jack and these  three marks and if that is the case then  this row gets deleted similar is the  case with Jill and these numbers so now when I click OK it says one duplicate  values found and removed  unique  values remain the reason being that in  case of Jill you can see  that the marks do not match so this entire row is not an exact match and hence it remains but since in case of Bill it was an exact match then the  row was removed.

data-cleaning-techniques

5. Data Cleaning Techniques-Highlight Errors

Here I have a dataset for five companies. I have their income numbers for three years and net income numbers for three years and using these numbers I have calculated the net income margin which is the net income by revenue.

 

Now you can see that there are errors in data for Company X and Company Z, the reason being that there is no revenue number for these companies in and hence I get a division error because I  try to divide their net income by nothing. Now this is a small data set and  you can visually spot these errors but if you have a huge data set these  errors could be difficult to spot so to do that you can use two methods first is  using conditional formatting,

 

So select this entire dataset go-to home -> New Rule within the Conditional Formatting and New Formatting Rules dialog box. Select only the cells that have this drop-down from Select Errors and when you select Errors you will get the option to format the cells. Which has an error, in this case, let me select red and I click OK and as soon as I do this all the errors occur which are highlighted in red

data-cleaning-techniques

We control Z that the other way to do this would be to select cells that have errors and you can do this using the Go-to-Special dialog box so that press F5 it opens the Go-to dialog box . You have a special button, click on it, it opens the Go-to-Special dialog box.

Here select formulas and within formulas as soon as you selected all these four options get available, deselect the first three options and only keep the errors option selected and now click OK. When you do this all those cells which have error in it get selected now you can manually either delete all these cells or type something like not available and hit control enter so that it gets entered in all the cells which have error in.

data-cleaning-techniques

6. Data Cleaning Techniques-Change Text to Lower/Upper/Proper Case

Here I have names written in different ways that you can see or it can be all caps, it can be all lowercase and in some cases, it is a mix-and-match of uppercase lowercase so that it all You can use one of these three threads to make it consistent,

 

LOWER() formula takes one argument, it could be either the text that you type  in or you can use a cell reference in this case if I’ll use the cell reference A1 and when I hit ctrl enter this gives  me mary jane the name but all the  alphabets have been converted into  lowercase and when I drag this down this  is the case for all these names all  these names now look consistent in  lowercase you may want these all in the  uppercase so in that case you can use  the formula UPPER() and you can see this  these are all in uppercase now as I drag  this down the most used way is proper  case because it would keep the first  alphabet of your name as in capital and  the rest all would be in the lowercase  may show you I will select PROPER()  and I hit ctrl enter and you can see M of Mary and J of Jane is in caps  and rest all the alphabets are in the  lowercase and now I drag it down so  these are three formulas that can very  quickly make your text consistent this  could be the case when you are sharing a  worksheet or you get it from a text file  where a lot of people enter it in different ways these formulas can quickly make these consistent.

7. Data Cleaning Techniques-Spell Check

If you have a very large data set and you only want to extract part of it as clean data, while Microsoft PowerPoint and Microsoft Word have a feature where it will outline whether any errors are grammatical errors or spelling errors in Microsoft Excel Although it does not have the feature that you can still do a spell check and fix these errors.

So to do that select the data and press F7 and when you do that it runs the spellcheck for you and it is the same thing that you see in Microsoft Word or PowerPoint it will show you the text that it thinks is a spelling error and it will show you the suggestions as well so you can change these and once it is done it will show you that spellcheck is complete and you are good to go.

8. Data Cleaning Techniques-Delete all Formatting

Data Cleansing Tools

Here are some interesting Data Cleansing tools relating to data cleaning techniques, analysis and modelling of data,

JASP – Open Source statistical software similar to SPSS with support of COS

Rattle – GUI for user-friendly machine learning with R

RapidMiner – Another point and click machine learning package

Orange – Open Source GUI for user-friendly machine learning with Python

Talend data preparation – Data cleaning, preparation tool with smarts

Trifacta Wranger – Data cleaning, preparation tool with the match by example feature

सबसे सस्ता जैकेट बम्फरधमाका

One thought on “8 ways to clean data using data cleaning techniques”

  1. Wow that was strange. I just wrote an very long comment but
    after I clicked submit my comment didn’t appear.
    Grrrr… well I’m not writing all that over again. Anyways,
    just wanted to say wonderful blog!

Leave a Reply

Your email address will not be published. Required fields are marked *