Data cleansing in the context of "Data analysis"

Play Trivia Questions online!

or

Skip to study material about Data cleansing in the context of "Data analysis"

Ad spacer

⭐ Core Definition: Data cleansing

Data cleansing or data cleaning is the process of identifying and correcting (or removing) corrupt, inaccurate, or irrelevant records from a dataset, table, or database. It involves detecting incomplete, incorrect, or inaccurate parts of the data and then replacing, modifying, or deleting the affected data. Data cleansing can be performed interactively using data wrangling tools, or through batch processing often via scripts or a data quality firewall.

After cleansing, a data set should be consistent with other similar data sets in the system. The inconsistencies detected or removed may have been originally caused by user entry errors, by corruption in transmission or storage, or by different data dictionary definitions of similar entities in different stores. Data cleaning differs from data validation in that validation almost invariably means data is rejected from the system at entry and is performed at the time of entry, rather than on batches of data.

↓ Menu

>>>PUT SHARE BUTTONS HERE<<<

👉 Data cleansing in the context of Data analysis

Data analysis is the process of inspecting, cleansing, transforming, and modeling data with the goal of discovering useful information, informing conclusions, and supporting decision-making. Data analysis has multiple facets and approaches, encompassing diverse techniques under a variety of names, and is used in different business, science, and social science domains. In today's business world, data analysis plays a role in making decisions more scientific and helping businesses operate more effectively.

Data mining is a particular data analysis technique that focuses on statistical modeling and knowledge discovery for predictive rather than purely descriptive purposes, while business intelligence covers data analysis that relies heavily on aggregation, focusing mainly on business information. In statistical applications, data analysis can be divided into descriptive statistics, exploratory data analysis (EDA), and confirmatory data analysis (CDA). EDA focuses on discovering new features in the data while CDA focuses on confirming or falsifying existing hypotheses. Predictive analytics focuses on the application of statistical models for predictive forecasting or classification, while text analytics applies statistical, linguistic, and structural techniques to extract and classify information from textual sources, a variety of unstructured data. All of the above are varieties of data analysis.

↓ Explore More Topics
In this Dossier

Data cleansing in the context of Computational scientist

A computational scientist is a person skilled in scientific computing. This person is usually a scientist, a statistician, an applied mathematician, or an engineer who applies high-performance computing and sometimes cloud computing in different ways to advance the state-of-the-art in their respective applied discipline; physics, chemistry, social sciences and so forth. Thus scientific computing has increasingly influenced many areas such as economics, biology, law, and medicine to name a few. Because a computational scientist's work is generally applied to science and other disciplines, they are not necessarily trained in computer science specifically, though concepts of computer science are often used. Computational scientists are typically researchers at academic universities, national labs, or tech companies.

One of the tasks of a computational scientist is to analyze large amounts of data, often from astrophysics or related fields, as these can often generate huge amounts of data. Computational scientists often have to clean up and calibrate the data to a usable form for an effective analysis. Computational scientists are also tasked with creating artificial data through computer models and simulations.

↑ Return to Menu

Data cleansing in the context of Data quality

Data quality refers to the state of qualitative or quantitative pieces of information. There are many definitions of data quality, but data is generally considered high quality if it is "fit for [its] intended uses in operations, decision making and planning". Data is deemed of high quality if it correctly represents the real-world construct to which it refers. Apart from these definitions, as the number of data sources increases, the question of internal data consistency becomes significant, regardless of fitness for use for any particular external purpose.

People's views on data quality can often be in disagreement, even when discussing the same set of data used for the same purpose. When this is the case, businesses may adopt recognised international standards for data quality (See #International Standards for Data Quality below). Data governance can also be used to form agreed upon definitions and standards, including international standards, for data quality. In such cases, data cleansing, including standardization, may be required in order to ensure data quality.

↑ Return to Menu