Outlier in the context of Data set


Outlier in the context of Data set

Outlier Study page number 1 of 1

Play TriviaQuestions Online!

or

Skip to study material about Outlier in the context of "Data set"


⭐ Core Definition: Outlier

In statistics, an outlier is a data point that differs significantly from other observations. An outlier may be due to a variability in the measurement, an indication of novel data, or it may be the result of experimental error; the latter are sometimes excluded from the data set. An outlier can be an indication of exciting possibility, but can also cause serious problems in statistical analyses.

Outliers can occur by chance in any distribution, but they can indicate novel behaviour or structures in the data-set, measurement error, or that the population has a heavy-tailed distribution. In the case of measurement error, one wishes to discard them or use statistics that are robust to outliers, while in the case of heavy-tailed distributions, they indicate that the distribution has high skewness and that one should be very cautious in using tools or intuitions that assume a normal distribution. A frequent cause of outliers is a mixture of two distributions, which may be two distinct sub-populations, or may indicate 'correct trial' versus 'measurement error'; this is modeled by a mixture model.

↓ Menu
HINT:

In this Dossier

Outlier in the context of Data

Data (/ˈdtə/ DAY-tə, US also /ˈdætə/ DAT) are a collection of discrete or continuous values that convey information, describing the quantity, quality, fact, statistics, other basic units of meaning, or simply sequences of symbols that may be further interpreted formally. A datum is an individual value in a collection of data. Data are usually organized into structures such as tables that provide additional context and meaning, and may themselves be used as data in larger structures. Data may be used as variables in a computational process. Data may represent abstract ideas or concrete measurements.Data are commonly used in scientific research, economics, and virtually every other form of human organizational activity. Examples of data sets include price indices (such as the consumer price index), unemployment rates, literacy rates, and census data. In this context, data represent the raw facts and figures from which useful information can be extracted.

Data are collected using techniques such as measurement, observation, query, or analysis, and are typically represented as numbers or characters that may be further processed. Field data are data that are collected in an uncontrolled, in-situ environment. Experimental data are data that are generated in the course of a controlled scientific experiment. Data are analyzed using techniques such as calculation, reasoning, discussion, presentation, visualization, or other forms of post-analysis. Prior to analysis, raw data (or unprocessed data) is typically cleaned: Outliers are removed, and obvious instrument or data entry errors are corrected.

View the full Wikipedia page for Data
↑ Return to Menu

Outlier in the context of Robust statistics

Robust statistics are statistics that maintain their properties even if the underlying distributional assumptions are incorrect. Robust statistical methods have been developed for many common problems, such as estimating location, scale, and regression parameters. One motivation is to produce statistical methods that are not unduly affected by outliers. Another motivation is to provide methods with good performance when there are small departures from a parametric distribution. For example, robust methods work well for mixtures of two normal distributions with different standard deviations; under this model, non-robust methods like a t-test work poorly.

View the full Wikipedia page for Robust statistics
↑ Return to Menu

Outlier in the context of Standard deviation

In statistics, the standard deviation is a measure of the amount of variation of the values of a variable about its mean. A low standard deviation indicates that the values tend to be close to the mean (also called the expected value) of the set, while a high standard deviation indicates that the values are spread out over a wider range. The standard deviation is commonly used in the determination of what constitutes an outlier and what does not. Standard deviation may be abbreviated SD or std dev, and is most commonly represented in mathematical texts and equations by the lowercase Greek letter σ (sigma), for the population standard deviation, or the Latin letter s, for the sample standard deviation.

The standard deviation of a random variable, sample, statistical population, data set, or probability distribution is the square root of its variance. (For a finite population, variance is the average of the squared deviations from the mean.) A useful property of the standard deviation is that, unlike the variance, it is expressed in the same unit as the data. Standard deviation can also be used to calculate standard error for a finite sample, and to determine statistical significance.

View the full Wikipedia page for Standard deviation
↑ Return to Menu

Outlier in the context of Average

An average of a collection or group is a value that is most central or most common in some sense, and represents its overall position.

In mathematics, especially in colloquial usage, it most commonly refers to the arithmetic mean, so the "average" of the list of numbers [2, 3, 4, 7, 9] is generally considered to be (2+3+4+7+9)/5 = 25/5 = 5. In situations where the data is skewed or has outliers, and it is desired to focus on the main part of the group rather than the long tail, "average" often instead refers to the median; for example, the average personal income is usually given as the median income, so that it represents the majority of the population rather than being overly influenced by the much higher incomes of the few rich people. In certain real-world scenarios, such computing the average speed from multiple measurements taken over the same distance, the average used is the harmonic mean. In situations where a histogram or probability density function is being referenced, the "average" could instead refer to the mode. Other statistics that can be used as an average include the mid-range and geometric mean, but they would rarely, if ever, be colloquially referred to as "the average".

View the full Wikipedia page for Average
↑ Return to Menu

Outlier in the context of Raw data

Raw data, also known as primary data, are data (e.g., numbers, instrument readings, figures, etc.) collected from a source. In the context of examinations, the raw data might be described as a raw score (after test scores).

If a scientist sets up a computerized thermometer which records the temperature of a chemical mixture in a test tube every minute, the list of temperature readings for every minute, as printed out on a spreadsheet or viewed on a computer screen are "raw data". Raw data have not been subjected to processing, "cleaning" by researchers to remove outliers, obvious instrument reading errors or data entry errors, or any analysis (e.g., determining central tendency aspects such as the average or median result). As well, raw data have not been subject to any other manipulation by a software program or a human researcher, analyst or technician. They are also referred to as primary data. Raw data is a relative term (see data), because even once raw data have been "cleaned" and processed by one team of researchers, another team may consider these processed data to be "raw data" for another stage of research. Raw data can be inputted to a computer program or used in manual procedures such as analyzing statistics from a survey. The term "raw data" can refer to the binary data on electronic storage devices, such as hard disk drives (also referred to as "low-level data").

View the full Wikipedia page for Raw data
↑ Return to Menu

Outlier in the context of Quartile

In statistics, quartiles are a type of quantiles which divide the number of data points into four parts, or quarters, of more-or-less equal size. The data must be ordered from smallest to largest to compute quartiles; as such, quartiles are a form of order statistic. The three quartiles, resulting in four data divisions, are as follows:

Along with the minimum and maximum of the data (which are also quartiles), the three quartiles described above provide a five-number summary of the data. This summary is important in statistics because it provides information about both the center and the spread of the data. Knowing the lower and upper quartile provides information on how big the spread is and if the dataset is skewed toward one side. Since quartiles divide the number of data points evenly, the range is generally not the same between adjacent quartiles (i.e. usually (Q3 - Q2) ≠ (Q2 - Q1)). Interquartile range (IQR) is defined as the difference between the 75th and 25th percentiles or Q3 - Q1. While the maximum and minimum also show the spread of the data, the upper and lower quartiles can provide more detailed information on the location of specific data points, the presence of outliers in the data, and the difference in spread between the middle 50% of the data and the outer data points.

View the full Wikipedia page for Quartile
↑ Return to Menu

Outlier in the context of Kurtosis

Kurtosis (from Greek: κυρτός (kyrtos or kurtos), meaning 'curved, arching') refers to the degree of tailedness in the probability distribution of a real-valued, random variable in probability theory and statistics. Similar to skewness, kurtosis provides insight into specific characteristics of a distribution. Various methods exist for quantifying kurtosis in theoretical distributions, and corresponding techniques allow estimation based on sample data from a population. It is important to note that different measures of kurtosis can yield varying interpretations.

The standard measure of a distribution's kurtosis, originating with Karl Pearson, is a scaled version of the fourth moment of the distribution. This number is related to the tails of the distribution, not its peak; hence, the sometimes-seen characterization of kurtosis as peakedness is incorrect. For this measure, higher kurtosis corresponds to greater extremity of deviations (or outliers), and not the configuration of data near the mean.

View the full Wikipedia page for Kurtosis
↑ Return to Menu

Outlier in the context of Box plot

In descriptive statistics, a box plot or boxplot is a method for demonstrating graphically the locality, spread and skewness groups of numerical data through their quartiles.

In addition to the box on a box plot, there can be lines (which are called whiskers) extending from the box indicating variability outside the upper and lower quartiles, thus, the plot is also called the box-and-whisker plot and the box-and-whisker diagram. Outliers that differ significantly from the rest of the dataset may be plotted as individual points beyond the whiskers on the box-plot. Box plots are non-parametric: they display variation in samples of a statistical population without making any assumptions of the underlying statistical distribution (though Tukey's boxplot assumes symmetry for the whiskers and normality for their length).

View the full Wikipedia page for Box plot
↑ Return to Menu

Outlier in the context of King effect

In statistics, economics, and econophysics, the king effect is the phenomenon in which the top one or two members of a ranked set show up as clear outliers. These top one or two members are unexpectedly large because they do not conform to the statistical distribution or rank-distribution which the remainder of the set obeys.

Distributions typically followed include the power-law distribution, that is a basis for the stretched exponential function, and parabolic fractal distribution.The King effect has been observed in the distribution of:

View the full Wikipedia page for King effect
↑ Return to Menu

Outlier in the context of Atmospheric anomaly

In the natural sciences, especially in atmospheric and Earth sciences involving applied statistics, an anomaly is a persisting deviation in a physical quantity from its expected value, e.g., the systematic difference between a measurement and a trend or a model prediction. Similarly, a standardized anomaly equals an anomaly divided by a standard deviation. A group of anomalies can be analyzed spatially, as a map, or temporally, as a time series.It should not be confused for an isolated outlier.There are examples in atmospheric sciences and in geophysics.

View the full Wikipedia page for Atmospheric anomaly
↑ Return to Menu

Outlier in the context of Averages

An average of a collection or group is a value that is most central or most common in some sense, and represents its overall position.

In mathematics, especially in colloquial usage, it most commonly refers to the arithmetic mean, so the "average" of the list of numbers [2, 3, 4, 7, 9] is generally considered to be (2+3+4+7+9)/5 = 25/5 = 5. In situations where the data is skewed or has outliers, and it is desired to focus on the main part of the group rather than the long tail, "average" often instead refers to the median; for example, the average personal income is usually given as the median income, so that it represents the majority of the population rather than being overly influenced by the much higher incomes of the few rich people. In certain real-world scenarios, such as computing the average speed from multiple measurements taken over the same distance, the average used is the harmonic mean. In situations where a histogram or probability density function is being referenced, the "average" could instead refer to the mode. Other statistics that can be used as an average include the mid-range and geometric mean, but they would rarely, if ever, be colloquially referred to as "the average".

View the full Wikipedia page for Averages
↑ Return to Menu

Outlier in the context of Trimmed estimator

In statistics, a trimmed estimator is an estimator derived from another estimator by excluding some of the extreme values, a process called truncation. This is generally done to obtain a more robust statistic, and the extreme values are considered outliers. Trimmed estimators also often have higher efficiency for mixture distributions, and heavy-tailed distributions than the corresponding untrimmed estimator, at the cost of lower efficiency for other distributions, such as the normal distribution.

Given an estimator, the x% trimmed version is obtained by discarding the x% lowest or highest observations or on both end: it is a statistic on the middle of the data. For instance, the 5% trimmed mean is obtained by taking the mean of the 5% to 95% range. In some cases a trimmed estimator discards a fixed number of points (such as maximum and minimum) instead of a percentage.

View the full Wikipedia page for Trimmed estimator
↑ Return to Menu