What is data exploration?
A preliminary exploration of the data to better understand its characteristics.
Key motivations of data exploration include
- Helping to select the right tool for preprocessing or analysis
- Making use of humans’ abilities to recognize patterns
- –People can recognize patterns not captured by datau analysis tools
Techniques Used In Data Exploration
In EDA, as originally defined by Tukey
- The focus was on visualization
- Clustering and anomaly detection were viewed as exploratory techniques
- In data mining, clustering and anomaly detection are major areas of interest, and not thought of as just exploratory
In our discussion of data exploration, we focus on
- Summary statistics
- Online Analytical Processing (OLAP)
Summary statistics are numbers that summarize properties of the data
Summarized properties include frequency, location and spread
Examples: location – mean
spread – standard deviation
Most summary statistics can be calculated in a single pass through the data
Frequency and Mode
The frequency of an attribute value is thel percentage of time the value occurs in the data set
- For example, given the attribute ‘gender’ and a representative population of people, the gender ‘female’ occurs about 50% of the time.
- The mode of a an attribute is the most frequent attribute value
- The notions of frequency and mode are typicallyl used with categorical data
For continuous data, the notion of a percentile isl more useful.
Given an ordinal or continuous attribute x and a number p between 0 and 100, the pth percentile is a value of x such that p% of the observed values of x are less than Xp .
Visualization is the conversion of data into a visual or tabular format so that the characteristics of the data and the relationships among data items or attributes can be analyzed or reported.
Visualization of data is one of the most powerfull and appealing techniques for data exploration.
- Humans have a well developed ability to analyze large amounts of information that is presented visually
- Can detect general patterns and trends
- Can detect outliers and unusual patterns
Selection Is the elimination or the de-emphasis of certainl objects and attributes
Selection may involve the chossing a subset of attributes
- Dimensionality reduction is often used to reduce the number of dimensions to two or three
- Alternatively, pairs of attributes can be considered
Selection may also involve choosing a subset ofl objects
- A region of the screen can only show so many points
- Can sample, but want to preserve points in sparse areas
On-Line Analytical Processing (OLAP) wasl proposed by E. F. Codd, the father of the relational database. Relational databases put data into tables, whilel OLAP uses a multidimensional array representation. – Such representations of data previously existed in statistics and other fields There are a number of data analysis and datal exploration operations that are easier with such a data representation.
OLAP Operations: Data Cube
The key operation of a OLAP is the formation of a data cube
A data cube is a multidimensional representation of data, together with all possible aggregates.
By all possible aggregates, we mean the aggregates that result by selecting a proper subset of the dimensions and summing over all remaining dimensions.
For example, if we choose the species type dimension of the Iris data and sum over all other dimensions, the result will be a one-dimensional entry with three entries, each of which gives the number of flowers of each type.
OLAP Operations: Slicing and Dicing
- Slicing is selecting a group of cells from the entirel multidimensional array by specifying a specific value for one or more dimensions.
- Dicing involves selecting a subset of cells byl specifying a range of attribute values.
– This is equivalent to defining a subarray from the complete array.
- In practice, both operations can also bel accompanied by aggregation over some dimensions.
OLAP Operations: Roll-up and Drill-down
Attribute values often have a hierarchicall structure.
- Each date is associated with a year, month, and week.
- A location is associated with a continent, country, state (province, etc.), and city.
- Products can be divided into various categories, such as clothing, electronics, and furniture. Note that these categories often nest and form al tree or lattice
- A year contains months which contains day
- A country contains a state which contains a city
This hierarchical structure gives rise to the roll-upl and drill-down operations.
- For sales data, we can aggregate (roll up) the sales across all the dates in a month.
- Conversely, given a view of the data where the time dimension is broken into months, we could split the monthly sales totals (drill down) into daily sales totals.
- Likewise, we can drill down or roll up on the location or product ID attributes