What is data exploration?

A preliminary exploration of the data to better understand its characteristics.

Key motivations of data exploration include

  • Helping to select the right tool for preprocessing or analysis
  • Making use of humans’ abilities to recognize patterns
  • ­–People can recognize patterns not captured by datau analysis tools



Techniques Used In Data Exploration

In EDA, as originally defined by Tukey

  • The focus was on visualization
  • Clustering and anomaly detection were viewed as exploratory techniques
  • In data mining, clustering and anomaly detection are major areas of interest, and not thought of as just exploratory

In our discussion of data exploration, we focus on

  • Summary statistics
  • Visualization
  • Online Analytical Processing (OLAP)


Summary Statistics

Summary statistics are numbers that summarize  properties of the data

Summarized properties include frequency, location and spread

  Examples: location – mean

spread – standard deviation

Most summary statistics can be calculated in a single pass through the data


Frequency and Mode 

The frequency of an attribute value is thel percentage of time the value occurs in the data set

  • For example, given the attribute ‘gender’ and a representative population of people, the gender ‘female’ occurs about 50% of the time.
  • The mode of a an attribute is the most frequent attribute value
  • The notions of frequency and mode are typicallyl used with categorical data



For continuous data, the notion of a percentile isl more useful.

Given an ordinal or continuous attribute x and a number p between 0 and 100, the pth percentile is a value of x such that p% of the observed values of x are less than  Xp  .


Visualization is the conversion of data into a visual or tabular format so that the characteristics of the data and the relationships among data items or attributes can be analyzed or reported.

Visualization of data is one of the most powerfull and appealing techniques for data exploration.

  • Humans have a well developed ability to analyze large amounts of information that is presented visually
  • Can detect general patterns and trends
  • Can detect outliers and unusual patterns

Selection  Is the elimination or the de-emphasis of certainl objects and attributes

Selection may involve the chossing a subset of attributes

  • Dimensionality reduction is often used to reduce the number of dimensions to two or three
  • Alternatively, pairs of attributes can be considered

Selection may also involve choosing a subset ofl objects

  • A region of the screen can only show so many points
  • Can sample, but want to preserve points in sparse areas





On-Line Analytical Processing (OLAP) wasl proposed by E. F. Codd, the father of the relational database.  Relational databases put data into tables, whilel OLAP uses a multidimensional array representation. – Such representations of data previously existed in statistics and other fields  There are a number of data analysis and datal exploration operations that are easier with such a data representation.


OLAP Operations: Data Cube


The key operation of a OLAP is the formation of a data cube


A data cube is a multidimensional representation of data, together with all possible aggregates.

By all possible aggregates, we mean the aggregates that result by selecting a proper subset of the dimensions and summing over all remaining dimensions.


For example, if we choose the species type dimension of the Iris data and sum over all other dimensions, the result will be a one-dimensional entry with three entries, each of which gives the number of flowers of each type.


OLAP Operations: Slicing and Dicing  


  • Slicing is selecting a group of cells from the entirel multidimensional array by specifying a specific value for one or more dimensions.
  • Dicing involves selecting a subset of cells byl specifying a range of attribute values.

– This is equivalent to defining a subarray from the complete array.

  • In practice, both operations can also bel accompanied by aggregation over some dimensions.

OLAP Operations: Roll-up and Drill-down  

Attribute values often have a hierarchicall structure.

  • Each date is associated with a year, month, and week.
  • A location is associated with a continent, country, state (province, etc.), and city.
  • Products can be divided into various categories, such as clothing, electronics, and furniture. Note that these categories often nest and form al tree or lattice
  • A year contains months which contains day
  • A country contains a state which contains a city

This hierarchical structure gives rise to the roll-upl and drill-down operations.

  • For sales data, we can aggregate (roll up) the sales across all the dates in a month.
  • Conversely, given a view of the data where the time dimension is broken into months, we could split the monthly sales totals (drill down) into daily sales totals.
  • Likewise, we can drill down or roll up on the location or product ID attributes




Big data analytics is of examining the large data sets to uncover hidden patterns, unknown correlations, market trends, customer preferences and other business information. The analytical findings lead to more effective marketing, new revenue opportunities, better customer service, improved operational efficiency, competitive advantages over rival organizations and some business benefits


R in the cloud

When R programmers talk about “big data,” they don’t necessarily mean data that goes through Hadoop. They generally use “big” to mean data that can’t be analyzed in memory.

The fact is you can easily get 16GB of RAM in a desktop or laptop computer. R running in 16GB of RAM can analyze millions of rows of data with no problem. Times have changed quite a bit since the days when a database table with a million rows was considered big.

One of the first steps many developers take when their program needs more RAM is to run it on a bigger machine. You can run R on a server; a common 4U Intel server can hold up to 2TB of RAM. Of course, hogging an entire 2TB server for one personal R instance might be a bit wasteful. So people run large cloud instances for as long as they need them, run VMs on their server hardware, or run the likes of RStudio Server on their server hardware.

RStudio Server comes in Free and Pro editions. Both have the same features for individual analysts, but the Pro version offers more in the way of scale: authorization and security, management visibility, performance tuning, support, and a commercial license. According to Roger Oberg of RStudio, the company’s intent is not to create paid-only features for individuals.


Big Data Strategies in R

If Big Data has to be tackle with R, five different strategies can be considered:



If data is too big to be analyzed in complete, its’ size can be reduced by sampling. Naturally, the question arises whether sampling decreases the performance of a model significantly. Much data is of course always better than little data. But according to Hadley Wickham’s useR! talk, sample based model building is acceptable, at least if the size of data crosses the one billion record threshold.

If sampling can be avoided it is recommendable to use another Big Data strategy. But if for whatever reason sampling is necessary, it still can lead to satisfying models, especially if the sample is

  • still (kind of) big in total numbers,
  • not too small in proportion to the full data set,
  • not biased.


Bigger hardware

  • R keeps all objects in memory. This can become a problem if the data gets large. One of the easiest ways to deal with Big Data in R is simply to increase the machine’s memory. Today, R can address 8 TB of RAM if it runs on 64-bit machines. That is in many situations a sufficient improvement compared to about 2 GB addressable RAM on 32-bit machines.


Store objects on hard disc and analyze it chunkwise

As an alternative, there are packages available that avoid storing data in memory. Instead, objects are stored on hard disc and analyzed chunkwise. As a side effect, the chunking also leads naturally to parallelization, if the algorithms allow parallel analysis of the chunks in principle. A downside of this strategy is that only those algorithms (and R functions in general) can be performed that are explicitly designed to deal with hard disc specific datatypes.

“ff” and “ffbase” are probably the most famous CRAN packages following this principle. Revolution R Enterprise, as a commercial product, uses this strategy with their popular “scaleR” package as well. Compared to ff and ffbase, Revolution scaleR offers a wider range and faster growth of analytic functions. For instance, the Random Forest algorithm has recently been added to the scaleR function set, which is not yet available in ffbase.

Integration of higher performing programming languages like C++ or Java

The integration of high performance programming languages is another alternative. Small parts of the program are moved from R to another language to avoid bottlenecks and performance expensive procedures. The aim is to balance R’s more elegant way to deal with data on the one hand and the higher performance of other languages on the other hand.

The outsourcing of code chunks from R to another language can easily be hidden in functions. In this case, proficiency in other programming languages is mandatory for the developers, but not for the users of these functions.


Alternative interpreters

A relatively new direction to deal with Big Data in R is to use alternative interpreters. The first one that became popular to a bigger audience was pqR(pretty quick R). Duncon Murdoc from the R-Core team preannounced that pqR’s suggestions for improvements shall be integrated into the core of R in one of the next versions.

Another very ambitioned Open-Source project is Renjin. Renjin reimplements the R interpreter in Java, so it can run on the Java Virtual Machine (JVM). This may sound like a Sisyphean task but it is progressing astonishingly fast. A major milestone in the development of Renjin is scheduled for the end of 2013.

Tibco created a C++ based interpreter called TERR. Beside the language, TERR differs from Renjin in the way how object references are modeled. TERR is available for free for scientific and testing purposes. Enterprises have to purchase a licensed version if they use TERR in production mode.

Another alternative R-interpreter is offered by Oracle. Oracle R uses Intel’s mathematic library and therefore achieves a higher performance without changing R’s core. Besides from the interpreter which is free to use, Oracle offers Oracle R Enterprise, a component of Oracles “Advanced analytic” database option. It allows to run any R code on the database server and has a rich set of functions that are optimized for high performance in-database computation. Those optimized function cover – beside data management operations and traditional statistic tasks – a wide range of data-mining algorithms like SVM, Neural Networks, Decision Trees etc.