Text analytics

Text analytics is the process of analyzing unstructured text, extracting relevant information, and transforming it into useful business intelligence. Text analytics processes can be performed manually, but the amount of text-based data available to companies today makes it increasingly important to use intelligent, automated solutions.

textanalysticswall1.jpg

Why is text analytics important?

Emails, online reviews, tweets, call center agent notes, and the vast array of other written feedback, all hold insight into customer wants and needs. But only if you can unlock it. Text analytics is the way to extract meaning from this unstructured text, and to uncover patterns and themes.

tutorial-text-analytics-for-security-9-638

Several text analytics use cases exist:

  • Case management—for example, insurance claims assessment, healthcare patient records and crime-related interviews and reports
  • Competitor analysis
  • Fault management and field-service optimization
  • Legal ediscovery in litigation cases
  • Media coverage analysis
  • Pharmaceutical drug trial improvement
  • Sentiment analytics
  • Voice of the customer

A well-understood process for text analytics includes the following steps:

  1. Extracting raw text
  2. Tokenizing the text—that is, breaking it down into words and phrases
  3. Detecting term boundaries
  4. Detecting sentence boundaries
  5. Tagging parts of speech—words such as nouns and verbs
  6. Tagging named entities so that they are identified—for example, a person, a company, a place, a gene, a disease, a product and so on
  7. Parsing—for example, extracting facts and entities from the tagged text
  8. Extracting knowledge to understand concepts such as a personal injury within an accident claim

text-analytics1

Qualitative Technique

An array of techniques may be employed to derive meaning from text. The most accurate method is an intelligent, trained human being reading the text and interpreting its meaning. This is the slowest method and the most costly, but the most accurate and powerful. Ideally, the reader is trained in qualitative research techniques and understands the industry and contextual framework of the text. A well-trained qualitative researcher can extract extraordinary understanding and insight from text. In a typical project, the qualitative researcher might read hundreds of paragraphs to analyze the text, develop hypotheses, draw conclusions, and write a report. This type of analysis is subject to the risks of bias and misinterpretation on the part of the qualitative researcher, but these limitations are with us always—regardless of method. The power of the human mind cannot be equaled by any software or any computer system. Decision Analyst’s team of highly trained qualitative researchers are experts at understanding text.

Content Analysis or Open-End Coding

The history of text analytics traces back to World War II and the development of “content analysis” by governmental intelligence services. That is, intelligence analysts would read documents, magazines, records, dispatches, etc., and assign numeric codes to different topics, concepts, or ideas. By summing up these numeric codes, the analyst could quantify the different concepts or ideas, and track them over time. This approach was further developed by the survey research industry after the war. Today as then, open-end questions in surveys are analyzed by someone reading the textual answers and assigning numeric codes. These codes are then summarized in tables, so that the analyst has a quantitative sense of what people are saying. This remains a powerful method of text mining or text analytics. It leverages the power of the human mind to discern subtleties and context.

093310_343455

The first step is careful selection of a representative sample of respondents or responses. In surveys the sample is usually representative and comparatively small (less than 2,000), so all open-ended questions are coded. However, in the case of social media text, CRM system, or customer complaint system, the text might be made up of millions of customer comments. So the first step is the random selection of a few thousand records, and these records are checked for duplicates, geographic distribution, etc. Then, a human being reads each and every paragraph of text and assigns numeric codes to different meanings and ideas. These codes are tabulated and statistical summaries are prepared for the analyst. This is text mining or text analytics at its apogee. Open-end coding offers the strength of numbers (statistical significance) and the intelligence of the human mind. Decision Analyst operates a large multilanguage coding facility with highly trained staff specifically for content analysis and text analytics.

big-data-text-analytics-predictive-analytics-processing

Machine Text Mining or Text Analytics

With the explosion of keyboard-generated text related to the spread of PCs and the Internet over the past two decades, many companies are searching for automated ways to analyze large volumes of textual data. Decision Analyst offers several text-analytic services, based on different software systems, to analyze and report on textual data. These software systems are very powerful, but they cannot take the place of the thinking human brain. The results from these software systems should be thought of as approximations, as crude indicators of truth and trends, but the results must always be verified by other methods and other data.

 

Advertisements

Business intelligence

 

Business intelligence(BI) can be described as “a set of techniques and tools for the acquisition and transformation of raw data into meaningful and useful information for business analysis purposes”. The term “data surfacing” is also more often associated with BI functionality. BI technologies are capable of handling large amounts of structured and sometimes unstructured data to help identify, develop and otherwise create new strategic business opportunities. The goal of BI is to allow for the easy interpretation of these large volumes of data. Identifying new opportunities and implementing an effective strategy based on insights can provide businesses with a competitive market advantage and long-term stability.BI technologies provide historical, current and predictive views of business operations. Common functions of business intelligence technologies are reporting, online analytical processing, analytics, data mining, process mining, complex event processing, business performance management, benchmarking, text mining, predictive analytics and prescriptive analytics.

business-intelligence

Components:

Business intelligence is made up of an increasing number of components including

*Multidimensional aggregation and allocation

*De-normalization, tagging and standardization

*Real time reporting with analytical alert

*A method of interfacing with unstructured data sources

*Group consolidation, budgeting and rolling forecasts

*Statistical inference and probabilistic simulation

*Key performance indicators optimization

*Version control and process management

*Open item management.

stock-photo-business-intelligence-150231065

Data warehousing:

Often BI applications use data gathered from a data warehouse(DW) or from a data mart, and the concepts of BI and DW sometimes combine as “BI/DW” or as “BIDW”. A data warehouse contains a copy of analytical data that facilitates decision support. To distinguish between the concepts of business intelligence and data warehouses, Forrester Research defines business intelligence in one of two ways:

1.Using a broad definition “Business Intelligence is a set of methodologies, processes, architectures, and technologies that transform raw data into meaningful and useful information used to enable more effective strategic, tactical, and operational insights and decision-making.”Under this definition, business intelligence also includes technologies such as data integration, data quality, data warehousing, master-data management, text- and content-analytics, and many others that the market sometimes lumps in to the “Information Management” segment. Therefore, Forrester refers to data preparation and data usage as two separate but closely linked segments of the business-intelligence architectural stack.

2.Forrester defines the narrower business-intelligence market as, referring to just the top layers of the BI architectural stack such as reporting, analytics and dash boards.

Comparison with competitive intelligence:

BI uses technologies, processes, and applications to analyze mostly internal, structured data and business processes while competitive intelligence gathers, analyzes and disseminates information with a topical focus on company competitors. If understood broadly, business intelligence can include the subset of competitive intelligence.

bi_2006_fig1

Business Intelligence Trends:

Currently organizations are starting to see that data and content should not be considered separate aspects of information management, but instead should be managed in an integrated enterprise approach. Enterprise information management brings Business Intelligence and Enterprise Content Management together. Currently organizations are moving towards Operational Business Intelligence which is currently under served and uncontested by vendors. Traditionally, Business Intelligence vendors are targeting only top the pyramid but now there is a paradigm shift moving toward taking Business Intelligence to the bottom of the pyramid with a focus of self-service business intelligence.

Business needs:

Because of the close relationship with senior management, another critical thing that must be assessed before the project begins is whether or not there is a business need and whether there is a clear business benefit by doing the implementation. Another reason for a business-driven approach to implementation of BI is the acquisition of other organizations that enlarge the original organization it can sometimes be beneficial to implement DW or BI in order to create more oversight.

BI Portals:

A Business Intelligence portal(BI portal) is the primary access interface for Data Warehouse(DW) and Business Intelligence (BI) applications. The BI portal is the user’s first impression of the DW/BI system. It is typically a browser application, from which the user has access to all the individual services of the DW/BI system, reports and other analytical functionality. The following is a list of desirable features for web portals in general and BI portals in particular: Usable User should easily find what they need in the BI tool. Content Rich The portal is not just a report printing tool, it should contain more functionality such as advice, help, support information and documentation. The portal should be implemented in a way that makes it easy for the user to use its functionality and encourage them to use the portal. Scalability and customization give the user the means to fit the portal to each user.

business-intelligence-hexagons

BIG DATA IN ORGANIZATION

 

Big data analytics is a trending practice that many companies are adopting. Before jumping in and buying big data tools, though, organizations should first get to know the landscape.

In essence, big data analytics tools are software products that support predictive and prescriptive analytics applications running on big data computing platforms typically, parallel processing systems based on clusters of commodity servers, scalable distributed storage and technologies such as Hadoop and NoSQL databases..

In addition, big data analytics tools provide the framework for using data mining techniques to analyze data, discover patterns, propose analytical models to recognize and react to identified patterns, and then enhance the performance of business processes by embedding the analytical models within the corresponding operational applications.

Screen-shot-2013-07-10-at-16.47.22-PM.png

Powering analytics: Inside big data and advanced analytics tools

Big data analytics yields a long list of vendors. However, many of these vendors provide big data platforms and tools that support the analytics process for example, data integration, data preparation and other types of data management software. We focus on tools that meet the following criteria:

  • They provide the analyst with advanced analytics algorithms and models.
  • They’re engineered to run on big data platforms such as Hadoop or specialty high-performance analytics systems.
  • They’re easily adaptable to use structured and unstructured data from multiple sources.
  • Their performance is capable of scaling as more data is incorporated into analytical models.
  • Their analytical models can be or already are integrated with data visualization and presentation tools.
  • They can easily be integrated with other technologies.

big-data-investments-by-industry

Big data and advanced analytics tools:

While some individuals in the organization are looking to explore and devise new predictive models, others look to embed these models within their business processes, and still others will want to understand the overall impact that these tools will have on the business. In other words, organizations that are adopting big data analytics need to accommodate a variety of user types, such as:

The data scientist, who likely performs more complex analyses involving more complex data types and is familiar with how underlying models are designed and implemented to assess inherent dependencies or biases.

The business analyst, who is likely a more casual user looking to use the tools for proactive data discovery or visualization of existing information, as well as some predictive analytics.

The business manager, who is looking to understand the models and conclusions.

IT developers, who support all the prior categories of users.

everis-big-datawilsonv14-23-638

How Applications of Big Data Drive Industries:

Generally, most organizations have several goals for adopting big data projects. While the primary goal for most organizations is to enhance customer experience, other goals include cost reduction, better targeted marketing and making existing processes more efficient. In recent times, data breaches have also made enhanced security an important goal that big data projects seek to incorporate. More importantly however, where do you stand when it comes to big data? You will very likely find that you are either:

  • Trying to decide whether there is true value in big data or not
  • Evaluating the size of the market opportunity
  • Developing new services and products that will utilize big data
  • Already utilizing big data solutions Repositioning existing services and products to utilize big data, or
  • Already utilizing big data solutions.

With this in mind, having a bird’s eye view of big data and its application in different industries will help you better appreciate what your role is or what it is likely to be in the future, in your industry or across different industries. With this in mind, having a bird’s eye view of big data and its application in different industries will help you better appreciate what your role is or what it is likely to be in the future, in your industry or across different industries.

In this article, I shall examine 10 industry verticals that are using big data, industry-specific challenges that these industries face, and how big data solves these challenges.

applications_of_big_data_infographic

Conclusion:

Having gone through 10 industry verticals including how big data plays a role in these industries, here are a few key takeaways:

  • There is substantial real spending around big data
  • To capitalize on big data opportunities, you need to Familiarize yourself with and Understand where spending is occurring
  • Match market needs with your own capabilities and solutions
  • Vertical industry expertise is key to utilizing big data effectively and efficiently
  • If there’s anything you’d like to add, explore, or know, do feel free to comment below.

INTEGRATED BIG DATA APPLICATIONS

bigdataapps2Integrated application systems are infrastructure systems pre-integrated with databases, applications software or both, providing appliance-like functionality for Big Data solutions, analytic platforms or similar demands. For ISVs, there are five key reasons to consider delivering your Big Data/analytics solutions in the form of integrated applications systems benefits that can make the difference between market-moving success and tepid sales and profits.

Operational Environment:

The typical IT enterprise evolved to its current state by utilizing standards and best practices. These include simple things like data naming conventions to more complex ones such as a well-maintained enterprise data model. New data-based implementations require best practices in organization, documentation and governance. With new data and processes in the works you must update documentation, standards and best practices and continue to improve quality.

Costs and benefits of new mainframe components typically involve software license charges. The IT organization will need to re-budget and perhaps even re-negotiate current licenses and lease agreements. As always, new hardware comes with its own requirements of power, footprint, and maintenance needs.

A Big Data implementation brings additional staff into the mix: experts on new analytics software, experts on special-purpose hardware, and others. Such experts are rare, so your organization must hire, rent, or outsource this work. How will they fit into your current organization?  How will you train current staff to grow into these positions?

image009

Start with the Source System:

This is your core data from operational systems. Interestingly, many beginning Big Data implementations will attempt to access this data directly (or at least to store it for analysis), thereby bypassing succeeding steps. This happens because Big Data  sources have not yet been integrated into your IT architecture. Indeed, these data sources may be brand new or never accessed.

Those who support the source data systems may not have the expertise to assist in analytics, while analytics experts may not understand the source data. Analytics accesses production data directly, so any testing or experimenting is done in a production environment.

Analyze Data Movement:

These data warehouse subsystems and processes first access data from the source systems. Some data may require transformations or ‘cleaning’. Examples include missing data or invalid data such as all zeroes for a field defined as a date. Some data must be gathered from multiple systems and merged, such as accounting data. Other data requires validation against other systems.

Data from external sources can be extremely problematic.  Consider data from an external vendor that was gathered using web pages where numbers and dates were entered in free-form text fields. This opens the possibility of non-numeric characters in numeric data fields. How can you maximize the amount of data you process, while minimizing the issues with invalid fields?  The usual answer is ‘cleansing’ logic that handles the majority of invalid fields using either calculation logic or assignment of default values.

big-data-bim

Review Data Storage for Analytics:

This is the final point, the destination where all data is delivered. From here, we get direct access to data for analysis, perhaps by approved query tools. Some subsets of data may be loaded into data marts, while others may be extracted and sent to internal users for local analysis. Some implementations include publish-and-subscribe features or even replication of data to external sources.

Coordination between current processes and the big data process is required. IT support staff will have to investigate whether options to get early use of the data are available. It may also be possible to load current data and the corresponding big data tables in parallel. Delays in loading data will impact the accuracy and availability of analytics; this is a business decision that must be made, and will differ from implementation to implementation.

Greater solution consistency:

In an integrated application system, you integrate the hardware and software for your product, so you control the environment that supports your product. That ensures your Big Data and analytics applications have all the processing, storage, and memory resources they need to deliver optimal performance on compute-intensive jobs. In short, your application runs the way it was designed, so customer satisfaction is optimized.

Better system security:

Analytics and other Big Data systems frequently deal with financial, medical or other proprietary information. By delivering your product as an integrated application system, you can build in the security tools necessary to prevent access by unauthorized users, hackers or other intruders. Your application is safer, so your customers gain confidence in your products.

data-integration-ecosystem-for-big-data-and-analytics

The conclusion:

Big data today has scale-up and scale-out issues. Further, it often involves integration of dissimilar architectures. When we insist that we can deal with big data by simply scaling up to faster, special-purpose hardware, we are neglecting more fundamental issues.

neural network -DATA MINING

A neural network is a powerful computational data model that is able to capture and represent complex input/output relationships. The motivation for the development of neural network technology stemmed from the desire to develop an artificial system that could perform “intelligent” tasks similar to those performed by the human brain.

customer-centric-data-mining-10-728

A neural network acquires knowledge through learning.

A neural network’s knowledge is stored within inter-neuron connection strengths known as synaptic weights.

The true power and advantage of neural networks lies in their ability to represent both linear and non-linear relationships and in their ability to learn these relationships directly from the data being modeled. Traditional linear models are simply inadequate when it comes to modeling data that contains non-linear characteristics.

The most common neural network model is the Multilayer Perceptron (MLP). This type of neural network is known as a supervised network because it requires a desired output in order to learn. The goal of this type of network is to create a model that correctly maps the input to the output using historical data so that the model can then be used to produce the output when the desired output is unknown.

The demonstration of a neural network learning to model using the exclusive-or (Xor) data. The Xor data is repeatedly presented to the neural network. With each presentation, the error between the network output and the desired output is computed and fed back to the neural network. The neural network uses this error to adjust its weights such that the error will be decreased. This sequence of events is usually repeated until an acceptable error has been reached or until the network no longer appears to be learning.

A good way to introduce the topic is to take a look at a typical application of neural networks. Many of today’s document scanners for the PC come with software that performs a task known as optical character recognition (OCR). OCR software allows you to scan in a printed document and then convert the scanned image into to an electronic text format such as a Word document, enabling you to manipulate the text. In order to perform this conversion the software must analyze each group of pixels (0’s and 1’s) that form a letter and produce a value that corresponds to that letter. Some of the OCR software on the market use a neural network as the classification engine.

neural

The demonstration of a neural network used within an optical character recognition (OCR) application. The original document is scanned into the computer and saved as an image. The OCR software breaks the image into sub-images, each containing a single character. The sub-images are then translated from an image format into a binary format, where each 0 and 1 represents an individual pixel of the sub-image. The binary data is then fed into a neural network that has been trained to make the association between the character image data and a numeric value that corresponds to the character.

nnpict-791240

Neural networks have been successfully applied to broad spectrum of data-intensive applications, such as:

Process Modeling and Control – Creating a neural network model for a physical plant then using that model to determine the best control settings for the plant.

Machine Diagnostics – Detect when a machine has failed so that the system can automatically shut down the machine when this occurs.

Portfolio Management – Allocate the assets in a portfolio in a way that maximizes return and minimizes risk.

Target Recognition – Military application which uses video and/or infrared image data to determine if an enemy target is present.

Medical Diagnosis – Assisting doctors with their diagnosis by analyzing the reported symptoms and/or image data such as MRIs or X-rays.

Credit Rating – Automatically assigning a company’s or individuals credit rating based on their financial condition.

Targeted Marketing – Finding the set of demographics which have the highest response rate for a particular marketing campaign.

Voice Recognition – Transcribing spoken words into ASCII text.

Financial Forecasting – Using the historical data of a security to predict the future movement of that security.

Quality Control – Attaching a camera or sensor to the end of a production process to automatically inspect for defects.

Intelligent Searching – An internet search engine that provides the most relevant content and banner ads based on the users’ past behavior.

Fraud Detection – Detect fraudulent credit card transactions and automatically decline the charge.

 

 

 

 

 

CLUSTERING- DATA MINING

Clustering is the process of breaking down a large population that has a high degree of variation and noise into smaller groups with lower variation. It is a popular data mining activity. In a poll conducted by Kdnuggets, clustering was voted as the 3rd most frequently used data mining technique in 2011. Only decision trees and regression got more votes.

Cluster analysis is an important part of an analyst’s arsenal. One needs to master this technique as it is going to be used often in business situations. Some common applications of clustering are –

  1. Clustering customer behavior data for segmentation
  2. Clustering transaction data for fraud analysis in financial services
  3. Clustering call data to identify unusual patterns
  4. Clustering call-centre data to identify outlier performers (high and low)

Screenshot (6).png

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters).

Cluster analysis itself is not one specific algorithm, but the general task to be solved. It can be achieved by various algorithms that differ significantly in their notion of what constitutes a cluster and how to efficiently find them. Popular notions of clusters include groups with small distances among the cluster members, dense areas of the data space, intervals or particular statistical distributions. Clustering can therefore be formulated as a multi-objective optimization problem. The appropriate clustering algorithm and parameter settings (including values such as the distance function to use, a density threshold or the number of expected clusters) depend on the individual data set and intended use of the results. Cluster analysis as such is not an automatic task, but an iterative process of knowledge discovery or interactive multi-objective optimization that involves trial and failure. It is often necessary to modify data preprocessing and model parameters until the result achieves the desired properties.

Connectivity models: for example, hierarchical clustering builds models based on distance connectivity.

Centroid models: for example, the k-means algorithm represents each cluster by a single mean vector.

Distribution models: clusters are modeled using statistical distributions, such as multivariate normal distributions used by the Expectation-maximization algorithm.

Density models: for example, DBSCAN and OPTICS defines clusters as connected dense regions in the data space.

Subspace models: in Biclustering (also known as Co-clustering or two-mode-clustering), clusters are modeled with both cluster members and relevant attributes.

Group models: some algorithms do not provide a refined model for their results and just provide the grouping information.

Graph-based models: a clique, that is, a subset of nodes in a graph such that every two nodes in the subset are connected by an edge can be considered as a prototypical form of cluster. Relaxations of the complete connectivity requirement (a fraction of the edges can be missing) are known as quasi-cliques, as in the HCS clustering algorithm

There are also finer distinctions possible, for example:

strict partitioning clustering: here each object belongs to exactly one cluster

strict partitioning clustering with outliers: objects can also belong to no cluster, and are considered outliers.

overlapping clustering (also: alternative clustering, multi-view clustering): while usually a hard clustering, objects may belong to more than one cluster.

hierarchical clustering: objects that belong to a child cluster also belong to the parent cluster

subspace clustering: while an overlapping clustering, within a uniquely defined subspace, clusters are not expected to overlap.

There are also finer distinctions possible, for example:

strict partitioning clustering: here each object belongs to exactly one cluster

strict partitioning clustering with outliers: objects can also belong to no cluster, and are considered outliers.

overlapping clustering (also: alternative clustering, multi-view clustering): while usually a hard clustering, objects may belong to more than one cluster.

hierarchical clustering: objects that belong to a child cluster also belong to the parent cluster

subspace clustering: while an overlapping clustering, within a uniquely defined subspace, clusters are not expected to overlap.

k-means clustering

In centroid-based clustering, clusters are represented by a central vector, which may not necessarily be a member of the data set. When the number of clusters is fixed to k, k-means clustering gives a formal definition as an optimization problem: find the k cluster centers and assign the objects to the nearest cluster center, such that the squared distances from the cluster are minimized.

Most k-means-type algorithms require the number of clusters – k – to be specified in advance, which is considered to be one of the biggest drawbacks of these algorithms. Furthermore, the algorithms prefer clusters of approximately similar size, as they will always assign an object to the nearest centroid. This often leads to incorrectly cut borders in between of clusters (which is not surprising, as the algorithm optimized cluster centers, not cluster borders).

K-means has a number of interesting theoretical properties. First, it partitions the data space into a structure known as a Voronoi diagram. Second, it is conceptually close to nearest neighbor classification, and as such is popular in machine learning. Third, it can be seen as a variation of model based classification, and Lloyd’s algorithm as a variation of the Expectation-maximization algorithm for this model discussed below.

.

Data Mining – Handling Missing Values

One of the important stages of data mining is preprocessing, where we prepare the data for mining. Real-world data tends to be incomplete, noisy, and inconsistent and an important task when preprocessing the data is to fill in missing values, smooth out noise and correct inconsistencies.

1. Ignore the data row1

This is usually done when the class label is missing (assuming you data mining goal is classification), or many attributes are missing from the row (not just one). However you’ll obviously get poor performance if the percentage of such rows is high.

For example, lets say we have a database of students enrolment data (age, SAT score, state of residence, etc.) and a column classifying their success in college to “Low”, “Medium” and “High”. Lets say our goal is do build a model predicting a student’s success in college. Data rows who are missing the success column are not useful in predicting success so they could very well be ignored and removed before running the algorithm.

2. Use a global constant to fill in for missing values

Decide on a new global constant value, like “unknown“, “N/A” or minus infinity, that will be used to fill all the missing values.
This technique is used because sometimes it just doesn’t make sense to try and predict the missing value.

For example, lets look at the students enrollment database again. Assuming the state of residence attribute data is missing for some students. Filling it up with some state doesn’t really makes sense as opposed to using something like “N/A”.

3. Use attribute mean

Replace missing values of an attribute with the mean (or median if its discrete) value for that attribute in the database.

For example, in a database of US family incomes, if the average income of a US family is X you can use that value to replace missing income values.

4. Use attribute mean for all samples belonging to the same class

Instead of using the mean (or median) of a certain attribute calculated by looking at all the rows in a database, we can limit the calculations to the relevant class to make the value more relevant to the row we’re looking at.

Lets say you have a cars pricing database that, among other things, classifies cars to “Luxury” and “Low budget” and you’re dealing with missing values in the cost field. Replacing missing cost of a luxury car with the average cost of all luxury cars is probably more accurate then the value you’d get if you factor in the low budget cars.

5. Use a data mining algorithm to predict the most probable value

The value can be determined using regression, inference based tools using Baysian formalism , decision trees, clustering algorithms (K-Mean\Median etc.).

For example, we could use a clustering algorithms to create clusters of rows which will then be used for calculating an attribute mean or median as specified in technique #3.
Another example could be using a decision tree to try and predict the probable value in the missing attribute, according to other attributes in the data.

Strategies for Dealing with Missing Data
There are three main strategies for dealing with missing data. The simplest solution for the missing values imputation problem is the reduction of the data set and elimination of all samples with missing values . Another solution is to treat missing values as special values. Finally missing values problem can be handled by various missing values imputation methods. Unfortunately missing values imputation methods are suitable only for missing values caused by missing completely at random (MCAR) and some of them for missing at random (MAR) mechanism. If missing values are caused by not missing at random mechanism (NMAR) it must be handled by going back to the source of data and obtaining more information or the appropriate model for the missing data mechanism have to be taken into account.

The Missing Values Problem
Missing value is a value that we intended to obtain during data collection (interview, measurement, observation) but we didn’t because of various reasons. Missing values can appear because respondent did not answer all questions in questionnaire, during manual data entry process, incorrect measurement, faulty experiment, some data are censored or anonymous and many others. Luengo, J. (2011) introduced three problems associated with missing values as – loss of efficiency, – complications in handling and analyzing the data, – bias resulting from differences between missing and complete data. Loss of efficiency is caused by time-consuming process of dealing with missing values. This is closely connected to the second problem – complications in handling and analyzing the data. Complications in handling and analyzing data lie in fact that most methods and algorithms are usually unable to deal with missing values and missing values problem must be solved prior to analyses during data preparation phase. The other problem – bias resulting from differences between missing and complete data lies in fact that imputed values are not exactly the same as known values of completed data set and should not be handled the same way. The same problem occurs also if data set is reduced and some cases (rows of data set) are removed or ignored.