sentimental analysis for product rating


Sentiment is a judgment or thought passed based on feeling. Sentiment plays a major role when products from different brands are developed with same quality and how sentiment helps one brand’s product to get a better market then the other. Sentiment analysis is opinion mining that deals with sentiment polarity categorization. The various process involved in sentiment analysis explained below:


Data collection

Online portals like Twitter provides API to extract data but most of other portals won’t provide such mechanism. Scripting languages are used to extract the content from online portal.  In general the data in these cases constitutes rating numbers 1 to 5 or 1 to 10 and rating description. Extracting, analyzing and charting rating numbers are relatively easier than analyzing rating description.


Feature extraction

Bag of words consist of some standard words and those words are compared to the data from review to derive binary feature vector. However this method is not effective on phrases so collocations is done with bigram functions. Bigram help in identifying negation words as they occur as pair or group of words. During feature extraction spell check need to be done to clean up the data. Parts of speech tag identification is key part of feature extraction.



On classification of data there are various methods like Naive Bayes method. It uses Bayes’ Rules to calculate the probability of feature of a vector in a class. This method is little complicated and hard to trace back which probabilities are causing certain classifications. Decision list one other method which operates on rule based tagger and that has advantage of human readability.



Based on the Sentiment analysis, the results are being charted or represented in tabular form. In simple rating numbers analysis extracts the results are charted in graph with products on x axis and review rating number on the y axis. In case of Navie Bayes and Decision list method the results are formatted in tabular column with Features as one of the column and scores on other columns.



Lexicon based approach, in this method for each review the matched negative and positive words from predefined set of sentiment lexicon are being counted. The polarity of the review is calculated based on the counting polar words and assigning polarity values such as positive, negative and neutral to them.



The various challenges in sentiment analysis starts right from data collection. Most of the data are free text and available on HTML pages. The rating numbers and rating description on many of the cases won’t match so simple analysis done using rating numbers are not accurate and this leads to analysis of the rating description using various machine language learning tools. Analysis using these tools are complex in nature. Since ratings are open to all customers there are good possibilities of junk reviews and spelling mistakes are common on rating content.



Natural language Toolkit (NLTK) – It runs on Python language platform. It provides features like tokenizing, identifying named entities and parsing.

Stanford core NLP Suite – It provides tools for parts of speech tagging, grammar praising and name entity recognition.

GATE and Apache UIMA – It is help in building complex NLP workflows which integrates with different processing steps.




SAS Text Analytics – It provides Text analytics software to extract information from text content. It discover patterns and trends from text using natural language processing, advanced linguistic technologies and advanced statistical modeling.

IBM Text Analytics – It converts unstructured survey text into quantitative data. It automates the categorization process to eliminate the manual processing. To reduce ambiguities in human language it uses linguistic based technologies

Lexalytics Text Analytics – Salience is text analytics engine build by Lexalytics.  It is helpful for social media monitoring, sentiment analysis, survey of customer voice.

Smart logic – It provides rule base classification modelling and information visualization. It applies metadata and classification to deliver navigation and search experience.



Customers today leave pieces of information and data over the Internet – bits of knowledge into who they are, what they like, and what they are going to buy. Furthermore, as most businesses today, the automotive business is assembling and utilizing as quite a bit of this data as they can collect.Of course, not all data is made equivalent. Data might be deficient, unstructured, or out and out off-base. Also, to exacerbate the issue, this data isn’t as a matter of course simple to gather and change into a significant data.


On the off chance that you don’t know who your clients and best prospects are, by what means would you be able to send then messages to get them into your dealership or repair focus? Without a doubt, you may have data on when they last went to your place of business, including a couple contact points of interest here and there. What’s more, maybe you have their charging address on the off chance that they have worked with you some time recently. Be that as it may, as is frequently the case, you are most likely missing various points of interest to help you comprehend your clients on a significantly more customized level.

You might need to know which families have kids might be a great opportunity to move up to a bigger vehicle, who has a teenager driver in the house  or who is occupied with the outside. Points of interest, for example, wage,  status, occupation, distractions, way of life, and age are a few case of demographics that can be utilized to make focused on advertising messages to which your shoppers are most able to relate.

Specialized Auto DATA

A few specific data arrangement suppliers can give  gritty data on vehicles and their proprietors. Search for a data arrangement that incorporates:

  • 100% populated with Make, Model and Year as got specifically from VINs.
  • Completely populated database in which each lead record incorporates data, for example, name, address, make, model and year.
  • Premium chooses, for example, in-business sector for another vehicle, purchaser demographics, fragmented riches demonstrating, email addresses, and full VIN.
  • Choices accessible, for example, motor size, fuel sort, drive train, motor piece, and motor barrels.
  • Approved mails and index help accepted telephone numbers. Automotive showcasing data.


The Modi government in power, there are expectations of increased focus on reforms and ramp up in infrastructure. Thus, government spending on infrastructure in roads and airports and higher GDP growth in the future will benefit the auto sector in general. We expect a slew of launches both in passenger cars and utility vehicles (UVs) given that the competition has intensified.

Our prospect focusing on expands showcasing results by focusing on your business messages and financing offers to prepared, willing and ready to purchase customers . Driving the up and coming era of automotive promoting by consolidating further bits of knowledge into buyer states of mind and practices with significant heading in the utilization of immediate, customary and advanced media, constant lead scoring and publicizing focusing on.

Key Features and Benefits

  • Cross media – direct mail, email, online advertising and more
  • Innovative – the next generation of patented methodologies
  • Affordable – value driven
  • Effective – validation results available.

Individualized,  administration showcasing effort are vital to holding clients and utilizing the most  client esteem for dealerships. This administration showcasing program build deals and benefits by focusing on clients in value, or those toward the end of term, lease or guarantee. Flawless Prospect uses DMS, OEM motivating forces, book qualities and outsider data to offer vehicle merchants an aggressive edge by distinguishing current and triumph open doors that have the most noteworthy likelihood of acquiring or overhauling with a dealership. Tweaked cautions convey prepared to-purchase open doors at the perfect time, to the right partner, in a RO dashboard that is straightforward and straightforward. Transform current administration clients into faithful, rehash deals clients when you get them into another vehicle with a like installment for practically no cash down


“PROSPECTS  DATA  and PLANNING the strategy  makes the clear path to be a successful Automobile Company”




Applications of Data Mining

Service providers

The first example of Data Mining and Business Intelligence comes from service providers in the mobile phone and utilities industries. Mobile phone and utilities companies use Data Mining and Business Intelligence to predict ‘churn’, the terms they use for when a customer leaves their company to get their phone/gas/broadband from another provider. They collate billing information, customer services interactions, website visits and other metrics to give each customer a probability score, then target offers and incentives to customers whom they perceive to be at a higher risk of churning.


Another example of Data Mining and Business Intelligence comes from the retail sector. Retailers segment customers into ‘Recency, Frequency, Monetary’ (RFM) groups and target marketing and promotions to those different groups. A customer who spends little but often and last did so recently will be handled differently to a customer who spent big but only once, and also some time ago. The former may receive a loyalty, upsell and cross-sell offers, whereas the latter may be offered a win-back deal, for instance.


Perhaps some of the most well -known examples of Data Mining and Analytics come from E-commerce sites. Many E-commerce companies use Data Mining and Business Intelligence to offer cross-sells and up-sells through their websites. One of the most famous of these is, of course, Amazon, who use sophisticated mining techniques to drive there, ‘People who viewed that product, also liked this’ functionality.


Supermarkets provide another good example of Data Mining and Business Intelligence in action. Famously, supermarket loyalty card programmes are usually driven mostly, if not solely, by the desire to gather comprehensive data about customers for use in data mining. One notable recent example of this was with the US retailer Target. As part of its Data Mining programme, the company developed rules to predict if their shoppers were likely to be pregnant. By looking at the contents of their customers’ shopping baskets, they could spot customers who they thought were likely to be expecting and begin targeting promotions for nappies (diapers), cotton wool and so on. The prediction was so accurate that Target made the news by sending promotional coupons to families who did not yet realise they were pregnant.

Crime agencies

The use of Data Mining and Business Intelligence is not solely reserved for corporate applications and this is shown in our final example. Beyond corporate applications, crime prevention agencies use analytics and Data Mining to spot trends across myriads of data – helping with everything from where to deploy police manpower (where is crime most likely to happen and when?), who to search at a border crossing (based on age/type of vehicle, number/age of occupants, border crossing history) and even which intelligence to take seriously in counter-terrorism activities.

Text analytics

Text analytics is the process of analyzing unstructured text, extracting relevant information, and transforming it into useful business intelligence. Text analytics processes can be performed manually, but the amount of text-based data available to companies today makes it increasingly important to use intelligent, automated solutions.


Why is text analytics important?

Emails, online reviews, tweets, call center agent notes, and the vast array of other written feedback, all hold insight into customer wants and needs. But only if you can unlock it. Text analytics is the way to extract meaning from this unstructured text, and to uncover patterns and themes.


Several text analytics use cases exist:

  • Case management—for example, insurance claims assessment, healthcare patient records and crime-related interviews and reports
  • Competitor analysis
  • Fault management and field-service optimization
  • Legal ediscovery in litigation cases
  • Media coverage analysis
  • Pharmaceutical drug trial improvement
  • Sentiment analytics
  • Voice of the customer

A well-understood process for text analytics includes the following steps:

  1. Extracting raw text
  2. Tokenizing the text—that is, breaking it down into words and phrases
  3. Detecting term boundaries
  4. Detecting sentence boundaries
  5. Tagging parts of speech—words such as nouns and verbs
  6. Tagging named entities so that they are identified—for example, a person, a company, a place, a gene, a disease, a product and so on
  7. Parsing—for example, extracting facts and entities from the tagged text
  8. Extracting knowledge to understand concepts such as a personal injury within an accident claim


Qualitative Technique

An array of techniques may be employed to derive meaning from text. The most accurate method is an intelligent, trained human being reading the text and interpreting its meaning. This is the slowest method and the most costly, but the most accurate and powerful. Ideally, the reader is trained in qualitative research techniques and understands the industry and contextual framework of the text. A well-trained qualitative researcher can extract extraordinary understanding and insight from text. In a typical project, the qualitative researcher might read hundreds of paragraphs to analyze the text, develop hypotheses, draw conclusions, and write a report. This type of analysis is subject to the risks of bias and misinterpretation on the part of the qualitative researcher, but these limitations are with us always—regardless of method. The power of the human mind cannot be equaled by any software or any computer system. Decision Analyst’s team of highly trained qualitative researchers are experts at understanding text.

Content Analysis or Open-End Coding

The history of text analytics traces back to World War II and the development of “content analysis” by governmental intelligence services. That is, intelligence analysts would read documents, magazines, records, dispatches, etc., and assign numeric codes to different topics, concepts, or ideas. By summing up these numeric codes, the analyst could quantify the different concepts or ideas, and track them over time. This approach was further developed by the survey research industry after the war. Today as then, open-end questions in surveys are analyzed by someone reading the textual answers and assigning numeric codes. These codes are then summarized in tables, so that the analyst has a quantitative sense of what people are saying. This remains a powerful method of text mining or text analytics. It leverages the power of the human mind to discern subtleties and context.


The first step is careful selection of a representative sample of respondents or responses. In surveys the sample is usually representative and comparatively small (less than 2,000), so all open-ended questions are coded. However, in the case of social media text, CRM system, or customer complaint system, the text might be made up of millions of customer comments. So the first step is the random selection of a few thousand records, and these records are checked for duplicates, geographic distribution, etc. Then, a human being reads each and every paragraph of text and assigns numeric codes to different meanings and ideas. These codes are tabulated and statistical summaries are prepared for the analyst. This is text mining or text analytics at its apogee. Open-end coding offers the strength of numbers (statistical significance) and the intelligence of the human mind. Decision Analyst operates a large multilanguage coding facility with highly trained staff specifically for content analysis and text analytics.


Machine Text Mining or Text Analytics

With the explosion of keyboard-generated text related to the spread of PCs and the Internet over the past two decades, many companies are searching for automated ways to analyze large volumes of textual data. Decision Analyst offers several text-analytic services, based on different software systems, to analyze and report on textual data. These software systems are very powerful, but they cannot take the place of the thinking human brain. The results from these software systems should be thought of as approximations, as crude indicators of truth and trends, but the results must always be verified by other methods and other data.


Business intelligence


Business intelligence(BI) can be described as “a set of techniques and tools for the acquisition and transformation of raw data into meaningful and useful information for business analysis purposes”. The term “data surfacing” is also more often associated with BI functionality. BI technologies are capable of handling large amounts of structured and sometimes unstructured data to help identify, develop and otherwise create new strategic business opportunities. The goal of BI is to allow for the easy interpretation of these large volumes of data. Identifying new opportunities and implementing an effective strategy based on insights can provide businesses with a competitive market advantage and long-term stability.BI technologies provide historical, current and predictive views of business operations. Common functions of business intelligence technologies are reporting, online analytical processing, analytics, data mining, process mining, complex event processing, business performance management, benchmarking, text mining, predictive analytics and prescriptive analytics.



Business intelligence is made up of an increasing number of components including

*Multidimensional aggregation and allocation

*De-normalization, tagging and standardization

*Real time reporting with analytical alert

*A method of interfacing with unstructured data sources

*Group consolidation, budgeting and rolling forecasts

*Statistical inference and probabilistic simulation

*Key performance indicators optimization

*Version control and process management

*Open item management.


Data warehousing:

Often BI applications use data gathered from a data warehouse(DW) or from a data mart, and the concepts of BI and DW sometimes combine as “BI/DW” or as “BIDW”. A data warehouse contains a copy of analytical data that facilitates decision support. To distinguish between the concepts of business intelligence and data warehouses, Forrester Research defines business intelligence in one of two ways:

1.Using a broad definition “Business Intelligence is a set of methodologies, processes, architectures, and technologies that transform raw data into meaningful and useful information used to enable more effective strategic, tactical, and operational insights and decision-making.”Under this definition, business intelligence also includes technologies such as data integration, data quality, data warehousing, master-data management, text- and content-analytics, and many others that the market sometimes lumps in to the “Information Management” segment. Therefore, Forrester refers to data preparation and data usage as two separate but closely linked segments of the business-intelligence architectural stack.

2.Forrester defines the narrower business-intelligence market as, referring to just the top layers of the BI architectural stack such as reporting, analytics and dash boards.

Comparison with competitive intelligence:

BI uses technologies, processes, and applications to analyze mostly internal, structured data and business processes while competitive intelligence gathers, analyzes and disseminates information with a topical focus on company competitors. If understood broadly, business intelligence can include the subset of competitive intelligence.


Business Intelligence Trends:

Currently organizations are starting to see that data and content should not be considered separate aspects of information management, but instead should be managed in an integrated enterprise approach. Enterprise information management brings Business Intelligence and Enterprise Content Management together. Currently organizations are moving towards Operational Business Intelligence which is currently under served and uncontested by vendors. Traditionally, Business Intelligence vendors are targeting only top the pyramid but now there is a paradigm shift moving toward taking Business Intelligence to the bottom of the pyramid with a focus of self-service business intelligence.

Business needs:

Because of the close relationship with senior management, another critical thing that must be assessed before the project begins is whether or not there is a business need and whether there is a clear business benefit by doing the implementation. Another reason for a business-driven approach to implementation of BI is the acquisition of other organizations that enlarge the original organization it can sometimes be beneficial to implement DW or BI in order to create more oversight.

BI Portals:

A Business Intelligence portal(BI portal) is the primary access interface for Data Warehouse(DW) and Business Intelligence (BI) applications. The BI portal is the user’s first impression of the DW/BI system. It is typically a browser application, from which the user has access to all the individual services of the DW/BI system, reports and other analytical functionality. The following is a list of desirable features for web portals in general and BI portals in particular: Usable User should easily find what they need in the BI tool. Content Rich The portal is not just a report printing tool, it should contain more functionality such as advice, help, support information and documentation. The portal should be implemented in a way that makes it easy for the user to use its functionality and encourage them to use the portal. Scalability and customization give the user the means to fit the portal to each user.




Big data analytics is a trending practice that many companies are adopting. Before jumping in and buying big data tools, though, organizations should first get to know the landscape.

In essence, big data analytics tools are software products that support predictive and prescriptive analytics applications running on big data computing platforms typically, parallel processing systems based on clusters of commodity servers, scalable distributed storage and technologies such as Hadoop and NoSQL databases..

In addition, big data analytics tools provide the framework for using data mining techniques to analyze data, discover patterns, propose analytical models to recognize and react to identified patterns, and then enhance the performance of business processes by embedding the analytical models within the corresponding operational applications.


Powering analytics: Inside big data and advanced analytics tools

Big data analytics yields a long list of vendors. However, many of these vendors provide big data platforms and tools that support the analytics process for example, data integration, data preparation and other types of data management software. We focus on tools that meet the following criteria:

  • They provide the analyst with advanced analytics algorithms and models.
  • They’re engineered to run on big data platforms such as Hadoop or specialty high-performance analytics systems.
  • They’re easily adaptable to use structured and unstructured data from multiple sources.
  • Their performance is capable of scaling as more data is incorporated into analytical models.
  • Their analytical models can be or already are integrated with data visualization and presentation tools.
  • They can easily be integrated with other technologies.


Big data and advanced analytics tools:

While some individuals in the organization are looking to explore and devise new predictive models, others look to embed these models within their business processes, and still others will want to understand the overall impact that these tools will have on the business. In other words, organizations that are adopting big data analytics need to accommodate a variety of user types, such as:

The data scientist, who likely performs more complex analyses involving more complex data types and is familiar with how underlying models are designed and implemented to assess inherent dependencies or biases.

The business analyst, who is likely a more casual user looking to use the tools for proactive data discovery or visualization of existing information, as well as some predictive analytics.

The business manager, who is looking to understand the models and conclusions.

IT developers, who support all the prior categories of users.


How Applications of Big Data Drive Industries:

Generally, most organizations have several goals for adopting big data projects. While the primary goal for most organizations is to enhance customer experience, other goals include cost reduction, better targeted marketing and making existing processes more efficient. In recent times, data breaches have also made enhanced security an important goal that big data projects seek to incorporate. More importantly however, where do you stand when it comes to big data? You will very likely find that you are either:

  • Trying to decide whether there is true value in big data or not
  • Evaluating the size of the market opportunity
  • Developing new services and products that will utilize big data
  • Already utilizing big data solutions Repositioning existing services and products to utilize big data, or
  • Already utilizing big data solutions.

With this in mind, having a bird’s eye view of big data and its application in different industries will help you better appreciate what your role is or what it is likely to be in the future, in your industry or across different industries. With this in mind, having a bird’s eye view of big data and its application in different industries will help you better appreciate what your role is or what it is likely to be in the future, in your industry or across different industries.

In this article, I shall examine 10 industry verticals that are using big data, industry-specific challenges that these industries face, and how big data solves these challenges.



Having gone through 10 industry verticals including how big data plays a role in these industries, here are a few key takeaways:

  • There is substantial real spending around big data
  • To capitalize on big data opportunities, you need to Familiarize yourself with and Understand where spending is occurring
  • Match market needs with your own capabilities and solutions
  • Vertical industry expertise is key to utilizing big data effectively and efficiently
  • If there’s anything you’d like to add, explore, or know, do feel free to comment below.


bigdataapps2Integrated application systems are infrastructure systems pre-integrated with databases, applications software or both, providing appliance-like functionality for Big Data solutions, analytic platforms or similar demands. For ISVs, there are five key reasons to consider delivering your Big Data/analytics solutions in the form of integrated applications systems benefits that can make the difference between market-moving success and tepid sales and profits.

Operational Environment:

The typical IT enterprise evolved to its current state by utilizing standards and best practices. These include simple things like data naming conventions to more complex ones such as a well-maintained enterprise data model. New data-based implementations require best practices in organization, documentation and governance. With new data and processes in the works you must update documentation, standards and best practices and continue to improve quality.

Costs and benefits of new mainframe components typically involve software license charges. The IT organization will need to re-budget and perhaps even re-negotiate current licenses and lease agreements. As always, new hardware comes with its own requirements of power, footprint, and maintenance needs.

A Big Data implementation brings additional staff into the mix: experts on new analytics software, experts on special-purpose hardware, and others. Such experts are rare, so your organization must hire, rent, or outsource this work. How will they fit into your current organization?  How will you train current staff to grow into these positions?


Start with the Source System:

This is your core data from operational systems. Interestingly, many beginning Big Data implementations will attempt to access this data directly (or at least to store it for analysis), thereby bypassing succeeding steps. This happens because Big Data  sources have not yet been integrated into your IT architecture. Indeed, these data sources may be brand new or never accessed.

Those who support the source data systems may not have the expertise to assist in analytics, while analytics experts may not understand the source data. Analytics accesses production data directly, so any testing or experimenting is done in a production environment.

Analyze Data Movement:

These data warehouse subsystems and processes first access data from the source systems. Some data may require transformations or ‘cleaning’. Examples include missing data or invalid data such as all zeroes for a field defined as a date. Some data must be gathered from multiple systems and merged, such as accounting data. Other data requires validation against other systems.

Data from external sources can be extremely problematic.  Consider data from an external vendor that was gathered using web pages where numbers and dates were entered in free-form text fields. This opens the possibility of non-numeric characters in numeric data fields. How can you maximize the amount of data you process, while minimizing the issues with invalid fields?  The usual answer is ‘cleansing’ logic that handles the majority of invalid fields using either calculation logic or assignment of default values.


Review Data Storage for Analytics:

This is the final point, the destination where all data is delivered. From here, we get direct access to data for analysis, perhaps by approved query tools. Some subsets of data may be loaded into data marts, while others may be extracted and sent to internal users for local analysis. Some implementations include publish-and-subscribe features or even replication of data to external sources.

Coordination between current processes and the big data process is required. IT support staff will have to investigate whether options to get early use of the data are available. It may also be possible to load current data and the corresponding big data tables in parallel. Delays in loading data will impact the accuracy and availability of analytics; this is a business decision that must be made, and will differ from implementation to implementation.

Greater solution consistency:

In an integrated application system, you integrate the hardware and software for your product, so you control the environment that supports your product. That ensures your Big Data and analytics applications have all the processing, storage, and memory resources they need to deliver optimal performance on compute-intensive jobs. In short, your application runs the way it was designed, so customer satisfaction is optimized.

Better system security:

Analytics and other Big Data systems frequently deal with financial, medical or other proprietary information. By delivering your product as an integrated application system, you can build in the security tools necessary to prevent access by unauthorized users, hackers or other intruders. Your application is safer, so your customers gain confidence in your products.


The conclusion:

Big data today has scale-up and scale-out issues. Further, it often involves integration of dissimilar architectures. When we insist that we can deal with big data by simply scaling up to faster, special-purpose hardware, we are neglecting more fundamental issues.

neural network -DATA MINING

A neural network is a powerful computational data model that is able to capture and represent complex input/output relationships. The motivation for the development of neural network technology stemmed from the desire to develop an artificial system that could perform “intelligent” tasks similar to those performed by the human brain.


A neural network acquires knowledge through learning.

A neural network’s knowledge is stored within inter-neuron connection strengths known as synaptic weights.

The true power and advantage of neural networks lies in their ability to represent both linear and non-linear relationships and in their ability to learn these relationships directly from the data being modeled. Traditional linear models are simply inadequate when it comes to modeling data that contains non-linear characteristics.

The most common neural network model is the Multilayer Perceptron (MLP). This type of neural network is known as a supervised network because it requires a desired output in order to learn. The goal of this type of network is to create a model that correctly maps the input to the output using historical data so that the model can then be used to produce the output when the desired output is unknown.

The demonstration of a neural network learning to model using the exclusive-or (Xor) data. The Xor data is repeatedly presented to the neural network. With each presentation, the error between the network output and the desired output is computed and fed back to the neural network. The neural network uses this error to adjust its weights such that the error will be decreased. This sequence of events is usually repeated until an acceptable error has been reached or until the network no longer appears to be learning.

A good way to introduce the topic is to take a look at a typical application of neural networks. Many of today’s document scanners for the PC come with software that performs a task known as optical character recognition (OCR). OCR software allows you to scan in a printed document and then convert the scanned image into to an electronic text format such as a Word document, enabling you to manipulate the text. In order to perform this conversion the software must analyze each group of pixels (0’s and 1’s) that form a letter and produce a value that corresponds to that letter. Some of the OCR software on the market use a neural network as the classification engine.


The demonstration of a neural network used within an optical character recognition (OCR) application. The original document is scanned into the computer and saved as an image. The OCR software breaks the image into sub-images, each containing a single character. The sub-images are then translated from an image format into a binary format, where each 0 and 1 represents an individual pixel of the sub-image. The binary data is then fed into a neural network that has been trained to make the association between the character image data and a numeric value that corresponds to the character.


Neural networks have been successfully applied to broad spectrum of data-intensive applications, such as:

Process Modeling and Control – Creating a neural network model for a physical plant then using that model to determine the best control settings for the plant.

Machine Diagnostics – Detect when a machine has failed so that the system can automatically shut down the machine when this occurs.

Portfolio Management – Allocate the assets in a portfolio in a way that maximizes return and minimizes risk.

Target Recognition – Military application which uses video and/or infrared image data to determine if an enemy target is present.

Medical Diagnosis – Assisting doctors with their diagnosis by analyzing the reported symptoms and/or image data such as MRIs or X-rays.

Credit Rating – Automatically assigning a company’s or individuals credit rating based on their financial condition.

Targeted Marketing – Finding the set of demographics which have the highest response rate for a particular marketing campaign.

Voice Recognition – Transcribing spoken words into ASCII text.

Financial Forecasting – Using the historical data of a security to predict the future movement of that security.

Quality Control – Attaching a camera or sensor to the end of a production process to automatically inspect for defects.

Intelligent Searching – An internet search engine that provides the most relevant content and banner ads based on the users’ past behavior.

Fraud Detection – Detect fraudulent credit card transactions and automatically decline the charge.







Clustering is the process of breaking down a large population that has a high degree of variation and noise into smaller groups with lower variation. It is a popular data mining activity. In a poll conducted by Kdnuggets, clustering was voted as the 3rd most frequently used data mining technique in 2011. Only decision trees and regression got more votes.

Cluster analysis is an important part of an analyst’s arsenal. One needs to master this technique as it is going to be used often in business situations. Some common applications of clustering are –

  1. Clustering customer behavior data for segmentation
  2. Clustering transaction data for fraud analysis in financial services
  3. Clustering call data to identify unusual patterns
  4. Clustering call-centre data to identify outlier performers (high and low)

Screenshot (6).png

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters).

Cluster analysis itself is not one specific algorithm, but the general task to be solved. It can be achieved by various algorithms that differ significantly in their notion of what constitutes a cluster and how to efficiently find them. Popular notions of clusters include groups with small distances among the cluster members, dense areas of the data space, intervals or particular statistical distributions. Clustering can therefore be formulated as a multi-objective optimization problem. The appropriate clustering algorithm and parameter settings (including values such as the distance function to use, a density threshold or the number of expected clusters) depend on the individual data set and intended use of the results. Cluster analysis as such is not an automatic task, but an iterative process of knowledge discovery or interactive multi-objective optimization that involves trial and failure. It is often necessary to modify data preprocessing and model parameters until the result achieves the desired properties.

Connectivity models: for example, hierarchical clustering builds models based on distance connectivity.

Centroid models: for example, the k-means algorithm represents each cluster by a single mean vector.

Distribution models: clusters are modeled using statistical distributions, such as multivariate normal distributions used by the Expectation-maximization algorithm.

Density models: for example, DBSCAN and OPTICS defines clusters as connected dense regions in the data space.

Subspace models: in Biclustering (also known as Co-clustering or two-mode-clustering), clusters are modeled with both cluster members and relevant attributes.

Group models: some algorithms do not provide a refined model for their results and just provide the grouping information.

Graph-based models: a clique, that is, a subset of nodes in a graph such that every two nodes in the subset are connected by an edge can be considered as a prototypical form of cluster. Relaxations of the complete connectivity requirement (a fraction of the edges can be missing) are known as quasi-cliques, as in the HCS clustering algorithm

There are also finer distinctions possible, for example:

strict partitioning clustering: here each object belongs to exactly one cluster

strict partitioning clustering with outliers: objects can also belong to no cluster, and are considered outliers.

overlapping clustering (also: alternative clustering, multi-view clustering): while usually a hard clustering, objects may belong to more than one cluster.

hierarchical clustering: objects that belong to a child cluster also belong to the parent cluster

subspace clustering: while an overlapping clustering, within a uniquely defined subspace, clusters are not expected to overlap.

There are also finer distinctions possible, for example:

strict partitioning clustering: here each object belongs to exactly one cluster

strict partitioning clustering with outliers: objects can also belong to no cluster, and are considered outliers.

overlapping clustering (also: alternative clustering, multi-view clustering): while usually a hard clustering, objects may belong to more than one cluster.

hierarchical clustering: objects that belong to a child cluster also belong to the parent cluster

subspace clustering: while an overlapping clustering, within a uniquely defined subspace, clusters are not expected to overlap.

k-means clustering

In centroid-based clustering, clusters are represented by a central vector, which may not necessarily be a member of the data set. When the number of clusters is fixed to k, k-means clustering gives a formal definition as an optimization problem: find the k cluster centers and assign the objects to the nearest cluster center, such that the squared distances from the cluster are minimized.

Most k-means-type algorithms require the number of clusters – k – to be specified in advance, which is considered to be one of the biggest drawbacks of these algorithms. Furthermore, the algorithms prefer clusters of approximately similar size, as they will always assign an object to the nearest centroid. This often leads to incorrectly cut borders in between of clusters (which is not surprising, as the algorithm optimized cluster centers, not cluster borders).

K-means has a number of interesting theoretical properties. First, it partitions the data space into a structure known as a Voronoi diagram. Second, it is conceptually close to nearest neighbor classification, and as such is popular in machine learning. Third, it can be seen as a variation of model based classification, and Lloyd’s algorithm as a variation of the Expectation-maximization algorithm for this model discussed below.