sentimental analysis for product rating


Sentiment is a judgment or thought passed based on feeling. Sentiment plays a major role when products from different brands are developed with same quality and how sentiment helps one brand’s product to get a better market then the other. Sentiment analysis is opinion mining that deals with sentiment polarity categorization. The various process involved in sentiment analysis explained below:


Data collection

Online portals like Twitter provides API to extract data but most of other portals won’t provide such mechanism. Scripting languages are used to extract the content from online portal.  In general the data in these cases constitutes rating numbers 1 to 5 or 1 to 10 and rating description. Extracting, analyzing and charting rating numbers are relatively easier than analyzing rating description.


Feature extraction

Bag of words consist of some standard words and those words are compared to the data from review to derive binary feature vector. However this method is not effective on phrases so collocations is done with bigram functions. Bigram help in identifying negation words as they occur as pair or group of words. During feature extraction spell check need to be done to clean up the data. Parts of speech tag identification is key part of feature extraction.



On classification of data there are various methods like Naive Bayes method. It uses Bayes’ Rules to calculate the probability of feature of a vector in a class. This method is little complicated and hard to trace back which probabilities are causing certain classifications. Decision list one other method which operates on rule based tagger and that has advantage of human readability.



Based on the Sentiment analysis, the results are being charted or represented in tabular form. In simple rating numbers analysis extracts the results are charted in graph with products on x axis and review rating number on the y axis. In case of Navie Bayes and Decision list method the results are formatted in tabular column with Features as one of the column and scores on other columns.



Lexicon based approach, in this method for each review the matched negative and positive words from predefined set of sentiment lexicon are being counted. The polarity of the review is calculated based on the counting polar words and assigning polarity values such as positive, negative and neutral to them.



The various challenges in sentiment analysis starts right from data collection. Most of the data are free text and available on HTML pages. The rating numbers and rating description on many of the cases won’t match so simple analysis done using rating numbers are not accurate and this leads to analysis of the rating description using various machine language learning tools. Analysis using these tools are complex in nature. Since ratings are open to all customers there are good possibilities of junk reviews and spelling mistakes are common on rating content.



Natural language Toolkit (NLTK) – It runs on Python language platform. It provides features like tokenizing, identifying named entities and parsing.

Stanford core NLP Suite – It provides tools for parts of speech tagging, grammar praising and name entity recognition.

GATE and Apache UIMA – It is help in building complex NLP workflows which integrates with different processing steps.




SAS Text Analytics – It provides Text analytics software to extract information from text content. It discover patterns and trends from text using natural language processing, advanced linguistic technologies and advanced statistical modeling.

IBM Text Analytics – It converts unstructured survey text into quantitative data. It automates the categorization process to eliminate the manual processing. To reduce ambiguities in human language it uses linguistic based technologies

Lexalytics Text Analytics – Salience is text analytics engine build by Lexalytics.  It is helpful for social media monitoring, sentiment analysis, survey of customer voice.

Smart logic – It provides rule base classification modelling and information visualization. It applies metadata and classification to deliver navigation and search experience.


E-Commerce product rating based on Customer review mining

E-Commerce product rating review mining is process of analyzing the customer review from online portal and understanding the customer feedback on a product. A simple analysis done for a sample product Laptop A, by collecting the rating for the product and plotting them in chart then summarizing popular ratings is shown below.


Disadvantages of simple e-commerce review mining

The rating description and the rating number won’t match in many cases. The ratings are done based on various components of product for example in case of Laptop some user rate only based on speed of performance  but other users rate based on battery, display and sound. So parsing the text description and analyzing the accurate ratings is so important than simply plotting the rating numbers.

Examples of complicated text ratings for Laptops are

  • Battery life is good but display is not good
  • Good but expected more
  • Display is good but overall ok
  • Price is little higher but overall good

So sentiment analysis techniques needs to applied to find the polarity of the review. That is to find if the review is positive or negative or neutral. Various sentiment analysis techniques are explained as follows

Document level Semantic analysis

To determine the polarity of the sentiment of the reviewers in this method for a particular product the total reviews from whole document is considered as input for analysis.

Aspect level Sematic Analysis

This method identifies various aspects of a particular entity from the review text. Then it assign sentiment polarity to the aspect. The keys steps are explain as follows, Aspect extraction is a process of extracting the opinion word from the review. Aspect Sentiment classification involves deciding the sentiment of aspect among positive or negative or neutral. Machine learning and natural language processing techniques are used in this classification process.

Machine Learning

There are two approaches to machine learning they are supervised learning and unsupervised learning.

Supervised learning

In supervised learning labeled training data is used to build the predictor model.  KNN is one type of supervised machine learning algorithm, it has training set and classifies the input based on similarity between K-neighbor. Naive Bayes is other type of supervised machine learning algorithm which determines the probability of sentiment word belonging to a particular class by Bayes Rule which is P (Label | Features) = P (label) * P (features label) / P (features). Support Vector Machines is one other supervised machine learning algorithm which uses linear classification techniques.

Unsupervised learning

It deals with classification algorithm using unlabeled data. It involves clustering and classification techniques. It is based on finding recurring patterns in dataset available and make use of such inferences for classifying the new input data.

Lexicon based approach is common unsupervised learning algorithm which involves analysis of phrases and idioms that forms opinions. There are two types of Lexicon based approaches which are Dictionary based approach and Semantic based approach. In Dictionary based approach list of words are uses to map with extracted sentiment. The drawback of this approach is dictionary words may not be able to determine the sentiment base on context. In the next Semantic based approach sentiment score is predicted based on similarities between words.

Sentiment analysis in HADOOP

Hadoop can be used for analyzing structured and unstructured data format. Structured data analysis in Hadoop is handled with SQL query like tool called hive. Unstructured data analysis is handled by Apache Hbase a NoSQL database. Other features of Hadoop are machine learning library called as Apache mahout and Apache storm to deal with real time data analysis. It provides wide varieties of sentiment analysis tools to review user comments from online portals. It also provides distributed computing environment.

Distributed computing : Hadoop


A distributed framework utilizes programming to arrange errands that are performed on different computers at the same time. The computers communicate to accomplish a shared objective, and they connected to each other via sending messages.

Distributed computing is utilized to tackle complex computational issues that can’t be finished inside a sensible measure of time on a single computer. The time necessary to finish all the calculation is diminished by tackling the power of numerous computers.

A distributed architecture can serve as an umbrella for a wide range of frameworks. Hadoop is one case of a framework that can unite a broad array of tools or instruments.


Hadoop is an open source distributed computing framework for the purpose of large scale datasets processing.


The two main components of hadoop are

Hadoop distributed file system (HDFS) – The HDFS is a master – slave architecture. Here the master is the name node and the slave are the data nodes. There will be single name nodes and multiple data nodes. The metadata of the file system is maintained by the name node.

MapReduce – MapReduce also works on master – slave architecture. The master is the jobtracker and the slave is the tasktrackers. The jobtracker contains the namenode and the tasktrackers contains the datanodes in it. Here the jobtracker handles the job and convert it into multiple tasks and it also decided where to run those particular tasks. Now the job of the tasktackers is to execute those tasks, monitoring the execution and continuously sending the feedback to the jobtracker.


HBase – HBase is also known as hadoop database. It is used to store tables that consists of million of rows ans columns in it. It is non – relational database. Hadoop only performs batch processing which means that the data is processed in consecutive manner. For random accessing, HBase is used.

Hive –Hive is developed by facebook and then it is given away to Apache foundation. It enables bulk loading of data in it which is stored in HDFS. It deals with structured data which is stored in HDFS The data will be in the form of query language which is similar to SQL.

Pig – Pig was developed by Yahoo. It is a scripting language which does not requires java or python or SQL. It consists of series of operations which is performed on the input that runs MapReduce program as its back-end process to deliver the output



Mahout– Mahout is an open source machine learning library written in java from Apache. It involves classification, filtering and clustering. Mahout mainly aims to be the machine learning tool that is required when the data which need to be processed is very high.

Oozie -Oozie manages the hadoop jobs which runs in the workflow scheduler system. Oozie is implemented as a Java Web-Application that runs in a Java Servlet-Container. It runs in a sequential manner.

Zookeeper– Zookeeper was developed to overcome the partial failure that occurs between the nodes. It is an open source server which provides high reliable distributed coordination. ZooKeeper is a centralized service which works on maintaining configuration information, naming, providing distributed synchronization, and providing group services .

Hadoop has various advantages in it. Some of them are:

Hadoop distributed file system which is shortly known as HDFS can store huge amount of datasets with high scalability, availability and reliability. By this way hadoop provides limited failure support.

The hadoop distributed file system also gives great fault tolerance which means that the limited failure will not consequence in the failure of the whole system and even data recoverability is provided for the limited failure.

Hadoop introduced Mapreduce. The MapReduce framework is used in finding out the unperformed tasks or the nodes and it will reschedule those tasks automatically to the other node.

Hadoop also provides data locality which means that the MapReduce framework gathers with the compute nodes. The task for each node is separated based on the dependency on each other. So that the dependency problem can be solved and more bandwidth can be saved.

Hadoop is used by many companies. They are Google, yahoo, cloudera, eBay, intel , Microsoft, facebook and many more.


Visualization tools

Business Intelligence Visualization tools provides interactive capabilities on visualizing dataand allows users to drill into the data. Few methodologies used in latest visualization tools are dials, gauges, Infographics, geographic maps, heat maps and spark lines. Visualization technologies are embedded into business intelligence software to create these tools. These tools are helps in taking fact based and insightful decisions.

Gauge report sample:


Infographics report sample:




Geographic maps report sample:



Qlik view

It is a user friendly platform. It provides industry based customized solution of banking, insurance and retail.This tool allows existing data manipulation, insight and data analytics

SAS Business Intelligence

This is self-service tool which is easy to use. For the entire organization it provides real time analytics. The solution provided includes scalability and centralized metadata. It has feature called auto-charting which eliminates the need for coding.

Oracle Business Intelligence Enterprise Edition

It provides interactive dashboards.It integrates various components like Hyperion Web analysis, Hyperion Financial reporting, Hyperion SQR production reporting, Hyperion interactive reporting, MS office Plug-in, BI Delivers, BI Publisher, BI interactive dashboards, BI servers and BI server.


It has data analytics and dashboards that can transform organizations information into reports. MicroStrategy has desktop and mobile versions. It can leverage storage on cloud through amazon web services. It can run with data stored in data warehouse and operational databases.

Clear analytics

It is similar to MS Excel. On top of regular excel reporting it provides features like reports scheduling, sharing capabilities, version control and audibility.

SAP Business Intelligence

It has data visualization capabilities through which data from multiple sources are combined.It provides infographics capabilities. It can deliver reports to mobile devices.


The primary objective of this reporting tool are user friendliness and high performance on large and diverse data. It provides interactive dashboards with wide range of data visualization options.

Board all in one business intelligence

It provides toolkit approach to interact with Board’s interface. It provides self-service solution for analytics, understanding data and reporting. This integrated BI performance management is labelled as ‘management Intelligence’. Multi-dimensional analysis, ad hoc querying and dash boarding are the additional features.

Microsoft Business Intelligence tools

This tool helps users to analyze, drill down and visualize data through excel. It comes handy with lot of self service capabilities.


It is one of the commonly used tool across various business areas. It is come with self-service products. It’s scaled architecture which makes it available to any application even on mobile devices.


It has the capability to convert complicated data into actionable information and can provide visual insights. This one is also termed as self-service tool. It can import data and report from flat files, Google analytics, google big query, Salesforce CRM, OLAP engines, google spreadsheets.

INTELEX Business Management Software

It is user friendly tool which can be easily managed, tracked. It can generate real time reports from key metrics. It complies with standards like ISO 14001, ISO 9001 and OHSAS 18001

Adaptive Suite

Adaptive suite is built from ‘cloud up’. This tool is commonly used and is extremely powerful.  This tool can be accessible through web and mobile applications.


This tool is capable of creating management dashboards. The analytical feature allows data drill down to get specific insights. It does not need any database backup for generating reports.


It is SaaS BI platform which has data discovery, analytics, and reporting in a single self-service platform. The dashboard offers wide range of visualization options.


It is BI platform from Bitam. It provides reporting for mid-markets and management dashboard. It provides customized dashboard to track different aspect of business.

Stonefield Query

It is one of the self-service business analytics software. It provide ad-hoc data analysis and reporting. Without programming the drill down filters allows user to narrow search results.

WebKPI Business intelligence dashboard

For small and medium size business it provide SaaS dashboard solution. This tool is scalable and self- service one.  It can publish reports through email.



Data warehouse is the repository of the data from various sources and it is the core component of business intelligence. Data that are stored in the data warehouse are used for analytics and generating reports in the business environment. Data warehouse has the ability to analyze the data from multiple sources. It is a multidimensional data model for adequate querying and analysis.

There are two approaches in data warehouse. They are

Top down approach : once the complete data warehouse has been created, the top down approach sequel data marts for specific groups of users

Bottom up approach : In the bottom up approach, the data marts has built first and then those data marts are combined into single data warehouse.


ETL is one of the process in data warehousing. ETL means Extraction, transformation and loading. It is responsible for dragging out the data from the source system and placing it into the data warehouse. Various tasks performed by ETL as follows:


This is the process of extracting the data from different source system like ERP, SAP, CRM and other operational systems. These data will be in different formats. And then the data is converted into a single format and consolidated into a data warehouse.


Transforming the data involves various process such as cleaning, filtering, merging the data from multiple sources, transposing columns and rows, splitting on columns into multiple columns and rows into multiple rows and so on.


This process involves loading the data into the data warehouse.


Some of the ETL tools that are in practice:

  • SAP – BusinessObjects Data Integrator
  • IBM – Cognos Data Manager
  • Informatica – Power Center
  • Oracle – Data Integrator
  • IBM – Websphere DataStage
  • SAS – Data Integration Studio
  • Oracle – Warehouse Builder
  • Microsoft – SQL Server Integration Services
  • Talend – Talend Open Studio
  • ETL Solutions Ltd. – Transformation Manager
  • Sybase – Data Integrated Suite ETL
  • OpenSys – CloverETL
  • IBM – DB2 Warehouse Edition



There are three types of data warehouse model

  • Virtual Warehouse
  • Data mart
  • Enterprise Warehouse

Virtual Warehouse

The vision above an operational data warehouse is known as a virtual warehouse. In operational database servers, the virtual warehouse requires excess capacity than the normal capacity.

Data Mart

Data mart contains a subclass of organizational of wide data. This subclass of data is valuable to specific groups of an organization. The data marts contain data precise to a particular group. For example, in marketing the data mart may contain data related to objects, trade, and sales.

The server Window based or Unix/Linux are used to implement data marts. These data mart are implemented on low cost servers. The implementation of data mart cycles is measured in short periods of time. It may take a year, a month or a week.

The life cycle of a data mart may be complex in long run, if its planning and design are not organization-wide. Data marts are smaller in size and are customized by department. The source of a data mart is departmentally structured data warehouse and they are flexible.

Enterprise Warehouse

            An enterprise warehouse collects all the information and the subjects across an entire organization. It also provides us enterprise-wide data integration. The data is integrated from operational systems and other external information providers and this information can vary from a few gigabytes to hundreds of gigabytes, terabytes or a far.   



Applications of data warehouse in industries:

Retail industry:

With the help of data warehouse, the retailers can understand the buying pattern of the customer using the previous data. By having access to all the data, marketing manager can find out the status of each product and can perform promotional strategies according to it. Sales forecasting with the historical data, product analysis, inventory management.

Banking industry:

Helps in making financial decision making, supporting management planning, providing comments on customer relationship and profitability to bankers, accessing the customer’s account details, asset details, loan details and so on.

Health care industry:

In health care industry lots of data are stored in data warehouse which helps in tracking the service provided in the hospital, patient’s details, doctor’s details, reports of all the patients, details about utilization of resources, different disease caused and its suggestion reports.