Sentiment Analysis
Data
CHAPTER 1 INTRODUCTION The amount of raw data stored in corporate databases is exploding. In today’s fiercely competitive business environment, companies need to rapidly turn these raw data into significant insights. Data mining, or knowledge discovery, is the computer-assisted process of digging through and analyzing enormous sets of data and then extracting the meaning of the data. Data mining tools predict behaviors and future trends, allowing businesses to make proactive, knowledge-driven decisions. The analysis of large volumes of data with data mining methods is generally regarded as a field for specialists. The latter create more or less complex analysis processes with often shockingly expensive software solutions for predicting the imminent handing in of notices or the sales figures of a product for example. The economic benefit is obvious, and so it was thought for a long time that the use of data mining software products was also associated with high software license costs and the support often necessary due to the complexity of the subject matter. Probably no later than when the open source software RapidMiner was developed could anybody seriously doubt that software solutions for data mining did not have to be expensive or difficult to use. There are different data mining tools like Weka, Orange, rattle, KNIME etc which are available in open source. In this group rapid miner stands out with its efficient performance. Today RapidMiner is the world‐wide leading open‐source data mining solution due to the combination of its leading‐edge technologies and its functional range. Applications of RapidMiner cover a wide range of real‐world data mining tasks.
Page 1
Sentiment Analysis
Data
1.1 The Tool RapidMiner is licensed under the GNU Affero General Public License version 3 and is currently available in version 5.3. RapidMiner contains more than 500 operators altogether for all tasks of professional data analysis, i.e. operators for input and output as well as data processing (ETL), modeling and other aspects of data mining. But also methods of text mining, web mining, the automatic sentiment analysis from Internet discussion forums (sentiment analysis, opinion mining) as well as the time series analysis and - prediction are available to the analyst. In addition, RapidMiner contains more than 20 methods to also visualize high-dimensional data and models. Moreover, all learning methods and weighting factors of the Weka Toolbox have also been completely and smoothly integrated into RapidMiner, meaning that the complete range of functions of Weka, which is equally widespread in research at the moment, also joins the already enormous range of functions of RapidMiner.
Page 2
Data
Sentiment Analysis
CHAPTER 2
2.2 Installation Download the appropriate installation package for your operating system and install RapidMiner according to the instructions on the website. All usual Windows versions are supported as well as Macintosh, Linux or UNIX systems. Download is available from http://www.rapid-i.com.
2.3 Perspective and views When you open you will be welcomed into the so-called Welcome Perspective. The upper section shows typical actions which you as an analyst will perform frequently after starting RapidMiner. Here are the details of these: 1. New: Starts a new analysis process. First you must define a location and a name within the process and data repository and then you will be able to start designing a new process. 2. Open Recent: Opens the process which is selected in the list below the actions. Alternatively, you can also open this process by double-clicking inside the list. Either way, RapidMiner will then automatically switch to the Design Perspective. 3. Open: Opens the repository browser and allows you to select a process to be opened within the process Design Perspective. 4. Open Template: Shows a selection of different pre-defined analysis processes, which can be configured in a few clicks. 5. Online Tutorial: Starts a tutorial which can be used directly within Rapid-Miner and gives an introduction to some data mining concepts using a selection of analysis processes.
Page 3
Data
Sentiment Analysis
Figure 1: Welcome Perspective of RapidMiner. We will find an icon for each perspective within the right-hand area of the toolbar:
Figure 2: Toolbar Icons for Perspectives
Page 4
Sentiment Analysis
Data
The icons shown here take you to the following perspectives: 1. Design Perspective: This is the central RapidMiner perspective where all analysis processes are created and managed. 2. Result Perspective: If a process supplies results in the form of data or models then RapidMiner takes you to this Result Perspective, where you can look at several results at the same time. 3. Welcome Perspective: The Welcome Perspective already described above, which RapidMiner welcomes you with after starting the program. You can switch to the desired perspective by clicking inside the toolbar or alternatively via the menu entry “View"- “Perspectives" followed by the selection of the target perspective. RapidMiner will eventually also ask you automatically if switching to another perspective seems a good idea, e.g. to the Result Perspective on completing an analysis process.
Design Perspective Since the Design Perspective is the central working environment of RapidMiner, we will discuss all parts of the Design Perspective separately in the following and discuss the fundamental functionalities of the associated views. There are two very central views in this area, at least in the standard setting.
Page 5
Sentiment Analysis
Data
Figure 3: Design Perspective of RapidMiner
Operators View All work steps (operators) available in RapidMiner are presented in groups here and can therefore be included in the current process. You can navigate within the groups in a simple manner and browse in the operators provided to your heart's desire. If RapidMiner has been extended with one of the available extensions, then the additional operators can also be found here.
Page 6
Sentiment Analysis
Data
Without extensions you will find at least the following groups of operators in the tree structure: 1.
Process Control: Operators such as loops or conditional branches which can control the
process flow. 2. Utility: Auxiliary operators which, alongside the operator “Subprocess" for grouping sub processes, also contain the important macro-operators as well as the operators for logging. 3. Repository Access: Contains the two operators for read and write access in repositories. 4. Import: Contains a large number of operators in order to read data and objects from external formats such as files, databases etc. 5. Export: Contains a large number of operators for writing data and objects into external formats such as files, databases etc. 6. Data Transformation: Probably the most important group in the analysis in terms of size and relevance. All operators are located here for transforming both data and meta data. 7. Modeling: Contains the actual data mining process such as classification methods, regression methods, clustering, weightings, methods for association rules, correlation and similarity analyses as well as operators, in order to apply the generated models to new data sets. 8. Evaluation: Operators using which one can compute the quality of a modeling and thus for new data e.g. cross-validations, bootstrapping etc. You can select operators within the Operators View and add them in the desired place in the process by simply dragging and dropping.
Repositories View Page 7
Sentiment Analysis
Data
The repository is a central component of RapidMiner which was introduced in Version 5. It serves for the management and structuring of your analysis processes into projects and at the same time as both a source of data as well as of the associated meta data.
Process View The Process View shows the individual steps within the analysis process as well as their interconnections.
Inserting Operators You can insert new operators into the process in different ways. Here are the details of the different ways: 1. Via drag &drop from the Operators View as described above, 2. Via double click on an operator in the Operators View, 3. Via dialog which is opened by means of the first icon in the toolbar of the Process View, 4. Via dialog which is opened by means of the menu entry “Edit" - New Operator. . . “(CTRL-I), 5. Via context menu in a free area of the white process area and there via the submenu\New Operator" and the selection of an operator.
Parameters View Page 8
Sentiment Analysis
Data
Numerous operators require one or several parameters to be indicated for a correct functionality. For example, operators that read data from files require the file path to be indicated. Note that some parameters are only indicated when other parameters have a certain value. For example, an absolute number of desired examples can only be indicated for the operator \sampling" when \absolute" has been selected as the type of sampling.
Help and Comment View Each time you select an operator in the Operators View or in the Process View, the help window within the Help View shows a description of this operator. These descriptions include 1. A short synopsis which summarizes the function of the operator in one or a few sentences, 2. A detailed description of the functionality of the operator, 3. A list of all parameters including a short description of the parameter, the default value (if available), the indication as to whether this parameter is an expert parameter as well as an indication of parameter dependencies.
Comment View Unlike Help, the Comment View is not dedicated to pre-defined descriptions but rather to your own comments on individual steps of the process. Simply select an operator and write any text on it in the comment field. This will then be saved together with your process definition and can be useful for tracing individual steps in the design later on.
Problems and Log View Page 9
Sentiment Analysis
Data
A further very central element and valuable source of help during the design of your analysis processes is the Problems View. Any warnings and error messages are clearly indicated in a table here. In the first column with the name “Message" you will find a short summary of the problem. The last column named “location" shows you the place where the problem arises in the form of the operator name and the name of the input port concerned. A considerable innovation of RapidMiner 5 however is the possibility of also suggesting solutions for such problems and of implementing them directly. These solution methods are called quick fixes. The second column gives an overview of such possible solutions, either directly as text if there is only one possibility of solution or as an indication of how many different possibilities exist to solve the problem.
Log View During the design, and in particular during the execution of processes, numerous messages are written at the same time and can provide information, particularly in the event of an error, as to how the error can be eliminated by a changed process design. You can copy the text within the Log View as usual and process it further in other applications. You can also save the text in a file, delete the entire contents or search through the text using the actions in the toolbar.
Page 10
Sentiment Analysis
Data
CHAPTER 3 SYSTEM REQUIREMENTS Hardware Requirements: • • • •
•
Processor : Pentium 4 Memory Size : 1 GB RAM Storage : 80GB Hard Disk Display : EGA/VGA Color Monitor • 600x800 Pixels Resolution • High Color (16 Bit) Keyboard : Any with minimum required Keys
Software Requirements: • •
Operating System : Windows XP and above, Linux, Mac Java SE 1.6 and above
Page 11
Data
Sentiment Analysis
CHAPTER 4 Data Sentiment Analysis with Rapidminer Sentiment analysis or opinion mining is an application of Text Analytics to identify and extract subjective information in source materials. A basic task in sentiment analysis is classifying an expressed opinion in a document, a sentence or an entity feature as positive or negative. The example presented here gives the list of movies and its review such as Positive or Negative. This program implements Precision and Recall method. Precision is the probability that a (randomly selected) retrieved document is relevant. Recall is the probability that a (randomly selected) relevant document is retrieved in a search. Or high recall means that an algorithm returned most of the relevant results. High precision means that an algorithm returned more relevant results than irrelevant. At first, both positive and negative reviews of a certain movie are taken. All of the words are stemmed into root words. Then the words are stored in different polarity (positive and negative). Both vector wordlist and model are created. Then, the required list of movies is given as an input. Model compares each and every word from the given list of movies with that of words which come under different polarity stored earlier. The movie review is estimated based on the majority of number of words that occur under a polarity. For example, when you look at Django Unchained, the reviews are compared with the vector wordlist created at the beginning. The highest number of words comes under positive polarity. So the outcome is Positive. Same happens for Negative outcome.
Page 12
Sentiment Analysis
Data
First step for implementing this analysis is Processing the document from data i.e. extracting the positive and negative reviews of a movie and storing it in different polarity. hug The model is shown in Figure1.
Figure 1.
Page 13
Data
Sentiment Analysis
Under Process document, click on the Edit List on the right. Load the positive and negative reviews under different class name "Positive" and "Negative" as shown in Figure 2.
Figure 2.
Page 14
Sentiment Analysis
Data
Under Process Document operator, nested operation takes place such as Tokenizing the words, Filtering the Stop words, Stemming the words into root words and Filtering the tokens between 4 and 25 characters as shown in Figure 3.
Figure 3.
Page 15
Sentiment Analysis
Data
Then two operators are used such as Store and Validation operator as shown in Figure 1. Store operator is used to output word vector to a file and directory of our choosing. Validation operator (Cross-validation) is a standard way to assess the accuracy and validity of a statistical model. Our data set is divided into two parts, a training set and a test set. The model is trained on the training set only and its accuracy is evaluated on the test set. This is repeated n number of times. Double click on validation operator. There will two panels- Training and Testing. Under Training panel, Linear Support Vector Machine(SVM) is used which is a popular set of classifier since the function is a linear combination of all the input variables. In order to test the model, we use the ‘Apply Model’ operator to apply the training set to our test set. To measure the model accuracy we use the ‘Performance’ operator. The operations under Validation is shown in Figure 4.
Figure 4. Page 16
Sentiment Analysis
Data
Then run the model. The result of Class Recall % and Precision % is shown in Figure 5. The model and vector wordlist are stored in a Repository.
Figure 5.
Page 17
Sentiment Analysis
Data
Then retrieve both the model and vector wordlist from the Repository you have stored earlier. Then connect out from the retrieve wordlist to the process document operator shown in Figure 6. The operations under Process document are same shown in Figure 3.
Figure 6.
Page 18
Sentiment Analysis
Data
Then click on Process Document operator and click edit list on the right. This time I have added the list of 5 movie reviews from Rottentomatoes website and stored it in a directory. Assign the class name as Unlabeled as shown in fig 7.
Figure 7.
Page 19
Sentiment Analysis
Data
The Apply Model operator takes a model from a Retrieve operator and unlabeled data from Process document as input and outputs the applied model to the ‘lab’ port, so connect that to the ‘res’ (results) port. The result is shown below. When you look at Les Miserables, there is 86.4% confidence that it is positive and 13.6% as negative because the match of the reviews with wordlist under positive polarity is higher compared to negative polarity.
Figure 8.
Page 20
Data
Sentiment Analysis
CHAPTER 5 COMPARISION Procedure
KNIME
RapidMiner Weka
TANAGRA
Pass (but limited partitioning methods)
Pass (but limited partitioning methods)
Pass (but Pass (but limited limited partitioning partitioning methods) methods)
Pass
Fail (cannot save parameters for scaling to apply to future datasets)
Fail (cannot save parameters for Fail (no scaling scaling to apply methods) to future datasets)
Fail (no Descriptor selection wrapper methods)
Pass
Fail (wrapper Fail (no Pass (but is not part methods valid wrapper of KnowledgeFlow) for logistic methods) regression only)
Parameter optimization of machine learning/statistical methods
Pass
Fail (not automatic)
Fail (not automatic)
Fail (not automatic)
Pass
Pass (but cannot save model so have to rebuild model for every future dataset)
Fail (cannot validate independent validation set)
Pass (but cannot save model so have to rebuild model for every future dataset)
Pass (but Partitioning of dataset limited into training and partitioning testing sets. methods)
Descriptor scaling
Pass
Fail (not automatic)
Model validation Pass (but using cross-validation limited error and/or independent measurement validation set methods)
Orange
Table 1.
Page 21
Data
Sentiment Analysis
CHAPTER 6 ADVANTAGES AND DISADVANTAGES Advantages
Free version has adequate resources to avoid big name options if a small business
It is a quality tool, given its ranking among the other commercial products
GUI is very user friendly.GUI is used to create data mining operators in XML files
XML Standardization is great for utilizing various data sources
Ease of use and available tutorials
Works on any operating system
Disadvantage
Some options are not available in free product, but you can upgrade
Possibly less customer service available for free version
There can be some restriction on customized use
Beginner may face some difficulty in understanding
Page 22
Sentiment Analysis
Data
Page 23