Tutoriels en anglais

The year 2017 ends, 2018 begins. I wish you all a very happy year 2018.

A small statistical report on the website statistics for 2017. All sites (Tanagra, course materials, e-books, tutorials) has been visited 222,293 times this year, 609 visits per day.

Since February, the 1st, 2008, the date from which I installed the Google Analytics counter, there was 2,33,371 visits (644 daily visits).

Who are you? The majority of visits come from France and Maghreb. Then there are a large part of French speaking countries, notably because some pages are exclusively in French. In terms of non-francophone countries, we observe mainly the United States, India, UK, Germany, ...

39 new course materials and tutorials were posted online this year: 18 in French language, 21 in English.

The pages containing course materials about Data Science and Programming (R and Python) are the most popular ones. This is not really surprising.

Happy New Year 2018 to all.

Ricco.
Slideshow: Website statistics for 2017

Sparse data file format

2018-01-02 (Tag:2004554045513585992)

The data to be processed with machine learning algorithms are increasing in size. Especially when we need to process unstructured data. The data preparation (e. g. the use of a bag of words representation in text mining) leads to the creation of large data tables where, often, the number of columns (descriptors) is higher than the number of rows (observations). With the singularity that the table contains many zero values. In this context, storing all these zero values into the data file is not opportune. A data compression strategy without loss of information must be implemented, which must remain simple so that the file is readable with a text editor.

In this tutorial, we describe the use of the sparse data file format handled by Tanagra (from the version 1.4.4). It is based on the file format processed by famous libraries for machine learning (svmlight, libsvm, libcvm). We show its use in a text categorization process applied to the Reuters database, well known in data mining. We will observe that the use of this kind of sparse format enables to reduce dramatically the data file size.

Keywords: sparse dataset, dense dataset, attribute-value table, support vector machine, svm, libsvm, c-svc, logistic regression, tr-irls, scoring, roc curve, auc, area under curve
Componets: VIEW DATASET, CONT TO DISC, UNIVARIATE DUISCRETE STAT, SELECT FIRST EXAMPLES, C-SVC, SCORING, ROC CURVE
Tutorial: en_Tanagra_Sparse_File_Format.pdf
Dataset: reuters.data.zip
References:
T. Joachims, "SVMlight: Support Vector Machine".
UCI Repository, "Reuters-21578 Text Categorization Collection".

Configuration of a multilayer perceptron

2017-12-29 (Tag:890217977981619790)

The multilayer perceptron is one of the most popular neural network approach for supervised learning, and that it was very effective if we know to determine the number of neurons in the hidden layers.

In this tutorial, we will try to explain the role of neurons in the hidden layer of the multilayer perceptron (when we have one hidden layer). Using an artificial toy dataset, we show the behavior of the classifier when we modify the number of neurons.

We work with Tanagra in a first step. Then, we use R (nnet package) to create a program to determine automatically the right number of neurons into the hidden layer.

Keywords: neural network, perceptron, multilayer perceptron, MLP
Components: MULTILAYER PERCEPTRON, FORMULA
Tutorial: Configuration of a MLP
Dataset: artificial2d.zip
References:
Tanagra Tutorials, "Single layer and multilayer perceptron (slides)", September 2014.
Tanagra Tutorials, "Multilayer perceptron - Software comparison", November 2008.

CDF and PPF in Excel, R and Python

2017-10-25 (Tag:6479019453967541775)

How to compute the cumulative distribution functions and the percent point functions of various commonly used distributions in Excel, R and Python.

I use Excel (in conjunction with Tanagra or Sipina), R and Python for the practical classes of my courses about data mining and statistics at the University. Often, I ask students to perform hypothesis tests or to calculate confidence intervals, etc.

We work on computers, it is obviously out of the question to use the statistical tables to obtain the quantile or p-value of the commonly used distribution functions. In this tutorial, I present the main functions for normal distribution, Student's t-distribution, chi-squared distribution and Fisher-Snedecor distribution. I realized that students sometimes find it difficult to match the reading of statistical tables with the functions they have difficulty identifying in software. It is also an opportunity for us to verify the equivalences between the functions proposed by Excel, R (stats package) and Python (scipy package). Whew! At least on the few illustrative examples given in our document, the results are consistent.

Keywords: excel, r, stats package, python, scipy package, p-value, quantile, cdf, cumulative distribution function, ppf, percent point function, quantile function
Tutorial: CDF and PPF

The "compiler" package for R

2017-10-18 (Tag:5196526638006581625)

It is widely agreed that R is not a fast language. Notably, because it is an interpreted language. To overcome this issue, some solutions exists which allow to compile functions written in R. The gains in computation time can be considerable. But it depends on our ability to write code that can benefit from these tools.

In this tutorial, we study the efficiency of the Luke Tierney's “compiler” package which is provided in the base distribution of R. We program two standard data analysis treatments, (1) with and (2) without using loops: the scaling of variables in a data frame; the calculation of a correlation matrix by matrix product. We compare the efficiency of non-compiled and compiled versions of these functions.

We observe that the gain for the compiled version is dramatic for the version with loops, but negligible for the second variant. We note also that, in the R 3.4.2 version used, it is not needed to compile explicitly the functions containing loops because it exists a JIT (just in time compilation) mechanism which ensure to our code the maximal performance.

Keywords: package compiler, cmpfun, byte code, package rbenchmark, benchmark, JIT, just in time
Tutorial: en_Tanagra_R_compiler_package.pdf
Program: compilation_r.zip
References :
Luke Tierney, "A Byte Code Compiler for R", Department of Statistics and Actuarial Science, University of Iowa, March 30, 2012.
Package 'compiler' - "Byte Code Compiler"

Regression analysis in Python

2017-10-09 (Tag:1337581716368637229)

Statsmodels is a Python module that provides classes and functions for the estimation of many different statistical models, as well as for conducting statistical tests, and statistical data exploration.

In this tutorial, we will try to identify the potentialities of StatsModels by conducting a case study in multiple linear regression. We will discuss about: the estimation of model parameters using the ordinary least squares method, the implementation of some statistical tests, the checking of the model assumptions by analyzing the residuals, the detection of outliers and influential points, the analysis of multicollinearity, the calculation of the prediction interval for a new instance.

Keywords: regression, statsmodels, pandas, matplotlib
Tutorial: en_Tanagra_Python_StatsModels.pdf
Dataset and program: en_python_statsmodels.zip
References:
StatsModels: Statistics in Python

Document classification in Python

2017-10-05 (Tag:7622424874300862006)

The aim of text categorization is to assign documents to predefined categories as accurately as possible. We are within the supervised learning framework, with a categorical target attribute, often binary. The originality lies in the nature of the input attribute, which is a textual document. It is not possible to implement predictive methods directly, it is necessary to go through a data preparation phase.

In this tutorial, we will describe a text categorization process in Python using mainly the text mining capabilities of the scikit-learn package, which will also provide data mining methods (logistics regression). We want to classify SMS as "spam" (spam, malicious) or "ham" (legitimate). We use the “SMS Spam Collection v.1” dataset.

Keywords: text mining, document categorization, corpus, bag of words, f1-score, recall, precision, dimensionality reduction, variable selection, logistic regression, scikit learn, python
Tutorial: Spam identification
Dataset: Corpus and Python program
References:
Almeida, T.A., Gómez Hidalgo, J.M., Yamakami, "A. Contributions to the Study of SMS Spam Filtering: New Collection and Results", in Proceedings of the 2011 ACM Symposium on Document Engineering (DOCENG'11), Mountain View, CA, USA, 2011.

SVM: Support Vector Machine in R and Python

2017-09-28 (Tag:1201163151736813140)

This tutorial completes the course material devoted to the Support Vector Machine approach (SVM).

It highlights two important dimensions of the method: the position of the support points and the definition of the decision boundaries in the representation space when we construct a linear separator; the difficulty to determine the “best” values of the parameters for a given problem.

We will use R (“e1071” package) and Python (“scikit-learn” package).

Keywords: svm, package e1071, logiciel R, logiciel Python, package scikit-learn, sklearn
Tutorial: SVM - Support Vector Machine
Dataset and programs: svm_r_python.zip
References:
Tanagra Tutorial, "Support Vector Machine", May 2017.
Tanagra Tutorial, "Implementing SVM on large dataset", July 2009.

Association rule learning with ARS

2017-09-11 (Tag:8700823904090274821)

SIPINA is known for its decision tree induction algorithms. In fact, the distribution includes two other tools that are little known to the public: REGRESS, which is specialized in multiple linear regression, we described it in one of our tutorials ; and an association rules extraction tool, called simply Association Rule Software (ARS).

In this tutorial, I describe the use of the ARS tool. Its interactivity with Excel spreadsheet is its main advantage. We launch the software from Excel using the “sipina.xla” add-in. We can easily retrieve the rules in the spreadsheet. Then, we can explore them (the mined rules) using the Excel data handling capabilities. The ability to filter and sort rules according to different criteria is a great help in detecting interesting rules. This is a very important aspect because the profusion of rules can quickly confuse the data miner.

Keywords: ARS, association rule software, excel spreadsheet, filtering and sorting rules, interestingness measures
Components: ASSOCIATION RULE SOFTWARE
Tutorial: en_Tanagra_Association_Sipina.pdf
Dataset: market_basket.zip
References:
Tanagra Tutorial, "Association rule learning (slides)", August 2014.

Linear classifiers

2017-08-25 (Tag:6941480369418642202)

In this tutorial, we study the behavior of 5 linear classifiers on artificial data. Linear models are often the baseline approaches in supervised learning. Indeed, based on a simple linear combination of predictive variables, they have the advantage of simplicity: the reading of the influence of each descriptor is relatively easy (signs and values of the coefficients); learning techniques are often (not always) fast, even on very large databases. We are interested in: (1) the naive bayes classifier; (2) the linear discriminant analysis; (3) the logistic regression; (4) the perceptron (single-layer perceptron); (5) the support vector machine (linear SVM).

The experiment was conducted under R. The source code accompanies this document. My idea, besides the theme of the linear classifiers that concerns us, is also to describe the different stages of the elaboration of an experiment for the comparison of learning techniques. In addition, we show also the results provided by the linear approaches implemented in various tools such as Tanagra, Knime, Orange, Weka and RapidMiner.

Keywords: linear classifier, naive bayes, linear discriminant analysis, logistic regression, perceptron, neural network, linear svm, support vector machine, decision tree, rpart, random forest, k-nn, nearest neighbors, e1071 package, nnet package, rf package, class package
Components : NAIVE BAYES CONTINUOUS, LINEAR DISCRIMINANT ANALYSIS, BINARY LOGISTIC REGRESSION, MULTILAYER PERCEPTRON, SVM
Tutorial: en_Tanagra_Linear_Classifier.pdf
Programs and dataset: linear_classifier.zip
References:
Wikipedia, "Linear Classifier".

Discriminant analysis and linear regression

2017-08-18 (Tag:7562189820002276064)

Linear discriminant analysis and linear regression are both supervised learning techniques. But, the first one is related to classification problems i.e. the target attribute is categorical; the second one is used for regression problems i.e. the target attribute is continuous (numeric).

However, there are strong connections between these approaches when we deal with a binary target attribute. From a practical example, we describe the connections between the two approaches in this case. We detail the formulas for obtaining the coefficients of discriminant analysis from those of linear regression.

We perform the calculations under Tanagra and R.

Keywords: linear discriminant analysis, predictive discriminant analysis, multiple linear regression, wilks' lambda, mahalanobis distance, score function, linear classifier, sas, proc discrim, proc stepdisc
Components: LINEAR DISCRIMINANT ANALYSIS, MULTIPLE LINEAR REGRESSION
Tutorial: en_Tanagra_LDA_and_Regression.pdf
Programs and dataset: lda_regression.zip
References:
C.J. Huberty, S. Olejnik, « Applied MANOVA and Discriminant Analysis »,Wiley, 2006.
R. Tomassone, M. Danzart, J.J. Daudin, J.P. Masson, « Discrimination et Classement », Masson, 1988.

Gradient boosting with R and Python

2017-08-11 (Tag:3868853721025357437)

This tutorial follows the course material devoted to the “Gradient Boosting” to which we are referring constantly in this document. It also comes in addition to the supports and tutorials for Bagging, Random Forest and Boosting approaches (see References).

The thread will be basic: after importing the data which are split into two data files (learning and testing) in advance, we build predictive models and evaluate them. The test error rate criterion is used to compare performance of various classifiers.

The question of parameters, particularly sensitive in the context of the gradient boosting, is studied. Indeed, there are many parameters, and their influence on the behavior of the classifier is considerable. Unfortunately, if we guess about the paths to explore to improve the quality of the models (more or less regularization), accurately identifying the parameters to modify and set the right values are difficult, especially because they (the various parameters) can interact with each other. Here, more than for other machine learning methods, the trial and error strategy takes a lot of importance.

We use R and Python with their appropriate packages.

Keywords: gradient boosting, R software, decision tree, adabag package, rpart, xgboost, gbm, mboost, Python, scikit-learn package, gridsearchcv, boosting, random forest
Tutorial: Gradient boosting
Programs and datasets: gradient_boosting.zip
References:
Tanagra tutorial, "Gradient boosting - Slides", June 2016.
Tanagra tutorial, "Bagging, Random Forest, Boosting - Slides", December 2015.
Tanagra tutorial, "Random Forest and Boosting with R and Python", December 2015.

Statistical analysis with Gnumeric

2017-08-04 (Tag:4890688456419529549)

The spreadsheet is a valuable tool for data scientist. This is what the annual KDnuggets polls reveal during these last years where Excel spreadsheet is always well placed. In France, this popularity is largely confirmed by its almost systematic presence in job postings related to the data processing (statistics, data mining, data science, big data/data analytics, etc.). Excel is specifically referred, but this success must be viewed as an acknowledgment of the skills and capabilities of the spreadsheet tools.

This tutorial is devoted to the Gnumeric Spreadsheet 1.12.12. It has interesting features: Setup and installation programs are small because it is not part of an office suite; It is fast and lightweight; It is dedicated to numerical computation and natively incorporates a "statistics" menu with the common statistical procedures (parametric tests, non-parametric tests, regression, principal component analysis, etc.); and, it seems more accurate than some popular spreadsheets programs. These last two points have caught my attention and have convinced me to study it in more detail. In the following, we make a quick overview of Gnumeric's statistical procedures. If it is possible, we compare the results with those of Tanagra 1.4.50.

Keywords: gnumeric, spreadsheet, descriptive statistics, principal component analysis, pca, multiple linear regression, wilcoxon signed rank test, welch test unequal variance, mann and whitney, analysis of variance, anova
Tanagra components: MORE UNIVARIATE CONT STAT, PRINCIPAL COMPONENT ANALYSIS, MULTIPLE LINEAR REGRESSION, WILCOXON SIGNED RANKS TEST, T-TEST UNEQUAL VARIANCE, MANN-WHITNEY COMPARISON, ONE-WAY ANOVA
Tutorial: en_Tanagra_Gnumeric.pdf
Dataset: credit_approval.zip
References :
Gnumeric, "The Gnumeric Manual, version 1.12".

Failure resolved

2017-08-02 (Tag:6146953071795307963)

Hi,

It seems that the failure has been resolved since yesterday "August 1st, 2017".

Again, sorry for the inconvenience. I hope that the continuity of service will be ensured throughout the summer.

Kind regards,

Ricco (August 2nd, 2017).

File server outage

2017-07-27 (Tag:8699855444328773935)

Since a few days (since the 07/24/2017 approximately), the server of the Eric laboratory that hosts the Tanagra project files (software, books, course materials, tutorials...) is idle. After a power outage, there is nobody to restart the server during the summer period. And the server is located in a room in which I do not have access.

So we wait. And it will take a little time, the summer break lasts a month, our University (and Lab) is officially reopened on August 21st! I am sorry for users that work from the documents that I put online. This difficulty is totally beyond my control and I cannot do anything about it.

Some internet users are reported to me the problem. I take the initiative to inform you. As soon as the situation is back in order, I will let you know.

Kind regards,

Ricco (July 27th, 2017).

Interpreting cluster analysis results

2017-07-22 (Tag:1029144818975973319)

Interpretation of the clustering structure and the clusters is an essential step in unsupervised learning. Identifying the characteristics that underlie differentiation between groups allows to ensuring their credibility.

In this course material, we explore the univariate and multivariate techniques. The first ones have the merit of the ease of calculation and reading, but do not take into account the joint effect of the variables. The seconds are a priori more efficient, but require additional expertise to fully understand the results.

Keywords: cluster analysis, clustering, unsupervised learning, percentage of variance explained, V-Test, test value, distance between centroids, correlation ratio
Slides: Characterizing the clusters
References:
Tanagra Tutorial, "Understanding the 'test value' criterion", May 2009.
Tanagra Tutorial, "Hierarchical agglomerative clustering", June 2017.
Tanagra Tutorial, "K-Means clustering", June 2017.

Kohonen map with R

2017-07-14 (Tag:3303603523797285689)

This tutorial complements the course material concerning the Kohonen map or Self-organizing map (June 2017). In a first time, we try to highlight two important aspects of the approach: its ability to summarize the available information in a two-dimensional space; Its combination with a cluster analysis method for associating the topological representation (and the reading that one can do) to the interpretation of the groups obtained from the clustering algorithm. We use the R software and the “Kohonen” package (Wehrens et Buydens, 2007). In a second time, we carry out a comparative study of the quality of the partitioning with the one obtained with the K-means algorithm. We use an external evaluation i.e. we compare the clustering results with pre-established classes. This procedure is often used in research to evaluate the performance of clustering methods. It takes on its meaning when it is applied to artificial data where the true class membership is known. We use the K-Means and Kohonen-Som components of Tanagra.

This tutorial is based on the Shane Lynn's article on the R-bloggers website (Lynn, 2014). I completed it by introducing the intermediate calculations to better understand the meaning of the charts, and by conducting the comparative study.

Keywords: som, self organizing map, kohonen network, data visualization, dimensionality reduction, cluster analysis, clustering, hierarchical agglomerative clustering, hac, two-step clustering, R software, kohonen package, k-means, external evaluation, heatmaps
Components: KOHONEN-SOM
Tutorial: Kohonen map with R
Program and dataset: waveform - som
References:
Tanagra tutorial, "Self-organizing map (slides)", June 2017.
Tanagra Tutorial, "Self-organizing map (with Tanagra)", July 2009.

Cluster analysis with Python - HAC and K-Means

2017-07-08 (Tag:4225442780229963653)

Keywords: python, scipy, scikit-learn, cluster analysis, clustering, hac, hierarchical agglomerative clustering, , k-means, principal component analysis, PCA
Turorial: hac and k-means with Python
Dataset and cource code: hac_kmeans_with_python.zip
References :
Marie Chavent, Teaching Page, University of Bordeaux.

Tanagra Tutorials, "Cluster analysis with R - HAC and K-Means", July 2017.

Cluster analysis with R - HAC and K-Means

2017-07-06 (Tag:4659275108429397791)

This tutorial describes a cluster analysis process. We deal with a set of cheeses (29 instances) characterized by their nutritional properties (9 variables). The aim is to determine groups of homogeneous cheeses in view of their properties.

We inspect and test two approaches using two procedures of the R software: the Hierarchical Agglomerative Clustering algorithm (hclust) ; and the K-Means algorithm (kmeans).

The data file "fromage.txt" comes from the teaching page of Marie Chavent from the University of Bordeaux. The excellent course materials and corrected exercises (commented R code) available on its website will complete this tutorial, which is intended firstly as a simple guide for the introduction of the R software in the context of the cluster analysis.

Keywords: R software, cluster analysis, clustering, hac, hierarchical agglomerative clustering, , k-means, fpc package, principal component analysis, PCA
Components: hclust, kmeans, kmeansruns
Turorial: hac and k-means with R
Dataset and cource code: hac_kmeans_with_r.zip
References :
Marie Chavent, Teaching Page, University of Bordeaux.

k-medoids clustering (slides)

2017-07-03 (Tag:5677495135905999822)

K-medoids is a partitioning-based clustering algorithm. It is related to the k-means but, instead of using the centroid as reference data point for the cluster, we use the medoid which is the individual nearest to all the other points within its cluster. One of the main consequence of this approach is that the resulting partition is less sensible to outliers.

This course material describes the algorithm. Then, we focus on the silhouette tool which can be used to determine the right number of clusters, a recurring open problem in cluster analysis.

Keywords: cluster analysis, clustering, unsupervised learning, paritionning method, relocation approach, medoid, PAM, partitioning aroung medoids, CLARA, clustering large applications, silhouette, silhouette plot
Slides: Cluster analysis - k-medoids algorithm
References:
Wikipedia, "k-medoids".

k-means clustering (slides)

2017-06-20 (Tag:8754699310714352180)

K-Means clustering is a popular cluster analysis method. It is simple and its implementation does not require to keep in memory all the dataset, thus making it possible to process very large databases.

This course material describes the algorithm. We focus on the different extensions such as the processing of qualitative or mixed variables, fuzzy c-means, and clustering of variables (clustering around latent variables). We note that the k-means method is relatively adaptable and can be applied to a wide range of problems.

Keywords: cluster analysis, clustering, unsupervised learning, partition method, relocation
Slides: K-Means clustering
References :
Wikipedia, "k-means clustering".
Wikipedia, "Fuzzy clustering".

Self-Organizing Map (slides)

2017-06-13 (Tag:1766453588608808002)

A self-organizing map (SOM) or Kohonen network or Kohonen map is a type of artificial neural network that is trained using unsupervised learning to produce a low-dimensional (typically two-dimensional), discretized representation of the input space of the training samples, called a map, which preserves the topological properties of the input space (Wikipedia).

SOM is useful for the dimensionality reduction, data visualization and cluster analysis. In this course material, we outline the mechanisms underlying the approach. We focus on its practical aspects (e.g. various visualization possibilities, prediction on a new instance, extension of SOM to the clustering task,…).

Illustrative examples in R (kohonen package) and Tanagra are briefly presented.

Keywords: som, self organizing map, kohonen network, data visualization, dimensionality reduction, cluster analysis, clustering, hierarchical agglomerative clustering, hac, two-step clustering, R software, kohonen package
Components: KOHONEN-SOM
Slides: Kohonen SOM
References:
Wikipedia, "Self-organizing map".

Hierarchical agglomerative clustering (slides)

2017-06-10 (Tag:5968170649431258184)

In data mining, cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters) (Wikipedia).

In this course material, we focus on the hierarchical agglomerative clustering (HAC). Beginning from the individuals which initially represents groups, the algorithms merge the groups in a bottom-up fashion until only the instances are gathered in only one group. The process is materialized by a dendrogram which allows to evaluate the nature of the solution and helps to determine the appropriate number of clusters.

Examples of analysis under R, Python and Tanagra are described.

Keywords: hac, cluster analysis, clustering, unsupervised learning, tandem analysis, two-step clustering, R software, hclust, python, scipy package
Components: HAC, K-MEANS
Slides: cah.pdf
References:
Wikipedia, "Cluster analysis".
Wikipedia, "Hierarchical clustering".

Support vector machine (slides)

2017-05-20 (Tag:77039719047140551)

In machine learning, support vector machines (SVM) are supervised learning models with associated learning algorithms that analyze data used for classification and regression analysis (Wikipedia).

These slides show the background of the approach in the classification context. We address the binary classification problem, the soft-margin principle, the construction of the nonlinear classifiers by means of the kernel functions, the feature selection process, the multiclass SVM.

The presentation is complemented by the implementation of the approach under the open source software Python (Scikit-Learn), R (e1071) and Tanagra (SVM and C-SVC).

Keywords: svm, e1071 package, R software, Python, scikit-learn package, sklearn
Components: SVM, C-SVC
Slides: Support Vector Machine (SVM)
Dataset: svm exemples.xlsx
References:
Abe S., "Support Vector Machines for Pattern Classification", Springer, 2010.

Tanagra website statistics for 2016

2017-01-05 (Tag:4508013158364414048)

The year 2016 ends, 2017 begins. I wish you all a very happy year 2017.

A small statistical report on the website statistics for the 2016. All sites (Tanagra, course materials, e-books, tutorials) has been visited 264,045 times this year, 721 visits per day.

Since February, the 1st, 2008, the date from which I installed the Google Analytics counter, there are 2,111,078 visits (649 daily visits).

Who are you? The majority of visits come from France and Maghreb. Then there are a large part of French speaking countries, notably because some pages are exclusively in French. In terms of non-francophone countries, we observe mainly the United States, India, UK, Brazil, Germany, ...

The pages containing course materials about Data Mining and R Programming are the most popular ones. This is not really surprising.

Happy New Year 2017 to all.

Ricco.
Slideshow: Website statistics for 2016

Text mining - Document classification

2016-09-17 (Tag:5623761064048193888)

The statistical approach of the "text mining" consists in to transform a collection of text documents in a matrix of numeric values on which we can apply machine learning algorithms.

The "unstructured document" designation is often used when one talks about text documents. This does not mean that he does not have a certain organization (titles, chapters, paragraphs, questions and answers, etc.). It shows first of all that we cannot express directly the collection in the form of a data table that is usually handled in data mining. To obtain this kind of data representation, a preprocessing phase is needed, then we extract relevant features to define the data table. These steps can influence heavily the relevance of the results.

In this tutorial, I take an exercise that I lead with my students for my text mining course at the University. We perform all the analysis under R with the dedicated packages for text mining such as “XML” or “tm”. The issue here is to perform exactly the study using other tools such as Knime 2.9.1 or RapidMiner 5.3 (Note: these are the versions available when I wrote the French version of this tutorial in April 2014). We will see that these tools provide specialized libraries which enable to perform efficiently a statistical text mining process.

Keywords: text mining, document classification, text categorization, decision tree, j48, lineat svm, reuters collection, XML format, stemming, stopwords, document-term matrix
Tutorial: en_Tanagra_Text_Mining.pdf
Dataset: text_mining_tutorial.zip
References :
Wikipedia, "Document classification".
S. Weiss, N. Indurkhya, T. Zhang, "Fundamentals of Predictive Text Mining", Springer, 2010.

Image classification with Knime

2016-06-25 (Tag:6717376866857603908)

The aim of image mining is to extract valuable knowledge from image data. In the context of supervised image classification, we want to assign automatically a label to image from their visual content. The whole process is identical to the standard data mining process. We learn a classifier from a set of classified images. Then, we can apply the classifier to a new image in order to predict its class membership. The particularity is that we must extract a vector of numerical features from the image before to launch the machine learning algorithm, and before to apply the classifier in the deployment phase.

We deal with an image classification task in this tutorial. The goal is to detect automatically the images which contain a car. The main result is that, even if I have a basic knowledge about the image processing, I can lead the analysis with a facility which is symptomatic of the usability of Knime in this context.

Keywords: image mining, image classification, image processing, feature extraction, decision tree, random forest, knime
Tutorial: en_Tanagra_Image_Mining_Knime.pdf
Dataset and program (Knime archive): image mining tutorial
References:
Knime Image Processing, https://tech.knime.org/community/image-processing
S. Agarwal, A. Awan, D. Roth, « UIUC Image Database for Car Detection » ; https://cogcomp.cs.illinois.edu/Data/Car/

Gradient boosting (slides)

2016-06-19 (Tag:6349915657908667406)

The "gradient boosting" is an ensemble method that generalizes boosting by providing the opportunity of use other loss functions ("standard" boosting uses implicitly an exponential loss function).

These slides show the ins and outs of the method. Gradient boosting for regression is detailed initially. The classification problem is presented thereafter.

The solutions implemented in the packages for R and Python are studied.

Keywords: boosting, regression tree, package gbm, package mboost, package xgboost, R, Python, package scikit-learn, sklearn
Slides: Gradient Boosting
References:
R. Rakotomalala, "Bagging, Random Forest, Boosting", December 2015.
Natekin A., Knoll A., "Gradient boosting machines, a tutorial", in Frontiers in Neurorobotics, December 2013.

Tanagra and Sipina add-ins for Excel 2016

2016-06-13 (Tag:2872695640292050458)

The add-ins “tangra.xla” and “sipina.xla” are greatly involved in popularity of Tanagra and Sipina software applications. They incorporate menus dedicated to data mining in Excel. They implement a simple bridge between the data into the spreadsheet and Tanagra or Sipina.

I developed and tested the latest add-ins versions for Excel 2007 and 2010. I had access recently to Excel 2016. I checked the add-ins. The conclusion is that the tools work without a hitch.

Keywords: data importation, excel data file, add-in, add-on, xls, xlsx
Lien : en_Tanagra_Add_In_Excel_2016.pdf
References:
Tanagra, "Tanagra add-in for Excel 2007 and 2010", August 2010.
Tanagra, "Sipina add-in for Excel 2007 and 2010", June 2016.

Sipina add-in for Excel 2007 and 2010

2016-06-12 (Tag:7908681530450891648)

SIPINA is a Data Mining Software which implements various supervised learning paradigms. This is an old tool but it is still used because this is the only free tool which provides fully functional interactive decision tree capabilities.

This tutorial briefly describes the installation and the use of the add-in "sipina.xla" into Excel 2007. The approach is easily generalized to Excel 2010. A similar document exists for Tanagra . It seemed to me nevertheless necessary to clarify the procedure, especially because several users have made the request. Other tutorials exist for earlier versions of Excel (1997-2003) and for Calc (Libre Office and Open Office).

A new tutorial will come soon. It shows that the add-in operates properly also under Excel 2016.

Keywords: data importation, excel data file, add-in, add-on, xls, xlsx
Tutorial: en_sipina_excel_addin.pdf
Dataset: heart.xls
References:
Tanagra, "Tanagra add-in for Office 2007 and Office 2010", august 2010.
Tanagra, "Tanagra and Sipina add-ins for Excel 2016", June 2016.

Categorical predictors in logistic regression

2016-04-03 (Tag:2542102591011369401)

The aim of the logistic regression is to build a model for predicting a binary target attribute from a set of explanatory variables (predictors, independent variables), which are numeric or categorical. They are treated as such when they are numeric. We must recode them when they are categorical. The dummy coding is undeniably the most popular approach in this context.

The situation becomes more complicated when we perform a feature selection. The idea is to determine the predictors that contribute significantly to the explanation of the target attribute. There is no problem when we consider a numeric variable. It is either excluded or either kept in the model. But how to proceed when we handle a categorical explanatory variable? Should we treat the dichotomous variables associated to a categorical predictor as a whole that we must exclude or include into the model? Or should we treat the each dichotomous variable independently? How to interpret the coefficients of the selected dichotomous variables in this case?

In this tutorial, we study the approaches proposed by various tools: R 3.1.2, SAS 9.3, Tanagra 1.4.50 and SPAD 8.0. We will see that feature selection algorithms rely on specific criteria according to the software. We will see also that they use different approaches when we are in the presence of the categorical predictor variables.

Keywords: logistic regression, dummy coding, categorical predictor variables, feature selection
Components: O_1_BINARIZE, BINARY LOGISTIC REGRESSION, BACKWARD-LOGIT
Tutorial: Feature selection - Categorical predictors - Logistic Regression
Dataset: heart-c.xlsx
References:
Wikipedia, "Logistic Regression"

Dummy coding for categorical predictor variables

2016-03-31 (Tag:6387332575165516206)

In this tutorial, we show how to perform a dummy coding for categorical predictor variables in the context of the logistic regression learning process.

In fact, this is an old tutorial that I was written a long time ago (2007), but it is not referenced in this blog (which was created in 2008). I found it in my archives because I plan to write soon a tutorial about the strategies for the selection of categorical variables in logistic regression. I was wondering if I had already written something that may be linked to this subject (the treatment of the categorical predictors in logistic regression) in the past. Obviously, I would have to check most often my archives.

We use Tanagra 1.4.50 in this tutorial.

Keywords: logistic regression, dummy coding, categorical predictor variables
Components: SAMPLING, O_1_BINARIZE, BINARY LOGISTIC REGRESSION, TEST
Tutorial: Dummy coding - Logistic Regression
Dataset: heart-c.xlsx
References:
Wikipedia, "Logistic Regression"

Cost-Sensitive Learning (slides)

2016-03-13 (Tag:7180769223465224963)

This course material presents approaches for the consideration of misclassification costs in supervised learning. The baseline method is the one for which we do not take into account the costs.

Two issues are studied : the metric used for the evaluation of the classifier when a misclassification cost matrix is provided i.e. the expected cost of misclassification (ECM); some approaches which enable to guide the machine learning algorithm towards the minimization of the ECM.

Keywords: cost matrix, misclassification, expected cost of misclassification, bagging, metacost, multicost
Slides: Cost Sensitive Learning
References:
Tanagra Tutorial, "Cost-senstive learning - Comparison of tools", March 2009.
Tanagra Tutorial, "Cost-sensitive decision tree", November 2008.

Hyper-threading and solid-state drive

2016-03-03 (Tag:488071904345948912)

After more than 6 years of good and faithful service, I decided to change my computer. It must be said that the former (Intel Core 2 Quad Q9400 2.66 Ghz - 4 cores - running Windows 7 - 64 bit) began to make disturbing sounds. I am obliged to put music to cover the rumbling of the beast and be able to work quietly.

The choice of the new computer was another matter. I spent the age of the race to the power which is necessarily fruitless anyway, given the rapid evolution of PCs. Nevertheless, I was sensitive to two aspects that I could not evaluate previously: The hyper-threading technology is effective in programming multithreaded algorithms of data mining? The use of temporary files to relieve the memory occupation takes advantage of SSD disk technology?

The new PC runs under Windows 8.1 (I wrote the French version of this tutorial one year ago). The processor is a Core I7 4770S (3.1 Ghz). It has 4 physical cores but 8 logical cores with the hyper-threading technology. The system disk is a SSD. These characteristics allows evaluate to their influences on (1) the implementation of multithreaded version of the linear discriminant analysis described in a previous paper (“Load balanced multithreading for LDA”, September 2013), where the number of threads used can be specified by the user; (2) the use of temporary files for the induction of decision trees algorithm, which enables us to handle very large dataset (“Dealing with very large dataset in Sipina”, January 2010; up to 9,634,198 instances and 41 variables).

In this tutorial, we reproduce the two studies using the SIPINA software. Our goal is to evaluate the behavior of these solutions (multi-threaded implementation, copy of data into temporary files to alleviate the memory occupation) on our new machine which, due to its characteristics, should expressly take advantage of them.

Keywords: hyper-threading, ssd disk, solid-state drive, multithread, multithreading, very large dataset, core i7, sipina, decision tree, linear discriminant analysis, lda
Tutorial: en_Tanagra_Hyperthreading.pdf
References:
Tanagra Tutorial, "Load balanced multithreading for LDA", September 2013.
Tanagra Tutorial, "Dealing with very large dataset in Sipina", January 2010.
Tanagra Tutorial, "Multithreading for decision tree induction", November 2010.

Tanagra website statistics for 2015

2016-01-04 (Tag:4999754759810076632)

The year 2015 ends, 2016 begins. I wish you all a very happy year 2016.

A small statistical report on the website statistics for the past year. All sites (Tanagra, course materials, e-books, tutorials) has been visited 255,386 times this year, 700 visits per day.

Since February, the 1st, 2008, the date from which I installed the Google Analytics counter, there are 1,848,033 visits (639 visits per day).

Who are you? The majority of visits come from France and Maghreb. Then there are a large part of French speaking countries, notably because some pages are exclusively in French. In terms of non-francophone countries, we observe mainly the United States, India, UK, Germany, Brazil, ...

Which pages are visited? The pages that are most successful are those that relate to documentation about the Data Science: course materials, tutorials, links to other documents available on line, etc.. This is not really surprising. I take more time myself to write booklets and tutorials, to study the behavior of different tools, of which Tanagra.

Happy New Year 2016 to all.

Ricco.
Slideshow: Website statistics for 2015

R online with R-Fiddle

2015-12-31 (Tag:7662089783343482203)

R-Fiddle is a programming environment for R available online. It allows us to encode and to run a program written in R.

Although R is free and there are also good free programming environments for R (e.g. R-Studio desktop, Tinn-R), this type of tool has several interests. It is suitable for mobile users who frequently change machine. If we have an Internet connection, we can work on a project without having to worry about the R installation on PCs. Collaborative work is another context in which this tool can be particularly advantageous. It allows us to avoid the transfer of files and the management of versions. Last, the solution allows us to work on a lightweight front-end, a laptop for example, and export the calculations on a powerful remote server (in the cloud as we would say today).

In this tutorial, we will briefly review the features of R-Fiddle.

Keywords: R software, R programming, cloud computing, linear discriminant analysis, logistic regression, classification tree, klaR package, rpart package, feature selection
Tutorial: en_Tanagra_R_Fiddle.pdf
Files: en_r_fiddle.zip
References:
R-Fiddle - http://www.r-fiddle.org/#/

Random Forest - Boosting with R and Python

2015-12-30 (Tag:2614807124647872780)

This tutorial follows the slideshow devoted to the "Bagging, Random Forest and Boosting". We show the implementation of these methods on a data file. We will follow the same steps as the slideshow i.e. we first describe the construction of a decision tree, we measure the prediction performance, and then we see how ensemble methods can improve the results. Various aspects of these methods will be highlighted: the measure of the variable importance, the influence of the parameters, the influence of the characteristics of the underlying classifier (e.g. controlling the tree size), etc.

As a first step, we will focus on R (rpart, adabag and randomforest packages) and Python (scikit-learn package). We can multiply analyses by programming. Among others, we can evaluate the influence of parameters on the performance. As a second step, we will explore the capabilities of software (Tanagra and Knime) providing turnkey solutions, very simple to implement, more accessible for people which do not like programming.

Keywords: R software, R programming, decision tree, classification tree, adabag package, rpart package, randomforest package, Python, scikit-learn package, bagging, boosting, random forest
Components: BAGGING, RND TREE, BOOSTING, C4.5, DISCRETE SELECT EXAMPLES
Tutorial: Bagging, Random Forest et Boosting
Files: randomforest_boosting_en.zip
References:
R. Rakotomalala, "Bagging, Random Forest, Boosting (slides)", December 2015.

Bagging, Random Forest, Boosting (slides)

2015-12-23 (Tag:4389426411970738562)

This course material presents ensemble methods: bagging, random forest and boosting. These approaches are based on the same guiding idea : a set of base classifiers learned from the an unique learning algorithm are fitted to different versions of the dataset.

For bagging and random forest, the models are fitted independently of bootstrap samples. Random Forest incorporates an additional mechanism in order to “decorrelate” the models which are necessarily decision trees.

Boosting works in a sequential fashion. A model at the step (t) is fitted to a weighted version of the sample in order to correct the error of the model learned at the preceding step (t-1).

Keywords: bagging, boosting, random forest, decision tree, rpart package, adabag package, randomforest package, R software
Slides: Bagging - Random Forest - Boosting
References :
Breiman L., "Bagging Predictors", Machine Learning, 26, p. 123-140, 1996.
Breiman L., "Random Forests", Machine Learning, 45, p. 5-32, 2001.
Freund Y., Schapire R., "Experiments with the new boosting algorithm", International Conference on Machine Learning, p. 148-156, 1996.
Zhu J., Zou H., Rosset S., Hastie T., "Multi-class AdaBoost", Statistics and Its Interface, 2, p. 349-360, 2009.

Python - Machine Learning with scikit-learn (slides)

2015-12-20 (Tag:8169489315039118140)

This course material presents some modules and classes of scikit-learn, a library for machine learning in Python.

We focused on a typical classification process as a first step: the subdivision of the dataset into training and test sets; the learning of the logistic regression on the training sample; applying the model to the test set in order to obtain the predicted class values; the evaluation of the classifier using the confusion matrix and the calculation of the performance measurements.

In the second step, we study other important domains of the classification task: the cross-validation error evaluation when we deal with a small dataset; the scoring process for direct marketing; the grid search for detecting the optimal parameters of algorithms for a given dataset; the feature selection issue.

Keywords: python, numpy, pandas, scikit-learn, logistic regression, predictive analytics
Slides: Machine Learning with scikit-learn
Dataset and programs: scikit-learn - Programs and dataset
References :
"scikit-learn -- Machine Learning in Python" on scikit-learn.org
Python - Official Site

Python - Statistics with SciPy (slides)

2015-12-08 (Tag:6665224531524379965)

This course material presents the use of some modules of SciPy, a library for scientific computing in Python. We study especially the stats package, it allows to perform statistical tests such as comparison of means for independent and related samples, comparison of variances, measuring the association between two variables. We study also the cluster package, especially the k-means and the hierarchical agglomerative clustering algorithms.

SciPy handles NumPy vectors and matrices which were presented previously.

Keywords: python, numpy, scipy, descriptive statistics, cumulative distribution functions, sampling, random number generator, normality test, test for comparing populations, pearson correlation, spearman correlation, cluster analysis, k-means, hac, dendrogram
Slides: scipy.stats and scipy.cluster
Dataset and programs: SciPy - Programs and dataset
References :
SciPy Reference Guide sur SciPy.org
Python - Official Site

Python - Handling matrices with NumPy (slides)

2015-10-28 (Tag:8795812480538413643)

This course material presents the manipulation of matrices using NumPy. The array type is common to vectors and matrices. The special feature is the addition of a second dimension in order to have values within a rows x columns structure.

The matrices pave the way to operators which play a fundamental role in statistical modeling and exploratory data analysis (e.g. matrix inversion, solving equations, calculation of eigenvalues and eigenvectors, singular value decomposition, etc.).

Keywords: langage python, numpy, vector, matrix, array, creation, extraction
Slides: NumPy Matrices
Datasets and programs: Matrices
References :
NumPy Reference sur SciPy.org
Haenel, Gouillart, Varoquaux, "Python Scientific Lecture Notes".
Python - Official Site

Python - Handling vectors with NumPy (slides)

2015-10-08 (Tag:1090097754049242595)

Python is becoming more and more popular in the eyes of Data Scientists. I decided to introduce Statistical Programming in Python among my teachings at the University (reference page in French).

This first course material described the handling of vectors of NumPy library. The structure and functionality have a certain similarity with the vectors under R.

Keywords: langage python, numpy, vector, array, creation, extraction
Slides: NumPy Vectors
Datasets and programs: Vectors
References :
NumPy Reference sur SciPy.org
Haenel, Gouillart, Varoquaux, "Python Scientific Lecture Notes".
Python - Official Site

Cross-validation, leave-one-out, bootstrap (slides)

2015-06-02 (Tag:3134865842970372878)

In supervised learning, it is commonly accepted that one should not use the same sample to build a predictive model and estimate its error rate. The error obtained under these conditions - called resubstitution error rate - is (very often) too optimistic, leaving to believe that the model will present an excellent performance in prediction.

A typical approach is to divide the data into 2 parts (holdout approach): a first sample, said train sample is used to construct the model; a second sample, said test sample, is used to measure its performance. The measured error rate reflects honestly the model behavior in generalization. Unfortunately, on small dataset, this approach is problematic. By reducing the amount of data presented to the learning algorithm, we cannot learn correctly the underlying relation between the descriptors and the class attribute. At the same time, the part devoted to testing remains limited, the measured error has high variance.

In this document, I present resampling techniques (cross-validation, leave-one-out and bootstrap) for estimating the error rate of the model constructed from the totality of the available data. A study on simulated data (the "waves" dataset; Breiman and al., 1984) is used to analyze the behavior of approaches according to various learning algorithms (decision trees, linear discriminant analysis, neural networks [perceptron]).

Keywords: resampling, cross-validation, leave-one-out, bootstrap, error rate estimation, holdout, resubstitution, train, test, learning sample
Components (Tanagra): CROSS-VALIDATION, BOOTSTRAP, TEST, LEAVE-ONE-OUT
Slides: Error rate estimation
References:
A. Molinaro, R. Simon, R. Pfeiffer, « Prediction error estimation: a comparison of resampling methods », in Bioinformatics, 21(15), pages 3301-3307, 2005.
Tanagra tutorial, "Resampling methods for error estimation", July 2009.

R programming under Hadoop

2015-04-11 (Tag:7437674239658033803)

The aim of this tutorial is to show the programming of the famous "word count" algorithm from a set of files stored in HDFS file system.

The "word count" is a state-of-the-art example for the programming under Hadoop. It is described everywhere on the web. But, unfortunately, the tutorials which describe the task are often not reproducible. The dataset are not available. The whole process, including the installation of the Hadoop framework, are not described. We do not know how to access to the files stored in the HDFS file system. In short, we cannot run programs and understand in details how they work.

In this tutorial, we describe the whole process. We detail first the installation of a virtual machine which contains a single-node Hadoop cluster. Then we show how to install R and RStudio Server which allows us to write and run a program. Last, we write some programs based on the mapreduce scheme.

The steps, and therefore the source of errors, are numerous. We will use many screenshots to actually understand each operation. This is the reason of this unusual presentation format for a tutorial.

Keywords: big data, big data analytics, mapreduce, package rmr2, package rhdfs, hadoop, rhadoop, logiciel R, rstudio, rstudio server, cloudera, R language
Tutorial: en_Tanagra_Hadoop_with_R.pdf
Files: hadoop_with_r.zip
References :
Tanagra Tutorial, "MapReduce with R", Feb. 2015.
Hugh Devlin, "Mapreduce in R", Jan. 2014.

MapReduce with R

2015-02-22 (Tag:6836762413698817832)

Big Data is a very popular topic these last years. The big data analytics refers to the process to discovering useful information or knowledge from big data. That is an important issue for organizations. In concrete terms, the aim is to extend, adapt or even create novel exploratory data analysis or data mining approaches to new data sources of which the main characteristics are “volume”, “variety” and “velocity”.

Distributed computing is essential in the big data context. It is illusory to want infinitely increase the power of servers for following the exponential growth of information to process. The solution depends on the efficient cooperation of a myriad of networked computers, ensuring both the volume management and computing power. Hadoop is a solution commonly cited for this requirement. This is a set of algorithms (an open-source software framework written in Java) for distributed storage and distributed processing of very large data sets (Big Data) on computer clusters built from commodity hardware. For the implementation of distributed programs, the MapReduce programming model plays an important role. The processing of large dataset can be implemented with parallel algorithms on a cluster of connected computers (nodes).

In this tutorial, we are interested in MapReduce programming in R. We use the technology RHadoop of the Revolution Analytics Company. The "rmr2" package in particular allows to learn the MapReduce programming without having to install the Hadoop environment which is already sufficiently complicated. There are some tutorials about this subject on the web. The one of Hugh Devlin (January 2014) is undoubtedly one of the most interesting . But, it is perhaps too sophisticated for the students which are not very familiar with the programming in R. So I decided to start afresh with very simple examples in a first time. Then, in a second time, we progress by programming a simple data mining algorithm such as the multiple linear regression.

Keywords: big data, big data analytcis, mapreduce, rmr2 package, hadoop, rhadoop, one-way anova, linear regression
Tutorial: en_Tanagra_MapReduce.pdf
Dataset: en_mapreduce_with_r.zip
References:
Hugh Devlin, "Mapreduce in R", Jan. 2014.
Tutoriel Tanagra, "Parallel programming in R", october 2013.

Correlation analysis (slides)

2014-12-12 (Tag:7772721999601016602)

The aim of the correlation analysis is to characterize the existence, the nature and the strength of the relationship between two quantitative variables. The visual inspection of scatter plots is a prime instrument in a first step, when we have no idea about the form of the underlying relationship between the variables. But, in second step, we need statistical tools to measure the strength of the relationship and to assess its significance.

In these slides, we present the Pearson's product-moment correlation. We show how to estimate its value using a sample. We present the inferential tools which enable to realize hypothesis testing and confidence interval estimation.

But the Pearson correlation is appropriate only to characterize linear relationship. We study the possible solutions for problematic situations with, among others, the Spearman's rank correlation coefficient (Spearman's rho).

Last, the partial correlation coefficient and the related inferential tools are described.

Keywords: correlation, partial correlation, pearson, spearman, hypothesis testing, significance, confidence interval
Components (Tanagra): LINEAR CORRELATION
Slides: Correlation analysis
References:
M. Plonsky, “Correlation”, Psychological Statistics, 2014.

Clustering of categorical variables (slides)

2014-12-02 (Tag:2158552116245790433)

The aim of clustering of categorical variables is to group variables according to their relationship. The variables in the same cluster are highly related; variables in different clusters are weakly related. In these slides, we describe an approach based on the Cramer’s V measure of association. We observe that the approach can highlight subset of variables which is useful - for instance - in a variable selection process for a subsequent supervised learning task. But, on the other hand, we have no indication about the nature of these associations. The interpretation of the groups is not obvious.

This leads us to deepen the analysis and to take an interest in the clustering of the categories of nominal variables. An approach based on a measure of similarity between categories using the indicator variables (dummy variables) is described. Other approaches are also reviewed. The main advantage of this kind of analysis (clustering of categories) is that we can easily interpret the underlying nature of the groups.

Keywords: categorical variables, qualitative variables, categories, clustering, clustering variables, latent variable, cramer's v, dice's index, clusters, groups, bottom-up, hierarchical agglomerative clustering, hac, top down, mca, multiple correspondence analysis
Components (Tanagra): CATVARHCA
Slides: Clustering of categorical variables
References:
H. Abdallah, G. Saporta, « Classification d’un ensemble de variables qualitatives » (Clustering of a set of categorical variables), in Revue de Statistique Appliquée, Tome 46, N°4, pp. 5-26, 1998.
F. Harrell Jr, «Hmisc: Harrell Miscellaneous», version 3.14-5.

Discretization of continuous attributes (slides)

2014-11-19 (Tag:8075515832670511859)

The discretization consists to transform a continuous attribute into a discrete (ordinal) attribute. The process determines a finite number of intervals from the available values, for which discrete numerical values are assigned. The two main issues of the process are: how to determine the number of intervals; how to determine the cut points.

In this slides, we present some discretization methods for the unsupervised and supervised contexts.

Keywords: discretization, data preprocessing, chi-merge, mdlpc, equal-frequency, equal-width, clustering, top-down, bottom-up, feature construction
Components (Tanagra): EQFREQ DISC, EQWIDTH DISC, MDLPC, BINARY BINNING, CONT TO DISC
Slides: Discretization
Tutorials:
Tanagra Tutorials, "Discretization of continuous features", may 2010.

Clustering variables (slides)

2014-09-24 (Tag:2504838243607242877)

The aim of clustering variables is to divide a set of numeric variables into disjoint clusters (subset of variables). In these slides, we present an approach based on the concept of latent component. A subset of variables is summarized by a latent component which is the first factor from the principal component analysis. This is a kind of "centroid" variable which maximizes the sum of the squared correlation with the existing variables. Various clustering algorithms based on this idea are described: a hierarchical agglomerative algorithm; a top down approach; and an approach which is inspired by the k-means method.

Keywords: clustering, clustering variables, latent variable, latent component, clusters, groups, bottom-up, hierarchical agglomerative clustering, top down, varclus, k-means, pca, principal component analysis
Components (Tanagra): VARHCA, VARKMEANS, VARCLUS
Slides: Clustering variables
Tutorials:
Tanagra tutorials, "Variable clustering (VARCLUS)", 2008.

Single layer and multilayer perceptrons (slides)

2014-09-16 (Tag:269083441246067909)

Artificial neural networks are computational models inspired by an animal’s central nervous system (in particular brain) which is capable of machine learning as well as pattern recognition (Wikipedia).

In these slides, we present the single layer and multilayer perceptrons, which are devoted to supervised learning process. We describe the baseline of the approaches: the difference between the linear (single-layer) and non-linear (multilayer) classifiers; the representation power of the models; the learning algorithm (the Widrow-Hoff rule and the back propagation algorithm).

Keywords: artificial neural network, perceptron, single layer, SLP, multilayer, MLP, widrow-hoff rule, backpropagation algorithm, linear classifier, non linear classifier
Components (Tanagra): MULTILAYER PERCEPTRON
Slides: Single layer and multilayer perceptrons
Tutorials:
Tanagra tutorials, "Configuration of a multilayer perceptron", December 2017.
Tanagra tutorials, "Multilayer perceptron - Software comparison", 2008.

Filter approaches for feature selection (slides)

2014-09-13 (Tag:4241355573683212143)

In the supervised learning context, the filter approach for feature selection consists in the selection of the most appropriate variables for any subsequent machine learning algorithm used for the construction of the model.

The methods are mostly based on the correlation concept (in a large sense). They are interesting because they enable to handle quickly high-dimensional data sets. On the other hand, they are questionable because they do not take into account the characteristics of the model (e.g. linear, non-linear) that will be developed from the selected variables.

Keywords: feature selection, filter methods, embedded methods, wrapper methods
Components (Tanagra): CFS FILTERING, FCBF FILTERING, MIFS FILTERING, MODTREE FILTERING, FEATURE RANKING, FISHER FILTERING, RUNS FILTERING, STEPDISC
Slides: Filter methods
Tutorials:
Tanagra tutorials, "Filter methods for feature selection", 2010.
Tanagra tutorials, "Filter methods for feature selection (continuation)", 2010.

Association rule learning (slides)

2014-08-31 (Tag:6518942386798258881)

Association rule learning is a popular approach to extract rules from large databases. Initially intended to transactional data, especially for the market basket analysis, the method can be applied to any binary or binarized data.

In these slides, we show the outline of the approach. We present a basic algorithm to generate association rules from data. We highlight the influence of the settings (minimum support and minimum confidence) for the reduction of the search space, and thus for the reduction of the amount of calculations.

Keywords: association rule, association rules, itemset, frequent itemset, eclat algorithm, support, confidence, lift
Components (Tanagra): A PRIORI, A PRIORI MR, A PRIORI PT, FREQUENT ITEMSETS, SPV ASSOC RULE, SPV ASSOC TREE
Slides: Association rule learning
References:
Wikipedia, "Association Rule Learning".
M. Zaki, S. Parthasaraty, M. Ogihara, W. Li, “New Algorithms for Fast Discovery of Association Rules”, in Proc. of KDD’97, p. 283-296, 1997.

ROC curve (slides)

2014-08-12 (Tag:1963481117570900414)

The ROC curve is a graphical tool for the evaluation and comparison of binary classifiers. It provides more complete evaluation than the confusion matrix and the error rate. It is valid even if we deal with a non-representative test set i.e. the observed class frequencies are not an estimate of the prior class probabilities. It is especially useful when we deal with class imbalance, and when the misclassification costs matrix is not well established.

In these slides, we show: the ideas underlying the ROC curve; the construction of the curve from a dataset; the calculation of the AUC (area under curve), a synthetic indicator derived from the ROC curve; and the use of the ROC curve for model comparison.

Keywords: receiver operating characteristic, roc curve, auc, area under curve, binary classifier, evaluation, model comparison, class probability estimate, score
Components (Tanagra): SCORING, ROC CURVE
Slides: ROC curve
References:
Wikipedia, "Receiver Operating Characteristic".
T. Fawcett, "An introduction to ROC analysis", Pattern Recognition Letters, 27, 861-874, 2009.

Customer targeting (slides)

2014-08-04 (Tag:2500270173564999198)

Customer targeting is one component of the direct marketing. The aim is to identify the customers which are the most interested in a new product. We are in the data mining context because we create a classifier from a learning sample. But we do not want to classify the instances. We want to measure the probability of the individuals to buy the product i.e. their score, their propensity to purchase. In this context, we use a specific tool - the gain chart (or the cumulative lift curve) - to assess the efficiency of the analysis.

In these slides, we detail the overall process. We emphasize the reading of the gain chart, especially the transposition of the reading of the chart from a labeled sample to the customer database (for which we do not know the values of the target attribute).

Keywords: customer targeting, direct marketing, scoring, score, propensity to purchase
Components (Tanagra): SCORING, LIFT CURVE
Slides: Customer targeting
References:
Microsoft, “Lift chart (Analysis Services – Data Mining)”, SQL Server 2014.
H. Hamilton, “Cumulative Gains and Lift Charts”, in CS 831 – Knowledge Discovery in Databases, 2012.

Descriptive discriminant analysis (slides)

2014-08-02 (Tag:8988355612596107382)

The descriptive discriminant analysis (DDA) or canonical discriminant analysis is a statistical approach which performs a multivariate characterization of differences between groups. It is related to other factorial approaches such as principal component analysis or canonical correlation analysis.

In these slides, we show the main issues of the approach, and the reading of the results. We show also how the discriminant analysis is related to the predictive discriminant analysis (linear discriminant analysis) which, yet, relies on restrictive statistical assumptions.

Keywords: discriminant analysis, descriptive discriminant analysis, canonical discriminant analysis, predictive discriminant analysis, correlation ratio, R, lda package MASS, sas, proc candisc
Components (Tanagra): CANONICAL DISCRIMANT ANALYSIS
Slides: DDA
Dataset: wine_quality.xls
References:
SAS, "CANDISC procedure".

Clustering tree (slides)

2014-07-11 (Tag:7074531890785949471)

The clustering tree algorithm is both a clustering approach and a multi-objective supervised learning method.

In the cluster analysis framework, the aim is to group objects in clusters, where the objects in the same cluster are similar in a certain sense. The clustering tree algorithm enables to perform this kind of task. We obtain a decision tree as a clustering structure. Thus, the deployment of the classification rule in the information system is really easy.

But we can also consider the clustering tree as an extension of the classification/regression tree because we can distinguish two set of variables: the explained (active) variables which are used to determine the similarities between the objects; the predictive (illustrative) variables which allows to describe the groups.

In this slides, we show the main features of this approach.

Keywords: cluster analysis, clustering, clustering tree, groups characterization
Slides: Clustering tree
References :
M. Chavent (1998), « A monothetic clustering method », Pattern Recognition Letters, 19, 989—996.
H. Blockeel, L. De Raedt, J. Ramon (1998), « Top-Down Induction of Clustering Trees », ICML, 55—63.

Sipina - Version 3.12

2014-05-19 (Tag:4559301184276256341)

The transfer between the Excel spreadsheet and Sipina was improved on the databases of moderate size (on large databases, several hundreds of thousands of rows, it is better to perform a direct importation of data file in text format TXT). The management of the decimal point has been improved. Now, the automatic processing is much faster than before.

The precision of the numerical cut points displayed in a decision tree becomes customizable. The users dispose of a new item into the menu "Tree Management".

Sipina website: Sipina
Download: Setup file

Binary classification via regression (slides)

2014-05-03 (Tag:1982787729674790467)

In these slides, we study the analogy between the linear discriminant analysis and the regression on an indicator variable when we deal with a binary classification problem.

The tests for the global significance of the model and the individual significance of the coefficients are equivalent. The coefficients are proportional, including the intercepts when we treat the balanced case. In the other case (unbalanced classes), an additional correction of the regression intercept is needed to obtain the linear discriminant analysis intercept.

For the multiclass classification, the equivalence between the regression and the linear discriminant analysis is no longer valid.

Keywords: supervised learning, linear discriminant analysis, multiple linear regression, R2, wilks lambda
Slides: Classification via regression
References :
R.O. Duda, P.E. Hart, D. Stork, «Pattern Classification», 2nd Edition, Wiley, 2000.
C.J. Huberty, S. Olejnik, «Applied MANOVA and Discriminant Analysis»,Wiley, 2006.

Naive Bayes classifier (slides)

2014-03-28 (Tag:6503620475187110142)

Here are the slides I use for my course about “Naive Bayes Classifier”. The main originality of this presentation is that I show it is possible to extract an explicit model based on the calculations of the conditional distributions. This makes highly easier the deployment of the model in real case studies. This aspect is commonly unknown. The feature selection problem in the context of Naive Bayes Classifier learning is also highlighted.

Keywords: machine learning, supervised methods, naive bayes, independence assumption, independent feature model, feature selection, cfs
Slides: Naive Bayes classifier
References:
T. Mitchell, "Generative and discriminative classifiers: naive bayes and logistic regression", in "Machine Learning", McGraw Hill, 2010; Draft of January 2010.
Wikipedia, "Naive Bayes classifier".

Linear discriminant analysis (slides)

2014-03-14 (Tag:6241246831826502851)

Here are the slides I use for my course about “Linear Discriminant Analysis” (LDA). The two main assumptions which enable to obtain a linear classifier are highlighted. The LDA is very interesting because we can interpret the classifier in different ways: it is a parametric method based on the MAP (maximum a posteriori) decision rule; it is a classifier based on a distance to the conditional centroids; it is a linear separator which defines various regions in the representation space.

Statistical tools for the overall model evaluation and the checking of the relevance of the predictive variables are presented.

Keywords: machine learning, supervised methods, discriminant analysis, predictive discriminant analysis, linear discriminant analysis, linear classification functions, wilks lambda, stepdisc, feature selection
Slides: linear discriminant analysis
References:
J. Gareth, D. Witten, T. Hastie, R. Tibshirani, "An introduction to statistical learning with applications in R", Springer, 2013.
R. Duda, P. Hart, G. Stork, "Pattern Classification", Wiley, 2000.

Regression Trees

2014-03-07 (Tag:6548856795648379672)

Here are the slides I use for my course about “Regression Trees”. Because this course comes after the one about “Decision Trees”, only the special features for the handling of a continuous target attribute are highlighted. The described algorithms correspond roughly to the AID and the CART approaches.

Keywords: machine learning, supervised methods, regression tree, aid, cart, continuous class attribute
Slides: regression trees
References:
L. Breiman, J. Friedman, R. Olshen and C. Stone, “Classification and Regression Trees”, Wadsworth Int. Group, 1984.
J. Morgan, J.A. Sonquist, "Problems in the Analysis of Survey Data and a Proposal", JASA, 58:415-435, 1963.

Decision tree learning algorithms

2014-03-01 (Tag:7456939298371498895)

Here are the slides I use for my course about the existing decision tree learning algorithms. Only the most popular ones are described: C4.5, CART and CHAID (a variant). The differences between these approaches are highlighted according: the splitting measure; the merging strategy during the splitting process; the approach for determining the right sized tree.

Keywords: machine learning, supervised methods, decision tree learning, classification tree, chaid, cart, c4.5
Slides: C4.5, CART and CHAID
References:
L. Breiman, J. Friedman, R. Olshen and C. Stone, “Classification and Regression Trees”, Wadsworth Int. Group, 1984.
G. Kass, “An exploratory technique for Investigating Large Quantities of Categorical Data”, Applied Statistics, 29(2), 1980, pp. 119-127.
R. Quinlan, “C4.5: Programs for machine learning”, Morgan Kaufman, 1993.

Introduction to Decision Trees

2014-02-27 (Tag:8631032447217503359)

Here are the lecture notes I use for my course “Introduction to Decision Trees”. The basic concepts of the decision tree algorithm are described. The underlying method is rather similar to the CHAID approach.

Keywords: machine learning, supervised methods, decision tree learning, classification tree
Slides: Introduction to Decision Trees
References:
T. Mitchell, "Decision Tree Learning", in "Machine Learning", McGraw Hill, 1997; Chapter 3, pp. 52-80.
L. Rokach, O. Maimon, "Decision Trees ", in "The Data Mining and Knowledge Discovery Handbook", Springer, 2005; Chapter 9, pp. 165-192.

Introduction to Supervised Learning

2014-02-22 (Tag:4603781498701255990)

Here are the lecture notes I use for my course “Introduction to Supervised Learning”. The presentation is very simplified. But, all the important elements are described: the goal of the supervised learning process, the Bayes rule, the evaluation of the models using the confusion matrix.

Keywords: machine learning, supervised methods, model, classifier, target attribute, class attribute, input attributes, descriptors, bayes rule, confusion matrix, error rate, sensitivity, precision, specificity
Slides: Introduction to Supervised Learning
References:
O. Maimon, L. Rokach, "Introduction to Supervised Methods", in "The Data Mining and Knowledge Discovery Handbook", Springer, 2005; Chapter 8, pp. 149-164.
T. Hastie, R. Tibshirani, J. Friedman, "The elements of Statistical Learning", Springer, 2009.

Cluster analysis for mixed data

2014-02-05 (Tag:3282675408585794344)

The aim of clustering is to gather together the instances of a dataset in a set of groups. The instances in the same cluster are similar according a similarity (or dissimilarity) measure. The instances in distinct groups are different. The influence of the used measure, which is often a distance measure, is essential in this process. They are well known when we work on attributes with the same type. The Euclidian distance is often used when we deal with numeric variables; the chi-square distance is more appropriate when we deal with categorical variables. The problem is a lot of more complicated when we deal with a set of mixed data i.e. with both numeric and categorical values. It is admittedly possible to define a measure which handles simultaneously the two kinds of variables, but we have trouble with the weighting problem. We must define a weighting system which balances the influence of the attributes, indeed the results must not depend of the kind of the variables. This is not easy .

Previously we have studied the behavior of the factor analysis for mixed data (AFDM in French). This is a generalization of the principal component analysis which can handle both numeric and categorical variables . We can calculate, from a set of mixed variables, components which summarize the information available in the dataset. These components are a new set of numeric attributes. We can use them to perform the clustering analysis based on standard approaches for numeric values.

In this paper, we present a tandem analysis approach for the clustering of mixed data. First, we perform a factor analysis from the original set of variables, both numeric and categorical. Second, we launch the clustering algorithm on the most relevant factor scores. The main advantage is that we can use any type of clustering algorithm for numeric variables in the second phase. We expect also that by selecting a few number of components, we use the relevant information from the dataset, the results are more reliable .

We use Tanagra 1.4.49 and R (ade4 package) in this case study.

Keywords: AFDM, FAMD, factor analysis for mixed data, clustering, cluster analysis, hac, hierarchical agglomerative clustering, R software, hclust, ade4 package, dudi.mix, cutree, groups description
Components: AFDM, HAC, GROUP CHARACTERIZATION, SCATTERPLOT
Tutorial: en_Tanagra_Clustering_Mixed_Data.pdf
Dataset: bank_clustering.zip
References:
Tanagra, "Factor Analysis for Mixed Data".
Jerome Pages, « Analyse Factorielle de Données Mixtes », Revue de Statistique Appliquee, tome 52, n°4, 2004 ; pages 93-111.

Scilab and R - Performance comparison

2014-01-15 (Tag:2730598585376828796)

We have studied the Scilab tool in a data mining scheme in a previous tutorial . We noted that Scilab is well adapted for data mining. It is a credible alternative to R. But, we observed also that the available toolboxes for statistical processing and data mining are not very numerous compared to those of R. In this second tutorial, we evaluate the behavior of Scilab when we deal with a dataset with 500,000 instances and 22 attributes. We compare its performances with those of R. Two criteria are used: the memory occupation measured in the Windows task manager; the execution time at each step of the process.

It is not possible to obtain an exhaustive point of view. To delimit the scope of our study, we have specified a standard supervised learning scenario: loading a data file, building the predictive model with linear discriminant analysis approach, calculating the confusion matrix and resubstitution error rate. Of course, this study is incomplete. But it seems that Scilab is less efficient in the data management step. It is however quite efficient in the modeling step. This last assessment depends on the toolbox used.

Keywords: scilab, toolbox, nan, linear discriminant analysis, R software, sipina, tanagra
Tutorial: en_Tanagra_Scilab_R_Comparison.pdf
Dataset: waveform_scilab_r.zip
References:
Scilab - https://www.scilab.org/en
Michaël Baudin, "Introduction to Scilab (in French)", Developpez.com.

Data Mining with Scilab

2014-01-07 (Tag:1772935362941038242)

I know the name "Scilab" for a long time (http://www.scilab.org/en). For me, it is a tool for numerical analysis. It seemed not interesting in the context of the statistical data processing and data mining. Recently a mathematician colleague spoke to me about this tool. He was surprised about the low visibility of Scilab within the data mining community, knowing that it proposes functionalities which are quite similar to those of R software. I confess that I did not know Scilab from this perspective. I decided to study Scilab by setting a basic goal: is it possible to perform simply a predictive analysis process with Scilab? Namely: loading a data file (learning sample), building a predictive model, obtaining a description of its characteristics, loading a test sample, applying the model on this second set of data, building the confusion matrix and calculating the test error rate.

We will see in this tutorial that the whole task has been completed successfully easily. Scilab is perfectly prepared to fulfill statistical treatments. But two small drawbacks appear during the catch in hand of Scilab: the library of statistical functions exists but it is not as comprehensive as that of R; their documentation is not very extensive at this time. However, I am very satisfied of this first experience. I discovered an excellent free tool, flexible and efficient, very easy to take in hand, which turns out a credible alternative to R in the field of data mining.

Keywords: scilab, toolbox, nan, libsvm, linear discriminant analysis, R software, predictive analytics
Tutorial : en_Tanagra_Scilab_Data_Mining.pdf
Dataset : data_mining_scilab.zip
References :
Scilab - https://www.scilab.org/fr
ATOMS : Homepage - http://atoms.scilab.org/

Tanagra, tenth anniversary

2014-01-02 (Tag:2779894367511347925)

First of all, let me introduce you to all my wishes of happiness, health and success for the year 2014 which begins.

For Tanagra, 2014 is of quite particular importance. 10 Years ago almost to the day, the first version of the software has been put on line. Designed originally as a tool for the students and researchers in the data mining domain, the project has changed a bit of nature in recent years. Today, Tanagra is an academic project which provides a point of access to the statistical and the data mining techniques. It is addressed to students, but also to the researchers of other areas (psychology, sociology, archeology, etc.). It allows, I hope, make it more attractive, more clear, the implementation of these techniques on real case studies.

This mutation was accompanied by a refocusing of my activity. The Tanagra software is still evolving (we are at version 1.4.50), new methods are added, existing components are regularly improved, but at the same time I put emphasis on the documentation in the form of books, training materials and tutorials. The underlying idea is very simple: understanding the ins and outs of the methods is the best way to learn how to use software which proposes them.

Over the past 5 years (2009/01/01 to 2013/12/31), my site gets 677 visits per day. The 10 countries that come most often are: France, Morocco, Algeria, Tunisia, United States, India, Canada, Belgium, United Kingdom and Brazil. The page of the materials for my data mining courses is the most visited (https://eric.univ-lyon2.fr/ricco/cours/supports_data_mining.html; 99 visits per day, 6 minutes 35 seconds average time spent on the page). At the same time, I note with great satisfaction that the English pages are overall as much visited as those in French. I think that the effort to write documentation in English is fruitful.

I hope that this work will be useful for a long time, and that 2014 will be the opportunity of exchanges always so rewarding for everybody.

Ricco.

Tanagra - Version 1.4.50

2013-12-18 (Tag:2414981878985523354)

Improvements have been introduced, a new component is added.

HAC. Hierarchical agglomerative clustering. Computing time has been dramatically improved. We will detail the new procedure used in a new tutorial soon.

CATVARHAC. Classification of the levels of the nominal variables. The calculations are based on the work of Abdallah and Saporta (1998). The component performs an agglomerative hierarchical clustering of the levels of qualitative variables. The Dice's index is used as the distance measures. Several kind of linkage criteria are proposed: single linkage, complete linkage, average linkage, Ward's method. A tutorial will come to describe the method soon.

Download page : setup

Parallel programming in R

2013-10-27 (Tag:976144477561843298)

Personal computers become more and more efficient. They are mostly equipped with multi-core processors. At the same time, most of the data mining tools, free or not, are often based on single-threaded calculations. Only one core is used during calculations, while others remain inactive.

Previously, we have introduced two multithreaded variants of linear discriminant analysis in Sipina 3.10 and 3.11. During the analysis that allowed me to develop the solutions introduced in Sipina, I had much studied parallelization mechanisms available in other Data Mining Tools. They are rather scarce. I noted that highly sophisticated strategies are proposed for the R software. These are often environments that enable to develop programs for multi-core processors machines, multiprocessor machines, and even for computer cluster. I studied in particular the "parallel" package which is itself derived from 'snow' and 'multicore' packages. Let us be quite clear. The library cannot miraculously accelerate an existing procedure. It gives us the opportunity to effectively use the machines resources by rearranging properly the calculations. Basically, the idea is to break down the process into tasks that can be run in parallel. When these tasks are completed, we perform the consolidation.

In this tutorial, we detail the parallelization of the calculation of the within-class covariance matrix under R 3.0.0. In a first step, we describe single-threaded approach, but easily convertible i.e. the basic tasks are easily identifiable. As a second step, we use the tools of “parallel” and “doParallel” packages to run elementary tasks on the available cores. We will then compare the processing time. We note that, unlike the toy examples available on the web, the results are mixed. The bottleneck is the managing of the data when we handle a large dataset.

Keywords: linear discriminant analysis, within-class covariance matrix, R software, parallel package, doparallel package, parLapply, mclapply, foreach
Didacticiel : en_Tanagra_Parallel_Programming_R.pdf
Fichiers : parallel.R, multithreaded_lda.zip
Références :
R-core, Package 'parallel', April 18, 2013.

Load balanced multithreading for LDA

2013-09-29 (Tag:9023299614701519666)

In a previous paper, we described a multithreading strategy for the linear discriminant analysis . The aim was to take advantage of the multicore processors of the recent computers. We noted that for the same memory occupation than the standard implementation, we can decrease dramatically the computation time according to the dataset characteristics. The solution however had two drawbacks: the number of cores used was dependent on the number of classes K of the dataset; the load of the cores depended on classes’ distributions. For instance, for one of dataset with K = 2 highly unbalanced classes, the gain was negligible compared to the single-threaded version.

In this paper, we present a new approach for the multithreaded implementation of the linear discriminant analysis, available in Sipina 3.11. It allows to overcome the two bottlenecks of the previous version. The capacity of the machine is fully used. More interesting, the number of used threads (cores) becomes customizable, allowing the user to adapt the machines resources used to process the database. But this is not without consideration. The memory occupation is increased. It depends on both the characteristics of the data and the number of cores that we want to use.

To evaluate the improvement introduced in this new version, we use various benchmark datasets to compare its computation time with those of the previous multithreaded approach, the single-threaded version, and the state-of-the-art proc discrim of SAS 9.3.

Keywords: sipina, multithreading, thread, multithreaded data mining, multithread processing, linear discriminant analysis, sas, proc discrim, R software, lda, MASS package, load balancing
Components: LINEAR DISCRIMINANT ANALYSIS
Tutorial: en_Tanagra_Sipina_LDA_Threads_Bis.pdf
Dataset: multithreaded_lda.zip
References:
S. Rathburn, A. Wiesner, S. Basu, "STAT 505: Applied Multivariate Statistical Analysis", Lesson 10: Discriminant Analysis, PennState, Online Learning: Department of Statistics.

Tanagra - Version 1.4.49

2013-09-15 (Tag:1401998939033736436)

Some enhancements regarding factor analysis approaches (PCA - principal component analysis, MCA - multiple correspondence analysis, CA - correspondence analysis, FDMA - factorial analysis of mixed data) have been incorporated. In particular, outputs have been completed.

The VARIMAX rotation has been improved. Thanks to Frédéric Glausinger for optimized source code.

The Benzecri correction is added in the MCA outputs. Thanks to Bernard Choffat for this suggestion.

Download page : setup

Sipina - Version 3.11

2013-05-30 (Tag:1084203423507322078)

A new multithreaded version of linear discriminant analysis is added to Sipina 3.11. Compared to the previous, it presents two assets: (1) it is able to use all the resources available on the machines with multi-core processors or multiprocessor; (2) the load balancing is better. It requires however more amount of memory, the internal structures of calculation are duplicated M times (M is the number of threads).

A tutorial will come to compare the behavior of this approach with the previous version and the single-threaded implementation.

Keywords: linear discriminant analysis, multihreaded implementation, multithread
Sipina website: Sipina
Download: Setup file

Multithreading for linear discriminant analysis

2013-05-29 (Tag:2335893243752123651)

Most of the modern personal computers have multicore CPU. This increases considerably their processing capabilities. Unfortunately, the popular free data mining tools does not really incorporate the multithreaded processing in the data mining algorithms they provide, aside from particular case such as ensemble methods or cross-validation process. The main reason of this scarcity is that it is impossible to define a generic framework whatever the mining method. We must study carefully the sequential algorithm, detect the opportunity of multithreading, and reorganize the calculations. We deal with several constraints: we must not increase excessively the memory occupation, we must use all the available cores, and we must balance the loads on the threads. Of course, the solution must be simple and operational on the usual personal computers.

Previously, we implemented a solution for the decision tree induction in Sipina 3.5. We studied also the solutions incorporated in Knime and RapidMiner. We show that the multithreaded programs outperform the single-thread version. This is wholly natural. But we observed also that there is not a unique solution. The internal organization of the multithread calculations influences the behavior and the performance of the program . In this tutorial, we present a multithreaded implementation for the linear discriminant analysis in SIPINA 3.10. The main property of the solution is that the calculation structure requires the same amount of memory compared with the sequential program. We note that in some situations, the execution time can be decreased significantly.

The linear discriminant analysis is interesting in our context. We obtain a linear classifier which has a similar classification performance to the other linear method on the most of the real databases, especially compared with the logistic regression which is really popular (Saporta, 2006 – page 480; Hastie et al., 2013 – page 128). But the computation of the discriminant analysis is comparably really faster. We will see that this characteristic can be enhanced when we take advantage of the multicore architecture.

To better evaluate the improvements induced by our strategy, we compare our execution time with tools such as SAS 9.3 (proc discrim), R (lda of the MASS package) and Revolution R Community (an "optimized" version of R).

Keywords: sipina, multithreading, thread, multithreaded data mining, multithread processing, linear discriminant analysis, sas, proc discrim, R software, lda, MASS package
Components: LINEAR DISCRIMINANT ANALYSIS
Tutorial: en_Tanagra_Sipina_LDA_Threads.pdf
Dataset: multithreaded_lda.zip
References:
Tanagra, "Multithreading for decision tree induction".
S. Rathburn, A. Wiesner, S. Basu, "STAT 505: Applied Multivariate Statistical Analysis", Lesson 10: Discriminant Analysis, PennState, Online Learning: Department of Statistics.

Sipina - Version 3.10

2013-05-23 (Tag:8960288905933953545)

The linear discriminant analysis has been enhanced. All operations are performed in a single pass on the data.

A multithreaded version of the linear discriminant analysis has been added. It improves the execution speed by distributing calculations on any hearts (computer with a multi-core processor) or processors (multi-processor computer) available on the computer.

A tutorial will describe the behavior of these new implementations on some large databases.

Keywords: linear discriminant analysis, multihreaded implementation
Sipina website: Sipina
Download: Setup file

Factor Analysis for Mixed Data

2013-03-31 (Tag:2875833766211338559)

Usually, as a factor analysis approach, we use the principal component analysis (PCA) when the active variables are quantitative; the multiple correspondence analysis (MCA) when they are all categorical. But what to do when we have a mix of these two types of variables?

A possible strategy is to discretize the quantitative variables and use the MCA. But this procedure is not recommended if we have a small dataset (a few number of instances), or if the number of qualitative variables is low in comparison with the number of quantitative ones. In addition, the discretization implies a loss of information. The choice of the number of intervals and the calculation of the cut points are not obvious.
Another possible strategy is to replace each qualitative variable by a set of dummy variables (a 0/1 indicator for each category of the variable to recode). Then we use the PCA. This strategy has a drawback. Indeed, because the dispersions of the variables (the quantitative variables and the indicator variables) are not comparable, we will obtain biased results.

The Jérôme Pages' "Multiple Factor Analysis for Mixed Data" (2004) [AFDM in French] relies on this second idea. But it introduces an additional refinement. It uses dummy variables, but instead of the 0/1, it uses the 0/x values, where 'x' is computed from the frequency of the concerned category of the qualitative variable. We can therefore use a standard program for PCA to lead the analysis (Pages, 2004; page 102). The calculation process is thus well controlled. But the interpretation of the results requires a little extra effort since it will be different depending on whether we study the role of a quantitative or qualitative variable.

In this tutorial, we show how to perform an AFDM with Tanagra 1.4.46 and R 1.15.1 (FactoMinerR package). We emphasize the reading of the results. We must study simultaneously the influence of quantitative and qualitative variables for the interpretation of the factors.

Keywords: PCA, principal component analysis, MCA, multiple correspondence analysis, AFDM, correlation, correlation ratio, FactoMineR package, R osftware
Components: AFDM, SCATTERPLOT WITH LABEL, CORRELATION SCATTERPLOT, VIEW MULTIPLE SCATTERPLOT
Tutorial: en_Tanagra_AFDM.pdf
Dataset: AUTOS2005AFDM.txt
References :
Jerome Pages, « Analyse Factorielle de Données Mixtes », Revue de Statistique Appliquee, tome 52, n°4, 2004 ; pages 93-111.

Correspondence Analysis - Tools comparison

2013-03-02 (Tag:8596356504858196836)

The correspondence analysis (or factorial correspondence analysis) is an exploratory technique which enables to detect the salient associations in a two-way contingency table. It proposes an attractive graphical display where the rows and the columns of the table are depicted as points. Thus, we can visually identify the similarities and the differences between the rows profiles (between the columns profiles). We can also detect the associations between rows and columns.

The correspondence analysis (CA) can be viewed as an approach to decompose the chi-squared statistic associated with a two-way contingency table into orthogonal factors. In fact, because CA is a descriptive technique, it can be applied to tables even if the chi-square test of independence is not appropriate. The only restriction is that the table must contain positive or zero values, the calculating the sum of the rows and the columns is possible, the rows and columns profiles can be interpreted.

The correspondence analysis can be viewed as a factorial technique. Factors are latent variables defined from linear combinations of the rows profiles (or columns profiles). We can use the factors scores coefficients to calculate the coordinate of supplementary rows or columns.

In this tutorial, we show how to implement the CA on a realistic dataset with various tools: Tanagra 1.4.48, which incorporates new features for a better reading of the results; R software, using the "ca" and "ade4" packages; OpenStat; and SAS (PROC CORRESP). We will see - as always - that all these software produce exactly the same numerical results (fortunately!). The differences are found mainly in terms of the organization of the outputs.

Keywords: correspondence analysis, symmetric graph, R software, package ca, package ade4, openstat, sas
Components: CORRESPONDENCE ANALYSIS
Tutorial: en_Tanagra_Correspondence_Analysis.pdf
Dataset: statements_foods.zip
References :
M. Bendixen, « A practical guide to the use of the correspondence analysis in marketing research », Marketing Research On-Line, 1 (1), pp. 16-38, 1996.
Tanagra Tutorial, "Correspondence Analysis".

Exploratory Factor Analysis

2013-02-05 (Tag:4545576267743879367)

PCA (Principal Component Analysis) is a dimension reduction technique which enables to obtain a synthetic description of a set of quantitative variables. It produces latent variables called principal components (or factors) which are linear combinations of the original variables. The number of useful components is much lower than to the number of original variables because these last ones are (more or less) correlated. PCA enables also to reveal the internal structure of the data because the components are constructed in a manner as to explain optimally the variance of the data.

PFA (Principal Factor Analysis) is often confused with PCA. There has been significant controversy about the equivalence or otherwise of the two techniques. One of the point of view which enables to distinguish them is to consider that the factors from the PCA account the maximal amount of variance of the available variables, while those from PFA account only the common variance in the data. The latter seems more appropriate if the goal of the analysis is to produce latent variables which highlight the underlying relation between the original variables. The influence of the variables which are not related to the other should be excluded.

They are thus different due to the nature of the information they make use. But the nuance is not obvious. Especially as they are often grouped in the same tool into some popular software (e.g. “PROC FACTOR” into SAS; “ANALYZE / DATA REDUCTION / FACTOR” into SPSS; etc.). In addition, their outputs and their interpretation are very similar.

In this tutorial, we present three approaches: Principal Component Analysis – PCA; non iterative Principal Factor Analysis - PFA; non iterative Harris Component Analysis - Harris. We highlight the differences by comparing the matrix (correlation matrix for the PCA) used for the diagonalization process. We detail the steps of the calculations using a program for R. We check our results by comparing them to those of SAS (PROC FACTOR). Thereafter, we implement these methods with Tanagra, with R using the PSYCH package, and with SPSS.

Keywords: PCA, principal component analysis, correlation matrix, principal factor analysis, harris, reproduced correlation, residual correlation, partial correlation, varimax rotation, R software, psych package, principal( ), fa( ), proc factor, SAS, SPSS
Components: PRINCIPAL COMPONENT ANALYSIS, PRINCIPAL FACTOR ANALYSYS, HARRIS COMPONENT ANALYSIS, FACTOR ROTATION
Tutorial: en_Tanagra_Principal_Factor_Analysis.pdf
Datasets: beer_rnd.zip
References:
D. Suhr, "Principal Component Analysis vs. Exploratory Factor Analysis".
Wikipedia, "Factor Analysis".

New features for PCA in Tanagra

2013-01-18 (Tag:561068860734465992)

Principal Component Analysis (PCA) is a very popular dimension reduction technique. The aim is to produce a few number of factors which summarizes as better as possible the amount of information in the data. The factors are linear combinations of the original variables. From a certain point a view, PCA can be seen as a compression technique.

The determination of the appropriate number of factors is a difficult problem in PCA. Various approaches are possible, it does not really exist a state-of-art method. The only way to proceed is to try different approaches in order to obtain a clear indication about the good solution. We had shown how to program them under R in a recent paper . These techniques are now incorporated into Tanagra 1.4.45. We have also added the KMO index (Measure of Sampling Adequacy – MSA) and the Bartlett's test of sphericity in the Principal Component Analysis tool.

In this tutorial, we present these new features incorporated into Tanagra on a realistic example. To check our implementation, we compare our results with those of SAS PROC FACTOR when the equivalent is available.

Keywords: principal component analysis, pca, sas, proc princomp, proc factor, bartlett's test of sphericity, R software, scree plot, cattell, kaiser-guttman, karlis saporta spinaki, broken stick approach, parallel analysis, randomization, bootstrap, correlation, partial correlation, varimax, factor rotation, variable clustering, msa, kmo index, correlation circle
Components: PRINCIPAL COMPONENT ANALYSIS, CORRELATION SCATTERPLOT, PARALLEL ANALYSIS, BOOTSTRAP EIGENVALUES, FACTOR ROTATION, SCATTERPLOT, VARHCA
Tutorial: en_Tanagra_PCA_New_Tools.pdf
Dataset : beer_pca.xls
References:
Tanagra - "Principal Component Analysis (PCA)"
Tanagra - "VARIMAX rotation in Principal Component Analysis"
Tanagra - "PCA using R - KMO index and Bartlett's test"
Tanagra - "Choosing the number of components in PCA"

Choosing the number of components in PCA

2013-01-12 (Tag:8029704394964592082)

Principal Component Analysis (PCA) is a dimension reduction technique. We obtain a set of factors which summarize, as well as possible, the information available in the data. The factors (or components) are linear combinations of the original variables.

Choosing the right number of factors is a crucial problem in PCA. If we select too much factors, we include noise from the sampling fluctuations in the analysis. If we choose too few factors, we lose relevant information, the analysis is incomplete. Unfortunately, there is not an indisputable approach for the determination of the number of factors. As a rule of thumb, we must select only the interpretable factors, knowing that the choice depends heavily on the domain expertise. And yet, this last one is not always available. We intend precisely to build on the data analysis to get a better knowledge on the studied domain.

In this tutorial, we present various approaches for the determination of the right number of factors for PCA based on the correlation matrix. Some of them, such as the Kaiser-Gutman rule or the scree plot method, are very popular even if they are not really statistically sound; others seems more rigorous, but seldom if ever used because they are not available in the popular statistical software suite.

In a first time, we use Tanagra and the Excel spreadsheet for the implementation of some methods; in a second time, especially for the resampling based approaches, we write programs for R from the results of the princomp() procedure.

Keywords: principal component analysis, factor analysis, pca, princomp, R software, bartlett's test of sphericity, xlsx package, scree plot, kaiser-guttman rule, broken-stick method, parallel analysis, randomization, bootstrap, correlation, partial correlation
Components: PRINCIPAL COMPONENT ANALYSIS, LINEAR CORRELATION, PARTIAL CORRELATION
Tutorial: en_Tanagra_Nb_Components_PCA.pdf
Dataset: crime_dataset_pca.zip
References :
D. Jackson, “Stopping Rules in Principal Components Analysis: A Comparison of Heuristical and Statistical Approaches”, in Ecology, 74(8), pp. 2204-2214, 1993.
P. Neto, D. Jackson, K. Somers, “How Many Principal Components? Stopping Rules for Determining the Number of non-trivial Axes Revisited”, in Computational Statistics & Data Analysis, 49(2005), pp. 974-997, 2004.
Tanagra - "Principal Component Analysis (PCA)"
Tanagra - "VARIMAX rotation in Principal Component Analysis"
Tanagra - "PCA using R - KMO index and Bartlett's test"

PCA using R - KMO index and Bartlett's test

2013-01-07 (Tag:3564034170267732499)

Principal Component Analysis (PCA) is a dimension reduction technique. We obtain a set of factors which summarize, as well as possible, the information available in the data. The factors are linear combinations of the original variables. The approach can handle only quantitative variables.

We have presented the PCA in previous tutorials. In this paper, we describe in details two indicators used for the checking of the interest of the implementation of the PCA on a dataset: the Bartlett's sphericity test and the KMO index. They are directly available in some commercial tools (e.g. SAS or SPSS). Here, we describe the formulas and we show how to program them under R. We compare the obtained results with those of SAS on a dataset.

Keywords: principal component analysis, pca, spss, sas, proc factor, princomp, kmo index, msa, measure of sampling adequacy, bartlett's sphericity test, xlsx package, psych package, R software
Components: VARHCA, PRINCIPAL COMPONENT ANALYSIS
Tutorial: en_Tanagra_KMO_Bartlett.pdf
Dataset: socioeconomics.zip
Références :
Tutoriel Tanagra - "Principal Component Analysis (PCA)"
Tutoriel Tanagra - "VARIMAX rotation in Principal Component Analysis"
SPSS - "Factor algorithms"
SAS - "The Factor procedure"

Discriminant Correspondence Analysis

2012-12-30 (Tag:3757425129067992641)

The aim of the canonical discriminant analysis is to explain the belonging to pre-defined groups of instances of a dataset. The groups are specified by a dependent categorical variable (class attribute, response variable); the explanatory variables (descriptors, predictors, independent variables) are all continuous. So, we obtain a small number of latent variables which enable to distinguish as far as possible the groups. These new features, called factors, are linear combinations of the initial descriptors. The process is a valuable dimensionality reduction technique. But its main drawback is that it cannot be directly applied when the descriptors are discrete. Even if the calculations are possible if we recode the variables using dummy variables for instance, the interpretation of the results - which is one of the main goals of the canonical discriminant analysis - is not really obvious.

In this tutorial, we present a variant of the discriminant analysis which is applicable to discrete descriptors due to Hervé Abdi (2007) . The approach is based on a transformation of the raw dataset in a kind of contingency table. The rows of the table correspond to the values of the target attribute; the columns are the indicators associated to the predictors’ values. Thus, the author suggests to use a correspondence analysis, on the one hand, in order to distinguish the groups, and on the other hand, to detect the relevant relationships between the values of the target attribute and those of the explanatory variables. The author called its approach "discriminant correspondence analysis" because it uses a correspondence analysis framework to solve a discriminant analysis problem.

In what follows, we detail the use of the discriminant correspondence analysis with Tanagra 1.4.48. We use the example described in the Hervé Abdi's paper. The goal is to explain the origin of 12 wines (3 possible regions) using 5 descriptors related to characteristics assessed by professional tasters. In a second part (section 3), we reproduce all the calculations with a program written for R.

Keywords: canonical discriminant analysis, descriptive discriminant analysis, correspondence analysis, R software, xlsx package, ca package
Components: DISCRIMINANT CORRESPONDENCE ANALYSIS
Tutorial : Tutorial DCA
Dataset: french_wine_dca.zip
References:
H. Abdi, « Discriminant correspondence analysis », In N.J. Salkind (Ed.): Encyclopedia of Measurement and Statistics. Thousand Oaks (CA): Sage. pp. 270-275, 2007.

Tanagra - Version 1.4.48

2012-12-01 (Tag:6547546986750083794)

New components have been added.

K-Means Strengthening. This component was suggested to me by Mrs. Claire Gauzente. The idea is to strengthen an existing partition (e.g. from a HAC) by using K-Means algorithm. A comparison of groups before and after optimization is proposed, indicating the efficiency of the optimization. The approach can be plugged to all clustering algorithm into Tanagra. Thanks to Claire for this valuable idea.

Discriminant Correspondence Analysis. This is an extension of the canonical discriminant analysis to discrete attributes (Hervé Abdi, 2007). The approach is based on a clever transformation of the dataset. The initial dataset is transformed into a crosstab. The values of the target attribute are in row, all the values of the input attributes are in column. The algorithm performs a correspondence analysis to this new data table to identify the associations between the values of the target and the input variables. Thus, we dispose of the tools of the correspondence analysis for a comprehensive reading of the results (factor scores, contributions, quality of representation).

Other components have been improved.

HAC. After the choice of the number of groups in the dendrogram in the Hierarchical Agglomerative Clustering, a last pass on the data is performed, it assigns each individual of the learning sample into the group with the nearest centroid. Thus, there may be discrepancy between the number of instances displayed on the tree nodes and the number of individuals in the groups. Tanagra displays the two partitions. Only the last one is used when Tanagra applies the clustering model on new instances, when it computes conditional statistics, etc.

Correspondence Analysis. Tanagra now provides the coefficients of the factor score functions for supplementary columns and rows in the factorial correspondence analysis. Thus, it will be possible to easily calculate the factor scores of new points described by their row or column profile. Finally, the results tables can be sorted according to contributions to the factors of the modalities.

Multiple correspondence analysis. Several improvements have been made to the multiple correspondence analysis: the component knows how to take into account supplementary continuous and discrete variables; the variables can be sorted according to their contribution to the factors; all indicators for the interpretation can be brought together in a single large table for a synthetic visualization of the results, this feature is especially interesting if we have a small number of factors; the coefficients for the factor score functions are provided, we can easily calculate the factorial coordinates of the supplementary individuals apart from Tanagra.

Some tutorials will come soon to describe the use of these components on realistic case studies.

Download page : setup

Linear Discriminant Analysis - Tools comparison

2012-11-05 (Tag:1683994649238490811)

Linear discriminant analysis is a popular method in domains of statistics, machine learning and pattern recognition. Indeed, it has interesting properties: it is rather fast on large bases; it can handle naturally multi-class problems (target attribute with more than 2 values); it generates a linear classifier linear, easy to interpret; it is robust and fairly stable, even applied on small databases; it has an embedded variable selection mechanism. Personally, I appreciate linear discriminant analysis because we can have multiple interpretations (probabilistic, geometric), and thus highlights various aspects of supervised learning.

In this tutorial, we highlight the similarities and the differences between the outputs of Tanagra, R (MASS and klaR packages), SAS, and SPSS software. The main conclusion is that, if the presentation is not always the same, ultimately we have exactly the same results. This is the most important.

Keywords: linear discriminant analysis, predictive discriminant analysis, canonical discriminant analysis, variable selection, feature selection, sas, stepdisc, candisc, R software, xlsx package, MASS package, lda, klaR package, greedy.wilks, confusion matrix, resubstitution error rate
Components: LINEAR DISCRIMINANT ANALYSIS, CANONICAL DISCRIMINANT ANALYSIS, STEPDISC
Tutorial: en_Tanagra_LDA_Comparisons.pdf
Dataset: alcohol
References :
Wikipedia - "Linear Discriminant Analysis"

Handling missing values in prediction process

2012-10-29 (Tag:3618529355750362645)

The treatment of missing values during the learning process has been received a lot of attention of researchers. We have published a tutorial about this in the context of logistic regression induction . By contrast, the handling of the missing values during the classification process, i.e. when we apply the classifier on an unlabeled instance, is less studied. However, the problem is important. Indeed, the model is designed to work only when the instance to label is fully described. If some values are not available, we cannot directly apply the model. We need a strategy to overcome this difficulty .

In this tutorial, we are in the supervised learning context. The classifier is a logistic regression model. All the descriptors are continuous. We want to evaluate on various datasets from the UCI repository the behavior of two imputations methods: the univariate approach and the multivariate approach. The constraint is that the imputation models must rely on information from the learning sample. We consider that this last one does not contain missing values.

We note that the occurrence of the missing value on the instance to classify is "missing completely at random" in our experiments i.e. each descriptor has the same probability to be missing.

Keywords: missing values, missing features, classification model, logistic regression, multiple linear regression, r software, glm, lm, NA
Components: Binary Logistic Regression
Tutorial: : en_Tanagra_Missing_Values_Deployment.pdf
Dataset and programs (R language): md_logistic_reg_deployment.zip
References:
Howell, D.C., "Treatment of Missing Data".
M. Saar-Tsechansky, F. Provost, “Handling Missing Values when Applying Classification Models”, JMLR, 8, pp. 1625-1657, 2007.

Handling Missing Values in Logistic Regression

2012-10-14 (Tag:3251410656580534676)

The handling of missing data is a difficult problem. Not because of its management which is simple, we just report the missing value with a specific code, but rather because of the consequences of their treatment on the characteristics of the models learned on the treated data.

We have already analyzed this problem in a previous paper. We studied the impact of the different techniques of missing values treatment on a decision tree learning algorithm (C4.5). In this paper, we repeat the analysis by examining their influence on the results of the logistic regression. We consider the following configuration: (1) missing values are MCAR, we wrote a program which removes randomly some values in the learning sample; (2) we apply logistic regression on the pre-treated training data i.e. on a dataset on which we apply a missing value processing technique; (3) we evaluate the different techniques of treatment of missing data by observing the accuracy rate of the classifier on a separate test sample which has no missing values.

In a first time, we conduct the experiments with R. We compare the listwise deletion approach to the univariate imputation (the mean for the quantitative variables, the mode for the categorical ones). We will see that this latter is a very viable approach in MCAR situation. In a second time, we will study the available tools in Orange, Knime and RapidMiner. We will observe that despite their sophistication, they are not better than the univariate imputation in our context.

Keywords: missing value, missing data, logistic regression, listwise deletion, casewise deletion, univariate imputation, missing data, R software, glm
Tutorial: en_Tanagra_Missing_Values_Imputation.pdf
Dataset and programs: md_experiments.zip
References:
Howell, D.C., "Treatment of Missing Data".
Allison, P.D. (2001), « Missing Data ». Sage University Papers Series on Quantitative Applications in the Social Sciences, 07-136. Thousand Oaks, CA : Sage.
Little, R.J.A., Rubin, D.B. (2002), « Statistical Analysis with Missing Data », 2nd Edition, New York : John Wiley.

Tanagra - Version 1.4.47

2012-09-24 (Tag:8227361639516944264)

Non iterative Principal Factor Analysis (PFA). This is an approach which tries to detect underlying structures in the relationships between the variables of interest. Unlike PCA, the PFA is focused only on the shared variances of the set of variables. It is suited when the goal is to uncover the latent structure of the variables. It works on a slightly modified version of the correlation matrix where the diagonal, the prior communality estimate of each variable, is replaced by its squared multiple correlation with all others.

Harris Component Analysis. This is a non-iterative factor analysis approach. It tries to detect underlying structures in the relationships between the variable of interest. Like Principal Factor Analysis, it focuses on the shared variances of the set of variables. It works on a modified version of the correlation matrix.

Principal Component Analysis. Two functionalities are added: the reproduced and residual correlation matrices can be computed, the variables can be sorted according to the loadings in the output tables.

These three components can be combined with the FACTOR ROTATION component (varimax or quartimax).

They can be combined to the re-sampling approaches for the detection of the relevant number of factors (PARALLEL ANALYSIS and BOOTSTRAP EIGENVALUES).

Download page : setup

Tanagra - Version 1.4.46

2012-09-01 (Tag:132794067523065377)

AFDM (Factor analysis for mixed data). It extends the principal component analysis (PCA) to data containing a mixture of quantitative and qualitative variables. The method is developed by Pagès (2004). A tutorial will come to describe the use of the method and the reading of the results.

Download page : setup

CVM and BVM from the LIBCVM toolkit

2012-07-06 (Tag:4142785508552234581)

The Support Vector Machines algorithms are well-known in the supervised learning domain. They are especially appropriate when we handle a dataset with a large number “p” of descriptors . But they are much less efficient when the number of instances “n” is very high. Indeed, a naive implementation is of complexity O(n^3) for the calculation time and O(n^2) for the storing of the values. In consequence, instead of the optimal solution, the learning algorithms often highlight the near-optimal solutions with a tractable computation time .

I recently discovered the CVM (Core Vector Machine) and BVM (Ball Vector Machine) approaches. The idea of the authors is really clever: since only approximate best solutions can be highlighted, their approaches try to resolve an equivalent problem which is easier to handle - the minimum enclosing ball problem in computational geometry - to detect the support vectors. So, we have a classifier which is as efficient as those obtained by the other SVM learning algorithms, but with an enhanced ability to process datasets with a large number of instances.

I found the papers really interesting. They are all the more interesting that all the tools enabling to reproduce the experiments are provided: the program and the datasets. So, all the results shown in the paper can be verified. It contrasts to too numerous papers where some authors flaunt tremendous results but we can never reproduce them.

The CVM and BVM methods are incorporated into the LIBCVM library. This is an extension of the LIBSVM (version 2.85), which is already included into Tanagra. The source code for LIBCVM being available, I compiled it as a DLL (Dynamic-link Library) and I included it also into Tanagra 1.4.44.

In this tutorial, we describe the behavior of the CVM and BVM supervised learning methods on the "Web" dataset available on the website of the authors. We compare the results and the computation time to those of the C-SVC algorithm based on the LIBSVM library.

Keywords: support vector machine, svm, libcvm, cvm, bvm, libsvm, c-svc
Components: SELECT FIRST EXAMPLES, CVM, BVM, C-SVC
Tutorial: en_Tanagra_LIBCVM_library.pdf
Dataset: w8a.txt.zip
References :
I.W. Tsang, A. Kocsor, J.T. Kwok : LIBCVM Toolkit, Version: 2.2 (beta)
C.C Chang, C.J. Lin : LIBSVM -- A Library for Support Vector Machines

Revolution R Community 5.0

2012-07-04 (Tag:930022275718163171)

The R software is a fascinating project. It becomes a reference tool for the data mining process. With the R package system, we can extend its features potentially at the infinite. Almost all existing statistical / data mining techniques are available in R.

But if there are many packages, there are very few projects which intend to improve the R core itself. The source code is freely available. In theory anyone can modify a part or even the whole software. Revolution Analytics proposes an improved version of R. It provides Revolution R Enterprise, it seems (according to their website) that: it improves dramatically the fastness of some calculations; it can handle very large database; it provides a visual development environment with a debugger. Unfortunately, this is a commercial tool. I could not check these features . Fortunately, a community version is available. Of course, I have downloaded the tool to study its behavior.

Revolution R Community is a slightly improved version of the Base R. The enhancements are essentially related to the calculations performances: it incorporates the Intel Math Kernal libary, which is especially efficient for the matrix calculations; it can take advantage also, in some circumstances, from the power of the multi-core processors. Performance benchmarks are available on the editor's website. The results are impressive. But we note that they are based on datasets generated artificially.

In this tutorial, we extend the benchmark to other data mining methods. We analyze the behavior of the Revolution R Community 5.0 - 64 bit version in various contexts: binary logistic regression (glm); linear discriminant analysis (lda from the MASS package); induction of decision trees (rpart from the rpart package); principal component analysis based on two different principles, the first one is based on the calculations of the eigenvalues and eigenvectors from the correlation matrix (princomp), the second one is done by a singular value decomposition of the data matrix (prcomp).

Keywords: R software, revolution analytics, revolution r community, logistic regression, glm, linear discriminant analysis, lda, principal components analysis, acp, princomp, prcomp, matrix calculations, eigenvalues, eignevectors, singular value decomposition, svd, decision tree, cart, rpart
Tutorial: en_Tanagra_Revolution_R_Community.pdf
Dataset: revolution_r_community.zip
References :
Revolution Analytics, "Revolution R Community".

Introduction to SAS proc logistic

2012-07-02 (Tag:8915699812661951935)

In my courses at the University, I use only free data mining tools (R, Tanagra, Sipina, Knime, Orange, etc.) and the spreadsheet applications (free or not). Sometimes, my students ask me if the commercial tools (e.g. SAS which is very popular in France) have different behavior, in terms of how to use, or for the reading of the results. I say them that some of these commercial tools are available on the computers of our department. They can learn how to use them by taking as a starting point the tutorials available on the Web.

But unfortunately, especially in the French language, they are not numerous about the logistic regression. We need a didactic document with clear screenshots which show how to: (1) import a data file into a SAS bank; (2) define an analysis with the appropriate settings; (3) read and understand the results.

In this tutorial, we describe the use of the SAS PROC LOGISTIC (SAS 9.3). We measure its quickness when we handle a moderate sized dataset. We compare the results with those of Tanagra 1.4.43.

Keywords: sas, proc logistic, binary logistic regression
Components: BINARY LOGISTIC REGRESSION
Tutorial: en_Tanagra_SAS_Proc_Logistic.pdf
Dataset: wave_proc_logistic.zip
References :
SAS - "The LOGISTIC Procedure"
Tanagra - "Logistic regression - Software comparison"
Tanagra - "Logistic regression on large dataset"

SAS Add-In 4.3 for Excel

2012-06-30 (Tag:7800599215347878724)

The connection between a data mining tool and a spreadsheet application such as Excel is a really valuable feature. We benefit from the powerful of the first one, and the popularity and the easy to use of the second one. Many people use a spreadsheet in their data preparation phase. Recently, I have presented an add-in for the connection between R and Excel. In this document, I describe a similar tool for the SAS software.

SAS is a popular tool, well-known of the statisticians. But the use of SAS is not really simple for the non-specialist people. We must know the syntax of the commands before to perform a statistical analysis. With the SAS add-in for Excel, some of the SAS drawbacks are alleviated: we do not need to load and organize the dataset into a bank; we do not need to know the command syntax to perform an analysis and set the associated parameters (we use a menu and dialog boxes instead); the results are automatically incorporated in a new sheet of an Excel workbook (the post processing of the results becomes easy).

In this tutorial, I describe the behavior of the add-in for various kinds of analyses (nonparametric statistic, logistic regression). We compare the results with those of Tanagra.

Keywords: excel, sas, add-on, add-in, logistic regression, nonparametric test
Components : MANN-WHITNEY COMPARISON, KRUSKAL-WALLIS 1-WAY ANOVA, MEDIAN TEST, VAN DER WAERDEN 1-WAY ANOVA, ANSARI-BRADLEY SCALE TEST, KLOTZ SCALE TEST, MOOD SCALE TEST
Tutorial: en_Tanagra_SAS_AddIn_4_3_for_Excel.pdf
Dataset: scoring_dataset.xls
References :
SAS - http://www.sas.com/
SAS - "SAS Add-in for Microsoft Office"
Tanagra Tutorial - "Tanagra Add-In for Office 2007 and Office 2010"

Tanagra - Version 1.4.45

2012-06-12 (Tag:2713614976767144679)

New features for the principal component analysis (PCA).

PRINCIPAL COMPONENT ANALYSIS. Additional outputs for the component: Scree plot and variance explained cumulative curve; PCA Correlation Matrix - Some outputs are provided for the detection of the significant factors (Kaiser-Guttman, Karlis-Saporta-Spinaki, Legendre-Legendre broken-stick test); PCA Correlation Matrix - Bartlett's sphericity test is performed and the Kaiser's measure of sampling adequacy (MSA) is calculated; PCA Correlation Matrix - The correlation matrix and the partial correlations between each pair of variables controlling for all other variables (the negative anti-image correlation) are produced.

PARALLEL ANALYSIS. The component calculates the distribution of eigenvalues for a set of randomly generated data. It proceeds by randomization. It applies to the principal components analysis and te multiple correspondence analysis. A factor is considered significant if its observed eigenvalue is greater than the 95-th percentile (this setting can be modified).

BOOTSTRAP EIGENVALUES. It calculates by bootstrap approach the confidence intervals of eigenvalues. A factor is considered significant if its eigenvalue is greater than a threshold which depends on the underlying factor method (PCA or MCA) method, or if the lower bound of the eigenvalue of a factor is greater than higher bound of the following one. The confidence level 0.90 can be modified. This component can be applied to the principal component analysis or the multiple correspondence analysis.

JITTERING. Jittering feature is incorporated to the scatter plot components (SCATTERPLOT, CORRELATION SCATTERPLOT, SCATTERPLOT WITH LABEL, VIEW MULTIPLE SCATTERPLOT).

RANDOM FOREST. The not used memory is released after the decision tree learning process. This feature is especially useful when we use an ensemble learning approach where we store a large number of trees in memory (BAGGING, BOOSTING, RANDOM FOREST). The memory occupation is reduced. The computation capacity is improved.

Download page : setup

Tanagra - Version 1.4.44

2012-05-14 (Tag:8132510589199120224)

LIBSVM (http://www.csie.ntu.edu.tw/~cjlin/libsvm/). Update of the LIBSVM library for support vector machine algorithms (version 3.12, April 2012) [C - SVC, Epsilon-SVR, nu - SVR]. The calculations are faster. The attributes can be normalized or not. They were automatically normalized previously.

LIBCVM (http://c2inet.sce.ntu.edu.sg/ivor/cvm.html; version 2.2). Incorporation of the LIBCVM library. Two methods are available: CVM and BVM (Core Vector Machine and Ball Vector Machine). The dezscriptors can be normalized or not.

TR-IRLS (http://autonlab.org/autonweb/10538). Update of the TR-IRLS library, for the logistic regression on large dataset (large number of predictive attributes) [last available version – 2006/05/08]. The deviance is automatically provided. The display of the regression coefficients is more precise (higher number of decimals). The user can tune the learning algorithms, especially the stopping rules.

SPARSE DATA FILE. Tanagra can handle sparse data file format now (see SVMlight ou libsvm file format). The data can be used for supervised learning process or regression problem. A description of this kind of file is available on line (http://c2inet.sce.ntu.edu.sg/ivor/cvm.html).

INSTANCE SELECTION. A new component for the selection of the m first individuals among n in a branch of the diagram is available [SELECT FIRST EXAMPLES]. This option is useful when the data file is the result of the concatenation of the learning and test samples.

Download page : setup

Using PDI-CE for model deployment (PMML)

2012-05-03 (Tag:3562460692384893631)

Model deployment is a crucial task of the data mining process. In the supervised learning, it can be the applying of the predictive model on new unlabeled cases. We have already described this task for various tools (e.g. Tanagra, Sipina, Spad, R). They have as common feature the use of the same tool for the model construction and the model deployment.

In this tutorial, we describe a process where we do not use the same tool for the model construction and the model deployment. This is only possible if (1) the model is described in a standard format, (2) the tool which used for the deployment can handle both the database with unlabeled instances and the model. Here, we use the PMML standard description for the sharing of the model, and the PDI-CE (Pentaho Data Integration Community Edition) for the applying of the model on the unseen cases.

We create a decision tree with various tools such as SIPINA, KNIME or RAPIDMINER; we export the model in the PMML format; then, we use PDI-CE for applying the model on a data file containing unlabeled instances. We see that the use of the PMML standard enhances dramatically the powerful of both the data mining tool and the ETL tool.

In addition, we will describe other solutions for deployment in this tutorial. We will see that Knime has its own PMML reader. It is able to apply a model on unlabeled datasets, whatever the tool used for the construction of the model. The key is that the PMML standard is respected. In this sense, Knime can be substituted to PDI-CE. Another possible solution, Weka, which is included into the Pentaho Community Edition suite, can export the model in a proprietary format that PDI-CE can handle.

Keywords: model deployment, predictive model, pmml, decision tree, rapidminer 5.0.10, weka 3.7.2, knime 2.1.1, sipina 3.4
Tutorial: en_Tanagra_PDI_Model_Deployment.pdf
Dataset: heart-pmml.zip
References:
Data Mining Group, "PMML standard"
Pentaho, "Pentaho Kettle Project"
Pentaho, "Using the Weka Scoring Plugin"

Pentaho Data Integration - Kettle

2012-04-22 (Tag:6853802511405981081)

The Pentaho BI Suite is an open source Business Intelligence suite with integrated reporting, dashboard, data mining, workflow and ETL capabilities (http://en.wikipedia.org/wiki/Pentaho).

In this tutorial, we talk about the Pentaho BI Suite Community Edition (CE) which is freely downloadable. More precisely, we present the Pentaho Data Integration (PDI-CE) , called also Kettle. We show briefly how to load a dataset and perform a simplistic data analysis. The main goal of this tutorial is to introduce a next one focused on the deployment of the models designed with Knime, Sipina or Weka by using PDI-CE.

This document is based on the 4.0.1 stable version of PDI-CE.

Keywords: ETL, pentaho data integration, community edition, kettle, BI, business intelligence, data importation, data transformation, data cleansing
Tutorial: PDI-CE
Dataset: titanic32x.csv.zip
References :
Pentaho, Pentaho Community

Mining frequent itemsets

2012-04-09 (Tag:2064183514124742461)

Searching regularities from dataset is the main goal of the data mining. They may have various representations. In the market basket analysis, we search the co occurrences of goods (items) i.e. the goods which are often purchased simultaneously. They are called “frequent itemset”. For instance, one result may be "milk and bread are purchased simultaneously in 10% of caddies".

Frequent itemset mining is often presented as the preceding step of the association rule learning algorithm. At the end of the process, we highlight the direction of the relation. We obtain rules. For instance, a rule may be "90% of the customers which buy milk and bread will purchase butter also". This kind of rule can be used in various manners. For instance, we can promote the sales of milk and bread in order to increase the sales of butter.

In fact, frequent itemsets provide also valuable information. Detecting the goods which are purchased simultaneously enables to understand the relation between them. It is a kind of variant of the clustering analysis. We search the items which come together. For instance, we can use this kind of information in order to reorganize the shelves of the store.

In this tutorial, we describe the use of the FREQUENT ITEMSETS component under Tanagra. It is based on the Borgelt's “apriori.exe” program. We use a very small dataset. It enables to everyone to reproduce manually the calculations. But, in a first time, we describe some definitions about the frequent itemset mining process.

Keywords: frequent itemsets, closed itemsets, maximal itemsets, generator itemsets, association rules, R software, arules package
Components: FREQUENT ITEMSETS
Tutorial: en_Tanagra_Itemset_Mining.pdf
Dataset: itemset_mining.zip
References :
C. Borgelt, "A priori - Association Rule Induction / Frequent Item Set Mining"
R. Lovin, "Mining Frequent Patterns"

Sipina add-on for OOCalc

2012-04-01 (Tag:4894565618570164225)

Combining a spreadsheet with the data mining tools is essential for the popularity of these last ones. Indeed, when we deal with a moderate sized dataset (thousands of rows and tens of variables), the spreadsheet is a practical tool for the data preparation. This is also a valuable tool for the preparation of the reports. It is thus not surprising that Excel, and generally speaking a spreadsheet, is one the most used tool by data miners.

Both Tanagra and Sipina provide an add-on for Excel. The add-on enables to insert a data mining tool menu into the spreadsheet. The user can select and send the dataset to Tanagra (or Sipina), which is automatically launched. But, only Tanagra provides an add-on for Open Office Calc and Libre Office Calc. It is not available for Sipina.

This omission has been corrected for this new version of Sipina (Sipina 3.9). In this tutorial, we show how to install and use the “SipinaLibrary.oxt” add-on for Open Office Calc 3.3.0 (OOCalc). The process is the same for Libre Office 3.5.1.

Keywords: calc, open office, libre office, oocalc, add-on, add-in, sipina
Tutorial: en_sipina_calc_addon.pdf
Dataset: heart.xls
References :
Tutoriel Tanagra - Sipina add-in for Excel
Tutoriel Tanagra - Tanagra add-on for Open Office Calc 3.3
Open Office - http://www.openoffice.org
Libre Office - http://www.libreoffice.org/

Tanagra - Version 1.4.43

2012-03-29 (Tag:8902777730607995455)

A few bugs have been fixed and some new features added.

The computed contributions of individuals in PCA (PRINCIPAL COMPONENT ANALYSIS) have been corrected. It was not valid when we work on a subsample of our data file. This error has been reported by Mr. Gilbert Laffond.

The standardization of the factors after VARIMAX (FACTOR ROTATION) have been corrected so that their variance coincides with the sum of the squares of the correlations with the axes, and thus with the eigen value associated to the axis. This modification has been suggested by Mr. Gilbert Laffond.

During the calculation of the confidence interval of the PLS regression coefficients (PLS CONF. INTERVAL), an error may occur when the requested number of axes was upper than the number of predictor variables. It is now corrected. This error has been reported by Mr. Alain Morineau.

In some circumstances, an error may occur in FISHER FILTERING, especially when Tanagra is run under Wine for Linux. We introduce some additional checking. This error has been reported by Mr. Bastien Barchiési.

The checking of missing values is now optional. The performance can be preferred for the treatment of very large files. We find the performances of 1.4.41 and previous versions.

The "COMPONENT / COPY RESULTS" menu sends information in HTML format. It is now compatible with the spreadsheet Calc of Libre Office 3.5.1. It was operating with the Excel spreadsheet only before. Curiously, the copy to the OOCalc (Open Office spreadsheet) is not possible at the present time (Open Office 3.3.0).

Donwload page : setup

Sipina - Version 3.9

2012-03-23 (Tag:6180635437759286641)

The add-on “SipinaLibrary.oxt” was added to the distribution. An additional menu is incorporated into spreadsheet OOCalc. It enables to launch SIPINA from a dataset (range of cells). The add-on operates with Open Office (tested for version 3.3.0) and Libre Office (version 3.5.1).

Note that a similar add-on exists for Excel (sipina.xla). It allows to make a connection between Sipina and Excel.

Keywords: sipina, OOCalc, open office, libre office, add-on, add-in
Sipina website: Sipina
Download: Setup file
References:
Tanagra - SIPINA add-in for Excel
Tanagra - Tanagra add-in for Excel 2007 and 2010
Open Office - http://www.openoffice.org/
Libre Office - http://www.libreoffice.org/

RExcel, a bridge between Excel and R

2012-03-21 (Tag:5804065042220992346)

Combining a specialized data mining tool with a spreadsheet is a very interesting idea. Most of the people know handle a spreadsheet such as Excel (but also LibreOffice Calc, Open Office Calc, Gnumeric, etc.). It is really popular because it is a very easy to use tool for data manipulation.

Many data mining tools can read XLS or XLSX file formats. But, it is even more interesting to implement a bridge between the data mining tools and Excel in a bidirectional way. So, we can lead easily the whole analysis by navigating between the tools: transforming the variables into Excel, performing the analysis into the data mining tool, and post-processing the results into Excel.

In this tutorial, we describe RExcel library for R. It sets a new menu into Excel. Thus, we can send a dataset to R on the one hand; retrieve dataset or more generally a vector or a matrix from R on the other hand. The tool is really easy to use.

Keywords: data importation, excel file format, xls, xlsx, addin, add-in, addon, add-on, multiple linear regression
Components: lm, stepAIC, predict
Tutorial: en_Tanagra_RExcel.pdf
Dataset: ventes_regression_rexcel.zip
References :
T. Baier, E. Neuwirth, "Powerful data analysis from inside your favorite application"

PSPP, an alternative to SPSS

2012-03-04 (Tag:4525019325939557649)

I spend a lot of time to analyze the available free statistical and data mining tools. There is not bad software, but some tools are more appropriate for some tasks. Thus, we must identify the one which is the best suited to our configuration. For that, we must know a large number of tools.

In this tutorial, we describe PSPP. It is presented as an alternative to the well-known SPSS: “PSPP is a program for statistical analysis of sampled data. It is a free replacement for the proprietary program SPSS, and appears very similar to it with a few exceptions”. Instead of to describe in detail each feature, the documentation is available on the website, we present some statistical techniques. We compare the results with those of Tanagra, R 2.13.2 and OpenStat (build 24/02/2012). This is also a way to validate them. If they provide different results, it means that there is a problem.

Keywords: pspp, R software, openstat, spss, descriptive statistics, t-test , welch test, comparison of means, comparison of variances, levene's test, chi-squared test, contingency table, cross tabs, analysis of variance, anova, multiple regression, roc curve, auc, area under curve
Components: MORE UNIVARIATE CONT STAT, GROUP CHARACTERIZATION, CONTINGENCY CHI-SQUARE, LEVENE'S TEST, T-TEST, T-TEST UNEQUAL VARIANCE, PAIRED T-TEST, ONE-WAY ANOVA, MULTIPLE LINEAR REGRESSION, ROC CURVE
Tutorial: en_Tanagra_PSPP.pdf
Dataset: autos_pspp.zip
References:
GNU PSPP, http://www.gnu.org/software/pspp/
R Project for Statistical Computing, http://www.r-project.org/
OpenStat, http://www.statprograms4u.com/

Regression analysis with LazStats (OpenStat)

2012-03-02 (Tag:3969808015279365028)

LazStat is a statistical software which is developed by Bill Miller, the father of OpenStat, a well-know tool by statisticians since many years. These are tools of the highest quality. OpenStat is one of tools that I use when I want to validate my own implementations.

Several variants of OpenStat are available. In this tutorial, we study LazStat . It is a version programmed in Lazarus, a development environment which is very similar to Delphi. It is based on the Pascal language. Projects developed in Lazarus benefit to the "write once, compile anywhere" principle i.e. we write our program on an OS (e.g. Windows), but we can compile it on any OS as long as Lazarus and the compiler are available (e.g. Linux). This idea has been proposed by Borland with Kylix some years ago. We could program a project for both Windows and Linux. But, unfortunately, Kylix has been canceled. It seems that the Lazarus is more mature. In addition, it enables us also to compile the same project for the 32 bit and 64 bit versions of an OS.

In this tutorial, we present some functionality of LazStats about regression analysis.

Keywords: linear regression, multiple regression, variable selection, forward, backward, stepwise, simultaneous regression
Tutorial: en_Tanagra_Regression_LazStats.pdf
Dataset: conso_vehicules_lazstats.txt
References :
LazStats - http://www.statprograms4u.com/
Lazarus - http://www.lazarus.freepascal.org/

Checking missing values in Tanagra

2012-02-19 (Tag:7518601898719913146)

Up to the 1.4.41 version, Tanagra does not handle missing values because it seems interesting to force the students, which are the main users of Tanagra, to think about and to propose the most appropriate solution in relation with the characteristics of their dataset and the goal of their analysis. Thus, Tanagra simply truncates the file to import from the first obstacle. This treatment often disconcerts the users, especially since no error message was sent. They wondered why, whereas the conditions look right, the data were not properly loaded.

From Tanagra 1.4.42 version, the importation of the text file format (tab separator), of the XLS file format (Excel 97-2003), and the data transfer using the add-in for Excel (up to Excel 2010 ) and LibreOffice 3.5/OpenOffice 3.3, have been modified. Tanagra reads all rows of the base. But it skips the incomplete rows and / or with inconsistencies (e.g. a column contains numeric value whereas this is a discrete attribute). And above all, an explicit error message counts the number of deleted rows. Thus, the users are better informed.

In this tutorial, we show the management of missing data when we send the data from Excel to Tanagra using the add-in Tanagra.xla. Some cells are empty into the Excel data range. This example illustrates the new behavior of Tanagra. We would get the same behavior if we import directly the XLS file or if we imported the corresponding file into the TXT format.

Keywords: missing values, missing data, inconsistent values, text file format importation, excel file format importation, add-in, add-in, tanagra.xla
Components: DATASET, VIEW DATASET
Tutorial: en_Tanagra_Missing_Data_Checking.pdf
Dataset: ronflement_with_missing_empty.zip
References:
Wikipedia, "Listwise deletion".
D.C. Howell, "Treatment of missing data".

Logistic regression on large dataset

2012-02-10 (Tag:3524244788060854256)

The programming of fast and reliable tools is a constant challenge for a computer scientist. In the data mining context, this leads to a better capacity to handle large datasets. When we build the final model that we want to deploy, the quickness is not really important. But in the exploratory phase where we search the best model, it is decisive. It improves our chance to obtain the best model simply because we can try more configurations.

I have tried many solutions to improve the calculation times of the logistic regression. In fact, I think the performance rests heavily on the optimization algorithm used. The source code of Tanagra shows that I have greatly hesitated. Some studies have helped me about the right choice.

Several tools propose the logistic regression. It is interesting to compare their calculation times and memory occupation. I have already studied this kind of comparison in the past . The novelty here is that I use a new operating system (64 bit version of Windows 7), and some tools are especially intended for this system. The calculating capabilities are greatly improved for these tools. For this reason, I have increased the dataset size. Moreover, to make more difficult the variable selection process, I added predictive attributes that are correlated to the original descriptors, but not to the class attribute. They have not to be selected in the final model.

In this paper, in addition to Tanagra 1.4.14 (32 bit), we use R 2.13.2 (64 bit), Knime 2.4.2 (64 bit), Orange 2.0b (build 15 oct2011, 32 bit) and Weka 3.7.5 (64 bit).

Keywords: logistic regression, software comparison, glm, stepAIC, R software, knime, orange, weka
Components: BINARY LOGISTIC REGRESSION, FORWARD LOGIT
Tutorial: en_Tanagra_Perfs_Bis_Logistic_Reg.pdf
Dataset: perfs_bis_logistic_reg.zip
References:
Tanagra, "Logistic regression - Software comparison", december 2008.
T.P. Minka, « A comparison of numerical optimizers for logistic regression », 2007.

Tanagra - Version 1.4.42

2012-02-04 (Tag:642080682076645612)

The Tanagra.xla add-in for Excel can work now for both the 32 and 64-bit versions of EXCEL.

With the FastMM memory manager, Tanagra can address up to 3 GB under 32-bit Windows and 4 GB under 64-bit Windows. The processing capabilities, especially about the handling of large datasets, are improved.

The importation of the tab-delimited text file format and xls file format (Excel 97-2003) is made safer. Previously, the importation is interrupted and the dataset is truncated when an invalid line is read (with missing or inconsistent values). Now, Tanagra skips the line and continues on the next rows. The number of skipped lines is reported into the importation report.

Donwload page : setup

ARS into the SIPINA package

2012-01-18 (Tag:5498871432178695292)

Association Rule Software (ARS) is a basic tool which extracts association rules from attribute-value datasets (categorical or binary attributes). It is distributed with the SIPINA package which includes: a tool for the supervised learning framework, especially the decision tree induction (SIPINA RESEARCH); a tool for the linear regression (REGRESS); and thus, ARS for the association rule mining.

ARS encodes automatically the categorical attributes in dummy variables. If you want use a continuous attributes, you must discretize them before.

This tutorial describes shortly the use of the Association Rule Software (ARS). Compared with the previous version, the GUI of the one incorporated into the SIPINA 3.8 package is simplified.

Keywords: association rule mining, support, confidence, lift, conviction
Download: Sipina setup file
Tutorial: How to use ARS
References:
Wikipedia - Association rule learning

Sipina - Version 3.8

2012-01-18 (Tag:838180549957383950)

The tools (SIPINA RESEARCH, REGRESS and ASSOCIATION RULE SOFTWARE) included in the SIPINA distribution have been updated with some improvements.

SIPINA.XLA. The add-in for Excel can work now with either for the 32 or 64-bit versions of EXCEL.

Importation of text data files. Processing time has been improved. This improvement reduces also the transferring time when we use the SIPINA.XLA add-in for Excel (which uses a temporary file in the text file format).

Association rule software. The GUI has been simplified; the display of the rules is made more readable.

Because they are internally based on the FastMM memory management, these tools can address up to 3 GB under 32-bit Windows and 4 GB under 64-bit Windows. The processing capabilities are improved.

Keywords: sipina, decision tree induction, association rule, multiple linear regression
Sipina website: Sipina
Download: Setup file
References:
Tanagra - SIPINA add-in for Excel
Tanagra - Tanagra add-in for Excel 2007 and 2010
Delphi Programming Resource - FastMM, a Fast Memory Manager

Tanagra website statistics for 2011

2012-01-02 (Tag:4312137056488766017)

The year 2011 ends, 2012 begins. I wish you all a very happy year 2012.

A small statistical report on the website statistics for the past year. All sites (Tanagra, course materials, e-books, tutorials) has been visited 281,352 times this year, 770 visits per day. For comparison, we had 662 daily visits in 2010, 520 in 2009, 349 in 2008.

Who are you? The majority of visits come from France and Maghreb. Then there are a large part of French speaking countries. In terms of non-francophone countries, we observe mainly the United States, India, UK, Italy, Brazil, Germany,...

Which pages are visited? The pages that are most successful are those that relate to documentation about the Data Mining: course materials, tutorials, links to other documents available on line, etc.. This is not really surprising. I take more time myself to write booklets and tutorials, to study the behavior of different software, of which Tanagra.

Happy New Year 2012 to all.

Ricco.
Slideshow: Website statistics for 2011

Tanagra add-in for Excel 2010 - 64-bit version

2011-12-31 (Tag:6490662296817463764)

The current Tanagra.xla add-in is valid to the 32-bit version of Excel (up to Excel 2010), even if we are working under 64-bit version of Windows. It does not operate on the other hand if we want to connect the 64-bit version of Excel to Tanagra. We must modify the add-in source code. These modifications are needed up to 1.4.41 version of Tanagra. They will be automatically introduced for the upcoming versions.

In this tutorial, we show the procedure to be followed for this upgrade. The screenshots have been achieved under a French version of Excel 2007 here, but I think (I hope) that the adaptation to other versions (Excel 2010 and/or other languages) is easy.

Thank you very much to Mrs. Nathalie Jourdan-Salloum which has pointed out this problem and has suggested to me the right solution.

Keywords: data importation, xls, xlsx, excel file format, macro-complémentaire, add-in, addin, add-on
Tutorial: en_Tanagra_Addin_Excel_64_bit.pdf
References:
Tanagra, "Tanagra add-in for Office 2007 and Office 2010".

Dealing with very large dataset (continuation)

2011-12-11 (Tag:446923616892985981)

Because I have recently updated my operating system (OS), I am wondering how the 64-bit versions of Knime 2.4.2 and RapidMiner 5.1.011 could handle a very large dataset, which cannot be loaded into main memory on a 32-bit OS. This article completes a previous study where we deal with a moderate sized dataset with 500,000 instances and 22 variables. Here, we handle a dataset with 9,634,198 instances and 41 variables. We have already used this dataset in another tutorial. We showed that we cannot perform a decision tree induction on this kind of database without a swapping system, which is implemented into the SIPINA, on a 32-bit OS. We note that Tanagra can handle the dataset, but this is because it encodes the values of the categorical attributes with a byte. The memory occupation remains moderate.

In this tutorial, I analyze the behavior of the 64-bit Knime and RapidMiner on this database. I use 64-bit OS and tools, but I have "only" 4 GB of available memory on my personal computer.

Keywords: very large dataset, decision tree, sampling, sipina, knime, rapidminer
Components: ID3
Tutorial: en_Tanagra_Tree_Very_Large_Dataset.pdf
Dataset: twice-kdd-cup-discretized-descriptors.zip
References:
Tanagra, "Dealing with very large dataset in Sipina".
Tanagra, "Decision tree and large dataset (continuation)".
Tanagra, "Decision tree and large dataset".
Tanagra, "Local sampling for decision tree learning".

Decision tree and large dataset (continuation)

2011-10-29 (Tag:6587924256753614730)

One of the exciting aspects of computing is that things are changing very quickly. The machines are ever more efficient, the operating systems are improved, the software also. Since writing an old tutorial about the induction of decision tree on a large dataset, I have a new computer and I use a 64 bit OS (Windows 7). Some of the tools studied propose a 64 bit version (Knime, RapidMiner, R). I wonder how behave the various tools in this new context. To do that, I renewed the same experiment.

We note that a more efficient computer allows to improve the computation time (about 20%). The specific gain for a 64 bit version is relatively low, but it is real (about 10%). And some tools are clearly improved their programming of the decision tree induction (Knime, RapidMiner). On the other hand, we observe that the memory occupation remains stable for the most of the tools in the new context.

Keywords: c4.5, decision tree, large dataset, wave dataset, knime2.4.2, orange 2.0b, r 2.13.2, rapidminer 5.1.011, sipina 3.7, tanagra 1.4.41, weka 3.7.4, windows 7 - 64 bits
Components: SUPERVISED LEARNING, C4.5
Tutorial: en_Tanagra_Perfs_Comp_Decision_Tree_Suite.pdf
Screenshots : Experiment screenshots.
Dataset: wave500k.zip
References:
Tanagra, "Decision tree and large dataset".
R. Quinlan, « C4.5 : Programs for Machine Learning », Morgan Kaufman, 1993.

A PRIORI PT updated

2011-09-25 (Tag:6057790099445204562)

A PRIORI PT is a tool dedicated for the extraction of association rules. This is one of the few components of Tanagra based on external library. We use the Borgelt's "apriori.exe" program. Until the version 1.4.40 of Tanagra, we used the 4.31 version of "apriori.exe". From the Tanagra 1.4.41, we introduce the latest update 5.57 (2011/09/02). Even if the settings of the tool are slightly modified, we observe that the extracted rules and the readings of the results are identical.

We take again a former tutorial to describe the behavior of this component (Association Rule Learning using A PRIORI PT). Thus, we do not detail the construction of the diagram here. We try above all to highlight the improvement of the library, especially about the computation time. We observe that this improvement is really impressive.

Keywords: association rule, large dataset
Components: A priori PT
Tutorial: en_Tanagra_AprioriPT_Updated.pdf
Dataset: assoc_census.zip
Reference: C. Borgelt, "A priori - Association Rule Induction / Frequent Item Set Mining"

Tanagra - Version 1.4.41

2011-09-22 (Tag:6024904027606357220)

A PRIORI PT. This component generates association rules. It is based on the Borgelt's apriori.exe program which has been recently updated (2011/09/02 - 5.57 version). The improvement of this new version, in terms of calculation time, is impressive.

FREQUENT ITEMSETS. Also based on the Borgelt's apriori.exe program (version 5.57), this component generates frequent (or closed, maximum, generators) itemsets.

Some tutorials are coming soon to describe the use of these new tools.

Donwload page : setup

New GUI for RapidMiner 5.0

2011-09-20 (Tag:7210427184341805629)

RapidMiner is a very popular data mining tool. It is (one of) the most used by the data miners according to the annual Kdnuggets polls (2011, 2010, 2009, 2008, 2007). There are two versions. We describe here the Community Edition which freely downloadable from the editor's website.

The new RapidMiner 5.0 has a new graphical user interface which is very similar to that of Knime. The organization of the workspace is the same. The sequence of data processing (using operators) is described with a diagram called "process" into the RapidMiner documentation. In fact, this version 5.0 joined the presentation adopted by the vast majority of data mining software. Some features are shared with many tools, among others: the connection to the R software; the meta-nodes which implements a loop or a standard succession of operations; the description of the methods underlying operators which is continuously in the right part of the main window.

RapidMiner 5.0 having evolved substantially (compared with previous versions e.g. the version 4.6 described in one of our tutorials). I thought it was appropriate to study this in detail, evaluating its behavior in the context of a standard data mining analysis. We want to implement the following process: (1) creating a decision tree from a labeled dataset; (2) exporting the model (the classification tree) into a external file (PMML format) in order to a deployment thereafter; (3) assessing the model performance using a cross-validation resampling scheme; (4) applying the model on a set of unlabeled instances, the results, i.e. the values of the descriptors and the assigned class, must be exported into a CSV file. These are standard data mining tasks. We have described them in many tutorials. We want to check if it is easy to implement them with this new version of RapidMiner. Indeed, with the previous version, defining some sequences of operations was complicated. Implementing a cross-validation for instance was not really intuitive.

Keywords: rapidminer, knime, cross-validation, decision tree, classification tree, deployment
Tutorial: en_Tanagra_RapidMiner_5.pdf
Dataset: adult_rapidminer.zip
References:
Rapid-I, "RapidMiner"
Knime, "Knime Desktop"

Regression model deployment

2011-09-19 (Tag:6573437796943113794)

Model deployment is one of the main objectives of the data mining process. We want to apply a model learned on a training set on unseen cases i.e. any people coming from the population. In the classification framework, the aim is to assign to the instance its class value from their description [e.g. Apply a classifier on a new dataset (Deployment)]. In the clustering framework, we try to detect the group which is as similar as possible to the instance according their characteristics (e.g. K-Means - Classification of a new instance).

We are concerned about the regression framework here . The aim is to predict the values of the dependent variable for unseen instances (or unlabeled instances) from the observed values on the independent variables. The process is rather basic if we handle a linear regression model. We apply the computed parameters on the unseen instances. But, it becomes difficult when we want to treat more complex models such as support vector regression with nonlinear kernels, or the models elaborated from a combination of techniques (e.g. regression from the factors of a principal component analysis). In this context, it is essential that the deployment process is directly ensured by the data mining tool.

With Tanagra, it is possible to easily deploy the regression models, even when they are the result of a combination of technique. Simply, we must prepare the data file in a particular way. In this tutorial, we describe below how to organize the data file in order to deploy various models in an unified framework: a linear regression model, a PLS regression model, a support vector regression model with a RBF (radial basis function) kernel, a regression tree model , a regression model from the factors of a principal component analysis. Then, we export the results (the predicted values for the dependent variable) in a new data file. Last, we check if the predicted values are similar according to the various models.

Keywords: model deployment, linear regression, pls regression, support vector regression, SVR, regression tree, cart, principal component analysis, pca, regression of factor scores
Components: MULTIPLE LINEAR REGRESSION, PLS REGRESSION, PLS SELECTION, C-RT REGRESSION TREE, EPSILON SVR, PRINCIPAL COMPONENT ANALYSIS, RECOVER EXAMPLES, EXPORT DATASET, LINEAR CORRELATION
Tutorial: en_Tanagra_Multiple_Regression_Deployment.pdf
Dataset: housing.xls
References :
R. Rakotomalala, Régression linéaire multiple - Diaporama (in French)

Data Mining with R - The Rattle Package

2011-08-27 (Tag:7176875015659908960)

R (http://www.r-project.org/) is one of the most exciting free data mining software projects of these last years. Its popularity is absolutely justified (see Kdnuggets Polls - Data Mining/ Analytic Tools Used - 2011). Among the reasons which explain this success, we distinguish two very interesting characteristics: (1) we can extend almost indefinitely the features of the tool with the packages; (2) we have a programming language which allows to perform easily sequences of complex operations.
But this second property can be also a drawback. Indeed, some users do not want to learn a new programming language before being able to realize projects. For this reason, tools which allow to define the sequence of commands with diagrams (such as Tanagra, Knime, RapidMiner, etc.) still remain a valuable alternative with the data miners.
In this tutorial, we present the "Rattle" package which allows to the data miners to use R without needing to know the associated programming language. All the operations are performed with simple clicks, such as for any software driven by menus. But, in addition, all the commands are stored. We can save them in a file. Then, in a new working session, we can easily repeat all the operations. Thus, we find one of the important properties which miss to the tools driven by menus.
To describe the use of the rattle package, we perform an analysis similar to the one suggested by the rattle's author in its presentation paper (G.J. Williams, " Rattle : A Data Mining GUI for R", in The R Journal, volume 1 / 2, pages 45-55, December 2009). We perform the following steps: loading the data file; partitioning the instances into learning and test samples; specifying the types of the variables (target or input); computing some descriptive statistics; learning the predictive models from the learning sample; assessing the models on the test sample (confusion matrix, error rate, some curves).
Keywords: R software, R project, rpart, random forest, glm, decision tree, classification tree, logistic regression
Tutorial: en_Tanagra_Rattle_Package_for_R.pdf
Dataset: heart_for_rattle.txt
References:
Togaware, "Rattle"
CRAN, "Package rattle - Graphical user interface for data mining in R"
G.J. Williams, "Rattle: A Data Mining GUI for R", in The R Journal, Vol. 1/2, pages 45--55, december 2009.

Predictive model deployment with R (filehash)

2011-08-22 (Tag:6356475213443561375)

Model deployment is the last task of the data mining steps. It corresponds to several aspects e.g. generating a report about the data exploration process, highlighting the useful results; applying models within an organization's decision making process; etc .
In this tutorial, we look at the context of predictive data mining. We are concerned about the construction of the model from a labeled dataset; the storage of the model; the distribution of the model, without the dataset used for its construction; the application of the model on new instances in order to assign them a class label from their description (the values of the descriptors).
We describe the filehash package for R which allows to deploy a model easily. The main advantage of this solution is that R can be launched under various operating systems. Thus, we can create a model with R under Windows; and apply the model in another environment, for instance with R under Linux. The solution can be easily generalized on a large scale because it is possible to launch R in batch mode. The update of the system will concern only the model file in the future.
We will write three R programs to distinguish the steps of the deployment process. The first one constructs a model from the dataset and stores it into a binary file (filehash format). The second one loads the model in another R session and uses it to label new instances from a second data file. The predictions are stored in a data file (CSV file format). Last, the third program loads the predictions and another data file containing the observed labels for these instances, and calculates the confusion matrix and the generalization error rate.
We use various predictive models in order to check the flexibility of the solutions. We tried the following ones: decision tree (rpart); logistic regression (glm); linear discriminant analysis (lda); linear discriminant analysis from factors of principal component analysis (lda + pca). This last one allowed to check if the system remains operational when we manipulate a combination of models.
Keywords: R software, filehash package, deployment, predictive model, rpart, lda, pca, glm, decision tree, linear discriminant analysis, logistic regression, principal component analysis, linear discriminant analysis on latent variables
Tutorial: en_Tanagra_Deploying_Predictive_Models_with_R.pdf
Dataset: pima-model-deployment.zip
References:
R package, "Filehash : Simple key-value database"
Kdnuggets, "Data mining deployment Poll"

REGRESS into the SIPINA package

2011-08-18 (Tag:6733979158907687783)

Few people know it. In fact, several tools are installed when we launch the SETUP file of SIPINA (setup_stat_package.exe). This is the case of REGRESS which is intended to multiple linear regression.
Even if a multiple linear regression procedure is incorporated to Tanagra, REGRESS can be useful essentially because it is very easy to use. It has the advantage of being very easy to handle while being consistent with a degree course in Econometrics. As such, it may be useful for anyone wishing to learn about the regression without too much get involved in the learning of a new software.
Keywords: regress, econometrics, multiple linear regression, outliers, influential points, normality tests, residuals, Jarque-Bera test, normal probability plot, sipina.xla, add-in
Tutorial: en_sipina_regress.pdf
Dataset: ventes-regression.xls
References:
R. Rakotomalala, "Econométrie - Régression Linéaire Simple et Multiple".
D. Garson, "Multiple regression".

PLS Regression - Software comparison

2011-08-14 (Tag:5538024845921083630)

Comparing the behavior of tools is always a good way to improve them.
To check and validate the implementation of methods. The validation of the implemented algorithms is an essential point for data mining tools. Even if two programmers use the same references (books, articles), the programming choice can modify the behavior of the approach (behaviors according to the interpretation of the convergence conditions for instance). The analysis of the source code is possible solution. But, if it is often available for free software, this is not the case for commercial tools. Thus, the only way to check them is to compare the results provided by the tools on a benchmark dataset . If there are divergences, we must explain them by analyzing the formulas used.
To improve the presentation of results. There are certain standards to observe in the production of reports, consensus initiated by reference books and / or leader tools in the field. Some ratios should be presented in a certain way. Users need reference points.
Our programming of the PLS approach is based on the Tenenhaus book (1998) which, itself, make reference to the SIMCA-P tool. Using the access to a limited version of this software (version 11), we have check the results provided by Tanagra on various datasets. We show here the results of the study on the CARS dataset. We extend the comparison to other data mining tools.
Keywords: pls regression, software comparison, simca-p, spad, sas, r software, pls package
Components: PLSR, VIEW DATASET, CORRELATION SCATTERPLOT, SCATTERPLOT WITH LABEL
Tutorial: en_Tanagra_PLSR_Software_Comparison.pdf
Dataset: cars_pls_regression.xls
References :
M. Tenenhaus, « La régression PLS – Théorie et pratique », Technip, 1998.
D. Garson, « Partial Least Squares Regression », from Statnotes: Topics in Multivariate Analysis.
UMETRICS. SIMCA-P for Multivariate Data Analysis.

The CART method under Tanagra and R (rpart)

2011-08-06 (Tag:2318214853926374974)

CART (Breiman and al., 1984) is a very popular classification tree (says also decision tree) learning algorithm. Rightly. CART incorporates all the ingredients of a good learning control: the post-pruning process enables to make the trade-off between the bias and the variance; the cost complexity mechanism enables to "smooth" the exploration of the space of solutions; we can control the preference for simplicity with the standard error rule (SE-rule); etc. Thus, the data miner can adjust the settings according to the goal of the study and the data characteristics.

The Breiman's algorithm is provided under different designations in the free data mining tools. Tanagra uses the "C-RT" name. R, through a specific package , provides the "rpart" function.

In this tutorial, we describe these implementations of the CART approach according to the original book (Breiman and al., 1984; chapters 3, 10 and 11). The main difference between them is the implementation of the post-pruning process. Tanagra uses a specific sample says "pruning set" (section 11.4); when rpart is based on the cross-validation principle (section 11.5) .

Keywords: decision tree, classification tree, recursive partitioning, cart, R software, rpart package
Components: DISCRETE SELECT EXAMPLES, C-RT, SUPERVISED LEARNING, TEST
Tutorial: en_Tanagra_R_CART_algorithm.pdf
Dataset: wave5300.xls
References:
Breiman, J. Friedman, R. Olsen, C. Stone, Classification and Regression Trees, Chapman & Hall, 1984.
"The R project for Statistical Computing" - http://www.r-project.org/

PLS Discriminant Analysis - A comparative study

2011-07-24 (Tag:1853924429082354461)

PLS regression is a regression technique usually designed to predict the values taken by a group of Y variables (target variables, dependent variables) from a set of variables X (descriptors, independent variables). Initially defined for the prediction of continuous target variable, the PLS regression can be adapted to the prediction of one discrete variable - i.e. adapted to the supervised learning framework - in different ways . The approah is called "PLS Discriminant Analysis" in this context. It incorporates the valuable qualities that we know usually into this new framework: the ability to process a representation space with very high dimensionality, a large number of noisy and / or redundant descriptors.

This tutorial is the continuation of a precedent paper dedicated to the presentation of some variants of the PLS-DA. We describe the behavior of one of them (PLS-LDA - PLS Linear Discriminant Analysis) on a learning set where the number of descriptors is moderately high (278 descriptors) in relation to the number of instances (232 instances). Even if the number of descriptors is not really very high, we note in our experiment a valuable characteristic of the PLS approach: we can control the variance of the classifier by adjusting the number of latent variables.

To assess this idea, we compare the behavior of the PLS-LDA with state-of-the-art supervised learning methods such as K-nearest neighbors , SVM (Support Vector Machine from the LIBSVM library ), the Breiman's Random Forest approach , or the Fisher's Linear Discriminant Analysis .

Keywords: pls regression, linear discriminant analysis, supervised learning, support vector machine, SVM, random forest, nearest neighbor
Components: K-NN, PLS-LDA, BAGGING, RND TREE, C-SVC, TEST, DISCRETE SELECT EXAMPLES, REMOVE CONSTANT
Tutorial: en_Tanagra_PLS_DA_Comparaison.pdf
Dataset: arrhytmia.bdm
References :
S. Chevallier, D. Bertrand, A. Kohler, P. Courcoux, « Application of PLS-DA in multivariate image analysis », in J. Chemometrics, 20 : 221-229, 2006.
Garson, « Partial Least Squares Regression (PLS) », http://www2.chass.ncsu.edu/garson/PA765/pls.htm

Tanagra add-on for OpenOffice Calc 3.3

2011-07-17 (Tag:3230155024695403642)

Tanagra add-on for OpenOffice 3.3 and LibreOffice 3.4.

The connection with spreadsheet applications is certainly a factor of success for Tanagra. It is easy to manipulate a dataset into OpenOffice Calc (up to version 3.2) and send it to Tanagra using the TanagraLibrary.zip extension for further analysis .

Recently, users have reported to me that the mechanism did not work with recent versions of OpenOffice (version 3.3) and LibreOffice (version 3.4). I realized that, rather than a correction, it was more appropriate to elaborate a new module which meets the standard for managing extensions of these tools. The new library "TanagraModule.oxt" is now incorporated into the distribution.

This tutorial describes how to install and to use this add-on under OpenOffice Calc 3.0. The adaptation to LibreOffice 3.4 is very easy.

Keywords : data importation, spreadsheet application, openoffice, libreoffice, add-in, add-on, excel
Component : View Dataset
Tutorial : en_Tanagra_Addon_OpenOffice_LibreOffice.pdf
Dataset : breast.ods
Références :
Tutoriel Tanagra, "OOo Calc file handling using an add-in"
Tutoriel Tanagra, "Launching Tanagra from OOo Calc under Linux"

Tanagra - Version 1.4.40

2011-07-05 (Tag:3324900599297262143)

Few improvements for this new version.

A new addon for the connection between Tanagra and the recent version of OpenOffice Calc spreadsheet has been created. The old one did not work for recent versions - OpenOffice 3.3 and LibreOffice 3.4. During the installation process, another library was added ("TanagraModule.oxt") to not interfere with the old, still functional for previous versions of Open Office (3.2 and earlier). A tutorial describing its installation and its utilization will be put online soon. I take this opportunity to highlight again how a privileged connection between a spreadsheet and a specialized tool for Data Mining is convenient. The annual poll organized by the kdnuggets.com website shows the interest of this connection (2011, 2010, 2009,...). We note that there is a similar addon for the R software (R4Calc). This change was suggested by Jérémy Roos (OpenOffice) and Franck Thomas (LibreOffice).

The non-standardized ACP is now available. It is possible to implement unchecking the option of standardization of the data in the Principal Component Analysis component. Change suggested by Elvire Antanjan.

Simultaneous regression was introduced. It is very similar to the method programmed into LazStats, which is unfortunately more accessible freely now. The approach is described in a free booklet online "Practice of linear regression analysis" (in French) (section 3.6).

The color codes according to the p-value have been introduced for the Linear Correlation component. Change suggested by Samuel KL.

Once again, thank you very much to all those who help me to improve this work by their comments or suggestions.

Donwload page : setup

Tanagra - Version 1.4.39

2011-05-26 (Tag:6899548697628617002)

Some minor corrections for the Tanagra 1.4.39 version.

For the PCA (principal component analysis) component, when we ask all the factors, none are generated. Reported by Jérémy Roos.

In the previous 1.4.38 version, the results of Multinomial Logistic Regression are not consistent with the tutorial on the website. The calculations are wrong. Reported by Nicole Jurado.

It is now possible to obtain the scores from the PLS-DA component (Partial Least Squares Regression - Discriminant Analysis). Reported by Carlos Serrano.

All these bugs are corrected in the 1.4.39 version. Once again, thank you very much to all those who help me to improve this work by their comments or suggestions.

Donwload page : setup

Mining Association Rule from Transactions File

2011-04-05 (Tag:1419133677054019536)

Association rule learning is a popular method for discovering interesting relations between variables in large databases. It was often used in market basket analysis domain. But in fact, it can be implemented in various areas where we want to discover the associations between variables. The association is described by a "IF THEN" rule. The IF part is called "antecedent" of the rule; the THEN part correspond to the "consequent" e.g. IF onions AND potatoes THAN burger (http://en.wikipedia.org/wiki/Association_rule_learning) i.e. if a customer buys onions and potatoes then he buys also burger.

It is possible to find co-occurrences in the standard attribute - value tables that are handled with the most of the data mining tools. In this context, the rows correspond to the baskets (transactions); the columns correspond to the list of all possible products (items); at the intersection of the row and the column, we have an indicator (true/false or 1/0) which indicates if the item belongs to the transaction. But this kind of representation is too naive. A few products are incorporated in each basket. Each row of the table contains a few 1 and many 0. The size of the data file is unnecessarily excessive. Therefore, another data representation, says "transactions file", is often used to minimize the data file size. In this tutorial, we treat a special case of the transactions file. The principle is based on the enumeration of the items included in each transaction. But in our case, we have only two values for each row of the data file: the transaction identifier, and the item identifier. Thus, each transaction can be listed on several rows of the data file.

This data representation is quite natural considering the problem we want to treat. It also has the advantage of being more compact since only the items really present in each transaction are enumerated. However, it appears that many tools do not know manage directly this kind of data representation. We observe curiously a distinction between professional tools and the academic ones. The first ones can handle directly this kind of data file without special data preparation. This is the case of SPAD 7.3 and SAS Enterprise Miner 4.3 that we study in this tutorial. On the other hand, the academic tools need a data transformation, prior the importation of the dataset. We use a small program written in VBA (Visual Basic for Applications) under Excel to prepare the dataset. Thereafter, we perform the analysis with Tanagra 1.4.37 and Knime 2.2.2 (Note: a reader told me that we can transform the dataset with Knime without the utilization of external program. This is true. I will describe this approach in a separate section at the end of this tutorial).

Attention, we must respect the original specifications i.e. focus only on rules indicating the simultaneous presence of items in transactions. We must not, consecutively to a bad "presence - absence" coding scheme, to generate rules outlining the simultaneous absence of some items. This may be interesting in some cases may be, but this is not the purpose of our analysis.

Keywords: association rules, a priori algorithm
Components: A priori
Tutorial: en_Tanagra_Assoc_Rule_Transactions.pdf
Dataset: assoc_rule_transactions.zip
References:
Tanagra Tutorials, "Association rule learning from transaction dataset"
P.N. Tan, M. Steinbach, V. Kumar, « Introduction to Data Mining », Addison Wesley, 2006 ; chapitre 6, « Association analysis : Basic Concepts and Algorithms ».
Wikipedia - "Association rule learning"

Multiple Regression - Reading the results

2011-02-20 (Tag:6826794804495294855)

The aim of the multiple regression is to predict the values of a continuous dependent variable Y from a set of continuous or binary independent variables (X1,..., Xp).

In this tutorial, we want to model the relationship between the cars consumption and their weight, engine-size and horsepower. We describe the outputs of Tanagra by associating them with the used formulas. We highlight the importance of the unscaled covariance matrix of the estimated coefficients [(X'X)-1] (Tanagra 1.4.38 and later). It is used for the subsequent analysis: individual significance of coefficients, simultaneous significance of several coefficients, testing linear combinations of coefficients, computation of the standard error for the prediction interval. These analyses are performed into the Excel spreadsheet.

Thereafter, we perform the same analyses with the R software. We identify the objects provided by the lm(.) procedure that we can use in the same context.

Keywords: linear regression, multiple regression, R software, lm, summary.lm, testing significance, prediction interval
Components: MULTIPLE LINEAR REGRESSION
Tutorial: en_Tanagra_Multiple_Regression_Results.pdf
Dataset: cars_consumption.zip
References :
D. Garson, "Multiple Regression"

Tanagra - Version 1.4.38

2011-02-04 (Tag:1474569535027730219)

Some minor corrections for the Tanagra 1.4.38 version.

The color codes for the normality tests have been harmonized (Normality Test). In some configurations, the colors associated with p-values were not consistent, it could misleading the users. This problem has been reported by Lawrence M. Garmendia.

Following indications from Mr. Oanh Chau, I realized that the standardization of variables to the HAC (hierarchical agglomerative clustering) was based on the sample standard deviation. This is not an error in itself. But the sum of index of level into the dendrogram does not consistent with the TSS (total sum of squares). This is unwelcome. The difference is especially noticeable on small dataset, it disappears when the dataset size increases. The correction has been introduced. Now the BSS ratio is equal to 1 when we have the trivial partition i.e. one individual per group.

Multiple linear regression (MULTIPLE LINEAR REGRESSION) displays the matrix (X'X) ^ (-1). It allows to deduce the variance covariance matrix of coefficients (by multiplying the matrix by the estimated variance of the error). It can be also used in the generalized tests for the model coefficients.

Last, the outputs of the descriptive discriminant analysis (CANONICAL DISCRIMINANT ANALYSIS) were improved. The group centroids (Group centroids) on the factorial axes are directly provided.

Thank you very much to all those who help me to improve this work by their comments or suggestions.

Download page: setup

Tanagra website statistics for 2010

2011-01-04 (Tag:3749773494061368837)

The year 2010 ends, 2011 begins. I wish you all a very happy year 2011.

A small statistical report on the website statistics for the past year. All sites (Tanagra, course materials, e-books, tutorials) has been visited 241,765 times this year, 662 visits per day. For comparison, we had 520 daily visits in 2009 and 349 in 2008.

Who are you? The majority of visits come from France and Maghreb (62%). Then there are a large part of French speaking countries. In terms of non-francophone countries, we observe mainly the United States, India, UK, Germany, Brazil,...

Which pages are visited? The pages that are most successful are those that relate to documentation about the Data Mining: course materials, tutorials, links to other documents available on line, etc.. This is hardly surprising. I take more time myself to write booklets and tutorials, to study the behavior of different software, of which Tanagra.

Happy New Year 2011 to all.

Ricco.
Slideshow: Website statistics for 2010

Creating reports with Tanagra

2010-12-09 (Tag:4866690558847292074)

The ability to create automatically reports from the results of an analysis is a valuable functionality for Data Mining. But this is rather an asset to the professional tools. The programming of this kind of functionality is not really promoted in the academic domain. I do not think that I can publish a paper in a journal where I describe the ability of Tanagra to create attractive reports. This is the reason for which the output of the academic tools, such as R or Weka, is mainly in a formatted text shape.

Tanagra, which is an academic tool, provides also text outputs. The programming remains simple if we see at a glance the source code. But, in order to make the presentation more attractive, it uses the HTML to format the results. I take advantage of this special feature to generate reports without making a particular programming effort. Tanagra is one of the few academic tools to be able to produce reports that can easily be displayed in office automation software. For instances, the tables can be copied into Excel spreadsheets for further calculations. More generally, the results can be viewed in a browser, regardless of data mining software.

These are the reporting features of Tanagra that we present in this tutorial.

Keywords: reporting, decision tree, c4.5, logistic regression, binary coding, roc curve, learning sample, test sample, forward, feature selection
Components: GROUP CHARACTERIZATION, SAMPLING, C4.5, TEST, O_1_BINARIZE, FORWARD-LOGIT, BINARY LOGISTIC REGRESSION, SCORING, ROC CURVE
Tutorial: en_Tanagra_Reporting.pdf
Dataset: heart disease

Multithreading for decision tree induction

2010-11-24 (Tag:3378022511446387135)

Nowadays, much of modern personal computers (PC) have multicore processors. The computer operates as if it had multiple processors. Software and data mining algorithms must be modified in order to benefit of this new feature.

Currently, few free tools exploit this opportunity because it is impossible to define a generic approach that would be valid regardless of the learning method used. We must modify each existing learning algorithm. For a given technique, decomposing an algorithm into elementary tasks that can execute in parallel is a research field in itself. In a second step, we must adopt a programming technology which is easy to implement.

In this tutorial, I propose a technology based on threads for the induction of decision trees. It is well suited in our context for various reasons. (1) It is easy to program with the modern programming languages. (2) Threads can share information; they can also modify common objects. Efficient synchronization tools enable to avoid data corruption. (3) We can launch multiple threads on a mono-core and mono-processor system. It is not really advantageous, but at least the system does not crash. (4) On a multiprocessor or multi-core system, the threads will actually run at the same time, with each processor or core running a particular thread. But, because of the necessity of synchronization between threads, the computation time is not divided by the number of cores in this case.

First, we briefly present the modification of the decision tree learning algorithm in order to benefit of the multithreading technology. Then, we show how to implement the approach with SIPINA (version 3.5 and later). We show also that the multithreaded decision tree learners are available in various tools such as Knime 2.2.2 or RapidMiner 5.0.011. Last, we study the behavior of the multithreaded algorithms according to the dataset characteristics.

Keywords: multithreading, thread, threads, decision tree, chaid, sipina 3.5, knime 2.2.2, rapidminer 5.0.011
Tutorial: en_sipina_multithreading.pdf
Dataset: covtype.arff.zip
References :
Wikipedia, "Decision tree learning"
Wikipedia, "Thread (Computer science)"
Aldinucci, Ruggieri, Torquati, " Porting Decision Tree Algorithms to Multicore using FastFlow ", Pkdd-2010.

Naive bayes classifier for continuous predictors

2010-11-11 (Tag:3893972798363934517)

The naive bayes classifier is a very popular approach even if it is (apparently) based on an unrealistic assumption: the distributions of the predictors are mutually independent conditionally to the values of the target attribute. The main reason of this popularity is that the method proved to be as accurate as the other well-known approaches such as linear discriminant analysis or logistic regression on the majority of the real dataset.

But an obstacle to the utilization of the naive bayes classifier remains when we deal with a real problem. It seems that we cannot provide an explicit model for its deployment. The proposed representation by the PMML standard for instance is particularly unattractive. The interpretation of the model, especially the detection of the influence of each descriptor on the prediction of the classes is impossible.

This assertion is not entirely true. We have showed in a previous tutorial that we can extract an explicit model from the naive bayes classifier in the case of discrete predictors (see references). We obtain a linear combination of the binarized predictors. In this document, we show that the same mechanism can be implemented for the continuous descriptors. We use the standard Gaussian assumption for the conditional distribution of the descriptors. According to the heteroscedastic assumption or the homoscedastic assumption, we can provide a quadratic model or a linear model. This last one is especially interesting because we obtain a model that we can directly compare to the other linear classifiers (the sign and the values of the coefficients of the linear combination).

This tutorial is organized as follows. In the next section, we describe the approach. In the section 3, we show how to implement the method with Tanagra 1.4.37 (and later). We compare the results to those of the other linear methods. In the section 4, we compare the results provided by various data mining tools. We note that none of them proposes an explicit model that could be easy to deploy. They give only the estimated parameters of the conditional Gaussian distribution (mean and standard deviation). Last, in the section 5, we show the interest of the naive bayes classifier over the other linear methods when we handle a large dataset (the "mutant" dataset - 16,592 instances and 5,408 predictors). The computation time and the memory occupancy are clearly advantageous.

Keywords: naive bayes classifier, rapidminer 5.0.10, weka 3.7.2, knime 2.2.2, R software, package e1071, linear discriminant analysis, pls discriminant analysis, linear svm, logistic regression
Components : NAIVE BAYES CONTINUOUS, BINARY LOGISTIC REGRESSION, SVM, C-PLS, LINEAR DISCRIMINANT ANALYSIS
Tutorial: en_Tanagra_Naive_Bayes_Continuous_Predictors.pdf
Dataset: breast ; low birth weight
References :
Wikipedia, "Naive bayes classifier"
Tanagra, "Naive bayes classifier for discrete predictors"

Tanagra - Version 1.4.37

2010-10-19 (Tag:5157036174110104915)

Naive Bayes Continuous is a supervised learning component. It implements the naive bayes principle for continuous predictors (gaussian assumption, heteroscedasticity or homoscedasticity). The main originality is that it provides an explicit model corresponding to a linear combination of predictors and, eventually, their square.

Enhancement of the reporting module.

Filter methods for feature selection

2010-10-14 (Tag:3774951973658351277)

The nature of the predictors' selection process has changed considerably. Previously, works in machine learning concentrated on the research of the best subset of features for a learning classifier, in the context where the number of candidate features was rather reduced and the computing time was not a major constraint. Today, it is common to deal with datasets comprising thousands of descriptors. Consequently, the problem of feature selection always consists in finding the most relevant subset of predictors but by introducing a new strong constraint: the computing time must remain reasonable.

In this tutorial, we are interested in correlation based filter approaches for discrete predictors. The goal is to highlight the most relevant subset of predictors which are highly correlated with the target attribute and, in the same time, which are weakly correlated between them i.e. which are not redundant. To evaluate the behavior of the various methods, we use an artificial dataset where we add irrelevant and redundant candidate variables. Then, we perform a feature selection based on the approaches analyzed. We compare the generalization error rate of the naive bayes classifier learned from the various subsets of selected variables. We lead the experimentation with Tanagra in a first time. Then, in a second time, we show how to perform the same analysis with other tools (Weka 3.6.0, Orange 2.0b, RapidMiner 4.6.0, R 2.9.2 - package FSelector).

Keywords: filter, feature selection, correlation based measure, discrete predictors, naive bayes classifier, bootstrap
Components: FEATURE RANKING, CFS FILTERING, MIFS FILTERING, FCBF FILTERING, MODTREE FILTERING, NAIVE BAYES, BOOTSTRAP
Tutorial: en_Tanagra_Filter_Method_Discrete_Predictors.pdf
Dataset: vote_filter_approach.zip
References:
Tanagra, "Feature Selection"

Connecting Sipina and Excel using OLE

2010-08-30 (Tag:7903721472103544510)

The connection between a data mining tool and Excel (and more generally spreadsheet) is a very important issue. We had addressed many times this topic in our tutorials. With hindsight, I think the solution based on add-ins for Excel is the best one, both for SIPINA and for TANAGRA. It is simple, reliable and highly efficient. It does not require developing specific versions. The connection with Excel is a simple additional functionality of the standard distribution.

Prior to reaching this solution, we had explored different trails. In this tutorial, we present the XL-SIPINA software based on Microsoft's OLE technology. At the opposite of the add-in solution, this version of SIPINA chooses to embed Excel into the Data Mining tool. The system works rather well. Nevertheless, it has finally been dropped for two reasons: (1) we were forced to compile special versions that work only if Excel is installed on the user's machine; (2) the transferring time between Excel and Sipina using OLE is prohibitive when the database size grows.

Thus, XL-SIPINA is essentially an attempt short-lived. There is always a bit of nostalgia when I am back on solutions I have explored, and I have finally abandoned. Can be also I have not completely explored this solution.

Last, the application was initially developed for Office 97. I note that it still up to date today, it works fine with Office 2010.

Keywords: excel, tableur, sipina, xls, xlsx, xl-sipina, decision tree induction
Download XL-SIPINA: XL-SIPINA
Tutorial: en_xls_sipina.pdf
Dataset: autos

Sipina add-in for Excel

2010-08-27 (Tag:1588652364488914774)

The data importation is a bottleneck for Data Mining Tools. The majority of users are working with a spreadsheet tool such as Excel, mainly in the coupling with specialized software for data mining (see KDnuggets polls). Therefore, a recurring issue for users is "how to send my data from Excel to SIPINA?"

It is possible to import different types of formats into SIPINA. About Excel workbooks, one particular device has been implemented.

An add-in is automatically copied to the computer during the installation process. It must be integrated into Excel. The add-in incorporates a new menu into Excel. After selecting the data range, the user only has to activate it, this leads to the following: (1) SIPINA starts automatically, (2) the data are transferred via the clipboard and (3) SIPINA considers the first row of the range of cells corresponds to the names of variables, (4) columns with numerical values of the variables are quantitative (5) columns with alphanumeric values are categorical variables.

Unlike the other tutorials, the sequence of manipulations is described in a video. The description is right only for the versions up to Excel 2003. Another tutorial about the using of the add-in under Office 2007 and Office 2010 is described below.

Keywords: excel file format, add-in, decision tree
Installing the add-in : sipina_xla_installation.htm
Using the add-in: sipina_xla_processing.htm

Tanagra add-in for Office 2007 and Office 2010

2010-08-27 (Tag:2343715851998873339)

The "tanagra.xla" add-in for Excel contributes to the wide diffusion of Tanagra. The principle is simple. It is to embed a Tanagra menu in Excel. Thus the user can run statistical calculations without having to leave the spreadsheet. It seems simplistic. But this feature facilitates immensely the work of data miner. Indeed, the spreadsheet is one of the most used tools for preparing dataset (see KDNuggets Polls: Tools / Languages for Data Cleaning - 2008). By embedding the data mining tool in the spreadsheet environment, it avoids to the practitioner the tedious and repetitive manipulations: importing the dataset, exporting the dataset, checking the compatibilities between data file formats, etc.

The installation and the use of the "tanagra.xla" add-in under the previous versions of Office are described elsewhere (Office 1997 to Office 2003). This description is obsolete for the latest version of Office because the organization of the menus is modified for these versions i.e. Office 2007 and Office 2010. And yet, the add-in is still operational. In this tutorial, we show how to install and to use the Tanagra add-in under Office 2007 and 2010.

This transition to recent versions of Excel is absolutely not without consequences. Indeed, compared to the previous Excel versions, Excel 2007 (and 2010) and can handle more important rows and columns. We can process a dataset up to 1,048,575 observations (the first line corresponds to the variable names) and 16,384 variables. In this tutorial, we will treat a database with 100,000 observations and 22 variables (wave100k.xlsx). This is a version of the famous waveform database. Note that this file, because of the number of rows, cannot be manipulated by earlier versions of Excel.

The process described in this document is also valid for the SIPINA add-in (sipina.xla).

Keywords: data importation, excel, add-in
Components: VIEW DATASET
Tutorial: en_Tanagra_Add_In_Excel_2007_2010.pdf
Dataset: wave100k.xlsx
References:
Tanagra, "Tanagra and Sipina add-ins for Excel 2016", June 2016.
Tanagra, "Excel file handling using an add-in".
Tanagra, "OOo Calc file handling using an add-in".
Tanagra, "Launching Tanagra from OOo Calc under Linux".
Tanagra, "Sipina add-in for Excel"

Naive bayes classifier for discrete predictors

2010-07-24 (Tag:8671610544546668437)

The naive bayes approach is a supervised learning method which is based on a simplistic hypothesis: it assumes that the presence (or absence) of a particular feature of a class is unrelated to the presence (or absence) of any other feature. Yet, despite this, it appears robust and efficient. Its performance is comparable to other supervised learning techniques.

We introduce in Tanagra (version 1.4.36 and later) a new presentation of the results of the learning process. The classifier is easier to understand, and its deployment is also made easier.

In the first part of this tutorial, we present some theoretical aspects of the naive bayes classifier. Then, we implement the approach on a dataset with Tanagra. We compare the obtained results (the parameters of the model) to those obtained with other linear approaches such as the logistic regression, the linear discriminant analysis and the linear SVM. We note that the results are highly consistent. This largely explains the good performance of the method in comparison to others.

In the second part, we use various tools on the same dataset (Weka 3.6.0, R 2.9.2, Knime 2.1.1, Orange 2.0b and RapidMiner 4.6.0). We try above all to understand the obtained results.

Keywords: naive bayes, linear classifier, linear discriminant analysis, logistic regression, linear support vector machine, svm
Components: NAIVE BAYES, LINEAR DISCRIMINANT ANALYSIS, BINARY LOGISTIC REGRESSION, SVM, 0_1_BINARIZE
Tutorial: en_Tanagra_Naive_Bayes_Classifier_Explained.pdf
Dataset: heart_for_naive_bayes.zip
References :
Wikipedia, "Naive bayes classifier".
T. Mitchell, "Generative and Discriminative Classifiers: Naive Bayes and Logistic Regression", in Machine Learning, Chapter 1, 2005.

Interactive decision tree learning with Spad

2010-07-21 (Tag:5071610824231660345)

In this tutorial, we will be interested in SPAD. This is a French software specialized in exploratory data analysis which evolved much these last years. We would perform a sequence of analysis from a dataset collected into 3 worksheets of a Excel data file: (1) we create a classification tree from the learning sample into the first worksheet, we try to analyze deeply some nodes of the tree to highlight the characteristics of covered instances, we try also to modify interactively (manually) the properties of some splitting operation; (2) we apply the classifier on unseen cases of the second worksheet; (3) we compare the prediction of the model with the actual values of the target attribute contained into the third worksheet.

Of course, we can perform this process using free tools such as SIPINA (the interactive construction of the tree) or R (the programming of the sequence of operations, in particular the applying of the model on unlabeled dataset). But with Spad or other commercial tools (e.g. SPSS Modeler, SAS Enterprise Miner, STATISTICA Data Miner…), we can very easily specify the whole sequence, even if we are not especially familiarized with data mining tools.

Keywords: decision tree, classification tree, interactive decision tree, spad, sipina, r software
Tutorial: en_Tanagra_Arbres_IDT_Spad.pdf
Dataset: pima-arbre-spad.zip
References :
SPAD, http://www.spad.eu/
SIPINA, https://eric.univ-lyon2.fr/ricco/sipina.html
R Project, http://www.r-project.org/

Supervised learning from imbalanced dataset

2010-07-12 (Tag:7702541561563780532)

In real problems, the classes are not equally represented in dataset. The instances corresponding to positive class, the one that we want to detect often, are few. For instance, in a fraud detection problem, there are a very few cases of fraud comparing to the large number of honest connections; in a medical problem, the ill persons are fortunately rare; etc. In these situations, using the standard learning process and assessing the classifier with the confusion matrix and the misclassification rate are not appropriate. We observe that the default classifier consisting to assign all the instances to the majority class is the one which minimizes the error rate.

For the dataset that we analyze in this tutorial, 1.77% of all the examples belong to the positive class. If we assign all the instances to the negative class - this is the default classifier - the misclassification rate is 1.77%. It is difficult to find a classifier which is able to do better. Even if we know that we have not a good classifier, especially because it does not supply a degree of membership to the classes (Note: in fact, it assigns the same degree of membership to all the instances).

A strategy enables to improve the behavior of the learning algorithms facing to the imbalance problem is to artificially balance the dataset. We can do this by eliminating some instances of the over-sized class (downsizing) or by duplicating some instances of the small class (over sampling). But few persons analyze the consequence of this solution on the performance of the classifier.

In this tutorial, we highlight the consequences of the downsizing on the behavior of the logistic regression.

Keywords: imbalanced dataset, logistic regression, over sampling, under sampling
Components: BINARY LOGISTIC REGRESSION, DISCRETE SELECT EXAMPLES, SCORING, RECOVER EXAMPLES, ROC CURVE, TEST
Tutorial : en_Tanagra_Imbalanced_Dataset.pdf
Dataset : imbalanced_dataset.xls
References :
D. Hosmer, S. Lemeshow, « Applied Logistic Regression », John Wiley &Sons, Inc, Second Edition, 2000.

Handling large dataset in R - The "filehash" package

2010-06-09 (Tag:4731981155829981389)

The processing of very large datasets is a crucial problem in data mining. To handle them, we must avoid to load the whole dataset into memory. The idea is quite simple: (1) we write all or a part of the dataset on the disk in a binary file format to allow a direct access; (2) the machine learning algorithms must be modified to efficiently access the values stored on the disk. Thus, the characteristics of the computer are no longer a bottleneck for the handling of a large dataset.

In this tutorial, we describe the great "filehash" package for R. It allows to copy (to dump) any kind of R objects into a file. We can handle these objects without loading them into main memory. This is especially useful for the data frame object. Indeed, we can perform a statistical analysis with the usual functions directly from a database on the disk. The processing capacities are vastly improved and, in the same time, we will note that the increase in computation time remains moderate.

To evaluate the "filehash" solution, we analyze the memory occupation and the computation time, with and without utilization of the package, during the performing of decision tree learning with rpart (rpart package) and a linear discriminant analysis with lda (MASS package). We perform the same experiments using SIPINA. Indeed, it provides also a swapping system (the data is dumped from the main memory to temporary files) for the handling of very large dataset. We can then compare the performances of the various solutions.

Keywords: very large dataset, filehash, decision tree, linear discriminant analysis, sipina, C4.5, rpart, lda
Tutorial: en_Tanagra_Dealing_Very_Large_Dataset_With_R.pdf
Données : wave2M.txt.zip
References :
R package, "Filehash : Simple key-value database"
Yu-Sung Su's Blog, "Dealing with large dataset in R"
Tanagra Tutorial, "MapReduce with R", February 2015.
Tanagra Tutorial, "R programming under Hadoop", April 2015.

Logistic Regression Diagnostics

2010-05-27 (Tag:4326915598201712548)

This tutorial describes the implementation of tools for the diagnostic and the assessment of a logistic regression. These tools are available in Tanagra version 1.4.33 (and later).

We deal with a credit scoring problem. We try to determine by using logistic regression the factors underlying the agreement or refusal of a credit to customers. We perform the following steps:
- Estimating the parameters of the classifier;
- Retrieving the covariance matrix of coefficients;
- Assessment using the Hosmer and Lemeshow goodness of fit test;
- Assessment using the reliability diagram;
- Assessment using the ROC curve;
- Analysis of residuals, detection of outliers and influential points.

On the one hand, we use Tanagra 1.4.33. Then, on the other hand, we perform the same analysis using the R 2.9.2 software [glm(.) procedure].

Keywords: logistic regression, residual analysis, outliers, influential points, pearson residual, deviance residual, leverage, cook's distance, dfbeta, dfbetas, hosmer-lemeshow goodness of fit test, reliability diagram, calibration plot, glm()
Components: BINARY LOGISTIC REGRESSION, HOSMER LEMESHOW TEST, RELIABILITY DIAGRAM, LOGISTIC REGRESSION RESIDUALS
Tutorial: en_Tanagra_Logistic_Regression_Diagnostics.pdf
Dataset: logistic_regression_diagnostics.zip
References :
D. Garson, "Logistic Regression"
D. Hosmer, S. Lemeshow, « Applied Logistic Regression », John Wiley &Sons, Inc, Second Edition, 2000.

Discretization of continuous features

2010-05-21 (Tag:6578202565102628458)

The discretization transforms a continuous attribute into a discrete one. To do that, it partitions the range into a set of intervals by defining a set of cut points. Thus we must answer to two questions to lead this data transformation: (1) how to determine the right number of intervals; (2) how to compute the cut points. The resolution is not necessarily in that sequence.

The best discretization is the one performed by an expert domain. Indeed, he takes into account other information than those only provided by the available dataset. Unfortunately, this kind of approach is not always feasible because: often, the domain knowledge is not available or it does not allow to determine the appropriate discretization; the process cannot be automated to handle a large number of attributes. So, we are often forced to found the determination of the best discretization on a numerical process.

Discretization of continuous features as preprocessing for supervised learning process. First, we must define the context in which we perform the transformation. Depending on the circumstances, it is clear that the process and criteria used will not be the same. In this tutorial, we are in the supervised learning framework. We perform the discretization prior to the learning process i.e. we transform the continuous predictive attributes into discrete before to present them to a supervised learning algorithm. In this context, the construction of intervals in which one and only one of the values of the target attribute is the most represented is desirable. The relevance of the computed solution is often evaluated through an impurity based or an entropy based functions.

In this tutorial, we use only the univariate approaches. We compare the behavior of the supervised and the unsupervised algorithms on an artificial dataset. We use several tools for that: Tanagra 1.4.35, Sipina 3.3, R 2.9.2 (package dprep), Weka 3.6.0, Knime 2.1.1, Orange 2.0b and RapidMiner 4.6.0. We highlight the settings of the algorithms and the reading of the results.

Keywords: mdlpc, discretization, supervised learning, equal frequency intervals, equal width intervals
Components: MDLPC, Supervised Learning, Decision List
Tutorial: en_Tanagra_Discretization_for_Supervised_Learning.pdf
Dataset: data-discretization.arff
References :
F. Muhlenbach, R. Rakotomalala, « Discretization of Continuous Attributes », in Encyclopedia of Data Warehousing and Mining, John Wang (Ed.), pp. 397-402, 2005 (http://hal.archives-ouvertes.fr/hal-00383757/fr/).
Tanagra Tutorial, "Discretization and Naive Bayes Classifier"

Sipina Decision Graph Algorithm (case study)

2010-05-16 (Tag:1006811416263408525)

SIPINA is a data mining tool. But it is also a machine learning method. It corresponds to an algorithm for the induction of decision graphs (see References, section 9). A decision graph is a generalization of a decision tree where we can merge any two terminal nodes of the graph, and not only the leaves issued from the same node.

The SIPINA method is only available under the version 2.5 of SIPINA data mining tool. This version has some drawbacks. Among others, it cannot handle large datasets (higher than 16.383 instances). But it is the only tool which implements the decision graphs algorithm. This is the main reason for which this version is available online to date. If we want to implement a decision tree algorithm such as C4.5 or CHAID, or if we want to create interactively a decision tree , it is more advantageous to use the research version (named also version 3.0). The research version is more powerful and it supplies much functionality for the data exploration.

In this tutorial, we show how to implement the Sipina decision graph algorithm with the Sipina software version 2.5. We want to predict the low birth weight of newborns from the characteristics of their mothers. We want foremost to show how to use this 2.5 version which is not well documented. We want also to point out the interest of the decision graphs when we treat a small dataset i.e. when the data fragmentation becomes a crucial problem.

Keywords: decision graphs, decision trees, sipina version 2.5
Tutorial: en_sipina_method.pdf
Dataset: low_birth_weight_v4.xls
References:
Wikipedia, "Decision tree learning"
J. Oliver, Decision Graphs: An extension of Decision Trees, in Proc. of Int. Conf. on Artificial Intelligence and Statistics, 1993.
R. Rakotomalala, Graphes d'induction, PhD Dissertation, University Lyon 1, 1997 (URL: https://eric.univ-lyon2.fr/ricco/publications.html; in french).
D. Zighed, R. Rakotomalala, Graphes d'induction : Apprentissage et Data Mining, Hermes, 2000 (in French).

User's guide for the old Sipina 2.5 version

2010-05-14 (Tag:2804163829007233425)

SIPINA has a long history. Before the current version (version 3.3, May 2010), we distributed a data mining tool dedicated exclusively to the induction of decision graphs, a generalization of decision trees. Of course, the state-of-the-art decision trees algorithms are also included (such as C4.5, CHAID).

This version, called 2.5, is online since 1995. Its development was suspended in 1998 when I started programming the version 3.0.

This version 2.5 is the only free tool which implements the decision graphs algorithm. This is a real curiosity in this respect. This is the reason for which I still distribute this version to date.

On the other hand, this 2.5 version has some severe limitations. Among others, it can handle only small dataset, up to 16.380 instances. If you want to implement a decision tree or if you want to handle a large dataset, it is always advised to use the current version (version 3.0 and later).

Setup of the old 2.5 version: Setup_Sipina_V25.exe
User's guide: EnglishDocSipinaV25.pdf
References:
J. Oliver, "Decision Graphs - An Extension of Decision Trees", in Proc. Of the 4-th Int. workshop on Artificial Intelligence and Statistics, pages 343-350, 1993.
R. Rakotomalala, "Induction Graphs", PhD Thesis, University of Lyon 1, 1997 (in French).
D. Zighed, R. Rakotomalala, "Graphes d'Induction - Apprentissage et Data Mining", Hermes, 2000 (in French).

Solutions for multicollinearity in multiple regression

2010-05-10 (Tag:6453773133248611080)

Multicollinearity is a statistical phenomenon in which two or more predictor variables in a multiple regression model are highly correlated. In this situation the coefficient estimates may change erratically in response to small changes in the model or the data. Multicollinearity does not reduce the predictive power or reliability of the model as a whole; it only affects calculations regarding individual predictors. That is, a multiple regression model with correlated predictors can indicate how well the entire bundle of predictors predicts the outcome variable, but it may not give valid results about any individual predictor, or about which predictors are redundant with others (Wikipedia). Sometimes the signs of the coefficients are inconsistent with the domain knowledge; sometimes, explanatory variables which seems individually significant are invalidated when we add other variables.

There are two steps when we want to treat this kind of problem: (1) detecting the presence of the collinearity; (2) implementing solutions in order to obtain more consistent results.

In this tutorial, we study three approaches to avoid the multicollinearity problem: the variable selection; the regression on the latent variables provided by PCA (principal component analysis); the PLS regression (partial least squares).

Keywords: linear regression, multiple regression, collinearity, multicollinearity, principal component analysis, PCA, PLS regression
Component : Multiple linear regression, Linear Correlation, Forward Entry Regression, Principal Component Analysis, PLS Regression, PLS Selection, PLS Conf. Interval
Tutorial: en_Tanagra_Regression_Colinearity.pdf
Dataset: car_consumption_colinearity_regression.xls
References :
Wikipedia, "Multicollinearity"

Linear discriminant analysis on PCA factors

2010-04-26 (Tag:3127643883353858241)

In this tutorial, we show that in certain circumstances, it is more convenient to use the factors computed from a principal component analysis (from the original attributes) as input features for the linear discriminant analysis algorithm.

The new representation space maintains the proximity between the examples. The new features known as "factors" or "latent variables", which are a linear combination of the original descriptors, have several advantageous properties: (a) their interpretation very often allows to detect patterns in the initial space; (b) a very reduced number of factors allows to restore information contained in the data, we can moreover remove the noise from the dataset by using only the most relevant factors (it is a sort of regularization by smoothing the information provided by the dataset); (c) the new features form an orthogonal basis, learning algorithms such as linear discriminant analysis have a better behavior.

This approach has a connection to the reduced-rank linear discriminant analysis. But, instead to this last one, the class information is not needed during the computations of the principal components. The computation can be very fast using an appropriate algorithm when we deal with very high-dimensional dataset (such as NIPALS). But, on the other hand, it seems that the standard reduced-rank LDA tends to be better in terms of classification accuracy.

Keywords: linear discriminant analysis, principal component analysis, reduced-rank linear discriminant analysis
Components: Supervised Learning, Linear discriminant analysis, Principal Component Analysis, Scatterplot, Train-test
Tutorial: en_dr_utiliser_axes_factoriels_descripteurs.pdf
Dataset: dr_waveform.bdm
References:
Wikipedia, "Linear discriminant analysis".

Induction of fuzzy rules using Knime

2010-04-22 (Tag:509414641670518430)

This tutorial is the continuation of the one devoted to the induction of decision rules (Supervised rule induction - Software comparison). I have not included Knime in the comparison because it implements a method which is different compared with the other tools. Knime computes fuzzy rules. It wants that the target variable is continuous. That seems rather mysterious in the supervised learning context where the class attribute is usually discrete. I thought it was more appropriate to detail the implementation of the method in a tutorial that is exclusively devoted to the Knime rule learner (version 2.1.1).

Especially, it is important to detail the reason of the data preparation and the reading of the results. To have a reference, we compare the results with those provided by the rule induction tool proposed by Tanagra.

Scientific papers about the method are available on line.

Keywords: induction of rules, supervised learning, fuzzy rules
Components: SAMPLING, RULE INDUCTION, TEST
Tutorial: en_Tanagra_Induction_Regles_Floues_Knime.pdf
Dataset: iris2D.txt
References :
M.R. Berthold, « Mixed fuzzy rule formation », International Journal of Approximate Reasonning, 32, pp. 67-84, 2003.
T.R. Gabriel, M.R. Berthold, « Influence of fuzzy norms and other heuristics on mixed fuzzy rule formation », International Journal of Approximate Reasoning, 35, pp.195-202, 2004.

"Wrapper" for feature selection (continuation)

2010-04-16 (Tag:3323366687966519762)

This tutorial is the continuation of the preceding one about the wrapper feature selection in the supervised learning context (http://data-mining-tutorials.blogspot.com/2010/03/wrapper-for-feature-selection.html). We analyzed the behavior of Sipina, and we have described the source code for the wrapper process (forward search) under R (http://www.r-project.org/). Now, we show the utilization of the same principle under Knime 2.1.1, Weka 3.6.0 and RapidMiner 4.6.

The approach is as follows: (1) we use the training set for the selection of the most relevant variables for classification; (2) we learn the model on selected descriptors; (3) we assess the performance on a test set containing all the descriptors.

This third point is very important. We cannot know the variables that will be finally selected. We do not have to manually prepare the test file by including only those which have been selected by the wrapper procedure. This is essential for the automation of the process. Indeed, otherwise, each change of setting in the wrapper procedure leading to another subset of descriptors would require us to manually edit the test file. This is very tedious.

In the light of this specification, it appeared that only Knime was able to implement the complete process. With the other tools, it is possible to select the relevant variables on the training file. But, I could not (or I did not know) apply the model on a test file containing all the original variables.

The naive bayes classifier is the learning method used in this tutorial .

Keywords: feature selection, supervised learning, naive bayes classifier, wrapper, knime, weka, rapidminer
Tutorial: en_Tanagra_Wrapper_Continued.pdf
Dataset: mushroom_wrapper.zip
References :
JMLR Special Issue on Variable and Feature Selection - 2003
R Kohavi, G. John, « The wrapper approach », 1997.
Wikipedia, "Naive bayes classifier".

"Wrapper" for feature selection

2010-03-30 (Tag:3099475216909191828)

The feature selection is a crucial aspect of supervised learning process. We must determine the relevant variables for the prediction of the target variable. Indeed, a simpler model is easier to understand and interpret; the deployment will be facilitated, we need less information to collect for prediction; finally, a simpler model is often more robust in generalization i.e. when we want to classify an unseen instance from the population.

Three kinds of approaches are often highlighted into the literature. Among them, the WRAPPER approach uses explicitly a performance criterion during the search of the best subset of descriptors. Most often, this is the error rate. But in reality, any kind of criteria can be used. This may be the cost if we use a misclassification cost matrix. It can be the area under curve (AUC) when we assess the classifier using ROC curves, etc. In this case, the learning method is considered as a black box. We try various subsets of predictors. We will choose the one that optimizes the criterion.

In this tutorial, we implement the WRAPPER approach with SIPINA and R 2.9.2. For this last one, we give the source code for a forward search strategy. The readers can easily adapt the program to other dataset. Moreover, a careful reading of the source code for R gives a better understanding about the calculations made internally by SIPINA.

The WRAPPER strategy is a priori the best since it explicitly optimizes the performance criterion. We verify this by comparing the results with those provided by the FILTER approach (FCBF method) available into TANAGRA. The conclusions are not as obvious as one can think.

Keywords: feature selection, supervised learning, naive bayes classifier, wrapper, fcbf, sipina, R software, RWeka paclage
Components: DISCRETE SELECT EXAMPLES, FCBF FILTERING, NAIVE BAYES, TEST
Tutorial: en_Tanagra_Sipina_Wrapper.pdf
Dataset: mushroom_wrapper.zip
References :
JMLR Special Issue on Variable and Feature Selection - 2003
R Kohavi, G. John, « The wrapper approach », 1997.

Tanagra - Version 1.4.36

2010-03-23 (Tag:7411521367586985240)

ReliefF is a component for automatic variable selection in a supervised learning task. It can handle both continuous and discrete descriptors. It can be inserted before any supervised method.

Naive Bayes was modified. It now described a prediction model in an explicit form (in a linear combination form), easy to understand and to deploy.

Supervised rule induction - Software comparison

2010-02-11 (Tag:3171657054469859571)

Supervised rule induction methods play an important role in the Data Mining framework. Indeed, it provides an easy to understand classifier. A rule uses the following representation: "IF premise THEN conclusion" (e.g. IF an account problem is reported on a client THEN the credit is not accepted).

Among the rule induction methods, the "separate and conquer" approaches are very popular during the 90's. Curiously, they are less present today into proceedings or journals. More troublesome still, they are not implemented in commercial software. They are only available in free tools from the Machine Learning community. However, they have several advantages compared to other techniques.

In this tutorial, we describe first two separate and conquer algorithms for the rule induction process. Then, we show the behavior of the classification rules algorithms implemented in various tools such as Tanagra 1.4.34, Sipina Research 3.3, Weka 3.6.0, R 2.9.2 with the RWeka package, RapidMiner 4.6, or Orange 2.0b.

Keywords: rule induction, separate and conquer, top-down, CN2, decision tree
Composants : SAMPLING, DECISION LIST, RULE INDUCTION, TEST
Tutorial: en_Tanagra_Rule_Induction.pdf
Dataset: life_insurance.zip
References:
J. Furnkranz, "Separate-and-conquer Rule Learning", Artificial Intelligence Review, Volume 13, Issue 1, pages 3-54, 1999.
P. Clark, T. Niblett, "The CN2 Rule Induction Algorithm", Machine Learning, 3(4):261-283, 1989.
P. Clark, R. Boswell, "Rule Induction with CN2: Some recent improvements", Machine Learning - EWSL-91, pages 151-163, Springer Verlag, 1991.

Tanagra - Version 1.4.35

2010-01-19 (Tag:4712669683977530822)

CTP. The method of detection of the right size of the tree is modified for the "Clustering Tree" with post-pruning component (CTP). It relies both on the angle between half-lines at each point on the curve of decreasing the WSS (within-group sum of squares) on the growing sample and the decrease of the same indicator computed on the pruning sample. Compared to the previous implementation, it results in a smaller number of clusters.

Regression Tree. The previous modification is incorporated into the Regression Tree component which is a univariate version of CTP.

C-RT Regression Tree. A new regression tree component was added. It faithfully implements the technique described in the Breiman's and al. (1984) book, including the post-pruning part with the 1-SE Rule (Chapter 8, especially p. 226 about the formula for the variance of the MSE).

C-RT. The report of the induction of decision tree C-RT has been completed. Based on the last column of the post-pruning table, it becomes easier to choose the parameter x (in x-SE Rule) to arbitrarily define the size of the pruned tree.

Some tutorials will describe these various changes soon.

Dealing with very large dataset in Sipina

2010-01-04 (Tag:40363561890475837)

The ability to handle large databases is a crucial problem in the data mining context. We want to handle a large dataset in order to detect the hidden information. Most of the free data mining tools have problems with large dataset because they load all the instances and variables into memory. Thus, the limitation of these tools is the available memory.

To overcome this limitation, we should design solutions that allow to copy all or part of the data on disk, and perform treatments by loading into memory only what is necessary at each step of the algorithm (the instances and/or the variables). If the solution is theoretically simple, it is difficult in practice. Indeed, the processing time should remain reasonable even if we increase the disk access. It is very difficult to implement a strategy that is effective regardless of the learning algorithm used (supervised learning, clustering, factorial analysis, etc.). They handle the data in very different way: some of them use intensively matrix operations; the others search mainly the co-occurrence between attribute-value pairs, etc.

In this tutorial, we present a specific solution in the induction tree context. The solution is integrated into SIPINA (as optional) because its internal data structure is especially intended to the decision tree induction. Developing an approach which takes advantages of the specificities of the learning algorithm was easy in this context. We show that it is then possible to handle a very large dataset (41 variables and 9,634,198 observations) and to use all the functionalities of the tool (interactive construction of the tree, local descriptive statistics on nodes, etc.).

To fully appreciate the solution proposed by Sipina, we compare its behavior to generalist data mining tools such as Tanagra 1.4.33 or Knime 2.03.

Keywords: very large dataset, decision tree, sampling, sipina, knime
Components: ID3
Lien : en_Sipina_Large_Dataset.pdf
Données : twice-kdd-cup-discretized-descriptors.zip
Références :
Tanagra, « Decision tree and large dataset ».
Tanagra, « Local sampling for decision tree learning »

CART - Determining the right size of the tree

2010-01-02 (Tag:2075493981494483619)

Determining the appropriate size of the tree is a crucial task in the decision tree learning process. It determines its performance during the deployment into the population (the generalization process). There are two situations to avoid: the under-sized tree, too small, poorly capturing relevant information in the training set; the over-sized tree capturing specific information of the training set, which specificities are not relevant to the population. In both cases, the prediction model performed poorly during the generalization phase.

Among the many variants of decision trees learning algorithms, CART is probably the one that detects better the right size of the tree.

In this tutorial, we describe the selection mechanism used by CART during the post-pruning process. We show also how to set the appropriate value of the parameter of the algorithm in order to obtain a specific (a user-defined) tree.

Keywords: decision tree, CART, 1-SE Rule, post-pruning
Components: Discrete select examples, Supervised Learning, C-RT, Test
Tutorial: en_Tanagra_Tree_Post_Pruning.pdf
Dataset: adult_cart_decision_trees.zip
References :
L. Breiman, J. Friedman, R. Olshen, C. Stone, " Classification and Regression Trees ", California : Wadsworth International, 1984.
R. Rakotomalala, " Arbres de décision ", Revue Modulad, 33, 163-187, 2005 (tutoriel_arbre_revue_modulad_33.pdf)

VARIMAX rotation in Principal Component Analysis

2009-12-24 (Tag:8966920901492095105)

A VARIMAX rotation is a change of coordinates used in principal component analysis (PCA) that maximizes the sum of the variances of the squared loadings. Thus, all the coefficients (squared correlation with factors) will be either large or near zero, with few intermediate values.

The goal is to associate each variable to at most one factor. The interpretation of the results of the PCA will be simplified. Then each variable will be associated to one and one only factor, they are split (as much as possible) into disjoint sets.

In this tutorial, we show how to perform this kind of rotation from the results of a standard PCA in Tanagra.

Keywords: PCA, principal component analysis, VARIMAX, QUARTIMAX
Components : Principal Component Analysis, Factor Rotation
Tutorial: en_Tanagra_Pca_Varimax.pdf
Dataset: crime_dataset_from_DASL.xls
References:
Tanagra, "New features for PCA in Tanagra"
Tanagra, "Principal Component Analysis (PCA)"
Wikipedia, "Varimax rotation"
H. Abdi, "Factor rotations in Factor Analyses"

Kruskal–Wallis one-way analysis of variance

2009-12-20 (Tag:5630831908638594898)

The tests for comparison of population try to determine if K (K 2) samples come from the same underlying population according to a dependent variable (X). In other words, we try to determine if the underlying distribution of X is the same whatever the group.

We talk about non parametric tests when we do not make assumption about the shape of the distribution of the dependent variable. They are considered as being "distribution free" methods, at the opposite of the parametric approaches.

In this tutorial, we implement various tests for differences in location. The Kruskal-Wallis test is certainly the most used one when we try to determine if the scores among groups are stochastically the same. But other tests exist. We compare the results obtained. We will complete the analysis by conducting multiple comparisons in order to identify groups that differ significantly from each other.

Keywords: non parametric test, independent samples, Kruskal-Wallis, Van der Waerden, Fisher-Yates-Terry-Hoeffding, median test, tests for differences in location
Components: KRUSKAL-WALLIS 1-WAY ANOVA, MEDIAN TEST, VAN DER WAERDEN 1-WAY ANOVA, FYTH 1-WAY ANOVA
Tutorial: en_Tanagra_Nonparametric_Test_KW_and_related.pdf
Dataset: wine_evaluation_nonparametric.xls
References:
R. Lowry, « Concepts and Applications of Inferential Statistics », SubChapter 14a. The Kruskal-Wallis Test for 3 or More Independent Samples.
Wikipedia. Kruskal–Wallis one-way analysis of variance.

Tests for differences in scale

2009-12-17 (Tag:3519676778641821964)

Parametric and non parametric tests for differences in scale.

The tests of equal variability (or dispersion, or scale, or simply variance) are often presented as a preliminary test before the comparison of means, in order to verify the homoscedasticity assumption. But this is not their only purpose. Compare dispersions can be an end in itself. For example, we wish to compare the performance of two systems of heating. The average temperature at the center of the room is the same; however one can wish to compare the mode of diffusion of heat in different parts of the room.

The parametric tests are based primarily on the Gaussian distribution. The test becomes a test for homogeneity of variance. We highlight the Levene test in this tutorial. Other tests exist (Bartlett test for instance), we mention them in this tutorial.

When the normality assumption is questionable, when sample size is low, when the variable is ordinal and not continuous, it is more appropriate to use non parametric tests. These are called tests for equality of scales or dispersions. In fact the procedures are not based on estimated variances. We will use well known techniques such as the Ansari-Bradley test, the Mood or the Klotz test. They have a scope broader since nonparametric. Some of these tests have a drawback, they are not applicable when the distributions conditionals do not share the same parameter of central tendency (the median in general, but we can adjust the values by centering in relation to the median).

In this tutorial, we show how to implement these various tests with Tanagra.

Keywords: parametric test, non parametric test, independent samples, Levene test, Bartlett test, Brown-Forsythe test, Mood test, Klotz test, Ansari-Bradley test
Components: LEVENE’S TEST, ANSARI-BRADLEY SCALE TEST, MOOD SCALE TEST, KLOTZ SCALE TEST
Tutorial: en_Tanagra_Nonparametric_Test_for_Scale_Differences.pdf
Dataset: tests_for_scale_differences.xls
References:
NIST, "Quantitative techniques", section 1.3.5 - http://www.itl.nist.gov/div898/handbook/eda/section3/eda35.htm

Outliers and influential points in regression

2009-12-09 (Tag:8315216096781488919)

The analysis of outliers and influential points is an important step of the regression diagnostics. The goal is to detect (1) the points which are very different to the others (outliers) i.e. they seem do not belong to the analyzed population; or (2) the points that if they are removed (influential points), leads us to a different model. The distinction between these kinds of points is not always obvious.

In this tutorial, we implement several indicators for the analysis of outliers and influential points. To avoid confusion about the definitions of indicators (some indicators are calculated differently from one tool to another), we compare our results with state-of-the-art tool such as SAS and R. In a first step, we give the results described into the SAS documentation. In a second step, we describe the process and the results under Tanagra and R. In conclusion, we note that these tools give the same results.

Keywords: linear regression, outliers, influential points, standardized residuals, studentized residuals, leverage, dffits, cook's distance, covratio, dfbetas, R software
Components: Multiple linear regression, Outlier detection, DfBetas
Tutorial: en_Tanagra_Outlier_Influential_Points_for_Regression.pdf
Dataset: USPopulation.xls
References:
SAS STAT User’s Guide, « The REG Procedure – Predicted and Residual Values »

Tests for comparing two related samples

2009-12-07 (Tag:3172181552173506727)

Dependent samples, also called related samples or correlated samples, occur when the response of the nth person in the second sample is partly a function of the response of the nth person in the first sample. There are several common forms of sample dependency . (1) Before-after and other studies in which the same people are surveyed at different points in time, including panel studies. (2) Matched-pairs studies in which each of the subjects of the study is paired with each of those in a comparison group on the basis matching factors (e.g. age, sex, income, etc.). (3) The pairs can simply be inherent in the situation we are trying to analyze. For instance, one tries to compare the time spent watching television by the man and woman within a couple. The blocks are naturally households. Men and women should not be considered as independent observations.

The aim of tests for related samples is to exclude from the analysis the within-group variation. The calculation of the differences is realized within each pair of subjects. In this tutorial, we show how to implement 3 tests for two related samples. Two of them are non-parametric (sign test and Wilcoxon matched-pairs ranks test), the last one is the parametric t-test for related samples.

Keywords: parametric test, non-parametric test, paired samples, sign test, wilcoxon signed rank test, paired samples t-test, normality test
Components: SIGN TEST, WILCOXON SIGNED RANK TEST, PAIRED T-TEST, FORMULA, NORMALITY TEST
Tutorial: en_Tanagra_Nonparametric_Test_for_Two_Related_Samples.pdf
Dataset : comparison_2_related_samples.xls
References :
R. Lowry, « Concepts and Applications of Inferential Statistics », SubChapter 12a. The Wilcoxon Signed-Rank Test.

Multivariate tests for comparing populations

2009-12-02 (Tag:8727733864228159845)

Multivariate parametric hypothesis testing for comparing populations.

A multivariate test for comparison of population try to determine if K (K 2) samples come from the same underlying population according to a set of variables of interest (X1,…,Xp).

We talk about parametric test when we assume that the data come from a type of probability distribution. Thus, the inference relies on the parameters of the distribution. For instance, if we assume that the data is drawn from a multivariate Gaussian distribution, the hypothesis testing relies on mean vector or on covariance matrix.

Keywords: Hotelling's T2, Wilks' Lambda, Box’s M test, Bartlett's test, mean vector, covariance matrix, MANOVA
Components: UNIVARIATE CONTINUOUS STAT, HOTELLING’S T2, HOTELLING’S T2 HETEROSCEDASTIC, BOX’S M TEST, ONE-WAY MANOVA
Lien: en_Tanagra_Multivariate_Parametric_Tests.pdf
Dataset: credit_approval.xls
References :
S. Rathburn, A. Wiesner, "STAT 505: Applied Multivariate Statistical Analysis", The Pennsylvania State University.

Parametric tests for comparing populations

2009-11-30 (Tag:8929581468242970928)

Parametric hypothesis testing for comparison of two or more populations. Independent and dependent samples.

The tests for comparison of population try to determine if K (K >= 2) samples come from the same underlying population according to a variable of interest (X). We talk parametric test when we assume that the data come from a type of probability distribution. Thus, the inference relies on the parameters of the distribution. For instance, if we assume that the distribution of the data is Gaussian, the hypothesis testing relies on mean or on variance.

We handle univariate test in this tutorial i.e. we have only one variable of interest. When we want to analyze simultaneously several variables, we talk about multivariate test.

Keywords: t-test, F-Test, Bartlett's test, Levene's test, Brown-Forsythe's test, independent samples, dependent samples, paired samples, matched-pairs samples, anova, welch's anova, randomized complete blocks
Components: MORE UNIVARIATE CONT STAT, NORMALITY TEST, T-TEST, T-TEST UNEQUAL VARIANCE, ONE-WAY ANOVA, WELCH ANOVA, FISHER’S TEST, BARTLETT’S TEST, LEVENE’S TEST, BROWN-FORSYTHE TEST, PAIRED T-TEST, PAIRED V-TEST, ANOVA RANDOMIZED BLOCKS
Tutorial: en_Tanagra_Univariate_Parametric_Tests.pdf
Dataset: credit_approval.xls
References:
NIST/SEMATECH e-Handbook of Statistical Methods, http://www.itl.nist.gov/div898/handbook/ (Chapter 7, Product and Process Comparisons)

Three curves for classifier assessment

2009-11-26 (Tag:7075276611794896607)

Evaluation of classifiers is an important step of the supervised learning process. We want to measure the performance of the classifier. On one hand we have the confusion matrix and associated indicators, very popular into the academic publications. On the other hand, in the real applications, the users prefers some curves which seem very mysterious for people outside the domain (e.g. ROC curve for the epidemiologists, gain chart or cumulative lift curve in the marketing domain, precision recall curve in the information retrieval domain, etc.).

In this tutorial, we give first the details of the calculation of these curves by creating them "at the hand" in a spreadsheet. Then, we use Tanagra 1.4.33 and R 2.9.2 for obtaining them. We use these curves for the comparison the performances of the logistic regression and support vector machine (Radial Basis Function kernel).

Keywords: roc curve, gain chart, precision recall curve, lift curve, logistic regression, support vector machine, svm, radial basis function kernel, rbf kernel, e1071 package, R software, glm

Components: DISCRETE SELECT EXAMPLES, BINARY LOGISTIC REGRESSION, SCORING, C-SVC, ROC CURVE, LIFT CURVE, PRECISION-RECALL CURVE
Tutorial: en_Tanagra_Spv_Learning_Curves.pdf
Dataset : heart_disease_for_curves.zip

Tanagra - Version 1.4.34

2009-11-22 (Tag:8604658843726329564)

A component of induction of predictive rules (RULE INDUCTION) was added under "Supervised Learning" tab. Its use is described in a tutorial available online (will be translated soon).

The DECISION LIST component has been improved, we changed the test done during the pre-pruning process. The formula is described in the tutorial above.

The SAMPLING and STRATIFIED SAMPLING components (Instance Selection tab) have been slightly modified. It is now possible to set ourself the seed number of the pseudorandom number generator.

Following an indication of Anne Viallefont, calculation of degrees of freedom in tests on contingency tables is now more generic. Indeed, the calculation was wrong when the database was filtered and some margins (row or column) contained a number equal to zero. Anne, thank you for this information. More generally, thank you to everyone who sent me comments. Programming has always been for me a kind of leisure. The real work starts when it is necessary to check the results, compare them with the available references, cross them with other data mining tools, free or not, understand the possible differences, etc.. At this step, your help is really valuable.

Handling Missing values in SIPINA

2009-11-09 (Tag:4094907759355781758)

Dealing with missing values is a difficult problem. The programming in itself is not a problem; we just report the missing value by a specific code. In contrast, the treatment before or during data analysis is very complicated.

Various techniques are available in order to handle missing values into SIPINA. In this tutorial, we show how to implement them; and what are their consequences on the decision tree learning context (C4.5 algorithm; Quinlan, 1993).

Keywords: missing value, missing data, listwise deletion, casewise deletion, data imputation, C4.5, decision tree
Tutorial: en_Sipina_Missing_Data.pdf
Dataset: ronflement_missing_data.zip
References:
P.D. Allison, « Missing Data », in Quantitative Applications in the Social Sciences Series n°136, Sage University Paper, 2002.
J. Bernier, D. Haziza, K. Nobrega, P. Whitridge, « Handling Missing Data – Case Study », Statistical Society of Canada.
D. Garson, "Data Imputation for Missing Values"

Model deployment with Sipina

2009-11-04 (Tag:2118397077776989554)

Model deployment is the last step of the Data Mining process. In its simplest form in a supervised learning task, it consists in to apply a predictive model on unlabeled cases.

Applying the model on unseen cases is a very useful functionality. But it would be even more interesting if we could announce its accuracy. Indeed, a misclassification can have dramatic consequences. We must measure the risk we take when we make decisions from a predictive model. An indication about the performance of a classifier is important when we decide or not to deploy it.

In this tutorial, we show how to apply a classifier on unlabeled sample with Sipina. We show also how to estimate the generalization error rate using a resampling scheme such as bootstrap.

Keywords: model deployment, unseen cases, unlabeled instances, decision tree, sipina, linear discriminant analysis

Tutorial: en_sipina_deployment.pdf

Dataset: wine_deployment.xls

References:

Tanagra Tutorials, "Applying a classifier on a new dataset (Deployment)"

Sipina - Supported file format

2009-11-03 (Tag:6890770492290342043)

The data access is the first step of the data mining process. It is a crucial step. It is one of the main criteria used when we want to assess the quality of a tool. If we do not able to load a dataset, we cannot perform any kind of analysis. The software is not useable. If the data access is not easy and requires complicated operations, we will devote less time to the other steps of the data exploration.

The first goal of this tutorial is to describe the various file formats that are supported in Sipina. Some of the solutions are more deeply described in other tutorials elsewhere; we indicate the appropriate reference in these cases. The second goal is to describe the behavior of these formats when we handle a large dataset with 4,817,099 instances and 42 variables.

Last, we learn a decision tree on this dataset in order to evaluate the behavior of Sipina when we process a large data file.

Keywords: file format, data file importation, decision tree, large dataset, csv, arff, fdm, fdz, zdm
Tutorial: en_Sipina_File_Format.pdf
Dataset: weather.txt and kdd-cup-discretized-descriptors.txt.zip

Importing Weka file (.arff) into Sipina

2009-10-31 (Tag:7169390756642272102)

WEKA is a very popular Data Mining tool. It supplies a very large of machine learning methods. WEKA can handle various files. But it has a native format (.ARFF) which is a text file with additional specifications.

The text file format is very simple and very easy to manipulate. But, on the other hand, the processing of this kind of file is often slow, slower than binary file format. When we deal with a moderate size file, the text file is enough efficient. The differences between the time processing are not discernible.

In this tutorial, we show how to import the ARFF file format into Sipina. We subdivide the dataset into train and test samples. Then we learn and we assess a decision tree.

Keywords: decision tree, c4.5, file format, data file importation, weka, arff
Tutorial: en_sipina_weka_file_format.pdf
Dataset: ionosphere.arff
References:
M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutmann, I. Witten, "The Weka Data Mining Software: An Update", SIGKDD Explorations, Vol. 11, Issue 1, 2009.

Local sampling for decision tree learning

2009-10-28 (Tag:1532682281200943046)

During the decision tree learning process, the algorithm detects the better variable according to a goodness of fit measure when it tries to split a node. The calculation can take a long time, particularly when it deals with a continuous descriptors for which it must detect the optimal cut point.

For all the decision tree algorithms, Sipina can use a local sampling option when it searches the best splitting attribute on a node. The idea is the following: on a node, it draws a random sample of size n, and then all the computations are made on this sample. Of course, if n is lower than the number of the existing examples on the node, Sipina uses all the available examples. It occurs when we have a very large tree with a high number of nodes.

We have described this approach in a paper (Chauchat and Rakotomalala, IFCS-2000) . We describe in this tutorial how to implement it with Sipina. We note in this tutorial that using a sample on each node enables to reduce dramatically the execution time without loss of accuracy.

We use a version of the WAVEFORM dataset with 21 continuous descriptors and 2,000,000 instances. We obtain the tree in 3 seconds on our computer.

Keywords : decision tree, sampling, large dataset
Components : SAMPLING, ID3, TEST
Tutorial : en_Sipina_Sampling.pdf
Dataset : wave2M.zip
Références :
J.H. Chauchat, R. Rakotomalala, « A new sampling strategy for building decision trees from large databases », Proc. of IFCS-2000, pp. 199-204, 2000.

Tanagra - Version 1.4.33

2009-10-03 (Tag:6510599570100452862)

Several logistic regression diagnostics and evaluation tools were implemented, one of them (reliability diagram) can be applied to any supervised method

1.The estimated covariance matrix
2. Hosmer - Lemeshow Test
3. Reliability diagram (says also calibration plot)
4. Analysis of residuals, outilers and influentials points (pearson residuals, deviance residuals, dfichisq, difdev, levier, Cook's distance, dfbeta, dfbetas)

A tutorial describing the utilization of these tools will be available soon.

Using batch mode for Tanagra

2009-09-28 (Tag:5009643938825691974)

For large simulations, it is more convenient to use BATCH mode capabilities of Tanagra rather than opening interactive session. This is the case for instance when we compare the performance of various algorithms on the same dataset; when we try to find automatically the best parameters for a learning method; when we repeat the same treatment on different datasets, etc. In these contexts, it is more useful to save the diagrams in text mode (.TDM file format). It will be easier to handle it outside TANAGRA, with a text editor for instance.

In this tutorial, we want to compare the performances of the naïve bayes classifier with and without the feature selection process. We know that the naïve bayes classifier is highly sensitive to irrelevant features. The goal of this tutorial is to evaluate the efficiency of the FCBF feature selection method in this context.

Keywords: batch mode, supervised learning, naive bayes, feature selection, experiments
Components: NAIVE BAYES, FCBF, CROSS VALIDATION
Tutorial: english_dr_utiliser_tanagra_en_mode_batch.pdf
Dataset: tanagra_batch_execution.zip

Nonparametric tests for groups comparison - Independent samples - Differences in location

2009-07-15 (Tag:4133764444418670009)

The aim of homogeneity test (or test for difference between groups) is to check if K (K >= 2) samples are drawn from the same population according to a variable of interest. In another words, we check if the probability distribution is the same in each sample.

The nonparametric tests make no assumptions about the distribution of the data. They are called also "distribution free" tests.

In this tutorial, we show how to implement nonparametric homogeneity tests for differences in location for K = 2 populations i.e. the distributions of the populations are the same excepting a shift in location (central tendency). The Kolmogorov-Smirnov test is the more general one. It checks all kind of differences between the cumulative distribution functions (CDF). Afterwards, we can implement other tests which characterize more deeply the difference. The Wilcoxon-Mann-Whitney test is certainly the most popular one. We will see in this tutorial that other tests can be also implemented.

Some the tests introduced here are usable when the number of groups is upper than 2 (K > 2).

Keywords: nonparametric test, Kolmogorov-Smirnov test, Wilcoxon-Mann-Whitney test, Van der Waerden test, Fisher-Yates-Terry-Hoeffding test, median test, location model
Components: FYTH 1-WAY ANOVA, K-S 2-SAMPLE TEST, MANN-WHITNEY COMPARISON, MEDIAN TEST, VAN DER WAERDEN 1-WAY ANOVA
Tutorial: en_Tanagra_Nonparametric_Test_MW_and_related.pdf
Dataset: machine_packs_cartons.xls
References:
R. Rakotomalala, « Comparaison de populations. Tests non paramétriques », Université Lyon 2 (in french).
Wikipedia, « Non-parametric statistics ».

Resampling methods for error estimation

2009-07-09 (Tag:3576664280342751216)

The ability to predict correctly is one of the most important criteria to evaluate classifiers in supervised learning. The preferred indicator is the error rate (1 - accuracy rate). It states the probability of misclassification of a classifier. In most cases we do not know the true error rate because we do not have the whole population and we do not know the probability distribution of the data. So we need to compute estimation from the available dataset.

In the small sample context, it is preferable to implement the resampling approaches for error rate estimation. In this tutorial, we study the behavior of the cross validation (cv), leave one out (lvo) and bootstrap (boot). All of them are based on the repeated train-test process, but in different configurations. We keep in mind that the aim is to evaluate the error rate of the classifier created on the whole sample. Thus, the intermediate classifiers computed on each learning session are not really interesting. This is the reason for which they are rarely provided by the data mining tools.

The main supervised learning method used is the linear discriminant analysis (LDA). We will see at the end of this tutorial that the behavior observed for this learning approach is not the same if we use another approach such as a decision tree learner (C4.5).

Keywords: resampling, generalization error rate, cross validation, bootstrap, leave one out, linear discriminant analysis, C4.5
Components: Supervised Learning, Cross-validation, Bootstrap, Test, Leave-one-out, Linear discriminant analysis, C4.5
Tutorial: en_Tanagra_Resampling_Error_Estimation.pdf
Dataset: wave_ab_err_rate.zip
Reference:
"What are cross validation and bootstrapping?"

Implementing SVM on large dataset

2009-07-05 (Tag:868264666218776416)

Support vector machines (SVM) are a set of related supervised learning methods used for classification and regression. Our aim is to compare various free implementation of SVM, in terms of accuracy and computation time. Indeed, because the heuristic nature of the algorithm, we can obtain different results according to the used tools on the same dataset. In fact, in the publications describing the performance of SVM, we should not only specify the parameters of the algorithm but also indicate what is the tool used. This latter can influence the results.

SVM is effective in domains with very high number of predictive variables, when the ratio between the number of variables and the number of observations is unfavorable. We are in a domain which is particularly favorable to SVM in this tutorial. We want to discriminate two families of proteins from their description with amino acids. We use sequence of 4 characters (4-grams) as descriptors. Thus, we can have a large number of descriptors (31,809) in comparison to the number of examples (135 instances).

We compare Tanagra 1.4.27, Orange 1.0b2, Rapidminer Community Edition 4.2 and Weka 3.5.6.

Keywords: svm, support vector machine
Components: C-SVC, SVM, SUPERVISED LEARNING, CROSS-VALIDATION
Tutorial: en_Tanagra_Perfs_Comp_SVM.pdf
Dataset: wide_protein_classification.zip
Reference:
Wikipedia (en), « Support vector machine »

Self-organizing map (SOM)

2009-07-01 (Tag:3728698017590737050)

A self-organizing map (SOM) or self-organizing feature map (SOFM) is a kind of artificial neural network that is trained using unsupervised learning to produce a low-dimensional (typically two-dimensional), discretized representation of the input space of the training samples, called a map. Self-organizing maps are different than other artificial neural networks in the sense that they use a neighborhood function to preserve the topological properties of the input space.

In this tutorial, we show how to implement the Kohonen's SOM algorithm with Tanagra. We try to assess the properties of this approach by comparing the results with those of the PCA algorithm. Then, we compare the results to those of K-Means, which is a clustering algorithm. Finally, we implement the Two-step Clustering process by combining the SOM algorithm with the HAC process (Hierarchical Agglomerative Clustering). It is a variant of the Two-Step Clustering where we combine K-Means and HAC. We observe that the HAC primarily merges the adjacent cells.

Keywords: Kohonen, self organizing map, SOM, clustering, dimensuionality reduction, k-means, hierarchical agglomerative clustering, hac, two-step clustering
Components: UNIVARIATE CONTINUOUS STAT, UNIVARIATE OUTLIER DETECTION, KOHONEN-SOM, PRINCIPAL COMPONENT ANALYSIS, SCATTERPLOT, K-MEANS, CONTINGENCY CHI-SQUARE, HAC
Tutorial: en_Tanagra_Kohonen_SOM.pdf
Dataset: waveform_unsupervised.xls
Reference:
Wikipedia, « Self organizing map », http://en.wikipedia.org/wiki/Self-organizing_map

Univariate outlier detection methods

2009-06-29 (Tag:1893398724464907844)

The detection and the treatment of outliers (individuals with unusual values) is an important task of data preparation. Unusual values can mislead results of subsequent data analysis.

Outliers can be detected on one variable (a man with 158 years old) or on a combination of variables (a boy with 12 years old crosses the 100 yards in 10 seconds). In this tutorial, we show how to use the UNIVARIATE OUTLIER DETECTION component. It is intended to univariate detection of outliers i.e. taking into account individually the variables.

The approaches implemented in the component come from the NIST website (see reference). We use also an additional rule based on the x-sigma deviation from the mean of the variable.

Keywords: outlier, influential point
Components: MORE UNIVARIATE CONT STAT, SCATTERPLOT WITH LABEL, UNIVARIATE OUTLIER DETECTION, UNIVARIATE CONT STAT
Tutorial: en_Tanagra_Outliers_Detection.pdf
Dataset: body_mass_index.xls
References:
NIST/SEMATECH, « e-Handbook of Statistical Methods », Section 7.1.6, « What are outliers in the data ? »
R. High, "Dealing with 'Outliers': How to Maintain Your Data's Integrity"

Copy paste feature into the diagram

2009-06-29 (Tag:6752361841708594803)

When we define the data analysis process into Tanagra, it is possible to copy components (or entire branches of components) towards another location into the diagram. This feature is very helpful when we have to repeat sequences of treatments in different parts of the diagram. The settings are also duplicated.

In this tutorial, we show how to copy a component or a branch. We will see that this feature is helpful when, for instance, we deal with the performance comparisons of supervised learning algorithms on the same dataset. In this context, the processing sequence is always the same, only the method that we want to evaluate is different.

We work on the same project here. We cannot copy paste components between two opened projects. But, in another tutorial, we show how to save a part of the diagram in an external file. Thus, the same processing sequence can be applied on multiple datasets.

Keywords: copy paste, diagram management, comparison of classifiers, supervised learning, cross validation, dimensionality reduction
Components: Supervised learning, Binary logistic regression, C-PLS, C-SVC, Linear discriminant analysis, K-NN, Principal Component Analysis
Tutorial: en_Tanagra_Diagram_New_Features.pdf
Dataset: sonar.xls

The A PRIORI MR component

2009-06-27 (Tag:2122614456029825949)

Association rule learning is a popular method for discovering interesting relations between variables in large databases. It was often used in market basket analysis domain e.g. if a customer buys onions and potatoes then he buys also beef. But, in fact, it can be implemented in various application areas where we want to discover the association between variables.

We were already described the association rule mining tools of Tanagra in several tutorials. The A PRIORI approach is certainly the most popular. But, despite its good properties, this method has a drawback: the number of obtained rules can be very high. The ability to underline the most interesting rules, those which are relevant, becomes a major challenge.

In this tutorial, we show to implement the A PRIORI MR component. It differentiates oneself from other by offering additional tools for exploring and assessing the mined rules: original measures based on the “test value” principle allow to evaluate differently the rules; the ability to copy the results into a spreadsheet allows a more detailed exploration of the rule base; by subdividing the dataset into train and test sets, we obtain a more reliable values of the interestingness measures of rules.

Keywords: association rule, a priori algorithm, interestingness measure, test value principle
Components: A PRIORI MR
Tutorial: en_Tanagra_APrioriMR_Component.pdf
Dataset: credit_assoc.xls
Reference:
Wikipedia, "Association rule learning"

Two-step clustering for handling large databases

2009-06-14 (Tag:5086959615464697845)

The aim of the clustering is to identify homogenous subgroups of instance in a population. In this tutorial, we implement a two-step clustering algorithm which is well-suited when we deal with a large dataset. It combines the ability of the K-Means clustering to handle a very large dataset, and the ability of the Hierarchical clustering (HCA – Hierarchical Cluster Analysis) to give a visual presentation of the results called “dendrogram”. This one describes the clustering process, starting from unrefined clusters, until the whole dataset belongs to one cluster. It is especially helpful when we want to detect the appropriate number of clusters.

The implementation of the two-step clustering (called also “Hybrid Clustering”) under Tanagra is already described elsewhere. According to the Lebart and al. (2000) recommendation , we perform the clustering algorithm on the latent variables supplied by a PCA (Principal Component Analysis) computed from the original variables. This pre-treatment cleans the dataset by removing the irrelevant information such as noise, etc. In this tutorial, we show the efficiency of the approach on a large dataset with 500,000 observations and 68 variables. We use Tanagra 1.4.27 and R 2.7.2 which are the only tools which allow to implement easily the whole process.

Keywords: clustering, hierarchical cluster analysis, HCA, k-means, principal component analysis, PCA
Components: PRINCIPAL COMPONENT ANALYSIS, K-MEANS, HAC, GROUP CHARACTERIZATION, EXPORT DATASET
Tutorial: en_Tanagra_CAH_Mixte_Gros_Volumes.pdf
Dataset: sample-census.zip
References:
L. Lebart, A. Morineau, M. Piron, « Statistique Exploratoire Multidimensionnelle », Dunod, 2000 ; chapter 2, sections 2.3 et 2.4.
D. Garson, "Cluster Analysis" from North Carolina State University.

K-Means - Comparison of free tools

2009-06-11 (Tag:5854170100391175797)

K-means is a clustering (unsupervised learning) algorithm. The aim is to create homogeneous subgroups of examples. The individuals in the same subgroup are similar; the individuals in different subgroups are as different as possible.

The K-Means approach is already described in several tutorials (http://data-mining-tutorials.blogspot.com/search?q=k-means). The goal here is to compare its implementation with various free tools. We study the following tools: Tanagra 1.4.28; R 2.7.2 without additional package; Knime 1.3.5; Orange 1.0b2 and RapidMiner Community Edition.

Keywords: clustering, k-means, PCA, principal component analysis, MDS,multidimensional scaling
Components: PRINCIPAL COMPONENT ANALYSIS, K-MEANS, GROUP CHARACTERIZATION, EXPORT DATASET
Tutorial: en_Tanagra_et_les_autres_KMeans.pdf
Dataset: cars_dataset.zip
Reference:
D. Garson, "Cluster Analysis"

Understanding the "test value" criterion

2009-05-30 (Tag:8326547432999274226)

The test value (VT) is a criterion often used in various components of TANAGRA. It is mainly used for the characterization of a group of observations according a continuous or categorical variable. The groups may be defined by categories from a discrete variable; they can also be computed by a machine learning algorithm (e.g. clustering, a node of a decision tree, etc.).

The principle is elementary: we compare the values of a descriptive statistic indicator computed on the whole sample and computed on sub sample related to the group. For a continuous variable, we compare the mean; for a discrete one, we compare the proportion.

Despite, or because of its simplicity, the VT is very useful. The formulation that we present in this tutorial is taken from the Lebart et al.’s book (2001). The VT is intensively used in some commercial software such as SPAD (http://eng.spad.eu/). It allows to characterize groups, but it can be used also to strengthen the interpretation of the factors extracted from a factorial analysis process.

In this tutorial, we emphasis the formulas used for both categorical and continuous variables. We put them in connection with the results provided by the GROUP CHARACTERIZATION component of TANAGRA.

Keywords: test value, group characterization, clustering, factorial analysis
Components: Group characterization
Tutorial: en_Tanagra_Comprendre_La_Valeur_Test.pdf
Dataset: heart_disease_male.xls
Reference:
L. Lebart, A. Morineau, M. Piron, « Statistique exploratoire multidimensionnelle », Dunod, 2000 ; pages 181 to 184.

Descriptive statistics (continued)

2009-05-29 (Tag:7670428229899790262)

The aim of descriptive statistics is to describe the main features of a collection of data in quantitative terms . The visualization of the whole data table is seldom useful. It is preferable to summarize the characteristics of the data with some selected numerical indicators.

In this tutorial, we distinguish two kinds of descriptive approaches: the univariate tools which summarize the characteristics of a variable individually; the bivariate tools which characterize the association between two variables. According to the type of the variables (categorical or continuous), we use different indicators.

Keywords: descriptive statistics
Components: UNIVARIATE DISCRETE STAT, CONTINGENCY CHI-SQUARE, UNIVARIATE CONTINUOUS STAT, SCATTERPLT, LINEAR CORRELATION, GROUP CHARACTERIZATION
Tutorial: en_Tanagra_Descriptive_Statistics.pdf
Dataset: enquete_satisfaction_femmes_1953.xls
References:
Tanagra Tutorials, "Descriptive statistics"

ID3 on a large dataset

2009-05-01 (Tag:8989223456643444641)

In the data mining domain, the increasing size of the dataset is one of the major challenges in the recent years. The ability to handle large data sets is an important criterion to distinguish between research and commercial software.

Commercial tools have often a very efficient data management systems, limiting the amount of data loaded into memory at each step of the treatment. Research tools, at the opposite, keep all data in memory. The limits are clearly the memory capacity of the machine in this context. It is certainly a drawback for the treatment of large files. We note however that, nowadays, we can have very powerful computers at least cost, this drawback is always postponed. With an appropriate encoding strategy, we can fit in memory all the dataset, even if we handle a large data file.

In this tutorial, we show how to import a file with 581,012 observations and 55 variables, and then how to build a decision tree with the ID3 method. In relation to other decision tree algorithm such as C4.5 or CART, the determination of the right size of the tree is based on a pre-pruning rule. We will see that the computation is fast because of this characteristic.

Keywords: large dataset, decision tree algorithm, ID3
Components: ID3, SPV LEARNING
Tutorial: en_Tanagra_Big_Dataset.pdf
Dataset: covtype.zip
References:
Tanagra tutorials, "Performance comparison under Linux"
Tanagra Tutorials, "Decision tree and large dataset"

Principal Component Analysis (PCA)

2009-04-30 (Tag:5345401340098416990)

The PCA belongs to the factor analysis approaches. It is used to discover the underlying structure of a set of variables. It reduces attribute space from a larger number of variables to a smaller number of factors (dimensions) and as such is a "non-dependent" procedure i.e. it does not assume a dependent variable is specified.

In this tutorial, we show how to implement this approach and how to interpret the results with Tanagra. We use the AUTOS_ACP.XLS dataset from the state-of-the-art SAPORTA’s book. The interest of this dataset is that we can compare our results with those described in the book (pages 177 to 181). We simply show the sequence of operations and the reading of the results tables in this tutorial. About the detailed interpretation, it is best to refer to the book.

Keywords: factor analysis, principal component analysis, correlation circle
Components: Principal Component Analysis, View Dataset, Scatterplot with labels, View multiple scatterplot
Tutorial: en_Tanagra_Acp.pdf
Dataset: autos_acp.xls
References:
G. Saporta, " Probabilités, Analyse de données et Statistique ", Dunod, 2006 ; pages 177 to 181.
D. Garson, "Factor Analysis".
Statsoft Textbook, "Principal components and factor analysis".

Multiple Correspondence Analysis (MCA)

2009-04-30 (Tag:5740391324092332432)

The multiple correspondence analysis is a factor analysis approach. It deals with a tabular dataset where a set of examples are described by a set of categorical variables. The aim is to map the dataset in a reduced dimension space (usually two) which allows us to highlight the associations between the examples and the variables. It is useful to understand the underlying structure of a tabular dataset.

In this tutorial, we show how to implement this approach and how to interpret the results with Tanagra. The opportunity to copy/paste the results in a spreadsheet is certainly one of the most interesting functionalities of the software. Indeed, it gives us access to tools (tri, formatted, etc) in a well-known environment of the experts of the data processing. For example, the possibility of sorting the various tables according to the contributions and the COS2 proves really practical when one wishes to interpret the dimensions.

Keywords: factor analysis, multiple correspondence analysis
Components: Multiple correspondance analysis, View Dataset, Scatterplot with labels, View multiple scatterplot
Tutorial: en_Tanagra_Acm.pdf
Dataset: races_canines_acm.xls
References:
M. Tenenhaus, " Méthodes statistiques en gestion ", Dunod, 1996 ; pages 212 to 222 (in French).
Statsoft Inc., "Multiple Correspondence Analysis".
D. Garson, "Statnotes - Correspondence Analysis".

Support Vector Regression (SVR)

2009-04-26 (Tag:8226744122661027336)

Support Vector Machines (SVM) is a well-know approach in the machine learning community. It is usually implemented for a classification problem in a supervised learning framework. But SVM can be used also in a regression process, where we want to predict or explain the values taken by a continuous predicted attribute. We say Support Vector Regression in this context.

The method is not widely diffused among statisticians. Yet it combines the qualities that rank it favorably compared with existing techniques. It has a well behavior even if the ratio between the number of variables and the number of observations becomes very unfavorable, with highly correlated predictors. Another advantage is the principle of kernel (the famous "kernel trick"). It is possible to construct a non-linear model without explicitly having to produce new descriptors. A deeply study of the characteristics of the method allows to make comparison with penalized regression such as ridge regression.

The first subject of this tutorial is to show how to use two new SVR components of the 1.4.31 version of Tanagra. They are based on the famous LIBSVM library. We use the same library for the classification (see C-SVC component). We compare our results to those of the R software (version 2.8.0). We utilize the e1071 package for R. It is also based on the LIBSVM library.

The second subject is to propose a new assessment component for the regression. It is usual in the supervised learning framework to split the dataset into two parts, the first for the learning process, the second for its evaluation, in order to obtain an unbiased estimation of the performances. We can implement the same approach for the regression. The procedure is even essential when we try to compare models with various complexities (or various degrees of freedom). We will see in this tutorial that the usual indicators calculated on the learning data are highly misleading in certain situations. We must use an independent test set when we want assess a model.

Keywords: support vector regression, support vector machine, regression, linear regression, regression assessment, R software, package e1071
Components: MULTIPLE LINEAR REGRESSION, EPSILON SVR, NU SVR, REGRESSION ASSESSMENT
Tutorial: en_Tanagra_Support_Vector_Regression.pdf
Dataset: qsar.zip
References :
C.C. Chang, C.J. Lin, "LIBSVM - A Library for Support Vector Machines".
S. Gunn, « Support Vector Machine for Classification and Regression », Technical Report of the University of Southampton, 1998.
A. Smola, B. Scholkopf, « A tutorial on Support Vector Regression », 2003.

Launching Tanagra from OOo Calc under Linux

2009-04-23 (Tag:3985888956957546758)

The integration of Tanagra into a spreadsheet, such as Excel or Open Office Calc (OOo Calc or OOCalc), is undoubtedly an advantage. Without special knowledge about the database format, the user can handle the dataset into a familiar environment, the spreadsheet, and send it to specialized tools for Data Mining when he want to lead more sophisticated analysis.

The add-on for OOCalc is initially created for Windows OS. Recently, I have described the installation and the utilization of Tanagra under Linux . The next step is of course the integration of Tanagra into OOCalc under Linux.Mr. Thierry Leiber has realized this work for the 1.4.31 version of Tanagra. He has extended the existing add-on. We can launch Tanagra from OOCalc now, either under Windows and Linux. The add-on was tested under the following configurations: Windows XP + OOCalc 3.0.0; Windows Vista + OOCalc 3.0.1; Ubuntu 8.10 + OOCalc 2.4; Ubuntu 8.1 + OOCalc 3.0.1.

This document extends a previous tutorial, but we work now under the Linux environment (Ubuntu 8.10). All the screen shots are in French because my OS is in French, but I think the process is the same for Linux with other language configuration.

Keywords: open office calc, add-on, principal component analysis, PCA, correlation circle, illustrative variable, linux, ubuntu 8.10 intrepid ibex
Components: PRINCIPAL COMPONENT ANALYSIS, CORRELATION SCATTERPLOT
Tutorial: en_Tanagra_OOCalc_under_Linux.pdf
Dataset: cereals.xls
References:
Tanagra, « Connection with Open Office Calc »
Tanagra, « Tanagra under Linux »

Tanagra - Version 1.4.31

2009-04-15 (Tag:4955018630663550474)

Thierry Leiber has improved the add-on making the connection between Tanagra and Open Office. It is now possible, under Linux, to install the add-on for Open Office and launch Tanagra directly after selecting the data (see the tutorials on installing Tanagra under Linux and the integration of add-on in Open Office Calc). Thierry, thank you very much for this contribution which helps the users of Tanagra.

Following a suggestion of Mr. Laurent Bougrain, the confusion matrix is added to the automatic saving of results in experiments. Thank you to Laurent, and all others, who by their constructive comments helps me upgrade Tanagra in the right direction.

In addition, two new components for regression using the support vector machine principle (support vector regression) were added: Epsilon-Nu-SVR and SVR. A tutorial shows these methods and compare our results with the R software will be available soon. Tanagra, as with the R package "e1071", are based on the famous LIBSVM library.

Tutorials about these releases are coming soon.

Cost-sensitive learning - Comparison of tools

2009-03-19 (Tag:1019262174839808385)

Everyone agrees that taking into consideration the misclassification costs is an important aspect of the practice of Data Mining. For instance, diagnosing disease for a healthy person does not produce the same consequences as to predict health for an ill person. Yet despite its importance, the topic is seldom addressed, both from a theoretical point of view i.e. how to integrate cost during the evaluation of models (easy) and their construction (a little less easy); and from the practical point of view i.e. how to implement the approach in software.

Using the misclassification cost during the classifier evaluation is easy. We make a cross-product between the misclassification cost matrix and the confusion matrix. We obtain an "expected misclassification cost" (or an expected gain if we multiply the result by -1). Its interpretation is not very easy. It is mainly used for the comparison of models.

Handling costs during the learning process is less usual. Several approaches are possible. In this tutorial, we show how to use some components of Tanagra intended to cost-sensitive supervised learning on a real (realistic) dataset. We also programmed the same procedures in the R software (http://www.r-project.org/) to give a better visibility on what is implemented. We compare our results with those of Weka. The algorithm underlying our analysis is a decision tree. According to the software, we use C4.5, CART or J48.

Keywords: supervised learning, cost sensitive learning, misclassification cost matrix, decision tree algorithm, Weka 3.5.8, R 2.8.0, rpart package
Tutorial: en_Tanagra_Cost_Sensitive_Learning.pdf
Dataset: dataset-dm-cup-2007.zip
References:
J.H. Chauchat, R. Rakotomalala, M. Carloz, C. Pelletier, "Targeting Customer Groups using Gain and Cost Matrix: a Marketing Application", PKDD-2001.
"Cost-sensitive Decision Tree", Tutorials for Sipina.

Predictive association rules

2009-02-26 (Tag:7877212175111381971)

The algorithms for association rules extraction were originally developed to find the logical relation between variables with the same status. The predictive association rules instead seek to generate association of items that characterize a dependent attribute. We are in a supervised learning framework.

Basically, the algorithm is not really modified. Exploration is just limited to itemsets that include the dependent variable. The computation time is then reduced. Two components of Tanagra are dedicated to this task; these are SPV ASSOC RULE and SPV ASSOC TREE. They are available in the Association tab.Compared to conventional approaches, the components of Tanagra introduce additional specificity: we have the possibility to specify the class value ("dependent variable = value") that you wish to predict. The interest is to finely set the parameters of the algorithm, directly related to the characteristics of data. This is crucial for example when the prior probabilities of the dependent variable values are very different.

We had already submitted the component SPV TREE ASSOC elsewhere. But it was in the context of multivariate characterization of groups of individuals (from a clustering algorithm for instance). We compare it to the GROUP CHARACTERIZATION component. In this tutorial, we will compare the behavior of SPV ASSOC TREE and SPV ASSOC RULE during a prediction task. We will put forward their shared properties, the problems they can handle, and their differences. SPV ASSOC RULE, which supplies original rule interestingness measures ("test value" indicator), has the ability to simplify the rule base.

Keywords: predictive association rules, interestingness measure, rule base ranking, rule base simplification
Components: SPV ASSOC TREE, SPV ASSOC RULE
Tutorial: en_Tanagra_Predictive_AssocRules.pdf
Dataset: credit_assoc.xls

Interestingness measures for association rules

2009-02-22 (Tag:9078262773107578774)

This document outlines the measures to assess association rules proposed by the A PRIORI MR and SPV ASSOC RULE components. They come from studies reported in several publications of A. Morineau and R. Rakotomalala.

A measure characterizes the relevance of a rule. It can be used to rank them. It should also help to discern those that are "significantly interesting" from those who are irrelevant. This last point is totally prospective. There is no really satisfactory solution at this time.

The A PRIORI MR and the SPV ASSOC RULE components are experimental tools for the evaluation of the rules extracted by the association rule induction algorithm. They allow to evaluate the rules using measures based on the test value principle.

Keywords: association rules, interestingness measure, test value
Components: A PRIORI MR, SPV ASSOC RULE
Tutorial: en_Tanagra_APrioriMR_Measures.pdf
References:
R. Rakotomalala, A. Morineau, 2008. “The TVpercent principle for the counterexamples statistic”, in Statistical Implicative Analysis, Studies in Computational Intelligence Series, 127, 449-462, Springer, 2008 -- http://www.springerlink.com/content/g245317206950529/
Wikipedia, "Association rule learning"

Performance comparison under Linux

2009-01-27 (Tag:1238927418646001310)

The gain chart is an alternative to confusion matrix for the evaluation of a classifier. Its name is sometimes different according the tools (e.g. lift curve, lift chart, cumulative gain chart, etc.).

The main idea is to elaborate a graph where the X coordinates is the percent of the population and the Y coordinates is the percent of the positive value of the class attribute. The gain chart is used mainly in the marketing domain where we want to detect potential customers, but it can be used in other situations.

The construction of the gain chart is already outlined in a previous tutorial (see http://data-mining-tutorials.blogspot.com/2008/11/lift-curve-coil-challenge-2000.html). In this tutorial, we extend the description to other data mining tools (Knime, RapidMiner, Weka and Orange). The second originality of this tutorial is that we lead the experiment under Linux (French version of Ubuntu 8.10 – see http://data-mining-tutorials.blogspot.com/2009/01/tanagra-under-linux.html for the installation and the utilization of Tanagra under Linux). The third originality is that we handle a large dataset with 2,000,000 examples and 41 variables. It will be very interesting to study the behavior of these tools in this configuration, especially because our computer is not really powerful. We note that some tools failed the analysis on the complete dataset.

Keywords: scoring, linear discriminant analysis, naive bayes classifier, lift curve, gain chart, cumulative gain chart, knime, rapidminer, weka, orange
Components: SAMPLING, LINEAR DISCRIMINANT ANALYSIS, SCORING, LIFT CURVE
Tutorial: en_Tanagra_Gain_Chart.pdf
Dataset: dataset_gain_chart.zip

Sipina under Linux

2009-01-24 (Tag:2659174898009779145)

In a recent tutorial, we show that it is possible to work with Tanagra under Linux using Wine. In this document, we implement Sipina (a data mining software intended to decision tree induction) with the same framework i.e. we install and use Sipina in a Linux environment. We use the Ubuntu distribution (French version 8.10). All the functionalities of Sipina are available, especially the interactive tools which allows us to explore deeply the subpopulation into a node of the tree.

In this tutorial, we implement the following steps: (1) Installing Sipina under Linux; (2) Launching the software; (3) Loading a dataset (text file with tab separator); (4) Choosing the class attribute and the predictive variables; (5) Partitioning the dataset in a train set and test set; (6) Computing the tree on the train set; (7) Evaluation the tree on the test set e.g. computing the confusion matrix, the error rate, etc.; (8) Exploring a subpopulation related to a node of the tree; (9) Launching a new analysis on a subpopulation related to a node of the tree.

We will describe quickly the various features of the software in this tutorial. They are already presented in several documents available online (https://eric.univ-lyon2.fr/ricco/sipina.html, see the DOWNLOAD section). Our main goal here is to show the capabilities of Sipina under Linux.

We use the French Ubuntu 8.10 distribution; we have installed also Wine, a program which allows to Windows programs to run under Linux.

Keywords: linux, ubuntu, wine, sipina, decision tree
Tutorial: en_Sipina_under_Linux.pdf
References:
Ubuntu, http://www.ubuntu.com/
Wine, https://help.ubuntu.com/community/Wine

Tanagra under Linux

2009-01-13 (Tag:4799220337499273941)

The users ask sometimes "Can I use Tanagra under Linux?" The answer is YES and NO.

NO, we cannot execute natively Tanagra under Linux. It is a 32-bits program for Windows.

But YES, we can run Tanagra under Linux using WINE, a famous Linux application which allows us to run Windows programs on Linux. We can then take all the advantages of Tanagra without asking any questions about compatibilities.

In this tutorial, we show how to install and run Tanagra under Ubuntu (a free of charge version of Linux) using WINE. We can fully use Tanagra in the Linux environment.

Keywords: linux, ubuntu, wine
Tutorial: en_Tanagra_under_Linux.pdf
References:
Ubuntu, http://www.ubuntu.com/
Wine, https://help.ubuntu.com/community/Wine

Logistic regression - Software comparison

2008-12-26 (Tag:4364396627072449055)

Logistic regression is a popular supervised learning method.

There are several reasons for this. The theoretical foundation of the method is attractive. It is in line with the generalized regression. Thus the logistic regression is a well identified variant which one can implement according the kind of the dependent variable (class attribute). Their performance in prediction is comparable to the other approaches. Furthermore, it puts forward some indicators for the interpretation of the results. Among them, the famous odds-ratio enables to identify precisely the contribution of each predictor.

Logistic regression is available in many free tools. In this tutorial, we compare the implementation of this technique with Tanagra 1.4.27, R 2.7.2 (GLM command), Orange 1.0b2, Weka 3.5.6, and the package RWeka 0.3-13 for R. Beyond the comparison, this tutorial is also an opportunity to show how to achieve the succession of operations with these tools: importing an ARFF file (Weka file format); split the data into learning and test set; computing the predictive model on the learning set; testing the model on the test set; selecting the relevant variable using criterion in agreement with the logistic regression; evaluating again the performance of the simplified model.

Keywords: logistic regression, supervised learning, software comparison
Components: BINARY LOGISTIC REGRESSION, SUPERVISED LEARNING, TEST, DISCRETE SELECT EXAMPLES
Tutorial: en_Tanagra_Perfs_Reg_Logistique.pdf
Dataset: wave_2_classes_with_irrelevant_attributes.zip
References:
D. Garson, "Logistic Regression"
Wikipedia, "Logistic Regression"

Association rule mining - Software comparison

2008-12-23 (Tag:4277159662482597321)

This document extends a previous tutorial dedicated to the comparison of various implementations of association rules mining (http://data-mining-tutorials.blogspot.com/2008/10/association-rule-learning.html). We had analyzed Tanagra, Orange and Weka. We extend here the comparison to R, RapidMiner and Knime.

We handle an attribute-value dataset. It is not the usual data format for the association rule mining where the "native" format is rather the transactional database. We see in this tutorial than some of tools can automatically recode the data. Others require an explicit transformation. Thus, we must find the right components and the correct sequence of treatments to produce the transactional data format. The process is not always easy according to the software.

The tools studied in this tutorial are: Tanagra 1.4.28, R 2.7.2 (arules package 0.6-6), Orange 1.0b2, RapidMiner Community Edition, Knime 1.3.5 and Weka 3.5.6. These programs load the data and perform the calculations in memory. When the size of the database increases, the real bottleneck is the memory available on our personal computer.

Keywords: association rule, frequent itemset
Components: A PRIORI, A PRIORI PT
Tutorial: en_Tanagra_Assoc_Rules_Comparison.pdf
Dataset: credit-german.zip
References:
R. Rakotomalala, « Règles d’association »
Wikipedia, "Association rule learning"

K-Means - Classification of a new instance

2008-12-20 (Tag:2254319807531827093)

The deployment is an important step of the Data Mining framework. In the case of a clustering, after the construction of clusters with a learning algorithm, we want to determine to which particular cluster (group) a new unlabelled instance belongs.

In this tutorial, we use the K-Means algorithm. We assign each new instance to the group which is closest using the distance to the center of groups. The method is fair because the technique used to assign a group in the deployment phase is consistent with the learning algorithm. It is not true if we use another learning algorithm e.g. HAC (hierarchical agglomerative clustering) with de single linkage aggregation rule. The distance to the center of groups is inadequate in this context. Thus, the classification strategy must be consistent with the learning strategy.

All the descriptors are discrete in our dataset. The K-Means algorithm does not handle directly this kind of data. We must transform them. We use a multiple correspondence analysis algorithm.

In this tutorial, we compare the results of Tanagra 1.4.28 and R 2.7.2.

Keywords: data clustering, k-means, multiple correspondence analysis, factorial analysis, clusters interpretation, data exportation
Components: MULTIPLE CORRESPONDENCE ANALYSIS, K-MEANS, GROUP CHARACTERIZATION, CONTINGENCY CHI-SQUARE, EXPORT DATASET
Tutorial: en_Tanagra_KMeans_Deploiement.pdf
Dataset: banque_classif_deploiement.zip
References:
Wikipedia (en), « K-Means algorithm ».
F. Husson, S. Lê, J. Josse, J. Mazet, « FactoMineR – A package dedicated to Factor Analysis and Data Mining with R ».

Decision tree and large dataset

2008-11-13 (Tag:8390368243502377409)

Dealing with large dataset is on of the most important challenge of the Data Mining. In this context, it is interesting to analyze and to compare the performances of various free implementations of the learning methods, especially the computation time and the memory occupation. Most of the programs download all the dataset into memory. The main bottleneck is the available memory.

In this tutorial, we compare the performance of several implementations of the C4.5 algorithm (Quinlan, 1993) when processing a file containing 500,000 observations and 22 variables. The programs used are: Knime 1.3.5; Orange 1.0b2; R (rpart package) 2.6.0; RapidMiner Community Edition; Sipina Research; Tanagra 1.4.27; Weka 3.5.6.

Our data file is well-known artificial dataset described in the CART book (Breiman et al., 1984). We have generated a dataset with 500.000 observations. The class attribute has 3 values, there are 21 continuous predictors.

Keywords: c4.5, decision tree, classification tree, large dataset, knime, orange, r, rapidminer, sipina, tanagra, weka
Components: SUPERVISED LEARNING, C4.5
Tutorial: en_Tanagra_Perfs_Comp_Decision_Tree.pdf
Dataset: wave500k.zip
Reference: R. Quinlan, « C4.5 : Programs for Machine Learning », Morgan Kaufman, 1993.

Decision tree and cross validation (continued)

2008-11-11 (Tag:6387845628072692266)

In a previous tutorial, we compare the implementation of the decision tree induction and cross validation evaluation performances with three programs: TANAGRA, ORANGE and WEKA.

In this paper, we extent the same framework for the comparison of three new programs: R 2.7.2, KNIME 1.3.51 and RAPIDMINER Community Edition.

Keywords: supervised learning, decision tree, classification tree, classifier assessment
Components: Supervised learning, C-RT, Cross validation
Tutorial: en_Tanagra_Validation_Croisee_Suite.pdf
Dataset: heart.zip

Interactive induction of decision tree

2008-11-10 (Tag:708086023930514582)

Interactive induction of decision trees with SIPINA.

Various functionalities of SIPINA are not documented. In this tutorial, we show how to explore nodes of a decision tree, in order to obtain a better understanding of the characteristics of the subpopulation on a node. This is an important task, for instance when we want to validate the rules with an expert domain.

Keywords: decision tree, classification tree, interactive analysis
Tutorial: en_sipina_interactive.pdf
Dataset: blood_pressure_levels.xls

Decision tree and contextual descriptive statistics

2008-11-10 (Tag:1300820465897562045)

SIPINA proposes some descriptive statistics functionalities. In itself, the information is not really exceptional; there is a large number of freeware which do that.

It becomes more interesting when we combine these tools with the decision tree. The exploratory phase is improved. Indeed, every node of the tree corresponds to a subpopulation. The variables which do not appear in the tree are not necessarily irrelevant. Perhaps, some of them were hided during the tree learning which selects the “best” variables. By computing contextual descriptive statistics, in connection with the each node, we better understand the prediction rules highlighted during the induction process.

Keywords: descriptive statistics, decision tree, interactive exploration
Tutorial: en_sipina_descriptive_statistics.pdf
Dataset: heart_disease_male.xls

Cost-sensitive Decision Tree

2008-11-10 (Tag:5234908511404244119)

Error rate evaluation is a key point of the induction process. A usual approach is to partition the dataset in a learning set, which is used for the induction of the classification model, and in a test set, which is used for the performance evaluation.

The first subject of this tutorial is to show how to make a partition of the dataset with SIPINA. Then, we build the tree on the first part of the dataset. Later, we classify the examples of the second part of the dataset. We compare the predicted value and the true value. We obtain honest error rate estimation.

The second main subject of this document is to show how to take into account the misclassification costs during the learning process and the evaluation process. We use a slightly modified version of C4.5 (Quinlan, 1993).

Keywords: decision trees, C4.5, classifier evaluation, cost-sensitive learning, F-Measure, spams detection
Tutorial: en_sipina_cost_sensitive.pdf
Dataset: spam.xls

Semi-partial correlation

2008-11-10 (Tag:9088007003676927455)

The semi-partial correlation measures the additional information of an independent variable (X), compared with one or several control variables (Z1,..., Zp), that we can used for the explanation of a dependent variable (Y).

We can compute the semi-partial correlation in various ways. The square of the semi-partial correlation can be obtained with the difference between the square of the multiple correlation coefficient of regression Y / X, Z1...,Zp (including X) and the same quantity for the regression Y / Z,...,Zp (without X).

We can also obtain the semi-partial correlation by computing the residuals of the regression X/Z1,...,Zp; then, we compute the correlation between Y and these residuals. In other words, we seek to quantify the relationship between X and Y, by removing the effect of Z on the latter. The semi-partial correlation is an asymmetrical measure.

In this tutorial, we show the different ways for computing the semi-partial correlation.

Keywords: correlation, Pearson's correlation, semi-partial correlation, multiple linear regression
Components: LINEAR CORRELATION, MULTIPLE LINEAR REGRESSION, SEMI-PARTIAL CORRELATION
Tutorial: en_Tanagra_Semi_Partial_Correlation.pdf
Dataset: cars_semi_partial_correlation.xls
Reference: M. Brannick, « Partial and Semipartial Correlation », University of South Florida.

Partial correlation

2008-11-10 (Tag:265754778257497397)

Partial correlation measures the degree of association between two random variables, with the effect of a set of controlling variables removed.

In this tutorial, we show how to use the PARTIAL CORREALTION component of Tanagra. We reproduce the example described online (see Reference). Thus, in addition to the presentation of the theoretical method, we can trace the detail of all the calculations that we will achieve.

Keywords: correlation, Pearson's correlation, rank correlation, Spearman's rho, partial correlation
Components: LINEAR CORRELATION, SPEARMAN’S RHO, PARTIAL CORRELATION
Tutorial: en_Tanagra_Partial_Correlation.pdf
Dataset: wechsler_adult_intelligence_scale.xls
Reference: S. Rathbun, A. Wiesner, « STAT 505 – Applied Multivariate Statistical Analysis », The Pennsylvania State University, Lesson 7 : Partial Correlations

Friedman Anova by Ranks

2008-11-10 (Tag:5518442633409766665)

In this tutorial, we show how to use the FRIEDMAN’S ANOVA BY RANKS component. We use this test when we want to check the null hypothesis that K related (matched) samples come from the same population. That matching can be achieved by studying the same group of individuals under each of the K conditions (repeated measure with various conditions).

Keywords: comparison of population, matched samples, analysis of variance, ranks, nonparametric, ANOVA
Components: Friedman’s ANOVA by Rank, One-way ANOVA, Kruskal-Wallis 1-way ANOVA
Tutorial: en_Tanagra_Friedman_Anova.pdf
Dataset: howell_book_friedman_anova_dataset.zip
Reference: Wikipedia, « Friedman test »

Manova - Multivariate analysis of variance

2008-11-10 (Tag:192798593959989620)

In this tutorial, we show how to use the ONE WAY MANOVA component (Multivariate Analysis of Variance): unlike classical ANOVA, there is more than one dependent variables.

We will see that a multivariate test and a combination of univariate tests give a different conclusion.

Keywords: multivariate analysis of variance, variance covariance matrix
Components: One-way ANOVA, One-Way MANOVA
Tutorial: en_Tanagra_Manova.pdf
Dataset: tomassone_p_29.xls
Reference:
S. Rathburn, A. Wiesner, "STAT 505 - Applied Multivariate Statistical Analysis", Penn State University, Departement of Statistics.

Normality test

2008-11-10 (Tag:2901142150823604471)

A goodness-of-fit test is used to decide if a sample comes from a population with specific distribution. TANAGRA has a new component, which uses several tests in order to check the normality assumption.

We use artificial dataset in this tutorial, we have generated the dataset from 3 distributions: uniform, normal and log-normal.

Keywords: Shapiro-Wilk's test, Lilliefors' test, Anderson-Darling's test, d’Agostino's test
Components: More Univariate cont stat, Normality Test
Tutorial: en_Tanagra_Normality_Test.pdf
Dataset: normality_test_simulation.xls
Reference:
Wikipedia - "Normality test"

Two-sample t-test

2008-11-10 (Tag:1187809649538460377)

In this tutorial, we show how to use TANAGRA to determine if two populations means are equal. The conditional variance may be assumed as equal or unequal i.e. we use pooled or separate variance estimation.

Keywords: test for mean comparison, Student's t-test
Components: T-Test, T-Test Unequal Variance
Tutorial: en_Tanagra_Two_Sample_T_Test_For_Equal_Means.pdf
Dataset: auto83b.xls
Reference: NIST/SEMATECH, « e-Handbook of Statistical Methods », Section 7.3.1 « Do two processes have the same mean ? ».

ANOVA and test for equality of variances

2008-11-10 (Tag:320130577535247639)

In this tutorial, we show how to use TANAGRA in an analysis of variance problem. We test also homogeneity of variances assumption on the same dataset.

We use the GEAR dataset (NIST/SEMATECH e-Handbook of Statistical Methods, http://www.itl.nist.gov/div898/handbook/). We consider that we have 10 machine tools, which produce gears. We have 10 batches of 10 observations. We want to test various assumptions: (1) the average diameter of the gears is the same one for the whole of the machines? (2) The variability of the gear diameter is the same for the whole of the machines?

Keywords: analysis of variance, test for equality of variances, bartlett's test, levene's test, brown-forsythe's test
Components: One-way ANOVA, Bartlett’s test, Levene’s test, Brown-Forsythe test
Tutorial: Anova and Tests for Equality of Variances
Dataset: gear_data_from_nist.xls
Reference: NIST/SEMATECH, « e-Handbook of Statistical Methods », Chapitre 7 « Product and Process Comparisons ».

Nonparametric statistics

2008-11-09 (Tag:3071151768781228111)

In this tutorial we show how to implement some nonparametric statistics technique with Tanagra.

Various approaches are available: difference between populations for independent and related samples, nonparametric correlations, measures of association between nominal variables, etc.

Keywords: tests for independent samples, tests for related samples, analysis of variance, measures of association, wald and wolfowitz runs test, mann and whitney test, kruskal and wallis, spearman's rho, kendall's tau, sign test, wilcoxon signed rank test
Components: Mann-Whitney Comaprison, Wald-Wolfowitz Runs Test, Kruskal-Wallis 1-way ANOVA, One-way ANOVA, Spearman’s rho, Kendall’s tau, Sign Test, Wilconxon Ranks Test, Paired T-Test
Tutorial: en_Tanagra_Nonparametric_Statistics.pdf
Dataset: nonpametric_statistics_dataset.xls
References:
S. Siegel, J. Castellan, « Nonparametric Statistics for the Behavioral Sciences », McGraw-Hill, 1988.
D. Sheskin, "Handbook of parametric and nonparametric statistical procedures", Chapman & Hall, 2007.

Measures of association for ordinal variables

2008-11-09 (Tag:6558740095479424579)

In this tutorial, we show how to use TANAGRA for measuring the association between ordinal variables.

All the measures that we present here rely on the concept of pairs. If, in a theoretical point of view, the measures intended for continuous attributes such as correlation are not convenient in our context, in the practical point of view, we display, in this tutorial, that it nevertheless gives interesting results for the studying the dependence between ordinal variables.

Keywords: concordant and discordant pairs, contingency table, goodman and kruskal's gamma, kendall's tau-c, sommers's d
Components: Goodman Kruskal Gamma, Kendall Tau-c, Sommers d, Linear Correlation
Tutorial: en_Tanagra_Measures_of_Association_Ordinal_Variables.pdf
Dataset: blood_pressure_ordinal_association.xls
Reference:
D. Garson, « Measures of association », in Statnotes : Topics in Multivariate Analysis.

Measures of association for nominal variables

2008-11-09 (Tag:4171249114620338136)

To measure the association between two continuous variables, we generally use the correlation coefficient. Its drawbacks and its qualities are well known.

When we want to characterize the association for nominal variables, the correlation coefficient is not suitable. We must use other indicators. The most widespread is certainly the chi-square test, it enables to evaluate the absence of relation. We see in this tutorial that other measures are available. We show how to use them with TANAGRA.

Keywords: association between nominal variables, contingency table, chi-square test, tschuprow's t, cramer's v, asymmetrical association, pre measures (proportional reduction in error), goodman and kruskal's tau, theil's u, partial association, partial theil's u
Components: Contingency Chi-Square, Goodman-Kruskal Tau, Theil U, Partial Theil U, Discrete select examples
Tutorial: en_Tanagra_Measures_of_Association_Nominal_Variables.pdf
Dataset: fuel_consumption.xls
Reference:
D. Garson, « Measures of association », in Statnotes : Topics in Multivariate Analysis.

Correlation coefficient

2008-11-09 (Tag:6521324731063881363)

In this tutorial, we show how to compute the correlation coefficient and sorting the results according to this indicator is a recurring task of the data miner.

We show how to quickly set up the calculation of the linear correlation (1) of an endogenous variable with exogenous variables in order to detect relevant attributes; (2) between exogenous variables in order to detect collinearities.

Keywords: linear correlation coefficient, partial correlation
Components: Linear correlation, Residual scores
Tutorial: en_Tanagra_Linear_Correlation.pdf
Dataset: cars_acceleration.xls
References:
D. Garson, « Correlation », in Statnotes : Topics in Multivariate Analysis.
D. Garson, « Partial correlation », in Statnotes : Topics in Multivariate Analysis.

Descriptive statistics

2008-11-09 (Tag:6656161642256069224)

In this tutorial, we show how to compute univariate descriptive statistics for continuous and discrete variables.

We use mainly tabular descriptions and summary statistics.

Keywords: statistique descriptive
Components: View dataset, Univariate continuous stat, Univariate discrete stat, Group characterization
Tutorial: enBasics.pdf
Dataset: breast.txt
References:
Wikipedia - "Descriptive Statistics"
M. Chow, L. Strauss, "STAT 500 - Applied Statistics", Penn State University, Departement of Statistics.

PLS Regression - Number of factors

2008-11-08 (Tag:9088172375803477108)

In this tutorial, we show how to detect the right number of factors for a PLS regression using a resampling approach.

Standard criteria such as PRESS or Q2 are used. The upstream component (PLS Regression and derivated components) can be automatically updated.

Keywords: pls regression, factor analysis
Components: PLS Factorial, PLS Selection
Tutorial: en_Tanagra_PLS_Selecting_Factors.pdf
Dataset: protien.txt

Clustering - The EM algorithm

2008-11-08 (Tag:7266904694489352823)

In the Gaussian mixture model-based clustering, each cluster is represented by a Gaussian distribution. The entire dataset is modeled by a mixture (a linear combination) of these distributions.

The EM (Expectation Maximization) algorithm is used in practice to find the “optimal” parameters of the distributions that maximize the likelihood function.

The number of clusters is a parameter of the algorithm. But we can also detect the “optimal” number of clusters by evaluating several values, i.e. testing 1 cluster, 2 clusters, etc. and choosing the best one (which maximizes the likelihood or another criterion such as AIC or BIC).

Keywords: clustering, expectation maximization algorithm, gaussian mixture model
Components: EM-Clustering, K-Means, EM-Selection, scatterplot
Tutorial: en_Tanagra_EM_Clustering.pdf
Dataset: two_gaussians.xls
Reference:
Wikipédia (en) -- Expectation-maximization algorithm

Combining clustering and graphical approaches

2008-11-08 (Tag:894854674358031752)

In this tutorial, we show the complementarity between a clustering method (HAC - Hierarchical agglomerative clustering) and a factor analysis approach for multivariate data visualization (PCA - Principal Component Analysis).

The aim is to obtain a better understanding of underlying concept organizing the data.

Keywords: HAC, clustering, PCA, factor analysis, statistical graphics, visualizing multivariate data
Components: HAC, Group characterization, Principal component analysis, correlation scatterplot, scatterplot
Tutorial: en_Tanagra_hac_pca.pdf
Dataset: cars.xls

Clustering trees

2008-11-08 (Tag:7519117345347244041)

The aim of clustering is to build groups of individuals so that, the examples in the same group are similar, the examples in different groups are dissimilar.

Top down induction of clustering trees adapts the supervised decision/regression trees framework towards clustering. The groups are built by recursive partitioning of the dataset, the internal nodes of the tree are classically split with input attributes. The obtained model, the clustering tree, describes the groups; the learning algorithm selects automatically the relevant attributes.

The clustering trees approach is not very known; we show in this tutorial the interesting properties of this method. Our main references are the papers of Chavent (1998) and Blockeel (1998).

Keywords: clustering algorithm, clustering tree, groups characterization
Components: Multiple Correspodance Analysis, CTP, Contingency Chi-Square, K-Means
Tutorial: en_Tanagra_Clustering_Tree.pdf
Dataset: zoo.xls
References:
M. Chavent (1998), « A monothetic clustering method », Pattern Recognition Letters, 19, 989—996.
H. Blockeel, L. De Raedt, J. Ramon (1998), « Top-Down Induction of Clustering Trees », ICML, 55—63.

Interactive Group Exploration

2008-11-08 (Tag:4506966100122792272)

Most of the time, the statistician must build groups of individuals and want to characterize them. The main interest of this very simple approach is that the results are easy to read and understand.

In this tutorial, we show how to build groups with some (target) attributes, and describe them with other (input) attributes. These component can be useful when we want to outline groups induced with a clustering algorithm for instance.

Keywords: visual group exploration, group characterization
Components: Group characterization, Group Exploration
Tutorial: en_Tanagra_Group_Exploration.pdf
Dataset: autos.xls

Canonical discriminant analysis

2008-11-08 (Tag:6409693040279026084)

We show how to use the CANONICAL DISCRIMINANT ANALYSIS component.

One of the goals of this method is to produce new variables (“latent” variables) from a set of examples classified into predefined classes. These new variables optimize the separation between groups.

This approach can be seen as a sophisticated graphical method. We show mainly these graphical capabilities in this tutorial.

Keywords: canonical discriminant analysis, latent variables, visualization technique
Components: Canonical discriminant analysis, Scatterplot
Tutorial: en_Tanagra_Canonical_Discriminant_Analysis.pdf
Dataset: wine_quality.xls
Reference: D. Garson, "Statnotes: Topics in Multivariate Analysis - Discriminant Function Analysis".

Variable clustering (VARCLUS)

2008-11-08 (Tag:8286996505615799992)

Variable clustering can be viewed like a clustering of the individuals where we would have transposed the dataset. But, instead of the utilization of the euclidean distance in order to compute the similarities between examples, we use the correlation coefficient (or the squared correlation coefficient).

Variable clustering may be useful in several situations. It can be used in order to detect the main dimensionality in the dataset; it may be used also in a feature selection process, in order to select the most relevant attributes for the subsequent analysis. The synthesized variable which represents a group, the main factor of PCA (Principal Component Analysis), may be used also.

Keywords: variable clustering, latent variables
Components: VARHCA, VARKMeans, VARCLUS
Tutorial: en_Tanagra_VarClus.pdf
Dataset: crime_dataset_from_DASL.xls
References:
E. Vigneau et E. Qannari, « Clustering of variables around latent components », Simulation and Computation, 32(4), 1131-1150, 2003.
SAS OnlineDoc – Version 8, « The VARCLUS Procedure ».

K-Means algorithm on discrete attributes

2008-11-08 (Tag:4072775465719753179)

In this tutorial, we show how to perform a K-Means clustering. We validate the results by comparing the clusters with a predefined classification.

We address an additional problem in this tutorial. Descriptors are categorical. We can not directly launch the K-Means with the usual Euclidean distance. We propose to use in 2 steps: (1) transform the original dataset using a correspondence analysis; (2) launch the K-Means on the X first latent variables. We then can use the algorithm standard K-Means based on Euclidean distance in this second step.

Keywords: clustering, k-means, correspondence analysis, cluster description
Components : Multiple Correspondance Analysis, K-Means, Group characterization, Cross Tabulation
Tutorial: en_dr_clustering_validation_externe.pdf
Dataset: dr_vote.bdm
References:
Wikipédia, « K-means algorithm ».
Statsoft Inc., "Correspondence Analysis".

HAC and Hybrid Clustering

2008-11-07 (Tag:5101269674877212909)

HAC (Hierarchical Agglomerative Clustering) is a clustering method that produces “natural “ groups of examples characterized by attributes. A tree, called dendrogram, where successive agglomerations are showed, starting from one example per cluster, until the whole dataset belong to one cluster, describes the clustering process.

The main advantage of HAC is the user can guess the right partitioning by visualizing the tree, he usually prune the tree between nodes presenting an important variation. The main disadvantage is that requires the computation of distances between each example, which is very time consuming when the dataset size increases.

TANAGRA implements the standard HAC, but it implements also a variation of HAC called HYBRID CLUSTERING. Knowing that we need often a very few number of clusters, the construction of the low part of the tree is reserved for a fast method.

There are two steps in the new algorithm:
• First, a low-level clusters are built from fast clustering method such as K-MEANS, SOM;
• HAC starts form these clusters and builds the dendogram.

Note that any clustering algorithm can provide the low level clusters, users can also specify them. Last, rather than the tree itself, it is the gap between the nodes which is important, these values are provided in a table.

Keywords: clustering, unsupervised learning, HAC, K-Means
Components: HAC , K-Means, Group characterization
Tutorial: enHAC_IRIS.pdf
Dataset: iris_hac.bdm
References:
L. Lebart, A. Morineau, M. Piron, " Statistique exploratoire multidimensionnelle ", Dunod, 2000 ; pp. 177 - 184.
Matteo Matteucci, "A tutorial on Clustering Algorithms - Hierarchical Clustering Algorithms".

Correspondence analysis

2008-11-07 (Tag:8598417245392297574)

Correspondence analysis is a visualization technique. It enables to see the association between rows and columns in a large contingency table. It belongs to "factorial analysis" approach. The method computes some axes, which are latent variables that we interpret in order to understand the proximities between rows and/or columns.

TANAGRA is not really intended for contingency table. So we use an artifice. The rows are specified from a discrete attribute, and the columns correspond to several continuous attributes in our dataset. We cannot treat a contingency table with more than 255 rows.

This tutorial is suggested by the presentation of Lebart, Morineau and Piron, in their book, « Statistique Exploratoire Multidimensionnelle », Dunod, 2000. Unfortunately, I don't think there is an English translation of this very good teaching book, which is really popular in France. However, I hope this tutorial is understandable without the book. If you read French, the description of the correspondence analysis is available at section 1.3 (pp. 67-107).

Keywords: analyse factorielle des correspondances, tableau de contingence, khi-2, chi-2, plan factoriel, contributions, cosinus carrés
Components: Correspondence analysis
Tutorial: en_Tanagra_Afc.pdf
Dataset: media_prof_afc.xls
Reference:
L. Lebart, A. Morineau, M. Piron, " Statistique exploratoire multidimensionnelle ", Dunod, 2000.
Statsoft Inc., "Correspondence Analysis".
D. Garson, "Statnotes - Correspondence Analysis".

Regression trees

2008-11-06 (Tag:8932840394618401946)

The aim of regression analysis is to produce a model that can predict or explain the values of a continuous variable (endogenous) from the values of a list of predictors (exogenous), continuous or discrete.

Multiple linear regression is certainly the most known, but other methods such as Regression Trees can perform this task. But, in this case, the predictive model has the appearance of a tree. We use a "if-then" conditions in order to predict the class value of an individual from its description.

Keywords: regression trees, CART
Components: Regression tree
Tutorial: en_Tanagra_Regression_Tree.pdf
Dataset: housign.arff
Reference:
L. Breiman, J. Friedman, R. Olsen, C. Stone, « Classification and Regression Trees », Wadsworth International, 1984.
Statsoft Inc., "Classification and Regression Trees (C&RT)"

Forward selection for regression analysis

2008-11-06 (Tag:4163466382331447502)

In this tutorial, we show how to use the FORWARD ENTRY REGRESSION component: it performs a multiple linear regression with a forward variable selection based on partial correlation.

We use CRIME_DATASET_FROM_DASL.XL from the DASL website. It contains various characteristics of 47 states of USA. We want to explain the criminality from unemployment, education level, …

Keywords: linear multiple regression, variable selection, forward selection, stepwise, colinearity, partial correlation
Components: View Dataset, Multiple linear regression
Tutorial: en_Tanagra_Forward_Selection_Regression.pdf
Dataset: crime_dataset_from_DASL.xls
References:
Wikipedia - "Stepwise regression"

Multiple linear regression

2008-11-06 (Tag:3208359426565905213)

In this tutorial, we show how to perform a regression analysis with Tanagra.

Our dataset consists in engine cars description. We want to predict “mpg” consumption from cars characteristics such as weight, horsepower, …

Keywords: linear regression, endogenous variable, exogenous variables
Components: View Dataset, Multiple linear regression
Tutorial: enRegression.pdf
Dataset: autompg.bdm
Références:
Wikipedia - "Linear regression"

PLS Regression

2008-11-06 (Tag:950432553303822031)

PLS (Partial Least Squares Regression) Regression can be viewed as a multivariate regression framework where we want to predict the values of several target variables (Y1, Y2, …) from the values of several input variables (X1, X2, …).

Roughly speaking, the algorithm is the following: “The components of X are used to predict the scores on the Y components, and the predicted Y component scores are used to predict the actual values of the Y variables. In constructing the principal components of X, the PLS algorithm iteratively maximizes the strength of the relation of successive pairs of X and Y component scores by maximizing the covariance of each X-score with the Y variables. This strategy means that while the original X variables may be multicollinear, the X components used to predict Y will be orthogonal”.

The dataset used correspond to 6 orange juices described by 16 physicochemical descriptors and evaluated by 96 judges [Source : Tenenhaus, M., Pagès, J., Ambroisine L. and & Guinot, C. (2005). PLS methodology for studying relationships between hedonic judgements and product characteristics. Food Quality an Preference. 16, 4, pp 315-325].

Keywords: pls regression, factorial analysis, multiple linear regression
Components: PLS Regression
Tutorial: en_Tanagra_PLS.pdf
Dataset: orange.bdm
References:
M. Tenenhaus, « La régression PLS – Théorie et pratique », Technip, 1998.S.
H. Abdi, "Partial Least Square Regression".
Garson, « Partial Least Squares Regression (PLS) », http://www2.chass.ncsu.edu/garson/PA765/pls.htm

PLS Regression for Classification Task

2008-11-06 (Tag:4647321003713803836)

PLS (Partial Least Squares Regression) Regression can be viewed as a multivariate regression framework where we want to predict the values of several target variables (Y1, Y2, …) from the values of several input variables (X1, X2, …).

Roughly speaking, the algorithm is the following: “The components of X are used to predict the scores on the Y components, and the predicted Y component scores are used to predict the actual values of the Y variables. In constructing the principal components of X, the PLS algorithm iteratively maximizes the strength of the relation of successive pairs of X and Y component scores by maximizing the covariance of each X-score with the Y variables. This strategy means that while the original X variables may be multicollinear, the X components used to predict Y will be orthogonal”.

The PLS Regression is initially defined for the prediction of continuous target variable. But it seems it can be useful in the supervised learning problem where we want to predict the values of discrete attributes. In this tutorial we propose a few variants of PLS Regression adapted to the prediction of discrete variable. The generic name "PLS-DA" (Partial Least Square Discriminant Analysis) is often used in the literature.

Keywords: pls regression, discriminant analysis, supervised learning
Components: C-PLS, PLS-DA, PLS-LDA
Tutorial: en_Tanagra_PLS_DA.pdf
Dataset: breast-cancer-pls-da.xls
References:
S. Chevallier, D. Bertrand, A. Kohler, P. Courcoux, « Application of PLS-DA in multivariate image analysis », in J. Chemometrics, 20 : 221-229, 2006.
Garson, « Partial Least Squares Regression (PLS) », http://www2.chass.ncsu.edu/garson/PA765/pls.htm

Multinomial logistic regression

2008-11-06 (Tag:5898347942915049712)

In this tutorial, we show how to implement a multinomial logistic regression with TANAGRA.

Logistic regression is a technique for making predictions when the dependent variable is a dichotomy, and the independent variables are continuous and/or discrete. The technique can be modified to handle dependent variable with several (K > 2) levels.

When the responses categories are unordered, we have the multinomial logistic regression. Roughly speaking, we compute the logit function for each (K-1) categories related to a reference group.

Keywords: multinomial logistic regression
Components: Supervised Learning, Multinomial Logistic Regression
Tutorial: en_Tanagra_Multinomial_Logistic_Regression.pdf
Dataset: brand_multinomial_logit_dataset.xls
References:
A. Slavkovic, « Multinomial Logistic Regression Models – Baseline-Category Logit Model », in « STAT 504 – Analysis of Discrete Data », Pensylvania State University, 2007.

Feature selection for logistic regression

2008-11-06 (Tag:2297032423022769187)

In some circumstances, the goal of the supervised learning is not to classify examples but rather to organize them in order to point up the most interesting individuals. For instance, in the direct marketing campaign, we want to detect the customers which are the most likely to respond to the solicitation. In this context, the confusion matrix is not really suitable for the evaluation of the predictive model. It is more valuable to use another tool, more appropriate for the evaluation of the respondents corresponding to the number of reached individuals: this is the “lift curve” (“gain chart”).

In this tutorial, we use the binary logistic regression for the construction of the gain chart. We show also that the variable selection is really useful in the context of dealing with large number of predictive variables.

We use a real/realistic dataset from a website (see Reference below). It contains 2158 examples and 200 predictive attributes. The objective variable is a response variable indicating whether or not a consumer responded to a direct mail campaign for a specific product.

Keywords: scoring, marketing campaign, logistic regression, feature selection, backward, forward, gain chart, lift curve
Components: Supervised learning, Binary logistic regression, Select examples, Scoring, Lift curve, Forward-logit, Backward-logit
Tutorial: en_Tanagra_Variable_Selection_Binary_Logistic_Regression.pdf
Dataset: dataset_scoring_bank.xls
References:
Statistical Society of Canada, "Data Mining - Case Studies - 2000"

STEPDISC - Feature selection for LDA

2008-11-06 (Tag:4693178585265269844)

In this tutorial, we use the stepwise discriminant analysis (STEPDISC) in order to determine relevant variables for a classification task.

STEPDISC (Stepwise Discriminant Analysis) is always associated to discriminant analysis because it relies on the same criterion i.e. the WILKS’ LAMBDA. So it is often presented such as a method especially intended for the discriminant analysis. In effect, it could be useful for various linear models because they are based upon the same representation bias (e.g. logistic regression, linear SVM, etc.). However, it is not really adapted to non-linear model such as nearest neighbor or multi layer perceptron.

We implement the FORAWRD and the BACKWARD strategies in TANAGRA. In the FORWARD approach, at each step, we determine which is the variable that really contributes to the discrimination between the groups. We add this variable if its contribution is significant. The process stops when there is no attribute to add in the model. In the BACKWARD approach, we begin with the complete model with all descriptors. We search which is the less relevant variable. We remove this variable if the removing does not significantly deteriorate the discrimination between groups. The process stops when there is no variable to remove.

Keywords: stepdisc, feature selection, linear discriminant analysis
Components: Supervised Learning, Linear discriminant analysis, Bootstrap, Stepdisc
Tutorial: en_Tanagra_Stepdisc.pdf
Dataset: sonar_for_stepdisc.xls
Reference: SAS/STAT User’s Guide, « The STEPDISC Procedure »

Random forest

2008-11-06 (Tag:8819485456460422219)

RANDOM FOREST is a combination of an ensemble method (BAGGING) and a particular decision tree algorithm (“Random Tree” into TANAGRA).

In this tutorial, we use the HEART (UCI Machine Learning Repository). We aim to predict a heart disease from various descriptors such as the age of the patient, etc. We have already used this dataset in other tutorials (see http://data-mining-tutorials.blogspot.com/search?q=heart).

Keywords: random forest, ensemble methods, decision tree, cross-validation
Components: Bagging, Rnd Tree, Supervised Learning, Cross-validation, C4.5
Tutorial: en_Tanagra_Random_Forest.pdf
Dataset: dr_heart.bdm
Reference: L. Breiman, A. Cutler, « Random Forests ».

SVM using the LIBSVM library

2008-11-06 (Tag:698682448927872415)

The LIBSVM library contains various support vector algorithms for classification, regression... The implementation is particularly efficient, especially about the processing time, as we will see below. Some documentations are available on the website of the authors.

We have compiled the C source code in a DLL on which we connect TANAGRA. In the first time, only C-SVC, multi-class support vector machine for classification, is available. We will add the other components in the future.

Keywords: SVM, support vector machine, multi-class SVM
Components: Supervised Learning, C-SVC, Bootstrap
Tutorial: en_Tanagra_CSVC_LIBSVM.pdf
Dataset: Tanagra_Nipals.zip
Reference: C.C. Chang, C.J. Jin, « LIBSVM – A library for Support Vector Machine »

Association rule learning using APRIORI PT

2008-11-03 (Tag:8673863095527741335)

In this tutorial, we show how to build association rule on a large dataset using an external program.

Our implementation of A PRIORI is fast but needs a lot of memory that limits its performances when we treat a big dataset or generate numerous rules. I have discovered the Christian BORGELT’s work, he proposes a very powerful association rule generator, which can handle huge dataset and is very fast.

To execute its implementation, we integrated a new approach in TANAGRA: the launching and the control of an external program. At the time of the execution, we create a temporary file, which we transmit to his program (APRIORI.EXE). Then the rules are automatically downloaded and displayed.

Keywords: association rule, large dataset
Components: A priori PT
Tutorial: en_Tanagra_A_Priori_Prefix_Tree.pdf
Dataset: assoc_census.zip
Reference: C. Borgelt, "A priori - Association Rule Induction / Frequent Item Set Mining"

"Supervised" Association Rules

2008-11-03 (Tag:5910104937104382359)

In many situations, we want to characterize a subset of the dataset. The GROUP CHARACTERIZATION component allows comparing several subgroups, it computes and compares descriptive statistics on the subsets. But, this component performs univariate analysis. It uses individually the attributes and does not analyze the possible interaction between two or more variables.

In this tutorial, we show a new component SPV ASSOC TREE that allows characterizing a subset of examples with the conjunction of variables. In fact, it is a “supervised like” association rule algorithm where we define the consequent of the rule.

Keywords: association rules, clusters characterization, clustering
Components: Group Characterization, Spv Assoc Tree
Tutorial: en_Tanagra_Spv_Assoc_Tree.pdf
Dataset: vote.txt

Association rule learning from transaction dataset

2008-11-03 (Tag:6147402530420358950)

Association rules can be built from attribute-value dataset, which is re-coded as binary table. In certain cases, we have a transaction dataset, which is already a binary table. It is not necessary to re-code this one. How to handle this kind of dataset?

TANAGRA can handle only attribute-value dataset: the absence of one item in a transaction is coded as 0, other values are seeing as a presence (1 value if the file is correctly encoded).

Keywords: association rules, a priori algorithm
Components: A priori
Tutorial: enBinary_A_Priori.pdf
Dataset: transactions.bdm
References: P.N. Tan, M. Steinbach, V. Kumar, « Introduction to Data Mining », Addison Wesley, 2006 ; chapitre 6, « Association analysis : Basic Concepts and Algorithms ».
Wikipedia - "Association rule learning"

Association rule learning from tabular dataset

2008-11-03 (Tag:6767376322185465496)

In this tutorial, we learn association rules from tabular dataset (examples x attributes).

We evaluate the influence of SUPPORT_MIN and CONFIANCE_MIN parameters on the number and the quality of computed rules.

Keywords: association rules, a priori algorithm
Components: A priori
Tutorial: enA_priori.pdf
Dataset: banque.bdm
References:
P.N. Tan, M. Steinbach, V. Kumar, « Introduction to Data Mining », Addison Wesley, 2006 ; chapitre 6, « Association analysis : Basic Concepts and Algorithms ».
Wikipedia - "Association rule learning"

Decision lists and Decision Trees

2008-11-03 (Tag:1630274221585064962)

Decision Lists have been popular methods in 90’s in machine learning scientific publications. They produce a list of sorted production rules such as “IF condition_1 THEN conclusion_1 ELSE IF condition_2 THEN condition_2 ELSE IF…”.

Decision Lists and Decision Trees have a similar representation bias but not the same learning bias, DL use the “separate-and-conquer” principle instead of “divide-and-conquer” principle. They can produce more specialized rules but they can also lead to overfitting: setting the right learning parameters is very important for the decision lists algorithm.

The algorithm that we have implemented in TANAGRA is suggested by CN2 (Clark & Niblett, ML-1989). We have introduced two main modifications: (1) we use a hill-climbing algorithm instead of a best-first search; (2) a new parameter, minimal support of a rule, can be adjusted to avoid non-significant rules.

Keywords: CN2, decision list, decision tree, CART, discretization
Components: Supervised Learning, MDLPC, Decision List, C-RT, Bootstrap
Tutorial: en_Tanagra_DL.pdf
Dataset: dr_heart.bdm
Reference: P. Clark, « CN2 – Rule induction from examples ».

SVM - Support vector machine

2008-11-03 (Tag:9104818427553290840)

SVM (Support Vector Machine) is a supervised learning algorithm which is well adapted for high dimensional problems. We implement the John Platt's SMO (sequential minimal optimization) algorithm into Tanagra.

In this tutorial, we show how to implement SVM with TANAGRA. We compare the classifier performance with that of the lineardiscriminant analysis (LDA). The error rate is measured using the bootstrap resamppling approach.

Keywords: SVM, support vector machine, machine à vaste marge, analyse discriminante linéaire, fonction noyau
Components: Supervised Learning, SVM, Linear discriminant analysis, Bootstrap
Tutorial: en_Tanagra_SVM.pdf
Dataset: sonar.xls
References: Wikipedia – « Support vector machine »

NIPALS for dimensionality reduction

2008-11-03 (Tag:4579678469176555189)

In this tutorial, we use NIPALS (Non-linear Iterative Partial Least Squares) algorithm for dimensionality reduction in a proteins discrimination problem. The latent variables produced by nipals become the input variables of a nearest neighbor algorithm. The accuracy of the subsequent classifier is dramatically improved.

NIPALS is a possible implementation of singular value decomposition (SVD); it enables to compute factors (latent variable) of principal component analysis (PCA) without a correlation matrix diagonalization. The computing time is reduced especially when we have dataset with many descriptors.

Keywords: NIPALS, principal component analysis, PCA, K-NN, nearest neighbor, bootstrap
Components: Supervised Learning, NIPALS, K-NN, Bootstrap
Tutorial: en_Tanagra_NIPALS.pdf
Dataset: Tanagra_Nipals.zip

Classifier comparison - Cross validation

2008-11-03 (Tag:655168427941640483)

Comparing the accuracy is often used in order to select the most interesting classifier. To do this, we must therefore produce a reliable error rate estimation.

Most of the time, we use a test set, a part of the dataset that not used during the learning phase. We obtain an unbiased measure of the error rate. But, this strategy is not feasible when we have a small dataset. Reserving a part of the dataset for the classifier evaluation penalizes the learning process.

In the context of small dataset, it is more judicious to use the resampling approaches such as cross validation. In this tutorial, how to implement the cross validation when we compare two classifiers.

Keywords: cross validation, resampling method, classifier comparison, classifier assessment, nearest neighbor, k-nn, decision tree, id3
Components: Supervised Learning, K-NN, Cross-validation
Tutorial: en_dr_comparer_spv_learning.pdf
Dataset: dr_heart.bdm

Classifier comparison - Using a predefined test set

2008-11-03 (Tag:7769743527622851511)

In order to evaluate a supervised learning algorithm, we often split the dataset into training set, which is used in the training process, and test set, which is used to obtain an unbiased error rate evaluation.

There are sampling components in TANAGRA, which enable to subdivide randomly the dataset, but in some circumstances, the user want use a predefined test set for their comparisons. It is especially usefull when we want to compare the performances of classifiers implemented in different softwares.

Keywords: supervised learning, classifier comparison, train and test set, error rate, confusion matrix, linear discriminant analysis, support vector machine, nearest neighbor classifier
Components: Select examples, Supervised learning, Linear discriminant analysis, SVM, K-NN
Tutorial: Classifier comparison
Dataset: sonar_with_test_set.xls

ROC Curve for classifier comparison

2008-11-03 (Tag:8444495832940981944)

ROC graphs enable to compare two or more supervised learning algorithms, they have properties that make them especially useful for domains with skewed class distribution and unequal classification error costs.

An ROC graph depicts relative trade-offs between true positives rate and false positives rate. It needs continuous output of classifier, an estimate of an instance’s class membership probabilities. In fact, a “score”, a numeric value that represents the degree to which an instance is a member of a class is sufficient.

AUC (Area Under Curve) reduces ROC performances to a single scalar value, which enables to compare several classifiers: this area is equivalent to the probability that the classifier will rank a randomly chosen positive instance higher than a randomly chosen negative instance.

In this tutorial, we compare linear discriminant analysis (LDA) and support vector machine (SVM) on a heart-diseases detection problem.

Keywords: roc curve, roc graphs, auc, area under curve, classifier performance comparison, linear discriminant analysis, svm, support vector machine, scoring
Components: Sampling, 0_1_Binarize, Supervised Learning, Scoring, Roc curve, SVM, Linear discriminant analysis
Tutorial: en_Tanagra_Roc_Curve.pdf
Dataset: dr_heart.bdm
References:
T. Fawcet – « ROC Graphs : Notes and Practical Considerations of Researchers »
Wikipedia - "Receiver operating characteristic"

Lift Curve - CoIL Challenge 2000

2008-11-02 (Tag:8293379519146958489)

The detection of potential customers is an essential task for data miners. TANAGRA now has new tools to perform this kind of task.

We use the dataset of the CoIL Challenge 2000 (CoIL Challenge 2000): targeting customers which will subscribe a particular insurance policy.

There were 2 datasets: (1) A learning set with 5822 examples. Target attribute is CLASS, there are 85 otherdescriptors, and 43 among them are socio-demographic attributes according of thezip code of the customer. (2) An unlabeled validation set of 4000 examples. We know that there are 238 positive examples in this dataset.

The challenge is to return to the organizers a file with 800 examples that contains the mostpositive customers.

Keywords: scoring, ciblage marketing, analyse discriminante, courbe lift, gain chart
Components: Supervised learning, Linear discriminant analysis, Select examples, Scoring, Lift curve
Tutorial: en_Tanagra_Scoring.pdf
Dataset: tcidata.zip

Apply a classifier on a new dataset (Deployment)

2008-11-02 (Tag:6029188503024580155)

How to apply classifier on a new dataset?

This functionality surpasses the TANAGRA framework, which intends only to evaluate and compare data mining algorithms. But, users ask it often; in this tutorial we show how to proceed.

Data preparation is a primordial step. Indeed, TANAGRA can handle only one data source. It is not theoretically possible to manipulate two dataset, and therefore apply a classifier on a new dataset. The trick is in the dataset preparation.

Keywords: deployment, CART algorithm, dataset exportation
Components: Supervised learning, C-RT, Select examples, View dataset, Export dataset
Tutorial: en_Tanagra_Deployment.pdf
Dataset: tanagra_deployment_files.zip

Feature selection using MIFS algorithm

2008-11-02 (Tag:2577492907021780163)

The variable selection is a crucial step of the Data Mining process. In a supervised learning context, the detection of the relevant variables is overriding. Furthermore, according the Occam Razor principle, we need always to build the simplest model.

This tutorial describes the implementation of the component MIFS (Battiti, 1994) in a naive bayes learning context. It is also interesting because the selection phase is preceded by a feature transformation step where continuous descriptors are discretized using the MDLPC algorithm (Fayyad and Irani, 1992).

Keywords: sélection de variables, discrétisation, modèle d’indépendance conditionnelle
Components: Supervised learning, Naive Bayes, MDLPC, MIFS filtering, Cross validation
Tutorial: enFeature_Selection_For_Naive_Bayes.pdf
Dataset: iris.bdm
References:
R. Battiti, « Using the mutual information for selecting in supervised neural net learning », IEEE Transactions on Neural Networks, 5, pp.537-550, 1994.
U. Fayyad et K. Irani, « Multi-interval discretization of continuous-valued attributes for classification learning », in Proc. of IJCAI, pp.1022-1027, 1993.

Discretization and Naive Bayes Classifier

2008-11-02 (Tag:8321387941855587850)

Build a naive bayes classifier on continuous descriptors. TANAGRA implementation of naive bayes classifier handles only discrete attributes, we needto discretize continuous descriptors before use them.

Because we are in a supervised learning context, we must use a superviseddiscretization algorithm such as Fayyad and Irani’s state-of-the-art MDLPC algorithm.

Keywords: contextula discretization, naive bayes classifier, cross-validation
Components: MDLPC, Supervised Learning, Naive bayes, Cross-validation
Tutorial: enSupervisedDiscretisation.pdf
Dataset: breast.bdm
References:
U. Fayyad et K. Irani, « Multi-interval discretization of continuous-valued attributes for classification learning », in Proc. of IJCAI, pp.1022-1027, 1993.

Decision Tree - ID3 algorithm

2008-11-02 (Tag:6654100773330435184)

This tutorial shows how to implement the ID3 induction tree algorithm (supervised learning) on a dataset. We analyze the famous "breast cancer wisconsin" dataset.

Keywords: decision tree, classification tree, ID3, supervised learning
Components: Supervised Learning, ID3
Tutorial: enDecisionTree.pdf
Dataset: breast.bdm
References:
R. Quinlan, " Induction of Decision Trees ", Machine Learning, 1, 81-106, 1986.

Saving and loading a sub-diagram

2008-11-02 (Tag:5950824448353184181)

We can save a part of the stream diagram. The goal is to perform some sequence of treatments on several similar datasets.

Keywords: save a diagram, copy paste, classifier comparison, supervised learning, cross validation, naive bayes classifier, feature selection, fcbf
Components: Supervised learning, Naive Bayes, Cross validation
Tutorial: en_Tanagra_Diagram_Save_Subdiagram.pdf
Dataset: congressvote_zoo.zip

Multilayer perceptron - Software comparison

2008-11-01 (Tag:8657259085780408491)

A Multilayer Perceptron for a classification task (neural network): comparison of TANAGRA, SIPINA and WEKA.

When we want to train a neural network, we have to follow these steps:
· Import the dataset;
· Select the discrete target attribute and the continuous input attributes;
· Split the dataset into learning and test set;
· Choose and parameterize the learning algorithm;
· Execute the learning process;
· Evaluate the performance of the model on the test set.

Keywords: neural network, multilayer perceptron, classifier assessment, sipina, weka
Components: Sampling, Supervised learning, Log-Reg TRIRLS, Linear Discriminant Analysis, C-SVC, Test
Tutorial: en_Tanagra_TSW_MLP.pdf
Dataset: ionosphere.arff
References:
Wikipedia - "Neural network"

Association rule learning - Software comparison

2008-11-01 (Tag:9203496538396619226)

Computing association rule with TANAGRA, ORANGE and WEKA.

We must respect the following steps if we want to compute association rules from a dataset:
• Import the dataset;
• Select the descriptors;
• Set the parameters of the association rule algorithm i.e. the minimal support and the minimal confidence;
• Execute the algorithm and visualize the rules.

Our three packages use attribute-based dataset. Each attribute-value couple becomes an item which be used for generating rules.

Keywords: association rules, a priori
Components: A priori
Tutorial: en_Tanagra_TOW_Association_Rule.pdf
Dataset: vote.txt
References:
Wikipedia - "Association rule learning"

Interactive tree builder

2008-11-01 (Tag:8310003213688311060)

One of the main advantages of decision trees is the possibility, for the users, to interactively build the prediction model. In this tutorial, we show how, using SIPINA and ORANGE, we build and manually modify a decision tree; especially we select the split attribute and
pruning the tree.

SIPINA is one of my old projects. It was very useful but it had some limitations, which have been rectified in TANAGRA: it was intended only for supervised learning; we cannot define and save the sequences of treatments in a diagram. Nevertheless, I use still this version for my courses, in particular for its functionalities in the interactive construction of decision trees.

SIPINA uses a graphical representation of the tree; ORANGE uses a standard treeview components. We will see that they propose very similar functionalities and provide the same results.

Keywords: decision tree, classification tree, interactive analysis, classififer assessment, orange
Tutorial: en_Tanagra_Interactive_Tree_Builder.pdf
Dataset: iris_tree.txt
References:
Orange - "Interactive tree builder"

Performance evaluation using a predefined test set

2008-10-31 (Tag:4975749763535195854)

TANAGRA, ORANGE and WEKA: Comparison of learning algorithms using a predefined learning and test set.

Very often, we use the accuracy to compare the performances of the algorithms. We then select the method that is the most accurate. So that the comparison is rigorous, it is necessary that we use the same dataset in training and test phase.

We show in this tutorial, how to implement this process in three data mining software: TANAGRA, ORANGE and WEKA. We chose to compare the performances of a SVM (linear
kernel), a logistic regression and a decision tree.

Keywords: supervised learning, decision tree, svm, logistic regression, classifier assessment, train and test set, orange, weka
Components: Select examples, Supervised learning, Binary logistic regression, C-RT, C-SVC, Test
Tutorial: en_Tanagra_TOW_Predefined_Test_Set.pdf
Dataset: breast_tow.zip

Learning classification tree - Software comparison

2008-10-31 (Tag:8122137893246251507)

Learning decision tree with TANAGRA, ORANGE and WEKA. Estimation of the error rate using a cross-validation.

When we build a decision tree from a dataset, we much follow the following steps (not necessarily in the same order):
• Import the dataset in the software;
• Select the class attribute (TARGET) and the descriptors (INPUT);
• Choose the induction algorithm, according the implementation we can obtain slightly different results;
• Learning process and viewing the decision tree;
• Use cross-validation in order to obtain an honest error rate estimate.

In this tutorial, we show how to perfom these operations using various free software.

Keywords: supervised learning, decision tree, classification tree, classifier assessment, performance evaluation, resampling method, cross-validation, orange, weka
Components: Supervised learning, C-RT, Cross validation
Tutorial: en_Tanagra_TOW_Decision_Tree.pdf
Dataset: heart.txt
References:
R. Rakotomalala, " Arbres de décision ", Revue Modulad, 33, 163-187, 2005 (tutoriel_arbre_revue_modulad_33.pdf) (fr)
Wikipedia, "Decision tree learning"

Computing ROC Curve

2008-10-31 (Tag:2295600375130376200)

TANAGRA, ORANGE and WEKA are free data mining softwares. They represent the succession of treatments as a stream diagram or a knowledge flow. Sometimes, there is a little difference between these softwares. Nevertheless, we show that in spite of these differences, these softwares often handle the same problems and give a very similar presentation of results. In this tutorial, we try to build a roc curve from a logistic regression.

Regardless the software we used, even for commercial software, we have to prepare the following steps when we want build a ROC curve.
• Import the dataset in the soft;
• Compute descriptive statistics;
• Select target and input attributes;
• Select the “positive” value of the target attribute;
• Split the dataset into learning (e.g. 70%) and test set (30%);
• Choose the learning algorithm. Be careful, the softwares can have different
implementation and present a slightly different results;
• Build the prediction model on the learning set and visualize the results;
• Build the ROC curve on the test set.

According the softwares, the progression can be different but it is clear that we must, explicitly or not, process these steps.

Keywords: supervised learning, roc curve, classifier assessment, orange, weka, training set, learning set, test set
Components: MORE UNIVARIATE CONT STAT, SAMPLING, SUPERVISED LEARNING, LOG-REG TRIRLS, SCORING, ROC CURVE
Tutorial: en_Tanagra_Orange_Weka_Roc_curve.pdf
Dataset: ds1_10.zip
References:
R. Rakotomalala – « Courbe ROC » (fr)
T. Fawcet – « ROC Graphs : Notes and Practical Considerations of Researchers »

Handling a weka file format (*.arff)

2008-10-31 (Tag:4330825954443383019)

WEKA is a popular free data mining software. It includes a large number of methods, mainly articulated around supervised and unsupervised approaches.

WEKA has a proprietary file format (*. ARFF), which is a text format, with specifications for ad hoc variables documentation. Import ARFF file is easy, since we know how to handle a text file.

In this tutorial, we show how to import a ARFF file into TANAGRA. When there are missing values, very simple substitution strategies are used: the average for continuous variables, a new value is added for discrete variables.

Treatments can start normally, a diagram is automatically created. If we decide to save TDM format, the reference file is saved. At the next loading of diagram, importation of the ARFF file ARFF is done automatically without specific manipulation.

Keywords: WEKA, ARFF file format, data file importation
Components: DATASET
Tutorial: en_Tanagra_Handle_WEKA_File.pdf
Dataset: sick.arff

Import tab-separated text file (csv)

2008-10-31 (Tag:5270006203822529310)

TANAGRA can handle text file format. Columns are delimited by tabulations. The first line corresponds to the names of variables. In this tutorial, we show how to: (1) preparing this type of file, from a spreadsheet for instance; (2) import the data by creating a new diagram in Tanagra.

Caution, the decimal point of continuous variables depends on the configuration of your system. Tanagra does not handle also missing values.

The use of text files is to be preferred when processing big dataset i.e. in the order of several hundreds of thousands of rows. For moderate-sized files (several tens of thousands of individuals), it is better to use Excel format.

Keywords : data file importation, text file format
Components : Dataset
Tutorial : enImportDataset.pdf
Dataset : weather.xls

OOo Calc file handling using an add-in

2008-10-31 (Tag:1478321249268778096)

In the same way that it is possible to transfer data directly into EXCEL using an add-in (see Excel Add-in), we have developed an add-in for the Calc spreadsheet Open Office (Open Office Calc).

The procedure is the same. When installing TANAGRA, the add-in is automatically installed on the disk. We have to plug in Open Office Calc following a specific procedure. The add-in is available since version 1.4.12 TANAGRA. The development and testing have been made with version 2.1.0 (French) Open Office.

This tutorial describes how to install the add-in Open Office Calc. Two documents are available: (1) classical pdf describes steps with screen shots (2) an animated tutorial (Adobe Flash format, in English, but the menus are all in French).

Keywords : data file importation, open office calc spreadsheet, add-in
Components : View Dataset
Tutorial : en_Tanagra_OOoCalc_Addon.pdf
Animated tutorial : from_OOoCalc_To_Tanagra.htm
Dataset : breast.ods

Excel file format - Direct importation

2008-10-31 (Tag:7789385562433613551)

TANAGRA can import a XLS file format in a native mode i.e. without it being necessary that the Excel spreadsheet software is present on your computer, XLS version 1997 to 2003 (for Excel 2007, it asked for confirmation).

Some minor restrictions on the workbook configuration: (1) data must be located in the first worksheet in the workbook; (2) there should not be left empty columns of data, or blank lines above the data, in other words, the data must be aligned with the first row and first column; (3) TANAGRA considers that the first line contains the names of variables.

In this tutorial, we show how to create a new diagram by directly importing an Excel.

Keywords : data file importation, excel spreadsheet
Components : Group characterization
Tutorial : en_Tanagra_Handle_Spreadsheet_File.pdf
Dataset : adult_dataset.zip

Excel file handling using an add-in

2008-10-31 (Tag:25559714519616417)

Starting from the Tanagra's 1.4.11 version, a new EXCEL add-in is available. It enables to define a data mining analysis directly from EXCEL spreadsheet without closing the EXCEL session.

The main asset of this functionality is that we can perform all data preparation (data transformation, feature construction, etc.) and basic descriptive statistics (mean, standard deviation, pivot table, etc.) in the spreadsheet. Then we call TANAGRA, from EXCEL, only for advanced machine learning technique.

In this tutorial, we show how to install and use this EXCEL add-in.

Keywords : data file importation, excel format, excel add-in
Components : Univariate Discrete stat
Tutorial : en_Tanagra_Excel_AddIn.pdf
Dataset : add_in_dataset.zip