Desktop version

Home arrow Management

  • Increase font
  • Decrease font

<<   CONTENTS   >>

Machine learning methods

Text mining can be combined with machine learning methodologies to identify ideas in text such as online product reviews. Christensen et al. [20171 discuss an approach using a machine learning classifier known as a support vector machine (S VM). An SVM is one of several machine learning methods that classify objects into groups, usually two groups for a binary classification. Other approaches are naive Bayes, decision trees, nearest neighbors, neural networks, and radial-basis support vector machines (as opposed to linear support vector machines). See Christensen et al. [2017, p. 27]. These are all in a general class of classification methodologies known as supervised learning methods; another class consists of unsupervised learning methods. Supervised learning requires a target variable, also called a dependent variable, that directs the independent variables in the estimation of unknown parameters. In regression analysis, the unknown parameters are the coefficients (and the variance) that are estimated based on the independent variables to give the best estimate of the dependent, target variable. In this case, “best” is defined as the minimum sum of the squared distances between the dependent variable and the predicted dependent variable. Unsupervised learning does not have a target variable or set of unknown parameters. Here, the goal is to find patterns in the data, so the unsupervised learning methods are pattern identifiers unlike the supervised learning methods which are parameter identifying methods. See Paczkowski [2018], Hastie et al. [2001], and James et al. [2013] for discussions of different learning methodologies.

In the approach outlined by Christensen et al. [2017], a set of product reviews is preprocessed using the tools described above: stop-words are eliminated, spelling is corrected, cases are fixed, and words are tokenized. A DTM is created which is used in an SVM. The target variable is an issue. In the Christensen et al. [2017J approach, a prior set of product reviews is combed through for those that contain ideas and those that do not contain ideas. Those that do contain ideas are coded as 1 while those that do not are coded as 0. This is a binary classification that defines the target variable for the SVM. The target variable and the DTM are submitted to an SVM that produces classification rules. These rules can then be used against a larger untrained set of reviews. This larger set would, of course, have to be preprocessed to produce a new DTM. Those reviews classified as containing new product ideas would have to be examined to see which actually did contain new ideas and those ideas would have to be rated for their usefulness. So there is still a human element involved that requires time and energy and thus a cost. Nonetheless, this cost would be smaller than the cost of having personnel read numerous product reviews looking for ideas. In short, this machine learning approach would minimize the cost of new idea identification, but not eliminate it. As machine learning and artificial intelligence (A I) develop further, newer more enhanced approaches will be developed that will further aid idea identification for online product reviews.

Managing ideas and predictive analytics

Few businesses have only one new product in a pipeline at any one time and certainly few have developed only one new product during the course of their existence. They typically have more than one product in the pipeline at any one point in time and they also have a history of pushing new products into the market, some of which succeeded and others not. This history could be used to determine which new product ideas should enter the current pipeline and even remain in the pipeline.

One way to accomplish this is to construct a data table of all past products with an indicator recording their market success or failure. The fields of the table contain the deign parameters of the products: features, characteristics, technical specifications. Basically, everything that describes the products. The launch marketing parameters (e.g., price point) should also be included because the product may have failed in the market simply because the marketing was incorrect. Aside from the marketing parameters, the technical parameters can be used to model the success or failure of each product in the data table and that model could then be used to predict how a new product idea will fare in the market. This is a predictive analytics approach to new product management. See Davenport and Spanyi [2016| for comments on this way of approaching new product development.

Predictive analytics is a broad area that is more an umbrella concept than an actual method or approach. It uses a variety of methods from several empirical disciplines such as statistics, data mining, and machine learning to say something about an unknown. Prediction and forecasting are often confused because both aim to accomplish this same objective. Forecasting is specifically concerned with future events. You forecast the sales of a new product “next year and the year after” which are certainly unknown at the present time but you predict whether or not a product will sell. Prediction is more general with forecasting as a subset of prediction. All forecasts are predictions but not all predictions are forecasts. Predictive analytics is concerned with applying different tools to allow us to say something about an unknown but not necessarily about a future time period unknown. Forecasting per sc for new products, in particular at launch time, is covered in Chapter 6 and forecast errors are covered in Chapter 7.

A major application of predictive analytics is to predict a behavior of customers to buy, sell, attend, make payments, default, and so on. Typically, a predictive model is specified and estimated using a subset of historical customer data and then the entire customer database is scored, meaning that each customer in the database is assigned a score which is usually interpreted as a probability. This score as a probability would be, say, the probability of buying a product given past purchase history. Customers can be sorted (in descending order) by their assigned score and those with the highest score are marketed to first. The customers are marketed to in descending order of their scores until the marginal cost of marketing to one more customer exceeds the expected returns from that customer.

Let the success/failure of past products be dummy or indicator coded such as

Let X be a vector of attributes for the products in a product data table. Then a scoring model is the logit model

where Z = X/? and ft is a vector of unknown parameters to be estimated.

Prior to estimating this model, the data table should be divided into two parts: a training data table and a testing data table. The former is used for estimation while the latter is used for testing the predictive ability of the estimated model. Typical proportions are 3/r and 2/s training data. Whichever proportion is used, it is important that the training data table have more cases than the testing data table. It is also important that the testing data never be used for estimation. See Paczkowski [2016] for some discussion about training and testing data sets. Also see James et al. [2013] and Hastie et al. [2001] for advanced discussions. Also see Chapter 6 for a discussion of splitting time series and Chapter 7 for other comments.

This model is reviewed in some detail in Chapter 5.

There are two issues with this approach: the construction of the product data table and the classification of a product as successful and unsuccessful. First, a business must not only have a sufficient number of past products that it could use in modeling, but it must also have the history on each one. A list of features, characteristics, technical specifications must be available. Such a historical list might be difficult to compile let alone maintain. Second, someone must decide on the criteria for success and failure of a product. Clearly, if nothing was sold, then the product failed. Zero sales are most unlikely for any product. A low number of units sold might constitute a product failure, but what is “low.” The most likely candidate for a criterion is the marketing objective originally established for the product. Was that objective met?

Davenport and Spanyi [2016] note that NetFlix is an example of a company that uses this approach. See the various articles at area/analytics regarding the use of predictive analytics at Netflix.


There are many software products available for text analysis. I will classify them into free and commercial. The free software includes Python and R while the commercial includes SAS and JMP.

Both Python and R are powerful, well established, and definitely known for their text processing capabilities. R is known for its steep learning curve which tends to make it less appealing to those without heavy training in statistics and who do their data analysis in spreadsheets. Not only is the learning curve steep, but everything that has to be done requires some degree of programming. This is a drawback because programming is a specialty skill that requires time to master. Typical new product development managers, regardless of their level, would probably not venture near R if they do not have any intentions of doing other and more sophisticated forms of data analysis. The time cost in coming up to speed with R would not be worth the effort.

Python in many regards is not much better than R. The strong point in favor of Python is its almost cult-like focus on its easily written and interpretable syntax. There is a Pythonic way of writing programming code that almost all who write in Python adhere to. Nonetheless, some programming is still required and this factor may be a hindrance to those who only want to get an answer with minimal hassles. The Python package Pandas has a simpler syntax making it easier to do data analysis, including data visualization, but without any loss of power or capabilities. The book by McKinney [2018] provides a great introduction to Python and Pandas.20 An excellent book on using Python for text analysis is Sarkar |2016],

SAS is the granddaddy of all statistical software products. It is probably safe to say that if some capability is not in SAS, then it is not worth using. SAS, in keeping with being at the forefront of statistical software, has the SAS Text Miner. This strives to make text analysis easier for all users. The problem is that SAS Text Miner, like the SAS software itself, is expensive.

JMP is a product of the JMP division of the SAS corporation. It too has a high price tag, but it also has a more intuitive interface targeted to allowing technical and nontechnical users to more easily interact with their data, whether text and numeric. This interface is graphical meaning that it is dynamic and simple to use and manipulate. The text analysis component of JMP was used in this chapter. See Paczkowski [2016] for usingJMP for analyzing market data.


In this chapter, I focused on a critical part of new product development - the development of the idea for the product. Any discussion of new product development processes is meaningless unless there is a new product. This should be obvious. Several approaches to product ideation were presented including text analysis using the vast amount of text data now collected through various means. The next chapter will assume that a product idea has been identified and now a design for it must be developed.


This Appendix outlines some mathematical concepts that are to know for text analysis methods such as Latent Semantic Analysis. It is not critical that you know or master this Appendix. It is completely optional. If you do want to peruse more, useful references are Lay [2012], Strang [2006], andjobson [1992].

Matrix decomposition

An important result in matrix algebra is the decomposition of a matrix into three parts. In particular, if A is а n X n symmetric matrix, then it can be written or decomposed into

where V is an orthogonal matrix and Л is a diagonal matrix with diagonal elements

Xx, X2.....X„. These diagonal elements are the eigenvalues of A and the columns of

V are the corresponding eigenvectors. Since V is orthogonal, then VVT = VTV =

I. This is the spectral decomposition of A where the “spectrum” is the set of eigenvalues. Seejobson [ 19921.

IfX is n X p, then XTX is (p X n)(n X p) = pX p and is symmetric. Similar, XXT is n X n and is also symmetric. Both are symmetric so they can be decomposed as XTX = VAVT and XXT = VTAV so V has the eigenvectors of XTX and A is the diagonal matrix of eigenvalues.

Singular value decomposition (SVD)

An important mathematical operation that plays a pivotal role in a number of statistical methodologies is the Singular Value Decomposition (SVD). It is important for:

  • 1. correspondence analysis;
  • 2. principal components analysis;
  • 3. regression analysis; and
  • 4. text analysis

to mention a few. SVD relies on three concepts:

  • 1. eigenvalues;
  • 2. eigenvectors; and
  • 3. similarity of two matrices,

although the last three are inseparably related. I assume that basic matrix operations and concepts are known. If a background is needed or has to be refreshed, see Lay [2012] and Strang [2006] for good introductions and developments. Some of the material in this section draws heavily from Lay [2012]. The eigenvalues and eigenvectors, however, may be somewhat unknown so the following discussion on these two related concepts will help lead into the SVD discussion.

If Ax = dx, A a scalar, then A, an иХ/i square matrix, transforms the vector x into a multiple of itself, the multiplier being A. Basically, x is stretched, shrunk, or reversed by A but its direction is not changed. As a typical example21, let


so the vector x is expanded by a factor of 4, but in the negative direction. The stretching/shrinking factor, A, is called an eigenvalue.22 Other names are characteristic value and characteristic root. There may be multiple eigenvalues, duplicated (sometimes referred to as multiplicities), or none at all. The vector x that is stretched or shrunk is called an eigenvector. There may be multiple eigenvectors, but each eigenvector corresponds to one eigenvalue. The eigenvector must be nonzero, but the corresponding eigenvalue may be zero.

There are many ways to decompose or translate a matrix into constituent parts. The most popular from an analytical perspective is the Singular Value Decomposition (SVD). This decomposition method can be used with large matrices and it is not restricted to square matrices.

The SVD method decomposes a matrix A of size nxp into three parts which, when multiplied together return the original matrix A:


  • • U is an n X n orthogonal matrix such that UTU = UUT = I where I is the identity matrix and is n X и;
  • • £ is an n X r diagonal matrix such that £ = diagonal(al, cr2, ■ ■ ■,
  • • VT is an rxp orthogonal matrix such that VTV = VVT = I where I is pXp.

Two vectors are orthogonal if their inner or dot product is zero. The inner product of two vectors, a and b, is a • b = £"=1 <>, X h, and the two vectors are orthogonal if а • b = 0. This is the same as saying they are perpendicular to each other, or the cosine of the angle between the two vectors is zero.

The main points to know about SVD are:

  • 1. The diagonal elements of E are non-negative. They can be arranged in any order (as long as the corresponding columns of U and V are in the appropriate order) so by convention they are arranged in descending order.
  • 2. The diagonal values of £ are called the singular values and are the square roots of the eigenvalues of AAT.
  • 3. The columns of U are the eigenvectors associated with the eigenvalues of AAT. Similarly for the columns of V.
  • 4. The columns of U are called the right singular vectors and the columns of V are the left singular vectors.

As an example of the method, let which is 3 X 2. Then the SVD of A is

Notice that the left and right singular matrices are each orthogonal and that the product of the three matrices is the original matrix, A (within rounding).

Not all the singular values in the diagonal matrix £ are positive or even large. Those that are small or negative are often set to 0 and/or dropped. If dropped, then the corresponding eigenvectors in U and V must be dropped. A truncated SVD results in an approximation.

If к < r is the number of positive, nonzero singular values retained, then the truncated SVD is


  • • Uj, is n X k
  • • Efc is к X к; and
  • • V J is к X p.

Most statistical analyses use a truncated or restricted S VD.

There is one final observation about the SVD. Let X by an nxp matrix. Since X = UEPT, you can write

TABLE 2.4 Comparison of Spectral Decomposition and STD.



V has the eigenvectors of A and is orthogonal

U and P have the singular vectors of A and are each orthogonal

so that the right singular vectors of X are the eigenvectors of XTX from the spectral decomposition. Similarly,

so that the left singular vectors of X are the eigenvectors of XXT from the spectral decomposition. Also, the eigenvalues of XTX are the squares of the singular values of X based on the spectral decomposition. This is the basis for the correspondence analysis report I will describe in Chapter 3 that shows the singular values and the eigenvalues as their squares.

Spectral and singular value decompositions

The spectral decomposition has a diagonal matrix as middle component that has eigenvalues on the diagonal. The SVD has a middle diagonal matrix that has singular values on the diagonal. For a symmetric matrix A, the spectral decomposition and the SVD are equivalent so that

so the singular vectors are the eigenvectors and the singular values are the eigenvalues. See Jobson [1992]. Table 2.4 compares both methods of matrix decomposition.


  • 1 See Couger [1995, p. 419].
  • 2 See Isaksen [1998].
  • 3 Obviously, both floors and wings can be added at the same time. This complication is an unnecessary one for this example. See sql-and-nosql/ for a comment about buildings.
  • 4 “NoSQL” is interpreted as “No SQL” or “Not SQL” to distinguish it from SQL.
  • 5 The Wikipedia entry for stop-words (, last accessed April 27, 2018) notes that the idea for stop-words was introduced in 1959.
  • 6 See for a list of stop-words. Last accessed April 27, 2018.
  • 7 The dot notation indicates summation over all values of j, or all documents in C in this case. That is,/c = Zmaff-
  • 8 Another reference is Term Document Matrix which is the transpose of the DTM. 1 prefer the DTM.
  • 9 A “1" is added to the argument of the log function to avoid situations of taking the /°
  • 10 See the blog article by Jim Cox at SAS for a discussion: sascom/2015/10/22/topical-advice-about-topics-comparing-two-topic-generation- methods/.
  • 11 See the Appendix to this chapter for a discussion of the SVD. Sometimes the transpose of the DTM is used so the SKD is applied to a term document matrix or TDM.
  • 12 See https://en.wikipedia.0rg/wiki/Non-negative_1natrix_factorization#Clustering_ property. Last accessed July 2, 2019.
  • 13 If a TDM is used, then the left set is used.
  • 14 In an actual problem, there would be far more than 22 reviews.
  • 15 See

sentiment-analysis for some discussion. Last accessed June 5, 2019.

  • 16 This list was complied from Last accessed on October 9, 2018.
  • 17 See Paczkowski [2018] for some comments.
  • 18 There is nothing special about 0.05. It is more a convention than anything else.
  • 19 See the Wikipedia article on ethnography at Ethnography. Last accessed July 31, 2019.
  • 20 Wes McKinney is the creator of Pandas.
  • 21 Source: Lay [2012].
  • 22 The prefix “eigen" is from the German word “eigen” meaning “proper" or “characteristic.”
<<   CONTENTS   >>

Related topics