I am trying to fit a dataset using the standard NNLS (non-negative least squares) approach. Formally:\[\min_x \left\Vert Ax-b\right\Vert^2_2\quad\mathrm{s.t.}\quad x\ge 0\] This is a quadratic program and can be solved optimally. The solution I find fits the data reasonably well and is relatively sparse (which is good for me) but has an undesirable property: I find that it may assign high weights to features (columns of A) that are highly correlated. I would like my solution to be such that the correlation between support features (features that have high weights) will be minimal. Note that in my setting all entries of A and b are non-negative, and I can normalize the columns of A so that the norm of each column is 1.

I tried approaching the problem by directly minimizing possible formalizations of the quantity I want. Note that A^TA is the covariance matrix, so it could potentially be a good thing to minimize. But minimizing

\[||Ax-b||^2_2+\lambda x^TA^TAx\]

does not make any sense, because:

\[||Ax-b||^2+\lambda x^TA^TAx = x^TA^TAx-2b^TAx+b^Tb+\lambda x^TA^TAx = (1+\lambda)x^TA^TAx-2b^TAx+b^Tb\]

This means you can absorb lambda into b which is the same as solving the original problem. In other words, this is mathematically equivalent to multiplying b by some constant and solving for that, which will yield the original solution multiplied by some number.


\[||Ax-b||^2_2+\lambda x^T(A^TA-I)x\]

makes the optimization non-convex (because you get negative eigenvalues). With this approach, I can see that intuitively this is a bit like doing the opposite of ridge-regression/Tikhonov regularization.

I also tried L1 regularization

\[||Ax-b||^2_2+\lambda ||x||_1\] just for the heck of it, but it doesn't solve this problem.

Formally, I am looking to modify the objective function such that it penalizes solutions in which features that are "similar" to each other both get high weights. There are several ways to formalize this notion, for example one being that I would like the support columns to be as orthogonal as possible.

Does anyone have ideas on how else to approach this?

asked 10 Jan '13, 12:24

Bitwise's gravatar image

accept rate: 0%

edited 11 Jan '13, 13:03

Ok here's another quick thought: how about trying some form of PCR (principal component regression) with a nonnegativity constraint? The internal PCA will get rid of the collinearity. Then all you have to do is solve a NLLS problem on the latent variables.

(12 Jan '13, 09:45) Gilead ♦

You could minimize \[\left\Vert Ax-b\right\Vert ^{2}+\lambda x^{T}A^{T}Ax\] where \(\lambda>0\) is a penalty weight. This penalizes covariance; if you know that \(\sum_i x_i\) is some predefined constant, you can scale the second term to get correlation rather than covariance. You'll have to pick \(\lambda\) by trial-and-error. Happily, the objective remains convex.


answered 10 Jan '13, 17:10

Paul%20Rubin's gravatar image

Paul Rubin ♦♦
accept rate: 19%

edited 11 Jan '13, 08:46

How did you get the math tex to work? putting it between $ signs doesn't work for me... (e.g. $x$)

(11 Jan '13, 08:30) Bitwise

You have to use backslash-parenthesis for inline math and backslash-bracket for display math, and you have to double up all backslashes ("\\").

(11 Jan '13, 08:43) Paul Rubin ♦♦

I can't get it to work... do you mind editing one of the equations in my answer so I can see how it is done? Thanks!

(11 Jan '13, 09:00) Bitwise

I replaced your first equation.

(11 Jan '13, 11:17) Paul Rubin ♦♦

Thanks! I extended my question to answer what you proposed. Briefly, it is not helpful as it is equivalent (up to a constant) to solving the original problem without regularization.

(11 Jan '13, 13:05) Bitwise

"Equivalent" in what sense? It will yield a different value of \(x\) as the optimum. The new optimum (call it \(\tilde{x}\)) will be the solution produced by your original model for some \(\tilde{b} \neq b\), but is that a problem?

(11 Jan '13, 16:05) Paul Rubin ♦♦

@Paul equivalent in the sense that the solution to the new problem will be the solution to the old problem multiplied by a constant.

(16 Jan '13, 09:29) Bitwise

@Bitwise: Right; which means that the covariance changes but the correlation does not. That's why I brought up a normalization constraint. (You don't have one listed in the original question, but I wasn't sure if your original model was complete.) If \(x\) is normalized in the model (\(\left\Vert x \right\Vert = K\) or \(\sum_i x_i = K\) for some constant \(K\)), then the correlation changes.

(16 Jan '13, 14:45) Paul Rubin ♦♦
showing 5 of 8 show 3 more comments

Ok here's a random idea that crossed my mind, based on your comments and Paul's. Not sure if it's the most efficient way of approaching this problem, but just throwing it out there...

Step 1: Solve the original problem: $$ \min_x \Phi = \left\Vert Ax-b\right\Vert^2_2\quad\text{s.t. }\quad x\ge 0 $$ Record the optimal objective value, \(\Phi^{*}\).

Step 2: Solve the problem again, but with a different objective: $$ \begin{align} &\min_x x^T A^T A x\\ \text{s.t. } & x \geq 0\\ & \left\Vert Ax-b\right\Vert^2_2 \leq \Phi^{*} + s \end{align} $$ where \(s > 0\) is a slack.

The idea here to exploit the possible non-uniqueness of the original problem (do you know anything about its SSOCs?). In Step 2, we search within the set of optimal solutions for the original problem for one that minimizes the covariance.

It looks like this can be rewritten into the canonical form of a QCQP, which under some restrictions on the positive-definiteness of the resultant matrices, can be shown to be convex. I haven't worked with QCQPs before, and word on the street has it that they're not that easy to solve. Perhaps someone else with more expertise can comment.

Edit: wait a minute -- a convex QP (with a symmetric-PD weight matrix), which I assume yours is, will always give a unique minimizer. Does this mean you're willing to trade-off some goodness-of-fit to minimize covariance?


answered 11 Jan '13, 14:40

Gilead's gravatar image

Gilead ♦
accept rate: 15%

edited 11 Jan '13, 15:13

Thanks for your ideas. In my case, the optimal x is often unique, so the space that you propose searching in the second step is not useful in my case. However, I can modify it by allowing some slack. Still, this seems to me like it is a thresholded version of the minimization that Paul proposed, which turns out not to be useful. I think the problem might be that the values that should be minimized are really only the off-diagonal values of A^TA. This is why I thought of minimizing A^TA-I, since it zeros the diagonal. But this makes the problem non-convex.

(11 Jan '13, 15:00) Bitwise

I've added a slack to the 2nd problem, which essentially changes the original problem. The \(s\) slack variable, as you say, is a thresholding parameter that systematically trades off goodness-of-fit for covariance, but I don't believe it will yield the original solution multiplied by a number (it's a relaxation, not a penalty). Intuitively, it seems to me that since minimizing goodness of fit and covariance are competing objectives, you won't be able to achieve both at the same time, hence the need for backing-off on one of them to achieve the other.

(11 Jan '13, 15:22) Gilead ♦

Also, even if one were to succeed in minimizing the off-diagonals, that would mean paying a price in terms of the goodnesss-of-fit, given that there is only one configuration of \(x\) that will give you the optimal fit. So it seems to me that the key is to define some sort of balance between the two. Does this reasoning sound right to you?

(11 Jan '13, 15:25) Gilead ♦

@Gilead: you are correct about the tradeoff (which is what I was after with my penalty approach). Having solved your step2 model (with the slack term), I think @Bitwise could also play "connect the dots" and look at convex combinations of the two solutions to explore different levels of the tradeoff.

(11 Jan '13, 16:09) Paul Rubin ♦♦

@Paul That is an excellent idea. In fact, perhaps some sort of trade-off curve could be constructed using the penalty approach: \( (1- \lambda)(\left\Vert Ax-b\right\Vert_2^{2}+\lambda x^{T}A^{T}Ax\) by manually varying \(\lambda \in [0,1]\).

(11 Jan '13, 16:38) Gilead ♦

@Gilead this sort of tradeoff parameter is widely used in these types of regularizations such as those mentioned above. The standard way of choosing the "best" value is with cross-validation and in some cases you can actually solve the problem for all values of lambda in one shot (e.g. LARS/LASSO). The idea is that a sparse solution with a reasonable fit, for example, might generalize better (overfit less) than a non-sparse solution that fits better. The challenge for me is to define an objective function that will achieve the property I desire, i.e. not to choose highly correlated variables.

(11 Jan '13, 23:46) Bitwise
showing 5 of 6 show 1 more comments

Another possibility would be to stick to the original objective function and add constraints of the form \[|x_i - x_j| \ge h(\rho_{ij})\]where \(\rho_{ij}\) is the correlation between features \(i\) and \(j\) and \(h()\) is some positive, nondecreasing function (could just be multiplication by a constant). The absolute value constraints would require the introduction of binary variables to linearize them, or a reformulation using SOS1 constraints, but that shouldn't be too onerous unless the feature space is large.


answered 16 Jan '13, 14:52

Paul%20Rubin's gravatar image

Paul Rubin ♦♦
accept rate: 19%

Your answer
toggle preview

Follow this question

By Email:

Once you sign in you will be able to subscribe for any updates here



Answers and Comments

Markdown Basics

  • *italic* or _italic_
  • **bold** or __bold__
  • link:[text](http://url.com/ "Title")
  • image?![alt text](/path/img.jpg "Title")
  • numbered list: 1. Foo 2. Bar
  • to add a line break simply add two spaces to where you would like the new line to be.
  • basic HTML tags are also supported



Asked: 10 Jan '13, 12:24

Seen: 2,432 times

Last updated: 16 Jan '13, 14:52

OR-Exchange! Your site for questions, answers, and announcements about operations research.