Folks, I have worked on building and deploying optimization models in one machine and never worked or completed coursework on parallel optimization. It seems quite interesting and would like to know more about it. Please may I request OR folks here to suggest me resources on where to start. I would like to know how to set up a parallel computing platform (given 2 or more machines readily available) and how to programming a simple LP using existing parallel computing setup. Also is it possible to run parallel optimization models on Hadoop (either mapreduce or YARN)? asked 08 Jan '14, 12:21 Pavan 
Hi Pavan, If you're familiar with Python and it's IPython interactive shell and other extras, you can setup a parallel computing infrastructure very easily. IPython's introduction to parallel computing will get you started in an hour or so (that's what it took for me). Also, there is Sage which allows you to easily define LP and MILP and also offers facilities for parallel computing. I've used Python for scientific computing for a few years now and it feels to me like the perfect balance between fast execution time and quick development. Further more, it's also multiplatform (in case you may need to bring together machines with different OSes). The Sage approach is to use Python as a "glue" language to call highly efficient numerical libraries (linear algegra, LP, etc.) and offers libraries for very highlevel mathematics if that's handy for your need. Enjoy! Eric answered 08 Jan '14, 15:45 EricP 2
An alternative: Julia http://julialang.org/, which also has IJulia, similar to IPython (runs on same infrastructure). It also has great support for optimization  see http://juliaopt.org for a summary.
(08 Jan '14, 23:07)
Iain Dunning

I'd agree its very interesting, but I'm going to be a bit negative and caution that the types of optimization that OR people usually consider are not easily made parallel.
So, to summarize, "here be dragons". Exploiting structure in your problems is critical  there are no easy wins here for most OR problems. On a more positive note, see my colleague Miles Lubin's blog post "Distributed Numerical Optimization" to see what is possible if that structure is present, and how it can be fairly easy with the right tools (here, Julia). answered 08 Jan '14, 23:04 Iain Dunning Great answer. Especially many normal developers fail to understand that MapReduce doesn't work on optimization problems. I 've been thinking of writing an article on that for some time.
(09 Jan '14, 04:32)
Geoffrey De ... ♦
I believe you mean to say optimization problems cannot be solved in mapreduce. Mapreduce is a layer on top of a given distributed file system  be it HDFS or MapR filesystem. We dont configure/design mapreduce to solve optimization problems (hence, mapreduce doesn't work on optimization problems). The idea is to configure/design algorithms to solve optimization problems in mapreduce. feasible or not comes later.
(09 Jan '14, 06:35)
Pavan
1
Many optimization problems can be described as a sequence of mapreduce operations, because most optimization algorithms take multiple iterations. For example, the multiplicative weights algorithm for solving LPs is very parallel at each iteration, and is easily mapped to a mapreduce paradigm  but you need multiple such iterations to converge to a solution. I would also say that optimization rarely needs "big data" in raw form  usually you would use MapReduce to extract relevant data/summary from your raw bigdata, then run the optimization on that. I have done this in practice.
(09 Jan '14, 07:44)
Iain Dunning
1
Indeed, there's theoretical support for the claim that LP is hard to parallelize. LP is Pcomplete.
(09 Jan '14, 20:30)
Austin Buchanan

Are you a CPLEX user? The latest version (12.6) has the ability to ship problems to remote processors (which must have CPLEX installed). Could be useful for, among other things, decomposition methods with multiple subproblems. I don't have docs handy, but it supports multiple communication protocols, including (I'm pretty sure) SSH. answered 08 Jan '14, 18:59 Paul Rubin ♦♦ 
In addition to what Paul says, CPLEX 12.6 includes a distributed parallel MIP solver. This solver builds on the capacity Paul describes. Our distributed MIP runs out of the box, you don't need to do much setup except than declaring the machines you'll be using. I concur on the various comments on use cases. LP seems hard to distribute. I've seen basic linear algebra being distributed, but it is far from providing a useful LP algorithm. Decomposition methods are a good way to start. Last, distributing the branch&bound tree as we do in our distributed MIP solver is also a good one. answered 09 Jan '14, 05:14 jfpuget 
I have used Stochastic Gradient Descent method to solve a logistic regression problem (sigmoid function) using Mapreduce (in Hadoop using Mahout  multi node cluster with 72 nodes). Also, I understand that mapreduce version of conjugate gradient is implemented. Hence, convex optimization problems can be solved using Mapreduce in Hadoop  my theory is that any algorithm that uses gradient (and not hessian) could be implemented in mapreduce paradigm. But even Newton Rhapson method seems to have been implemented in mapreduce framework. Look up MapReduce for Machine Learning on Multicore Not to forget Breadth First Search, Shortest Path Algorithm in mapreduce. I have not personally solved TSP using Mapreduce but Apache Mahout developers have developed Evolutionary Algorithm for solving TSP. And yes, the procedure uses mapreduce paradigm. You can also look at HAMA package csl.skku.edu/papers/CSTR2010330.pdf that provides a matrix and graph computation framework for Hadoop. True, it uses BSP not use mapreduce framework and my question was mapreduce/YARN. Also lookup pseudo distributed mode of Hadoop  it does not use mapreduce layer. Also I was actually hoping someone would comment/answer in context of YARN. answered 09 Jan '14, 06:29 Pavan You're mapReducing the computation (branches etc). I was talking about mapReducing the problem data :) I 'll clearly mark the distinction in the article to avoid confusion.
(09 Jan '14, 06:37)
Geoffrey De ... ♦
not sure what you mean. I merely exported csv files and txt files into HDFS and after i execute my code, mapreduce will do the rest. the beauty here is the background daemon tools like job tracker and task tracker will divide the data into chunks, process it in each slave node (map phase) and aggregated (reduce phase) to produce desired resultsd. its actually like map, shuffle and sort/combine, reduce. besides can you give me an example where you "mapreduce the data".
(09 Jan '14, 07:09)
Pavan
MapReduce the problem data. For example, take a TSP dataset with 100k cities, divide it into 2 datasets of 50k cities each. In parallel, solve each one optimally. Then "reduce" those 2 solved datasets back to 1 dataset. Is it optimal? No. HTH :)
(10 Jan '14, 02:43)
Geoffrey De ... ♦
1
@Geoffrey: In all fairness, one could certainly build a fairly good heuristic solution in this way  not sure abt. the "framework overhead", though [there are probably way more efficient ways to implement parallelized split/combine operators].
(10 Jan '14, 05:00)
fbahr ♦
@fbahr Agreed. But it's quality goes down as the number of splits increase  so it would scale out badly. It would be interesting to see a graph of the splitCount vs the solutionQuality on different use cases to get some metrics (instead of educated guesses as we're doing now).
(10 Jan '14, 05:28)
Geoffrey De ... ♦
2
I finally got around to writing that article on MapReducing a planning problem. Enjoy
(03 Mar '14, 11:21)
Geoffrey De ... ♦
1
looks good to me. agree with you on the fact that splitting a problem will not yield optimal solutions.
(05 Mar '14, 09:54)
Pavan
showing 5 of 7
show 2 more comments
