I have a problem, and it seems like it should be something that someone has studied before. I have a sorted list of N elements, and I want to divide them into K groups, by choosing K1 split points between them. There may be elements with the same value, and we want to have items with same value in the same group. Find K groups as close in size to round(N/K) as possible. For example, divide these 32 elements in to 4 groups of size 8:
One solution would be these 3 break points:
total error = 4 Does this look familiar to anyone? I'd like an approximation algorithm if possible. Thanks, Craig Schmidt 
I think a dynamic program (which would run in polynomial time). For each break position x, define the value v(x,k) = best objective if the k'th break is in position x. Then you calculate v(x,k) by running through all x'<x and taking v(x',k1) + cost having [x',x] as a group. The final objective is the best v(x'',k) for the final position x'' (I think I need some notation). In your example, if we denote 1,2,3,.. 10 to be the break positions after the corresponding number, we get v(1,1) = 84 = 4 v(2,1) = 86 = 2 v(2,2) = 4+ (82) = 10 v(3,1) = 148 = 6 v(3,2) = min (v(2,1)+88, v(1,1)+abs(810)) = min (2,6) = 2 v(3,3) = 10 + 88 = 10 etc. v(10,4) will have your value. Complexity is something like N^2K, though I am sure you can bring it down. answered 07 Jan '12, 02:50 Michael Trick ♦♦ 1
nice. so the graph problem I was looking for is a shortest path problem in a state network...
(07 Jan '12, 03:52)
Marco Luebbecke ♦

For what it's worth I implemented a simple MiniZinc (CP) model for this problem: equal sized groups.mzn . It assumes that The main constraints are the following where % Set the break points (x) and sum the number of elements in each bucket (s) constraint s[1] = x[1] / forall(i in 2..k1) ( s[i] = x[i]x[i1] ) / s[k] = nx[k1] ; The following constraint ensure that we don't have a break point between elements with the same values: forall(j in 2..n) ( a[j1] = a[j] > not(exists(p in 1..k1) ( x[p] = j1) ) % alt. version % a[j1] = a[j] > forall(p in 1..k1) (x[p] != j1 ) The objective is to minimize int: gsize = round(int2float(n) / int2float(k)); % group size var 0..n: z = sum(i in 1..k) ( abs(s[i]gsize) ); Using some random data, it's quite fast for arrays of size < 100, and some solvers (such as Chuffed, a lazy SAT solver) have no problems with arrays of size 1000, at least when k is not too large (say < 20 or so); however Chuffed have to work a lot for arrays of size 10000. (I don't know if these sizes are in the range of the application.) I have assumed that the "grouping" constraint is hard in the requirements, i.e. that element with the same values must be in the same bucket. However, when Also, I have not been able to identify any upper bound on the decision variables (i.e. lower than Update: I have now changed the model with two improvements: 1) Tighter domain for set of int: valid_breaks = {j1  j in 2..n where a[j] != a[j1]}; % ... array[1..k1] of var valid_breaks: x; Many solvers are faster with this variant, though the Chuffed solver don't like it. 2) Replaced the implication constraint shown above with the following by eliminating the invalid breakpoints, i.e. the positions between elements of the same values; this is useful if we can't use the valid breakpoints as a domain for x: set of int: non_valid_breaks = {j1  j in 2..n where a[j] = a[j1]}; % ... forall(j in non_valid_breaks, p in 1..k1) ( x[p] != j ) Now Chuffed solves the 10000 problem instance in about 0.1 second (though it takes 30 seconds to generate the FlatZinc .fzn file), and both fzn2smt and G12's LazyFD is much faster for the 1000 size instance (just a second or two). answered 08 Jan '12, 04:58 Hakan Kjelle... 
Thanks to everyone who answered my question. It is really interesting to see some different approaches. I came up with a recursive, heuristic approach, to my question. I considered two forms of the problem. The first, as I stated it above, we want exactly K groups. The second, we want at most K groups. This relaxation can give better solutions in terms of error. We want K groups out of the N data points. Let L = int(round(float(N)/float(K))) be the desired size of each group. We first try to build a chunk of size L. There may be a group of identical points that goes across the ideal chunk boundary. We don't know if it is better to include or exclude this group, so we try both and choose the solution with the best error. For each case, we call our function again with the remaining data and K1 groups. If we can build a chunk of size L then we just do so, and recursively call the algorithm on K1 groups. In the exact case you have an additional restriction on the size of the group. You need to make sure that the remaining data has at least K1 distinct values, or you won't be able to produce K1 groups. Splitting our N points is an O(N) operation. The recursion will have to explore 2^K solutions in the worst case, although in practice you won't have a block overlapping the desired boundary each time. Thus the overall time is O(N 2^K). If K is small, as it is for my cases, this will be fast enough. If anyone is interested, here is a python implementation, which solves my example problem for all values of K.
This gives the following output. The first set of output is the nonexact case with <= K chunks, and the second is the exact case with K chunks.
answered 09 Jan '12, 14:41 Craig 
Can't say it looks familiar, but it's a variant of a set partitioning problem, so it could be modelled easily as a mixed integer linear program. (Of course, "modeled easily" and "solved easily" are not synonymous, at least for large instances.) It could also be modeled as a constraint programming problem. The CP might be easier to solve if the objective were to minimize the maximum deviation from target size, rather than the sum of deviations. (Once the CP achieves a certain maximum deviation, it will try to achieve a smaller one, and reducing the maximum allowed deviation will eliminate certain combinations from consideration.) That might be true for the MIP formulation as well. answered 06 Jan '12, 18:03 Paul Rubin ♦ 
this is a combinatorial problem, the numbers are irrelevant. this problem has an "interval property", and actually I cannot believe that this problem is NPhard. it is certainly polytime solvable when K is fixed because there are only (N1 choose K1) possible solutions which we can exhaustively inspect in polytime. but when K is part of the input, hell, there is some graph problem behind this... answered 06 Jan '12, 20:14 Marco Luebbecke ♦ 
If I recall correctly [disclaimer: that's not a very strong statement], @Mark once posed a similar question. Unfortunately, I can't find it from his user profile (there's a possibility that this question got lost in a database corruption). To my best knowledge, I proposed hierarchical clustering as a solution technique (for which performance guarantees are studied here). The toprated answer, though, was a twopass merge routine  afaicr proposed by either @Paul Rubin or (1 of the 5) @Matthew Saltzman(s). answered 07 Jan '12, 06:50 fbahr ♦ @Florian: I vaguely recall that, and I think you're right that Matt proposed a twopass merge. I can't find it either (tried both Matt's answers and general searches), so it does look as though the site has lost some data. :(
(07 Jan '12, 10:24)
Paul Rubin ♦
