Function to make a list as unsorted as possible
I am looking for a function to make the list as unsorted as possible. Preferably in Python.
Backstory:
I want to check URLs statuses and see if URLs give a 404 or not. I just use asyncio
and requests
modules. Nothing fancy.
Now I don't want to overload servers, so I want to minimize checking URLs which are on the same domain at the same time. I have this idea to sort the URLs in a way that items which are close to one another (having the same sort key = domain name) are placed as far apart from each other in the list as possible.
An example with numbers:
a=[1,1,2,3,3] # <== sorted list, sortness score = 2
0,1,2,3,4 # <== positions
could be unsorted as:
b=[1,3,2,1,3] # <== unsorted list, sortness score = 6
0,1,2,3,4 # <== positions
I would say that we can compute a sortness score by summing up the distances between equal items (which have the same key = domain name). Higher sortness means better unsorted. Maybe there is a better way for testing unsortness.
The sortness score for list a
is 2. The sum of distances for 1 is (1-0)=1, for 2 is 0, for 3 is (4-3)=1.
The sortness score for list b
is 6. The sum of distances for 1 is (3-0)=3, for 2 is 0, for 3 is (4-1)=3.
URLs list would look something like a list of (domain, URL) tuples:
[
('example.com', 'http://example.com/404'),
('test.com', 'http://test.com/404'),
('test.com', 'http://test.com/405'),
('example.com', 'http://example.com/405'),
...
]
I am working on a prototype which works Ok-ish, but not optimal as I can find some variants which are better unsorted by hand.
Anyone wants to give it a go?
This is my code, but it's not great :):
from collections import Counter
from collections import defaultdict
import math
def test_unsortness(lst:list) -> float:
pos = defaultdict(list)
score = 0
# Store positions for each key
# input = [1,3,2,3,1] => {1: [0, 4], 3: [1, 3], 2: [2]}
for c,l in enumerate(lst):
pos[l].append(c)
for k,poslst in pos.items():
for i in range(len(poslst)-1):
score += math.sqrt(poslst[i+1] - poslst[i])
return score
def unsort(lst:list) -> list:
free_positions = list(range(0,len(lst)))
output_list = [None] * len(free_positions)
for val, count in Counter(lst).most_common():
pos = 0
step = len(free_positions) / count
for i in range(count):
output_list[free_positions[int(pos)]] = val
free_positions[int(pos)] = None # Remove position later
pos = pos + step
free_positions = [p for p in free_positions if p]
return output_list
lsts = list()
lsts.append( [1,1,2,3,3] )
lsts.append( [1,3,2,3,1] ) # This has the worst score after unsort()
lsts.append( [1,2,3,0,1,2,3] ) # This has the worst score after unsort()
lsts.append( [3,2,1,0,1,2,3] ) # This has the worst score after unsort()
lsts.append( [3,2,1,3,1,2,3] ) # This has the worst score after unsort()
lsts.append( [1,2,3,4,5] )
for lst in lsts:
ulst = unsort(lst)
print( ( lst, '%.2f'%test_unsortness(lst), '====>', ulst, '%.2f'%test_unsortness(ulst), ) )
# Original score Unsorted score
# ------- ----- -------- -----
# ([1, 1, 2, 3, 3], '2.00', '====>', [1, 3, 1, 3, 2], '2.83')
# ([1, 3, 2, 3, 1], '3.41', '====>', [1, 3, 1, 3, 2], '2.83')
# ([1, 2, 3, 0, 1, 2, 3], '6.00', '====>', [1, 2, 3, 1, 2, 3, 0], '5.20')
# ([3, 2, 1, 0, 1, 2, 3], '5.86', '====>', [3, 2, 1, 3, 2, 1, 0], '5.20')
# ([3, 2, 1, 3, 1, 2, 3], '6.88', '====>', [3, 2, 3, 1, 3, 2, 1], '6.56')
# ([1, 2, 3, 4, 5], '0.00', '====>', [1, 2, 3, 4, 5], '0.00')
PS. I am not looking just for a randomize function and I know there are crawlers which can manage domain loads, but this is for the sake of exercise.
Solution 1:
Instead of unsorting your list of URLs, why not grouping them by domain, each in a queue, then process them asynchronously with a delay (randomised?) in between?
It looks to me less complex than what you're trying to do to achieve the same thing and if you have a lot of domain, you can always throttle the number to run through concurrently at that point.
Solution 2:
I used Google OR Tools to solve this problem. I framed it as a constraint optimization problem and modeled it that way.
from collections import defaultdict
from itertools import chain, combinations
from ortools.sat.python import cp_model
model = cp_model.CpModel()
data = [
('example.com', 'http://example.com/404'),
('test.com', 'http://test.com/404'),
('test.com', 'http://test.com/405'),
('example.com', 'http://example.com/405'),
('google.com', 'http://google.com/404'),
('example.com', 'http://example.com/406'),
('stackoverflow.com', 'http://stackoverflow.com/404'),
('test.com', 'http://test.com/406'),
('example.com', 'http://example.com/407')
]
tmp = defaultdict(list)
for (domain, url) in sorted(data):
var = model.NewIntVar(0, len(data) - 1, url)
tmp[domain].append(var) # store URLs as model variables where the key is the domain
vals = list(chain.from_iterable(tmp.values())) # create a single list of all variables
model.AddAllDifferent(vals) # all variables must occupy a unique spot in the output
constraint = []
for urls in tmp.values():
if len(urls) == 1: # a single domain does not need a specific constraint
constraint.append(urls[0])
continue
combos = combinations(urls, 2)
for (x, y) in combos: # create combinations between each URL of a specific domain
constraint.append((x - y))
model.Maximize(sum(constraint)) # maximize the distance between similar URLs from our constraint list
solver = cp_model.CpSolver()
status = solver.Solve(model)
output = [None for _ in range(len(data))]
if status == cp_model.OPTIMAL or status == cp_model.FEASIBLE:
for val in vals:
idx = solver.Value(val)
output[idx] = val.Name()
print(output)
['http://example.com/407',
'http://test.com/406',
'http://example.com/406',
'http://test.com/405',
'http://example.com/405',
'http://stackoverflow.com/404',
'http://google.com/404',
'http://test.com/404',
'http://example.com/404']
Solution 3:
There is no obvious definition of unsortedness that would work best for you, but here's something that at least works well:
- Sort the list
- If the length of the list is not a power of two, then spread the items out evenly in a list with the next power of two size
- Find a new index for each item by reversing the bits in its old index.
- Remove the gaps to bring the list back to its original size.
In sorted order, the indexes of items that are close together usually differ only in the smallest bits. By reversing the bit order, you make the new indexes for items that are close together differ in the largest bits, so they will end up far apart.
def bitreverse(x, bits):
# reverse the lower 32 bits
x = ((x & 0x55555555) << 1) | ((x & 0xAAAAAAAA) >> 1)
x = ((x & 0x33333333) << 2) | ((x & 0xCCCCCCCC) >> 2)
x = ((x & 0x0F0F0F0F) << 4) | ((x & 0xF0F0F0F0) >> 4)
x = ((x & 0x00FF00FF) << 8) | ((x & 0xFF00FF00) >> 8)
x = ((x & 0x0000FFFF) << 16) | ((x & 0xFFFF0000) >> 16)
# take only the appropriate length
return (x>>(32-bits)) & ((1<<bits)-1)
def antisort(inlist):
if len(inlist) < 3:
return inlist
inlist = sorted(inlist)
#get the next power of 2 list length
p2len = 2
bits = 1
while p2len < len(inlist):
p2len *= 2
bits += 1
templist = [None] * p2len
for i in range(len(inlist)):
newi = i * p2len // len(inlist)
newi = bitreverse(newi, bits)
templist[newi] = inlist[i]
return [item for item in templist if item != None]
print(antisort(["a","b","c","d","e","f","g",
"h","i","j","k","l","m","n","o","p","q","r",
"s","t","u","v","w","x","y","z"]))
Output:
['a', 'n', 'h', 'u', 'e', 'r', 'k', 'x', 'c', 'p', 'f', 's',
'm', 'z', 'b', 'o', 'i', 'v', 'l', 'y', 'd', 'q', 'j', 'w', 'g', 't']