php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Doc Bug #80346 Function `array_intersect_ukey`requires consistent comparison function
Submitted: 2020-11-10 01:21 UTC Modified: 2020-11-17 14:11 UTC
Votes:2
Avg. Score:4.0 ± 1.0
Reproduced:2 of 2 (100.0%)
Same Version:1 (50.0%)
Same OS:1 (50.0%)
From: gilperon at gmail dot com Assigned:
Status: Open Package: Arrays related
PHP Version: 7.4.12 OS: ALL
Private report: No CVE-ID: None
 [2020-11-10 01:21 UTC] gilperon at gmail dot com
Description:
------------
Run the test script. It has 2 arrays in such a way that only the key `4` appears on both. Using `array_intersect_ukey` does not return `4`, instead it returns `array(0) {}`. Using `array_intersect_key` works perfectly. I also realized there is a bug in the implementation of this function, because if you run the test script it will show you this:

|4|6|DIFFERENT
|8|6|DIFFERENT
|4|8|DIFFERENT
|3|4|DIFFERENT
|5|4|DIFFERENT
|3|5|DIFFERENT
|6|4|DIFFERENT
|6|5|DIFFERENT
|6|3|DIFFERENT

It shouldnt be comparing 8 with 6 because these keys only appear on the 1st array and these keys shouldnt be compared at all. You can clearly see that this function, instead of comparing each key of the first array with each key of the second array, it's comparing the keys from the first array WITH ITSELF and then comparing the keys of the second array ITSELF (line |3|5|) and only then comparing the keys of both arrays together (but for some reason it misses the |4|4|).

I know that in my simple use case I should be using `array_intersect_key` but I really need `array_intersect_ukey` to work properly because I have a much more complex use case for it.

 

Test script:
---------------
<?php

$array1 = array(4 => true, 6 => true, 8 => true);
$array2 = array(3 => true, 4 => true, 5 => true);

var_dump(array_intersect_ukey(

	$array1,
	$array2,
	function ($key1, $key2) {
		
		echo "|" . $key1 . "|" . $key2 . "|";
		
		if ($key1 === $key2) {

			echo "EQUAL\n";
			return 0;
			
		}
		else{

			echo "DIFFERENT\n";
			return 1;
		
		}
		
	}

));

?>

Expected result:
----------------
It should only show the element 4, which is present in both arrays.

Actual result:
--------------
Nothing is returned from the intersection.

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2020-11-10 01:32 UTC] danack@php.net
Changing the code:
  
   echo "DIFFERENT\n";
   return 1;

to:

    echo "DIFFERENT\n";
    return $key2 - $key1;

Makes your code work.

It looks like this code may be sorting the arrays first, and sorting algorithms need to be stable: https://www.geeksforgeeks.org/stability-in-sorting-algorithms/

That's an educated guess. If it is that, it's a docs problem as stability probably should be mentioned.
 [2020-11-10 01:57 UTC] requinix@php.net
-Summary: Function `array_intersect_ukey` not working (really!) +Summary: Function `array_intersect_ukey`requires consistent comparison function -Type: Bug +Type: Documentation Problem -Package: *General Issues +Package: Arrays related
 [2020-11-10 01:57 UTC] requinix@php.net
Indeed, this is user error: as with every other array sorting function (and the intersect/diff functions *do* perform sorting - for efficiency), the callback needs to be consistent in its comparisons. It must return less than/equal to/greater than, and doing something else will likely result in "garbage in, garbage out" behavior.

This matter is going to be common to all sorting, intersection, diff, etc. functions. If the docs are amended then I suggest some &copypasta; for each affected page.


PS: Returning $key2 - $key1 is technically backwards, but the final sort order doesn't really matter since this function only cares about equality between elements.
 [2020-11-10 02:08 UTC] gilperon at gmail dot com
Thank you guys, but I think that most users will assume that the callback function will only be called when the comparson between the keys are effectively being done. I think most people will never assume that the callback will be called during the "sort" phase, which by the way, there is no single line saying that array_intersect (or similars) will sort the arrays previously.

Also, I am running a very CPU intensive algorithm and I am trying to optimize every bit and because of the way my algorithm works, it already sorts the data before calling array_intersect. So is there a way to tell PHP not to sort the arrays and leave them untouched and just do the intersect, as it should?

I think that it's not obvious to users to know that `array_intersect` (or similars) will sort the arrays before actually doing what it's meant to do. If so, it should be called `array_sort_intersect` or at least allow a FLAG to be set if the user already sorted the arrays by himself.

Anyway, is there a way to me to check, inside the callback function, if the keys being compared are in the "sort" phase or in the "intersect" phase?
 [2020-11-10 02:34 UTC] requinix@php.net
> Thank you guys, but I think that most users will assume that the callback
> function will only be called when the comparson between the keys are effectively
> being done.
The docs already say that the function has to return negative/zero/positive - not simply same/different - and following those instructions will make the process work regardless of whether it did a O(n log n) sort-and-check or the naive O(n^2) intersection.

> So is there a way to tell PHP not to sort the arrays and leave them untouched and just
> do the intersect, as it should?
No, you'll get the best case of O(n log n) for the sort then O(n) for the intersection.

At least I assume that's the best case. I don't remember what algorithm PHP uses now - used to be quicksort but I think it changed.

If you need to sort the arrays for other reasons, write out a userland implementation of an intersection and profile the result. It could be that the "slow" user code beats out the "fast" sorts.

(The implementation: two concurrent iterators on both arrays, running beginning to end, where each one advances or not based on a comparison of their respective current items.)

> is there a way to me to check, inside the callback function, if the keys being
> compared are in the "sort" phase or in the "intersect" phase?
Not really. Does it matter? I can't imagine why...
 [2020-11-10 12:39 UTC] gilperon at gmail dot com
requinix@php.net

Thank you for your explanations. I just think that for some strange reason this `array_intersect_ukey` function is much much much slower than `array_intersect_key` (without callback). It should be a little bit slower, but not by a lot. I will create a new test script and post so you can check what I am saying.

And I still firmly believe that makes no sense calling the callback function when the algorithm of `array_intersect_ukey` is in the `sort` phase, also, if this cant be changed, it would be nice to the user to know if it is in the `sort` phase and not in the `intersect` phase so I could use a much more complex function in the `intersect` phase (which is my case, I have a very complex intersection code which makes no sense being run in the sort phase because just makes everything much slower than it already is).

Regarding writing a function on my own to do what I want, no matter what I come up, it is always at least 10x slower than native PHP functions. PHP is doing some really good magic because even using Golang to benchmark same script, PHP still is faster when working with arrays!
 [2020-11-10 13:21 UTC] cmb@php.net
> The docs already say that the function has to return
> negative/zero/positive […]

Indeed.  I consider the sorting an implementation detail, and as
such don't think we should document that; or should we?

What else would need to be documented in this regard?

FWIW, PHP 7 uses a hybrid quicksort (fewer than 16 elements
trigger insertion sort), and as such sorting has to be considered
unstable.  PHP 8 still uses basically the same algorithm, but
guarantees stable order[1].

[1] <https://wiki.php.net/rfc/stable_sorting>
 [2020-11-10 14:14 UTC] gilperon at gmail dot com
cmb@php.net in my opinion as a PHP user/programmer for the last 10 years which uses a lot the documentation, I think that we can have at least 2 alternatives:

1) make it clear in the documentation that the `callback` function will not only be called when the intersection is being done/compared, but also in the `sort` phase (which, to me, makes completely no sense, since the intent of the callback is exactly to make a custom comparator between keys, not to make a custom "sorter");

2) allow a parameter to be added to the function `array_intersect_ukey` named DONT_CALL_MY_CALLBACK_ON_SORT_PHASE which the user can set to true/false (again, to me makes completely no sense calling the custom callback function in the sort phase, because it's not intuitive and not expected by the user/programmer).

If you guys still decide to keep calling the callback function during the sort phase, I think there should be a third parameter in the callback function `function ($key1, $key2, $phase)` which tells the user whether it's in the sort or intersect phase, and the user can do something like:

function ($key1, $key2, $phase) {

    if ($phase === 'sort') {

         return $key2 - $key1;

    }
    else if ($phase === 'intersect') {

        //here comes the actual intersect callback function, where the user can do lots of complicated math, stuff... and return TRUE or FALSE if the intersection was succesfull or not.

    }

}
 [2020-11-17 14:11 UTC] cmb@php.net
> make it clear in the documentation that the `callback` function
> will not only be called when the intersection is being
> done/compared, but also in the `sort` phase

Like I said, I consider this sorting phase to be an implementation
detail.  It is just done this way, to be more efficient in the
general case.

> allow a parameter to be added to the function
> `array_intersect_ukey` named DONT_CALL_MY_CALLBACK_ON_SORT_PHASE

That could easily give erroneous results.  The array has to be
sorted according to the compare function.

> here comes the actual intersect callback function, where the
> user can do lots of complicated math, stuff...

If you have really time consuming complex calculations, consider
to precompute respective values, and to just look these up in the
comparison function.
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Mon Dec 30 14:01:28 2024 UTC