php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Request #81075 Expose curl_multi_fdset function to PHP userspace
Submitted: 2021-05-23 13:04 UTC Modified: 2021-05-31 08:27 UTC
From: asmqb7 at gmail dot com Assigned:
Status: Wont fix Package: cURL related
PHP Version: 8.0.6 OS: All
Private report: No CVE-ID: None
Have you experienced this issue?
Rate the importance of this bug to you:

 [2021-05-23 13:04 UTC] asmqb7 at gmail dot com
Description:
------------
It's currently not possible to efficiently run a curl_multi request while simultaneously selecting on unrelated PHP streams. The only solution is to tick-tock between curl_multi_select() with an extremely small (or 0) timeout, and stream_select() with a small timeout (eg 2500us), in an infinite loop.

This can be made to work for the most part with acceptable (but always nonzero) CPU usage, but magic number tuning is necessary to balance constant CPU wastage against network responsiveness. It's a generally broken situation that requires pulling in and learning how to coordinate a library like Swoole or Ev, or cracking one's knuckles and building upon stream_socket_open("ssl://...") to solve; you don't have any options if you want to stick with the baseline of what you can get just about everywhere with `apt install php-curl` or equivalent (if curl support isn't already bundled).

In bug #80956 I suggested an unnecessary, braindead approach to solving this problem, which involved cramming PHP stream objects into a curl_multi_wait() function and hoping for the best. Considering cURL's provenance, it's easy to see in retrospect that it was kind of silly to assume that a better way did not exist (but I indeed did not initially exhaustively search the documentation).

While digging into some specifics of how cURL works internally earlier today, I discovered the very interesting curl_multi_fdset() function: https://curl.se/libcurl/c/curl_multi_fdset.html

This takes caller-allocated pointers to fd_set structures, which it fills (FD_SET()s) with the socket descriptors cURL is currently using internally.

The whole point of this function is to support the calling application taking over the top-level select() loop.

While poking around searching for references suggesting when curl_multi_fdset() was introduced (A Long Time Ago, I think), I stumbled by chance on a reference to an outdated copy of curl/ext/multi.c from PHP 7 **actually using curl_multi_fdset() as part of curl_multi_exec()**: https://github.com/php/php-src/blob/b4140bf64811b97af153a5d49a1d71677993a075/ext/curl/multi.c#L250 - it literally grabs the fds from cURL and selects on them from the PHP side.

I'm not sure why PHP 8 no longer does this (https://github.com/php/php-src/blob/master/ext/curl/multi.c#L186) - perhaps because CURLM_MULTI_CALL_PERFORM was deprecated (https://curl.se/mail/lib-2012-08/0042.html). Assuming the old code had no issues, it's practically perfect; we just need... the select()... moved... into PHP userspace... :)

To answer the discussion in bug #80956 regarding PHP not having an fd_set type, I think the most compatible, accessible, sane and "just make it work at all" way to PHP userspace would be to simply instantiate discrete PHP stream objects for each low-level FD_ISSET()-ed socket descriptor.

The overhead incurred would not actually be the end of the world IMO. I would guess most realistic use-cases using a "lot" of connections are probably just managing maybe 20 open sockets. That's 60 stream objects (read+write+except). 50 open connections is 150 stream objects. 100 connections is 300 streams. It's certainly arguable this is not ideal, but also IMO effectively counter-arguable that things could be much much worse. Frameworks do terribler things millions of times a second :D (sic)

HTTP/2 multiplexing may also translate some parallel downloading use-cases from the same or a small set of remotes to much lower numbers of connections.

In the case where someone's doing something that translates to many hundreds of real sockets, and it's argued that that person's N+1st problem is that they're using PHP, and that PHP should generally not be expected to reasonably scale that far, the "doesn't reasonably scale" argument would translate in this case to "because it'll produce lots of stream object thrashing"... but that's not going to for example risk data corruption, it'll just make things slow down a bit because PHP would be creating lots of stream objects every loop.

There is a very good question as to whether this would become noticeably quadratic, or only noticeable in (micro)benchmarks, buuuut, that discussion will likely only become concrete at the many-hundreds-of-connections point in the real world. Many use{rs, use cases} won't be affected by the stream (re)creation overhead.

If things need further optimization badly enough under real-world load, an obvious way will either present itself or be hammered into existence :)


Test script:
---------------
An example of what a possible (currently non-existing as of May 2021) curl_multi_fdset() function might look like:

Prototype:

curl_multi_fdset(CurlMultiHandle $multi_handle, array &$read, array &$write, array &$except);


Semantics:

To be consistent with the majority of PHP functions that deal with multiple return values, this function would use argument pass-by-reference.

Inserted streams would be appended to the supplied arrays by adding integer indices, and skipping over existing indices. So if array indices 1, 2 and 4 were already set, indices 0, 3, 5, 6, ... would be used. The types or values of the preexisting contents of the arrays would be ignored.

If relevant, a small micro-optimization could be implemented, possibly in a later revision of the implementation, that used PHP internal knowledge of the passed arrays to determine whether the passed array only used integer indices, and was numerically consistent (eg [0, 1, 2, 3], without skipping any elements), and then short-circuit the index-skipping logic to pre-determine the first array index to begin blindly inserting at. I have absolutely no idea if this would ultimately speed anything up, or if I'm just naively bikeshedding the implementation here :)


Example:

This shows one (potentially non-optimal) example of how this would be used, distinguishing cURL sockets from application-specific streams.

$cm = curl_multi_init();

for (;;) {

  $read = $this->read;     // app-specific stream pools
  $write = $this->write;   // (here represented as part of a class instance)
  $except = $this->except;

  curl_multi_fdset($cm, $read, $write, $except); // this does not exist yet as of May 2021

  stream_select($read, $write, $except, NULL);

  foreach ($read as $stream) {

    if (!in_array($this->read, $fd, true)) { // if $stream is not in $this->read,
      // (curl socket)                       // it was added by curl_multi_fdset()
    } else {
      // app-specific stream
    }

  }

  // (repeat for $cwrite and $cexcept)

}

As I said, this is potentially not the best approach (given that stream_select() works with arrays of streams, is there a better approach than in_array() here?); suggestions and/or boilerplate code for how to best use the function would probably be worth headscratching over then putting into the documentation.


Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2021-05-23 15:53 UTC] danack@php.net
I hope someone will more detailed knowledge about sockets will pop by, but just one thing:

"this function would use argument pass-by-reference."

That's unlikely to win favour. My understanding of the current consensus is that pass-by-reference is....not a great thing to work with, and it's long term goal to remove it, after appropriate other techniques are available. So adding another function that uses it seems unlikely.
 [2021-05-25 12:49 UTC] asmqb7 at gmail dot com
Replying to comment 2,

A common pattern used by for example SQLite3Result::FetchArray() and stat() is to return an array containing both numeric and associative keys. This function could work similarly. So on the one hand I could grab the associative values like

$fdset = curl_multi_fdset();

... = $fdset['read'];
... = $fdset['write'];

but I could also *really* fancy and do _very_ nice things like

[$read, $write, $except] = curl_multi_fdset();

which I actually like a lot (it would be so nice if stream_select() could be made to work the same way).

The existence of SQLite3Result::FetchArray()'s optional `$mode` parameter, which lets you ask for numeric, associative or both types, does make me wonder about the efficiency overhead of this approach however. (Can PHP connect the same set of value arrays by reference - :P - to both the numeric and associative entries?)

Fundamentally, my use of pass-by-reference was kind of piggybacking off of the precedent and momentum of stream_select()'s existing interface language, which I expect this function would predominantly be used in close proximity to.

But while thinking a bit further, I've also realized that using pass-by-reference allows the function to manipulate the passed arrays from the C side of the fence - and combined with the fact that the array(s) wouldn't strictly have to already be empty, users could simply make $read/$write/$except copies from wherever (in much the way I demonstrated), have the function add in cURL's socket fds to those copies, and skip an array_merge() or similar step before stream_select() can be done.

At the end of the day I think both approaches have equal merit. They're both fundamentally aesthetic arguments, so I don't mind how the function gets done at the end of the day.

If I was going to vote for my personal preference, I currently see a winner in pass-by-reference, because of a) its consistency with stream_select()'s behavior, b) the fact that it can change along with stream_select() whenever that happens, and c) the efficiency of the array manipulation (typically appending tends to hundreds of entries, on every single userspace loop iteration) being done from inside C - which I realize is still going through PHP's type system, but without the rest of the VM overhead. The counterargument would perhaps be pointing out how irrelevantly miniscule the real-world gains would be from this micro-optimization.

---

NB. Apparently I can't proofread (even after previewing multiple times...); for completeness, s/stream_socket_open/stream_socket_client/ in the 2nd paragraph, and s/c//g in the comment at the end of the test script.
 [2021-05-31 08:27 UTC] krakjoe@php.net
-Status: Open +Status: Wont fix
 [2021-05-31 08:27 UTC] krakjoe@php.net
I had a bit of a discussion with another developer about this, and we came to the conclusion that this would not be an interface we want to expose - it simply can't scale with curl.

A more suitable interface that we might want to expose is: https://everything.curl.dev/libcurl/drive/multi-socket#multi_socket-callbacks

However, this would not be a simple function addition, would require somebody familiar with that interface to implement it, and would require a whole bunch of testing which nobody is volunteering to do any of.

I'm going to close this as won't fix, because that seems to be the answer to your particular question.
 [2021-06-01 15:26 UTC] asmqb7 at gmail dot com
Oh. *long sigh* :(

Well, thanks for taking the time to take this seriously and think and properly discuss it. For what it's worth (and if you see this), I'm very curious to understand what you mean that "it can't scale with curl".

I presume you mean that the overhead produced by multiple hundreds of connections would cause noticeable thrashing and slowdown? I'm kinda curious what the real-world impact would be, to be honest...

I've noticed SOCKETFUNCTION mentioned here and there, TIL what it actually does. I can definitely agree that wiring something like that deeply into PHP's streams system would add arguably almost equivalent complexity as simply implementing HTTP(S) inside PHP itself, which many existing libraries already do.

Hmm. Now I'm thinking about this somewhat freshly I've just thought of another approach; take three. How about this compromise?

  curl_multi_stream_select(array &$read, array &$write, array &$except, int &$curl_activity, int $tv_sec[, int $tv_usec = 0]) : int

- This would function identically to stream_select(), but inject cURL's active fds behind the scenes.

- If something happens on a PHP-userspace stream, it gets returned as per current conventions, including with return value identical to stream_select()'s.

- If something happens on a cURL stream, this function returns with $curl_activity set to 1, which you should interpret to mean that you must call curl_multi_exec ASAP.

At the end of the day, it is irrelevant which sockets map to which cURL connections; indeed, one socket will indeed map to multiple connections with HTTP/2. It's also irrelevant what type of stream has been created, which will become increasingly relevant in the future with mainstream connections being established over UDP et al. So, a $something_opaque_happened_with_curl flag value seems like a reasonable and forward-looking abstraction.

I will acknowledge that &$curl_activity being pass-by-ref could almost have circus theme music behind it :) (in terms of the emphasis I've put on the approach), but I'm not sure how else to jam this functionality in sideways while preserving stream_select()'s semantics.

I'm very interested to hear your opinions regarding this different approach. I'll let this sit around for a month-ish or so, and open a new bug in case the WONTFIX status on this impedes visibility of comments or whatever.

Lastly, the function name does kinda look a bit weird to me when I look at it from certain angles. It's very function-over-form, merging the well-known(ish) "curl_multi_*" namespace with the also-well-known "stream_select" namespace. I have no particular preference/investment/bias on the above name.

Thanks very much.
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Thu Mar 28 16:01:29 2024 UTC