php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #69575 Mega data - mega problem
Submitted: 2015-05-05 16:13 UTC Modified: 2015-06-15 14:29 UTC
From: mark at briley dot com Assigned:
Status: Not a bug Package: *General Issues
PHP Version: 5.5.24 OS: Windows 7 Pro
Private report: No CVE-ID: None
View Add Comment Developer Edit
Welcome! If you don't have a Git account, you can't do anything here.
You can add a comment by following this link or if you reported this bug, you can edit this bug over here.
(description)
Block user comment
Status: Assign to:
Package:
Bug Type:
Summary:
From: mark at briley dot com
New email:
PHP Version: OS:

 

 [2015-05-05 16:13 UTC] mark at briley dot com
Description:
------------
---
From manual page: http://www.php.net/function.array-unique
---

This problem seems to affect several functions.  Array-unique, sort, and others that work with arrays.

Problem: I had 50MB of INSERT commands which were computer generated.  There were several duplicates as updates had come in to the data set before I ran the program to generate the INSERT commands.  All INSERT commands are made the same way.  Thus, the first entry in the INSERT command was the ID of the record.  Duplicates are not allowed so these extra INSERT commands had to be removed.  Using PHP's built-in SORT() command did not sort the records.  At first I thought I had done something wrong and tried it on a smaller data set.  Sort worked on the smaller data set but silently fails on the larger (50MB) data set.  Tried array_unique as well as several other array commands.  It seems that when the data set gets over a certain size these functions simple return the original array.  In memory, the program consumed over 2GB of memory.  My machine has 8GB of memory and over 100GB of disk space.

Test script:
---------------
I can not provide a test script since you would need over 50MB of data with records reaching a length of over 65K characters.  By the way - the Windows SORT command can not handle over 65K length strings.  So if PHP calls the Windows SORT routine - that might be the problem.

In my program, I had a small section that simply said:

# $ary = the 50MB list of INSERT commands.
   $b = array_unique( $ary );
# $b should now have an array that is unique.  But when written out to
# items.sql - there were duplicate INSERT commands.  When I compared
# these INSERT commands they were identical.  All lines were sent
# through TRIM() to ensure no invisible characters were in a line.

With the SORT routine, I did

# $ary = the 50MB list of INSERT commands.
   sort( $ary );
   for( $i=0; $i<count($ary)-1; $i++ ){
      if( $ary[$i] === $ary[$i+1] ){ unset( $ary[$i] ); }
      }

   $ary = array_reverse( array_reverse( $ary ) );

$ary still contained duplicates.  The weird thing was - the duplicates were hundreds of records apart.  So maybe at record #567 and record #1245.  Or even #34629.  The sort function should have put them all next to each other.  After all, if the records are:

INSERT INTO <table> (id,atr1,atr2,atr3...) VALUES (1,"hair","face","feet",...)

and that first value (ie: 1) is on two different records - then they should sort and be together.  This is even if the first value was something like 11256. Because "1," would sort either before or after "11256," for ALL occurrances of "1,".

Expected result:
----------------
If any of the routines can not handle what is given to it I would have expected an error to be generated.  Anything to let the programmer know there is a problem.  This has taken over a week to track down because there are no errors generated but when I tried to import the file into MySQL I received a "Duplicate entry..." message which is the only error I ever got letting me know there was a problem.

Actual result:
--------------
The items.sql file was rife with duplicate entries.  When compared by hand the records were exactly the same.  No duplicates should have been possible.

Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2015-05-05 17:32 UTC] cmb@php.net
-Status: Open +Status: Feedback
 [2015-05-05 17:32 UTC] cmb@php.net
The following test script gives the expected results (PHP 5.5.24,
Win7):

    $strings = array();
    for ($i = 0; $i < 1000; $i++) {
        $strings[] = str_repeat($i % 10, 66 * 1024);
    }
    var_dump(strlen($strings[0]));   // 67584
    var_dump(count($strings));       // 1000
    $unique = array_unique($strings);
    var_dump(count($unique));        // 10

To be able to reproduce the issue, we would need respective sample
data. Can you make these available for download somewhere?
 [2015-05-06 15:35 UTC] mark at briley dot com
-Status: Feedback +Status: Open
 [2015-05-06 15:35 UTC] mark at briley dot com
As stated in the problem, small data sets work just fine.  Try changing your data set size by doing the following:

#
#   We are going to need a LOT of memory!
#   We want 6GB of memory to work on this problem.
#
    ini_set( 'memory_limit', '6000M' );

    $strings = array();
#
#   Make 25,000,000 entries.
#
    for ($i = 0; $i < 25000000; $i++) {
        $strings[] = uniqid( $i, true );
    }
#
#   Duplicate all of them so we can see if the duplicates
#   sort next to each other.
#
    foreach( $strings as $k=>$v ){
       $strings[] = $strings[$v];
       }
#
#   Save it so we can edit it via VIM or some other editor.
#   All duplicate records should appear right next to each other.
#
    file_put_contents( "./out1.dat", $strings );

    var_dump(strlen($strings[0]));   // 67584
    var_dump(count($strings));       // 1000
    $unique = array_unique($strings);
#
#   If the array_unique worked then all of the duplicates
#   should be gone.  Save it again so we can see if it
#   really DID remove all of the non-unique values.
#
    file_put_contents( "./out2.dat", $strings );

    var_dump(count($unique));        // 10

NOW see if it works.

I found this same error years ago in the PERL code.  The problem was that PERL was using shorts instead of longs in its sorting routines.  This meant that it worked up to 65,536 records but if you exceeded that many records it borked and returned only SOME of the array sorted.  I suspect this is that same problem showing up again.  Your next question to me should be "Why the heck are you sorting so many records in memory?"

The answer is - we are taking thousands of XML files, putting them together, converting them to SQL, ensuring they are going into the database in the right order (sorting), and then splitting the SQL file up into separate files.  We are having to do this because we are not going for our own server and instead are using one of the cheaper setups.  Unfortunately, the cheaper setup won't let us upload all of the one large file at one time.  No file can be larger than 10MB and due to problems when we were really close to the 10MB limit - I'm making sure the files are no larger than 5MB each.  What can I say?  We are in the initial stages of setting everything up and until it is shown that the whole thing will work - no more money is going to be used for this project.  (Other than my salary.)  Once everything is up and running we will probably get a dedicated server.

Ok - next question - "Why don't we just upload the XML files?"  The company we are going through is using an older version of phpMyAdmin and only CSV and SQL files can be uploaded.  Thus - we have to convert the XML files over to SQL files.  We can not use CSV because of the size of some of these files.  CSV will croak if the size of a record is greater than....drum roll please.....65,536 characters.  Thank you 8bit computers for making companies NOT switch to 32bit technologies.

It is like that old joke.  Why are roads the width they are?  The joke drags on until you get to the real reason which is : "The romans built the original roads and these roads HAD to be the same width as a chariot which was pulled by two horses."  Thus proving two horse's rears take precedent over what something actually should be and also proving once a standard is created it stays around forever.  :-) (ie: shorts over longs because that's how it has ALWAYS been done)
 [2015-05-06 17:06 UTC] cmb@php.net
Hmm, I can't run the given test script, because there is a 2 GB
memory limit in 5.5.24 (x64 as well as experimental x64 build) on
Windows (besides a typo in the script: $strings[$v] should be
$strings[$k]). If I'm not mistaken this memory limit applies to
all current PHP versions.

This might have actually been the cause of your script misbehaving.

| we are taking thousands of XML files, putting them together,
| converting them to SQL, ensuring they are going into the
| database in the right order (sorting), and then splitting the
| SQL file up into separate files.

It might be a workaround to split the big SQL first, then sorting
each file separately, and then doing a merge sort[1] on the files.

[1] <http://en.wikipedia.org/wiki/Merge_sort>
 [2015-05-06 17:15 UTC] cmb@php.net
-Status: Open +Status: Feedback
 [2015-05-06 17:15 UTC] cmb@php.net
Gee! Setting the memory_limit to '6000000000' instead of '6000M'
did work (so disregard my former comment about a general 2G memory
limit), and the test script gave the expected results.
 [2015-05-06 17:24 UTC] cmb@php.net
Oops! Disregard my last comment; that was nonsense (I had
decreased the number of strings).
 [2015-05-06 18:30 UTC] mark at briley dot com
-Status: Feedback +Status: Open
 [2015-05-06 18:30 UTC] mark at briley dot com
Hah!  :-)  Yeah.  The program never got over 3GB but my PHP script never complained about the 6000M memory_limit.  Strange that.  :-?
 [2015-05-06 21:33 UTC] cmb@php.net
-Status: Open +Status: Feedback
 [2015-05-06 21:33 UTC] cmb@php.net
> The program never got over 3GB but my PHP script never
> complained about the 6000M memory_limit.

That might be regarded as bug, but that's another issue than
array_unique() failing for large arrays, and as such should be
reported as distinct issue.

Anyhow, I wonder how you have been able to run the supplied test
script. All available official Windows builds (x86 as well as x64)
of PHP 5.5.24 don't seem to be able to allocate more than 2G of
memory. Even with a somewhat recent snapshot of an x64 build of
PHP 7.0.0 I have not been able to run the test script (a memory
limit of 6,000,000,000 bytes doesn't suffice, and it seems that
it's not possible to raise the memory limit beyond 8,000,000,000 –
yet another issue).

Can you please give a test script that makes it possible to
reproduce the issue?
 [2015-05-07 18:19 UTC] mark at briley dot com
-Status: Feedback +Status: Open
 [2015-05-07 18:19 UTC] mark at briley dot com
Give me a day.  I'll post something as soon as I have it.
 [2015-05-07 18:46 UTC] mark at briley dot com
Ok!  I just finished modifying and running the same program.  For unknown reasons - now PHP won't go past 2GB.  Because of this I changed the program to the following:

<?php
#
#   We are going to need a LOT of memory!
#   We want 6GB of memory to work on this problem.
#
    ini_set( 'memory_limit', '3000M' );

    $strings = array();
#
#   Make 25,000,000 entries.
#
    for ($i = 0; $i < 3000000; $i++) {
        $strings[] = uniqid( $i, true );
    }
#
#   Duplicate all of them so we can see if the duplicates
#   sort next to each other.
#
    foreach( $strings as $k=>$v ){
       $strings[] = $v;
       }
#
#   Save it so we can edit it via VIM or some other editor.
#   All duplicate records should appear right next to each other.
#
    $a = implode( "\n", $strings );
    file_put_contents( "./out1.dat", $a );

    var_dump(strlen($strings[0]));   // 67584
    var_dump(count($strings));       // 1000
    $unique = array_unique($strings);
#
#   If the array_unique worked then all of the duplicates
#   should be gone.  Save it again so we can see if it
#   really DID remove all of the non-unique values.
#
    $a = implode( "\n", $strings );
    file_put_contents( "./out2.dat", $a );

    var_dump(count($unique));        // 10

#   int(24)
#   int(6000000)
#   int(3000000)

?>

Note the "3000M" and a reduced set to 3000000 in the FOR loop.  When I ran this the above three numbers were returned. This worked.  (So maybe I am hallucinating!)  I am going to go back to the original data and see if it still has problems.  (This is just too weird.)
 [2015-05-07 19:10 UTC] mark at briley dot com
-Status: Open +Status: Closed
 [2015-05-07 19:10 UTC] mark at briley dot com
Ugh.  I forgot I modified the program to use the fopen, fgets, fwrite, and fclose because of the problems I was having.  I did not keep the original code.  :-(  Darn it.  I am sorting everything as I read it in now and removing duplicates as each line is read in.  :-(

I am going to try to recreate what I had but as it now stands - I can not reproduce the problem because I completely changed the program and got rid of the array_unique, sort, and any other PHP specific function that handled arrays. :-(

I'm setting the status to closed because I've gotten past this problem but as I said - I will see if I can recreate the code and if so - I'll post it.
 [2015-05-08 05:55 UTC] ab@php.net
-Status: Closed +Status: Re-Opened
 [2015-05-08 05:55 UTC] ab@php.net
For the huge memory_limit, please use some master x64 snapshot. PHP5 on windows can't physically take this.

Thanks.
 [2015-06-02 17:59 UTC] ab@php.net
-Status: Re-Opened +Status: Feedback
 [2015-06-02 17:59 UTC] ab@php.net
@mark, if you're willing to test this, please do. Otherwise the ticket will be auto closed if there's no feedback.

Thanks.
 [2015-06-14 04:22 UTC] php-bugs at lists dot php dot net
No feedback was provided. The bug is being suspended because
we assume that you are no longer experiencing the problem.
If this is not the case and you are able to provide the
information that was requested earlier, please do so and
change the status of the bug back to "Re-Opened". Thank you.
 [2015-06-15 13:48 UTC] mark at briley dot com
-Status: No Feedback +Status: Closed
 [2015-06-15 13:48 UTC] mark at briley dot com
Yes.  Close this.  As I said - as mysteriously as it came - it is now gone.  I can no longer re-create this problem.  Unknown why or why not.  :-/  I hate reporting things that for no apparent reason fix themselves.  :-(
 [2015-06-15 14:29 UTC] cmb@php.net
-Status: Closed +Status: Not a bug
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Thu Mar 28 12:01:27 2024 UTC