trustedsamurai commented on a Page, CSV to mySQL - Create Table and Insert Data  -  Oct 21, 2015

Hi Hawkee, I want to grab a csv from a geo server once a week and load it into a mySQL instance on AWS.

Would your script do this? I've tried to achieve this directly from mySQL with a LOAD DATA INFILE and always get security errors. Even though I can enter the location of the csv file into the browser url and download it fine.

Hawkee  -  Oct 21, 2015

This will create the table the first time it runs based on the first row as the field names. After that it should continue to work as long as the file format does change. If you run it more than once on the same file it'll create duplicates. Another thing to note is that the field names will all be varchar(250), so you may want to go back and change those to more meaningful types.

trustedsamurai  -  Oct 21, 2015

Thanks for writing back so promptly. I'm wondering if this is the best way to do this or whether I should try and use LOAD INLINE FILE from mySQL to bring the data in directly - without PHP.

Hawkee  -  Oct 21, 2015

Ideally you'd want to create a custom script that handles your particular format and generates insert statements, but if that's beyond your scope you could look for mySQL import software solutions.

trustedsamurai  -  Oct 21, 2015

using PHP isn't beyond my abilities at all but I'm concerned at using minimum methods to move the data around. I've got data generated by Server A that I want to get to the AWS mySQL instance on Server C. I can't run script on Server A, although I do have a web server (Server B) that can run PHP. It seems to me a better solution (less links, less PHP, more secure) if I get Server C to Load the CSV directly from Server A then tidy it up.

Besides, I've already developed some PHP script on Server B to do this and it keeps timing out at the 2 minute mark. I realise I could probably increase the timeout limit but what if the CSV file I'm working with one day is so big that increasing a timeout value no longer works?

Cheers,

John.

trustedsamurai  -  Oct 22, 2015

I'll just add to this as I am enjoying this conversation and others may benefit.
It turns out that I can't use the LOAD INFILE statement on mySQL where the CSV/data file is addressed with a URL. also, to make matters more painful I can't insert a LOAD statement into a mySQL stored procedure - which is what I would want to do to automate the process.

So it looks like I DO need to re-consider PHP after all.

Its the time-outs on PHP that concern me the most. It would be good if I could chunk the file but that really does look complicated. besides, the first file I'm testing with is not that big - around 650k.

Hawkee  -  Oct 22, 2015

Are you running the script on Server B through your web browser? This sort of script should be run from the command line as web browsers aren't really designed to handle processing tasks like this. Ideally you'll upload the CSV with a cron script running every so often looking for a new file. If one exists then the import can proceed.

trustedsamurai  -  Oct 22, 2015

I create the php file which resides on the web server (in a separate folder) then execute it from the URL.

This has been a great way to expand my understanding of PHP (and its interaction with mySQL) as the host environment supports all of these.

Are you saying the job might perform better if I were to schedule it with CRON?

Hawkee  -  Oct 22, 2015

You definitely want to run imports from the command line, so yes cron is ideal but you can manually do it via ssh every time you want to import as well.

trustedsamurai  -  Oct 22, 2015

I had better go and work out what I need to do to connect to my shared host with SSH.

Thanks for your help.

Hawkee  -  Oct 22, 2015

I always choose my shared hosts based on whether or not they offer SSH. Some don't and those that do will sometimes require a copy of your driver's license to make it available.

trustedsamurai  -  Oct 23, 2015

OK. I've used PUTTY to access my shared host folder via SSH. I have two filesets to upload. The first contains a list of fish species and is approx 5,000 rows. This completed reasonably quickly. That's good news! it was timing out before when run from the URL of the browser. (Thanks Mr Hawkee!).

The next file has 8 columns and 250,000 rows. its progressing at roughly 5,000 rows every 10 minutes so i figure it will take 8 hours to complete. When I used a LOAD INLINE command from mySQL Workbench this same file transfer completed in 10 minutes. So I know it can go quicker.

Any thoughts on how to speed this part up?

Thanks again,

John.

Hawkee  -  Oct 23, 2015

It depends on what your script is doing. Are you querying for existing rows before inserting? If so, are you querying against indexed fields?

trustedsamurai  -  Oct 23, 2015

I have a staging table on the database that is truncated before I start the transfer. the staging table has an id column that auto generates a unique number and this is the primary key. But no indexing on this table.

my script inserts row by row into the table.

Hawkee  -  Oct 23, 2015

Are there any select queries? That's the only thing I can imagine that would cause it to take so long to insert. Unless you're experiencing latency between the database server and your shared server.

trustedsamurai  -  Oct 23, 2015

not a select statement as such. but the process does work line by line which means its writing line by line to the database. I understand that block processing would be faster.

here is a subset of the code. You can see the top section grabs the current row of the CSV , then below it INSERTs.

if (($handle = fopen($csv_file, "r")) !== FALSE) {
   fgetcsv($handle);   
   while (($data = fgetcsv($handle, 1000, ",")) !== FALSE) {
        $num = count($data);
        for ($c=0; $c < $num; $c++) {
          $col[$c] = $data[$c];
        }

 $col1 = $col[1];                       //SpeciesID
 $col2 = $col[2];                       //SurveyID
 $col3 = str_replace("'","",$col[3]);   //Longitude
 $col4 = str_replace("'","",$col[4]);   //Latitude
 $col5 = str_replace("'","",$col[5]);   //Realm
 $col6 = str_replace("'","",$col[6]);   //Country
 $col7 = str_replace("'","",$col[7]);   //State
 $col8 = str_replace("'","",$col[8]);   //Geometry

/ $col9 = str_replace("'","",$col[9]); //frequency
$col10 = str_replace("'","",$col[10]); //abundance
$col11 = str_replace("'","",$col[11]); //commonfamilyname
$col12 = str_replace("'","",$col[12]); //commonclassname
$col13 = str_replace("'","",$col[13]); //commonphylumname
$col14 = $col[14]; //lmax
/
// SQL Query to insert data into DataBase
$sql = "INSERT INTO locations(speciesid,locationid,longitude,latitude,realm,country,state) VALUES(".$col1.",'".$col2.",'".$col3."','".$col4."','".$col5."','".$col6."','".$col7."','".$col8."')";

Hawkee  -  Oct 23, 2015

Seems simple enough. Could it be a latency issue between servers causing the delay? You might just want to generate an import file and use the LOAD commands to bring it in. That'll be much faster.

trustedsamurai  -  Oct 23, 2015

Would you explain the context of the 'import file' and the 'LOAD' commands? I have tried a LOAD INFILE command from mySQL workbench and this only works with local files, not with url addresses.

Hawkee  -  Oct 23, 2015

If you can connect to the remote mySQL server from your PHP server then you can use "LOAD DATA LOCAL INFILE" when the file is on your PHP server.

trustedsamurai  -  Oct 23, 2015

Oh wow. I couldn't get the LOAD DATA LOCAL INFILE to work when running in the mySQL Workbench environment because the file being imported was addressed in a URL. But it never occurred to me to run this LOAD from the host server where the CSV file is residing.

results: 250,000 row CSV estimated to take 8 hours to load took 15 seconds!

Nice one.... Now I just need to get the CSV files from the geo-server via a URL to my host server where this new script can pick them up. Then CRON it and I'm away! :)

Thanks for the tip.

trustedsamurai  -  Oct 23, 2015

OK. While North America sleeps the real work gets done! {joke}.

i was able to use file_get_contents to save a local CSV file from a geo server generated file.

Everything is now running in a series of php scripts that are included in a single "master" script and all I have to do is schedule this "master" from cron.

The only thing that concerns me is error handling. I've got some try/catches, but admit I probably should build more in.

Anyway, it's looking very good. All thanks to you!

cheers.

John.

Hawkee  -  Oct 24, 2015

Awesome! I'm happy to have helped.

Sign in to comment

Are you sure you want to unfollow this person?
Are you sure you want to delete this?
Click "Unsubscribe" to stop receiving notices pertaining to this post.
Click "Subscribe" to resume notices pertaining to this post.