Friday, July 1, 2011

sample data in postgres

This article takes some sample data and imports it into Postgresql (i) using CSV import and (ii) using INSERTs from an export of MySQL

Now to get the data into postgres we could go back to the spreadsheet (csv loading approach) or maybe take a compatible export from mysql and load that (compatible loading approach)

The data we will be using can be found here and I am about to start working with this file.

( If you are coming from a MySQL background and want to just load the sample 'employees' database into Postgresql, then project at  this link will help with that )

CSV loading approach:

In postgres use COPY FROM to get data from the filesystem into the database.

But wait; surely you need to already created a table so as to have a table to load into?
Yep, otherwise you would see postgres complain about a missing relation like so...

postgres=# COPY amd_bang_per_watt FROM '/tmp/amdAM3clockSpeedsAndWattage__200907.csv' WITH CSV QUOTE AS E'\042';
ERROR: relation "amd_bang_per_watt" does not exist


The table does not exist so you need to 'create table'.

I cover this in more detail in the next section 'Compatible loading approach' so either consult there or have a go yourself first perhaps.

I now assume you have the table amd_bang_per_watt created.

My table needed emptying first but you can likely ignore the next command:

: #root@156ns1(~) ;echo 'delete from amd_bang_per_watt' | psql amd_power_dissipation postgres
Password for user postgres:
DELETE 50

...and pick things up again here where I will try the COPY FROM:

amd_power_dissipation=# COPY amd_bang_per_watt FROM '/tmp/amdAM3clockSpeedsAndWattage__200907.csv' WITH CSV QUOTE AS E'\042';
ERROR: value too long for type character varying(10)
CONTEXT: COPY amd_bang_per_watt, line 1, column model_family: "Model Family"
amd_power_dissipation=#

and taking care of the header we execute

amd_power_dissipation=# COPY amd_bang_per_watt FROM '/tmp/amdAM3clockSpeedsAndWattage__200907.csv' WITH CSV HEADER QUOTE AS E'\042';
ERROR: null value in column "speed_power_ratio" violates not-null constraint
CONTEXT: COPY amd_bang_per_watt, line 4: ""X2 ","6??? x¹ ",2.3,,,"3.4 HT3 ",,,45,"x¹ 2009 ","
amd_power_dissipation=# select count(*) from amd_bang_per_watt;
count
-------
0
(1 row)

which is still not completing as we would wish so attempt 3:

amd_power_dissipation=# COPY amd_bang_per_watt FROM '/tmp/amdAM3clockSpeedsAndWattage__200907.csv' WITH CSV HEADER QUOTE AS E'\042' FORCE NOT NULL speed_power_ratio;
ERROR: invalid input syntax for type numeric: ""
CONTEXT: COPY amd_bang_per_watt, line 4, column speed_power_ratio: ""

...attempt 4...

amd_power_dissipation=# ;COPY amd_bang_per_watt (1,2,3,4,5,6,7,8,9,10) FROM '/tmp/amdAM3clockSpeedsAndWattage__200907.csv' WITH CSV HEADER QUOTE AS E'\042';
ERROR: syntax error at or near "1"
LINE 1: COPY amd_bang_per_watt (1,2,3,4,5,6,7,8,9,10) FROM '/tmp/amd...

...attempt 5...

amd_power_dissipation=# COPY amd_bang_per_watt (model_family,model,clock_speed,l2cache,l3cache,ht_bus_ghz,voltage,socket,tdp_watts,process_comments) FROM '/tmp/amdAM3clockSpeedsAndWattage__200907.csv' WITH CSV HEADER QUOTE AS E'\042';
ERROR: extra data after last expected column
CONTEXT: COPY amd_bang_per_watt, line 2: ""X2 II ","550 Black Edition ",3.1,"2x512k ","6MB ",2,"1.15-1.425 ","AM3 ",80,"45nm Callisto Q3-2009 ..."

...attempt 6...

amd_power_dissipation=# COPY amd_bang_per_watt (model_family,model,clock_speed,l2cache,l3cache,ht_bus_ghz,voltage,socket,tdp_watts,process_comments,process_comments) FROM '/tmp/amdAM3clockSpeedsAndWattage__200907.csv' WITH CSV HEADER QUOTE AS E'\042';
ERROR: column "process_comments" specified more than once

...and still no success so back to the drawing board (or rather sed in fact):

sed 's/\,$/\,0/' <> amdAM3clockSpeedsAndWattage__200907.csv-truncated
: #root@156ns1(tmp) ;sed 's/\,$/\,0/' <> amdAM3clockSpeedsAndWattage__200907.csv-edited
: #root@156ns1(tmp) ;sed 's/\,\,\,/\,0\,\,/' <> amdAM3clockSpeedsAndWattage__200907.csv-edited2

Here is some dialogue to explain how I came to have three sed commands of which the first is now redundant:

amd_power_dissipation=# COPY amd_bang_per_watt FROM '/tmp/amdAM3clockSpeedsAndWattage__200907.csv-truncated' WITH CSV HEADER QUOTE AS E'\042';
ERROR: missing data for column "speed_power_ratio"
CONTEXT: COPY amd_bang_per_watt, line 4: ""X2 ","6??? x¹ ",2.3,,,"3.4 HT3 ",,,45,"x¹ 2009 ""
amd_power_dissipation=# COPY amd_bang_per_watt FROM '/tmp/amdAM3clockSpeedsAndWattage__200907.csv-edited' WITH CSV HEADER QUOTE AS E'\042';
ERROR: null value in column "clock_speed" violates not-null constraint
CONTEXT: COPY amd_bang_per_watt, line 10: ""X2 ","6??? x¹ ",,,"2MB ","DDR2 ",,,,"x¹ 65nm Kuma Q2/2008 ",0"
amd_power_dissipation=# COPY amd_bang_per_watt FROM '/tmp/amdAM3clockSpeedsAndWattage__200907.csv-edited2' WITH CSV HEADER QUOTE AS E'\042';
COPY 50

and now a few selects to check things look okay:

amd_power_dissipation=# select count(*) from amd_bang_per_watt;
count
-------
50
(1 row)

amd_power_dissipation=# select count(*),substr(process_comments,1,4) as nm from amd_bang_per_watt group by nm;
count | nm
-------+------
2 | x¹
3 | x¹ Q
1 | x¹ 4
6 | x¹ 2
19 | 65nm
15 | 45nm
4 | x¹ 6
(7 rows)

All looks well here.

The CSV files (original, and edited using sed) are listed below:
If recommend saving the above files to your filesystem and viewing them from there, however if you do instead open them directly in your browser and you see x¹ rather than x¹ then try using Konqueror instead.

Konqueror is a clickable install on most Linux desktop distributions and is available to Windows users via KDE on Windows project.


Compatible loading approach:

Here is an extract of some ruby code which I looked up, so as I have an idea of what mysql types map to what postgres types:

def convert_type(type)
case type
when "tinyint(1)"
"boolean"
when /tinyint/
"tinyint"
when /int/
"integer"
when /varchar/
"varchar"
when /decimal/
"decimal"
else
type
end
end
You might find a more accessible read in the form of the Mark Slade migration guide for mysql to postgres. That guide includes some discussion about datatype mappings.

What I wanted to check was that decimal in mysql would mean decimal in postgres, and, yes it seems that it does.

One recommendation for using decimal fields in any database is to always specify both parameters rather than relying on database defaults.
That way if you do end up exporting/importing as part of some migration, then you will not be tripped up by differing defaults for precision.

Now in part 3 of the 'sample data and mysql' postings I used --compatible=postgresql flag to produce these files:
Having following the mysqldump documentation link through to reading about server modes it seems that a reasonable expectation is as follows:
  • Mysql is going to do what it can to help you when you ask for --compatible=somerdbms
  • Mysql will try and avoid giving you sql output it knows for sure will cause somerdbms a problem.
  • Mysql does is not promising 100% compatibility but just goes some way to making your task less onerous.
In particular you should not expect --compatible=postgresql to spit out sql which postgres can feed on right away - there is still some work to do.

To give a quick illustration I took the .mysqldump4postgresSansOpt file and removed the row inserts and tried to get psql to execute it:

cat /tmp/amd_bang_per_watt.mysqldump4postgresCreateOnly | psql amd_power_dissipation postgres
Password for user postgres:
ERROR: syntax error at or near "@"
LINE 1: SET @saved_cs_client = @@character_set_client;
^
ERROR: unrecognized configuration parameter "character_set_client"
ERROR: syntax error at or near "COMMENT"
LINE 12: ...power_ratio" decimal(5,2) NOT NULL default '0.00' COMMENT 'b...
^
ERROR: syntax error at or near "@"
LINE 1: SET character_set_client = @saved_cs_client;

As you can see postgres is not happy.

So I clean it up a bit and retry:

: #root@156ns1(~) ;cat /tmp/amd_bang_per_watt.mysqldump4postgresCreateOnlyCleanedUp | psql amd_power_dissipation postgres
Password for user postgres:
CREATE TABLE

...better. Now what I had to do was get rid of the column comment for speed_power_ratio and get rid of the set type stuff mysql had placed before and after the create table block.

Here are the files as I worked on them:
So we have the table - lets get on run the inserts as follows (abbreviated):

: #root@156ns1(~) ;cat /tmp/amd_bang_per_watt.mysqldump4postgresInsertsOnly | psql amd_power_dissipation postgres
Password for user postgres:
INSERT 0 1
...
INSERT 0 1

The 'INSERT 0 1' feedback above is what we should expect given our setup here.
Further reading of the postgres documentation for INSERT and scanning down for oid should make things clear.

A few quick selects to see if things look okay:

postgres@ns1:~$ psql
Password:
Welcome to psql 8.3.7, the PostgreSQL interactive terminal.

Type: \copyright for distribution terms
\h for help with SQL commands
\? for help with psql commands
\g or terminate with semicolon to execute query
\q to quit

postgres=# \l
List of databases
Name | Owner | Encoding
-----------------------+----------+----------
amd_power_dissipation | postgres | UTF8
postgres | postgres | UTF8
template0 | postgres | UTF8
template1 | postgres | UTF8
(4 rows)

postgres=# \dt
No relations found.
postgres=# \c amd_power_dissipation
You are now connected to database "amd_power_dissipation".
amd_power_dissipation=# select count(*) from amd_bang_per_watt;
count
-------
50
(1 row)

amd_power_dissipation=# select count(*),substr(process_comments,1,4) as nm from amd_bang_per_watt group by nm;
count | nm
-------+------
4 | x¹ 6
19 | 65nm
2 | x¹
3 | x¹ Q
1 | x¹ 4
6 | x¹ 2
15 | 45nm
(7 rows)

amd_power_dissipation=# select model,clock_speed,l3cache,tdp_watts,speed_power_ratio from amd_bang_per_watt where speed_power_ratio > 27;
model | clock_speed | l3cache | tdp_watts | speed_power_ratio
--------------------+-------------+---------+-----------+-------------------
550 Black Edition | 3.1 | 6MB | 80 | 38.75
545 | 3 | 6MB | 80 | 37.50
720 Black Edition | 2.8 | 6MB | 95 | 29.47
710 | 2.6 | 6MB | 95 | 27.37
705e | 2.5 | 6MB | 65 | 38.46
810 | 2.6 | 4MB | 95 | 27.37
900e | 2.4 | 6MB | 65 | 36.92
905e | 2.5 | 6MB | 65 | 38.46
910 | 2.6 | 6MB | 95 | 27.37
945 | 3 | 6MB | 95 | 31.58
9100e | 1.8 | 2MB | 65 | 27.69
9150e | 1.8 | 2MB | 65 | 27.69
(12 rows)

Sidenote: Having tried the substr() funtion, it seems that the sql statement shown below works unaltered in both mysql and postgres:

select count(*),substr(process_comments,1,4) as nm from amd_bang_per_watt group by nm;


Conclusion:

The --compatible option of mysql goes some way to getting your data into postgres and this article hopefully gives a flavour of what manual steps you might have to take.

Having worked through both options for getting the data in (csv and --compatible), I have to say that I am much in favour of using --compatible=postgresql and going that way.

No comments: