Tuesday, August 4, 2009

postgres - xml out (part 3 of 3) table_to_xml

Earlier in the 'postgres xml out' articles we used query_to_xml() to get data out of postgres in xml format.

Postgres also has the function table_to_xml() which is perfect for exporting the whole table in xml format.

Here is the function signature from the postgres documentation page:
table_to_xml(tbl regclass, nulls boolean, tableforest boolean, targetns text)

The first parameter will be 'amd_bang_per_watt' for all the examples in this article.
The fourth parameter I will not be making use of and so '' will be supplied.

The parameters that I will discuss (and vary for illustration) are the boolean second and third parameters, nulls and tableforest respectively.

Four examples of this output are listed below marked with the values I used for the booleans null and tableforest:
  1. amd_bang_per_watt-table_to_xmlFFheaderless.xml (tbl,False,False,'')
  2. amd_bang_per_watt-table_to_xmlFTheaderless.xml (tbl,False,True,'')
  3. amd_bang_per_watt-table_to_xmlTFheaderless.xml (tbl,True,False,'') ***
  4. amd_bang_per_watt-table_to_xmlTTheaderless.xml (tbl,True,True,'')
Option 3 has been marked with *** to indicate this is my preferred output form and I will be using the xml this form produces in the next paragraphs and in some later articles.

Using True for the parameter
nulls boolean
...gets you...
<voltage xsi:nil="true"/>
rather than omission (the voltage node being omitted altogether).

Using False for the parameter
tableforest boolean
means that each row will be wrapped in a
<row> and </row>
pairing which is nice and convenient.

Executing suitable queries in psql and chopping off the head and tail will produce files similar to the four files listed above. An example of *** (Option 3) is shown here:

psql amd_power_dissipation -c "select table_to_xml('amd_bang_per_watt',true,false,'');" | \
tail -n +3 | head -n -3 > amd_bang_per_watt-table_to_xmlTF.xml;

...and the commands in full are in this text file.

Extracting the 1st row, 47th row, and 50th (final) row of sample data from the file amd_bang_per_watt-table_to_xmlTFheaderless.xml, I create row1headerless.xml containing:
<model_family>X2 II </model_family>
<model>550 Black Edition </model>
<l2cache>2x512k </l2cache>
<l3cache>6MB </l3cache>
<voltage>1.15-1.425 </voltage>
<socket>AM3 </socket>
<process_comments>45nm Callisto Q3-2009 </process_comments>
and row47headerless.xml containing:
<model_family>X4 II </model_family>
<model>965 Black Edition </model>
<l2cache>4x512K </l2cache>
<l3cache>6MB </l3cache>
<voltage>1.15-1.425 </voltage>
<socket>AM3 </socket>
<process_comments>x¹ 45nm Deneb Q3-2009 </process_comments>
and row50headerless.xml containing:
<model_family>FX </model_family>
<model>FX-82 x¹ </model>
<l2cache>4x512KB </l2cache>
<l3cache>2MB </l3cache>
<voltage xsi:nil="true"/>
<socket>AM2+ </socket>

You might recall from previous postings regarding this sample data, that the data contains a superscript numeral character 'superscript one' (¹) having utf8 code 00B9 (sometimes appearing in code analysis output as \xc2\xb9).

In particular just glancing at the rows we just output I can see that row47 and row50 both contain superscript one in some of their fields.

The next post will attempt to deal with getting xml data into postgres from a file.

Monday, August 3, 2009

norge press in your native lang - rss in reader

The inspiration for this posting came from a short messsage (tweet) in which I pointed out a useful facility of google reader and this led to an enquiry.

Rather than answer the enquiry via a series of short messages, I thought a blog posting including some screenshots might be a more useful way.

I needed a practical example and that is where the 'Norge' in the title comes in. I take an interest in Norwegian news and thought this would be a good illustration of rss and the reader product from google.

From now on I will just refer to the software from reader.google.com as 'Reader' for brevity.

Here is a screenshot of what you see (on the left) when Reader is set up and includes a few feeds:

That screenshot has been cropped and the full width shot shows the actual feed entries in a list.

For the right hand pane, selected 'Expanded' rather than 'List' gives you:

*** At this point if you are already familiar with rss and feeds and the symbols for them and/or use reader software already, then you probably want to skip read the rest of this article. I am deliberately avoiding the point and click shortcuts available for adding rss and aiming this article at being instructive without being the quickest way to get to the goal. ***

The most important button for our use is at the top of the first screenshot and is labelled 'Add a subscription'.

Clicking 'Add a subscription' will bring up a small dialogue box where you should paste in the url of the rss feed.

The highlighted phrases in the previous sentence have links to Wikipedia should you require some explanation of the new terminology I just introduced.

Here are some example urls that are valid feed urls (as of time of writing):
  1. http://dictionary.reference.com/wordoftheday/wotd.rss
  2. http://gnubyexample.blogspot.com/feeds/posts/default
  3. http://googleblog.blogspot.com/atom.xml
  4. http://www.timesonline.co.uk/rssfeed/rssIndexFeed/0,,2,00.xml
Here are some unlikely urls for feeds:
  • http://www.google.com/index.html
  • http://www.w3.org/WAI/
  • http://www.gridstatus.com/
What is different about the first bunch of urls and the second bunch of urls? Well in an ideal world all rss feeds would end with .rss
However that is not the situation, yet there are some clues.
Dealing with the numbered entries (1) ends with .rss which is nice (2) has the word feeds in the path (3) ends with .xml (4) ends with .xml and mentions rss and feeds in the path.

You understand that none of this is an absolute guarantee, however it may be some help to anyone who has not looked at urls in close detail before.

The bunch of 3 urls I marked as unlikely do not mention rss or feed in the path and in particular the first ends index.html.
( The default page for a site is normally default.html or index.html and it would be a bit silly of the website owner to place an rss feed at such a url. )

Now back to the Norwegian news part of this story. I have visited Norway and enjoyed my stay in Oslo and Bergen. Since then I occasionally read up on what is going on in the news.

One of the newspapers with an online presence is Aftenposten and they used to provide a section 'News in English' which I headed for recently. It seems that this editorial is no longer funded so, disappointed, I had a hunt around for an alternative paper and came across The Norway Post which looked to fit the bill.

Now, I do not live in Norway or read the printed copies of Norwegian newspapers. Furthermore I do not really wish to compare the editorial content of these two sites, nor endorse either, as that is not the purpose of this article.

Speaking purely hypothetically, it may be that having digested all the articles at norwaypost.no, I am hungry for more news from that country and want Aftenposten.

Referring back to the title of this article, I now explain a useful feature of Reader which is 'automatic translation'. Google has provided such a thing for web pages for a while. I use it occassionally but it requires some cutting and pasting perhaps. Not so in reader.google.com.

Essentially once the feed url has been entered, there is a nice option 'Translate into my language'.

A feed marked Nyheter (News) is what I am looking for and in firefox 'Tools/Page Info/Feeds' shows me a promising url which I repeat below:
Pasting this into 'Add a subscription' in Reader brought up a list of news items (not in my language):

...and quickly flicking 'Translate into my language' option of Feed Settings gives the following:

Now I do not claim that google translator is fantastic, in fact in some cases the translation can be way off. However it being free, and a very convenient option of Reader, I will use it without putting too much store by the accuracy of it's results.

Based on the above sample of 3 news items I would have quickly read the first article summary and discarded the other two as suspect translations.

If this posting has got you interested in reader and you want to explore some of it's more advanced facilities then join the group dedicated to reader to swap hints and tips.

Saturday, August 1, 2009

postgres - xml out (part 2 of 3) - query_to_xml

Having shown previously several ways of getting data out of postgres in xml form, I thought there might be some benefit in looking at xmlforest, xmlattributes, and xmlelement.

The compact form for row data produced by 'quick report' of pgAdmin is a good candidate for a manual reconstruction, so i begin there.

First row of the sample data in compact form:
<row id="r1" number="1" c1="X2 II " c2="550 Black Edition " c3="3.1"
c4="2x512k " c5="6MB " c6="2" c7="1.15-1.425 " c8="AM3 " c9="80"
c10="45nm Callisto Q3-2009 " c11="38.75" />
could be generated by xmlelement and xmlattributes as follows:

drop sequence row_nbr;create temporary sequence row_nbr;
select xmlelement (name row, xmlattributes (
'r' || nextval('row_nbr') as id,
currval('row_nbr') as number,
model_family as c1, model as c2,
clock_speed as c3, l2cache as c4,
l3cache as c5, ht_bus_ghz as c6,
voltage as c7, socket as c8,
tdp_watts as c9, process_comments as c10,
speed_power_ratio as c11
)) from amd_bang_per_watt limit 1;

Now it seems silly to write a query where we know all the column names then replace them with c1,c2,c3. However this is just an exercise in reconstruction, so as to illustrate xmlelement and xmlattributes.

Now I will illustrate how to obtain the same row, but this time, using less of the attributes and treating every field as a node to obtain:
<model_family>X2 II </model_family>
<model>550 Black Edition </model>
<l2cache>2x512k </l2cache>
<l3cache>6MB </l3cache>
<voltage>1.15-1.425 </voltage>
<socket>AM3 </socket>
<process_comments>45nm Callisto Q3-2009 </process_comments>

...we simply run this query which uses xmlforest:
select xmlelement(name row, xmlforest(
model_family, model, clock_speed, l2cache, l3cache,
ht_bus_ghz, voltage, socket, tdp_watts,
)) from amd_bang_per_watt limit 1;
with a bit of sed postprocessing:
sed 's/></>\n</g' <> amd_bang_per_watt-xmlforested.txt
or if you want to avoid the postprocessing just use the handy postgres query_to_xml() function show here executing in psql:

amd_power_dissipation=# select query_to_xml('select * from amd_bang_per_watt
limit 1',true,true,'');
<row xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">

<model_family>X2 II </model_family>
<model>550 Black Edition </model>
<l2cache>2x512k </l2cache>
<l3cache>6MB </l3cache>
<voltage>1.15-1.425 </voltage>
<socket>AM3 </socket>
<process_comments>45nm Callisto Q3-2009 </process_comments>

Mysql --xml option produces row data in a 'node per field' form discussed previously. However in that output, the field name is an attribute, making it a more tricky task than our examples above.

If you are an XSL or XPath expert then you may well leap straight to more elegant solutions than what I have given here.
I have aimed this posting at postgres users who are not xml experts, however comments showing xsl or xpath ways are very welcome as additional insight.

The 'postgres xml out' postings continue with the next post dealing with more ways to get xml out of postgres.

In this post (sed command) I made use of the bash continuation character pair \newline as described at gnu.org