Consuming an RSS Feed with ColdFusion
TechnicalAbout a year ago I wrote a feed aggregator for a content management product. In doing so, I ran into a number of issues - some big, some small - that had me banging my head against a wall until i finally tracked them down. This morning, Bob Imperial asked a question on the CF-Talk mailing list related to one of those topics. I answered his question and went to work tracking down another, similar question that I knew I had responded to in the past. I thought it was the same question so my intent was to blog the solution in case anyone else was having the same problem.
Turns out the questions were different, but were both related to XML handling in ColdFusion and both were issues that I encountered while using ColdFusion to consume and display RSS feeds. Some of these issues caused me many hours of frustration so maybe I can help someone else avoid the same hassle by blogging the process and all of the problems I can remember encountering.
I can't post all of the code since it's a commercial product and I don't own the code, but I can talk through some of the issues I ran into, speak to how I handled them and post isolated snippets where appropriate.
Issue I : Minimizing Bandwidth Usage
The first issue of consuming a feed programmatically is that you'll probably want the feed updated as a background process - likely as part of a scheduled process. This can burn a lot of bandwidth if you don't pull selectively. Nick Bradbury, developer of FeedDemon (a great aggregator for Windows) and, incidentally, of HomeSite, wrote about how he handled these issues in FeedDemon and his suggestions make perfect sense. I followed almost all of them. Here are the measures I took and how I implemented them:
- Utilize the If-None-Match (ETag) and If-Modified-Since request headers. If the feed content hasn't changed the server will return a 304:not modified status code and minimal bandwidth is used. Just use cfhttpparam to pass along a header value and then check the status code that is returned.
- Abide by the value set in the feed's ttl element. If set, then the feed author is telling you roughly how often to expect the feed to be updated. If this value existed, I didn't allow users to update the feed at smaller time intervals.
- Honor values set in the skipDays and skipHours elements, as well. Again, the author is telling you which hours or days the feed is not likely to be changed.
- Set the default update interval high. I set mine to 2 hours. Users could then lower that interval, if necessary (but only as low as the ttl value, of course). I found that very few did so.
- In the cfhttp tag, I set the useragent value so that the product name was specified. By doing so, I was giving feed authors the ability to block the user agent if the number of requests was deemed excessive or abusive. Because I followed the other rules, I never had a problem (that I knew about).
Issue II : "Content is not Allowed in Prolog" Error
After retrieving the feed, I wanted to ensure that it was valid. I didn't need to parse the feed because I was simply storing the downloaded content for later display via XSL, but I did want to ensure its validity so I could let the user know of any problems. To do so, I used isXML(). More than a few feeds reported a Content is not allowed in the prolog error. To the naked eye, the feed looked fine, so I added some code to strip out any content that existed in front of the XML prolog:
That did the trick.
Issue III : Character Sets/Encoding
Ultimately, the point of consuming a feed was to display it (or parts of it). When retrieving the feed, I was setting the character set to UTF-8 using cfhttp's charset attribute under the assumption that UTF-8 would handle most characters and display fine except in a very few edge cases. As you can see, I'm hardly an expert with character sets and encoding. This didn't work at all. I found that I needed to do everything I could to ask the feed what character set I should use when I retrieved it so I did just that.
First, I made a head request to the feed URI. Most feeds will return character set data in the response header. If cfhttp.charset contained a value, I used that. If no value was returnd, I had to violate my bandwidth conservation policy and perform a GET request so that I could parse the XML prologue for the value of its encoding attribute:
If I couldn't find a value any other way, then UTF-8 would just have to do. Once I had a character set, I could make the "real" GET request to the feed URI using the proper charset value. I found that if I didn't retrieve the feed using the correct character set, I couldn't convert it properly later for display.
I'll blog some of the issues I had when displaying a feed in my next post.





Loading....