16 May 2006

How NetNewsWire 2.1 Reads Feeds

People have been asking me how NetNewsWire 2.1 reads feeds. Does it read feeds from NewsGator or from the sites themselves?

I’ll explain how it works, and also explain why it works the way it does.

How it works

If you have NewsGator syncing turned on, then NetNewsWire reads feeds from NewsGator Online.

If you do not have NewsGator syncing turned on, then NetNewsWire reads feeds from the original sources, just like it always has.

In other words, you can choose. It’s up to you.

(Also: if you have NewsGator syncing turned on, and there’s an outage or network error connecting to NewsGator Online, NetNewsWire will temporarily read feeds from the original sources. This way you still get your news.)

I’ll get to why it works this way—but, first, let’s start at the top.

The Big Picture

Server-based systems (like NewsGator Online, Bloglines, FeedLounge, and so on) have advantages—but desktop readers like NetNewsWire have advantages too.

The big advantage of server-based readers is that if you can get to a browser you can read your feeds. You don’t have to be in front of your normal computer.

I was determined not to let NetNewsWire users miss out on this just because they chose to use a desktop reader.

But I wanted to go even farther, to make it so you could sync not just multiple copies of NetNewsWire and an online reader but also be able to sync with other aggregators running on other platforms.

It’s important, in this increasingly cross-platform and multiple-gadget world, to be able to get your feeds anywhere, anytime.

As I’ve written before, it’s not just about syncing, it’s about ubiquity, having your feeds and read/unread status available wherever you are.

I’m aware that not everybody needs this yet. But tons of people already do, and more and more people will. (I think, ultimately, that pretty much everybody will. The number of people who will use just one aggregator on just one machine will be small.)

Server-based readers

Let’s set desktop apps aside for the moment and look at server-based readers.

Server-based readers each have a big list of feeds that their users subscribe to. When a system reads a given feed, it reads that feed one time on behalf of 10 or 100 or 100,000 subscribers.

Makes sense, right? If a feed has 100 subscribers, the system wouldn’t read it 100 times an hour, it would read it once an hour. Same if it has 1 subscriber or 100,000 subscribers—it gets read once an hour. (Or whatever the update period is.)

Server-based readers and subscriber counts

If you’re a feed publisher, you’re probably thinking that this sounds pretty good. It can save you a ton in bandwidth costs—you’re not penalized for popularity.

But you might also be wondering, “Well, how can I know how many subscribers I have?”

The answer is simple: NewsGator and many other systems report the number of subscribers in the user-agent. The answer is in your logs. You’ll see things like this:

NewsGatorOnline/2.0 (http://www.newsgator.com; 4796 subscribers)

It just comes right out and tells you how many subscribers there are—no need to count unique IP addresses (which wouldn’t give an accurate count anyway).

Syncing

Now imagine that you want different RSS clients—that run on various operating systems and even on phones and PDAs and various gadgets—to sync.

Your first thought might be to do something like IMAP. Well, it turns out that email and RSS are quite different, no matter how much they appear similar on the surface. But still, the basic idea of IMAP syncing—the idea of having a smart server that different clients talk to—is sound.

You could use the under-the-hood part of a server-based system as a sync server, as kind of like IMAP for RSS. (It’s not that much like IMAP, as I said, but it’s a convenient way to think about it.)

Using a server solves some tough problems and provides some nice efficiencies.

The big tough problem: the lack of unique IDs

For any two different aggregators to sync, they need a way to refer to news items in a feed. Many feeds have unique IDs—but, importantly, many feeds do not have unique IDs.

(Think of a unique ID as like a social security number. There may be 100 brown-haired, green-eyed guys named John Smith who live in Kentucky—but they each have a different social security number, a different unique ID.)

So what NewsGator’s system does is assign unique IDs to each news item. This way all the different clients have an agreed-on way to refer to news items, which makes syncing possible. These assigned unique IDs appear in the feeds coming from NewsGator, which NetNewsWire and other clients read.

These unique IDs have to appear inside the feeds, inside the news items they refer to—or else you’d have the problem of not knowing what ID applies to what item.

(What if all feeds everywhere contained unique IDs? They don’t. I can only dream about a parallel universe.)

Other info in the feed

The feeds also contain other information—such as whether or not the item is read. Since we already put a unique ID in the feed, it makes sense to also put the status in the feed, so a separate call to get the status is not always necessary.

Here, for example, is a snippet from a feed:

<ng:postId xmlns:ng="http://newsgator.com/schema/extensions">883945119</ng:postId> <ng:read xmlns:ng="http://newsgator.com/schema/extensions">True</ng:read> <ng:avgRating xmlns:ng="http://newsgator.com/schema/extensions">5.000000</ng:avgRating>

The ng:postId is the unique ID and ng:read is the read/unread status. (Also notice ng:avgRating. NetNewsWire doesn’t use that yet, but it’s there. You could imagine that we might put even more info in the feeds later.)

Efficient updates

A big efficiency gain is how updates happen: NetNewsWire asks NewsGator which feeds have changes, and then reads just those feeds. To be clear, this is possible because NewsGator has downloaded the feeds and knows what’s in them.

As an engineer, I love this kind of thing—the old-fashioned system where every client polls every feed all the time is kind of wasteful: there’s way more traffic than is necessary.

Summing Up

The way this works, there is no difference between syncing and downloading feeds. It’s more efficient (in terms of bandwidth and CPU usage)—and it solves the big problem of unique IDs.

It’s a good thing for feed publishers, too: publishers get an easier, more accurate way to count subscribers and they get lower bandwidth costs.

But, finally, there’s choice—if you don’t care about syncing, you don’t have to use it, and NetNewsWire will download feeds from the original sources just like it always has.

P.S. There’s an API

If you want to know more about how all this works, the SOAP and REST APIs are documented.

P.P.S. No, you can’t rely on links or titles or anything else

Whenever the issue of unique IDs come up, people say, “Well, why not just use the links as unique IDs? Or the titles?”

The answer is because they’re not unique. I’ve seen feeds where every single item has the same title and the same link to the home page. And remember, too, that news items are frequently edited—and the title, link, or both may be edited.

Unique IDs are needed, and if they don’t reliably come from the original feeds, they have to come from somewhere.