Premature Optimization and Servers

More than one person has suggested I’m guilty of violating the law of premature optimization when it comes to my server work.

Here’s the thing, though: when it comes to database schema, I really, really want to get it right before shipping.

Making code changes in a client app is normal. Making database schema changes in a client app is a pain, but not the worst thing.

Making code changes on the server is normal too, though a little hairy. But the hairiest of all is database schema changes on the server. I’m designing so that I don’t ever need to do that. (I may not reach that goal. Time will tell.)

Even though Brian Reischl wrote up how to do data migration, and so I have a good plan if I ever need to go there, I just don’t ever want to go there.

In other words: getting the server-side database schema right right now isn’t premature — it’s exactly the right time.

More on UUIDs and Clustered Indexes

My SQL Server genius pointed me to this article by Kimberly Tripp about the problem with GUIDs as primary and/or clustering key. (GUIDs and UUIDs are the same thing in this context. Microsoft folks often call them GUIDs.)

Another article suggests this isn’t a big deal with Azure SQL because a network write is slower than a page split anyway.

But the advice still seems to be that UUIDs don’t make the best clustering key. You want something narrow to keep down index size.

So I’ve slept on it. Do I still like this layout for the notes table?

id - auto-incrementing integer clustering primary key
noteID - UUID, unique
userID - int

Yes, I like it.

It’s the API?

Cesare Rocchi argues that the Heartbleed problem isn’t C — it’s the API.

My counter-argument: people make mistakes. People make dumb APIs. With C, combine a dumb API and a mistake and you get Heartbleed.

That’s far less likely with another language.

Here’s the thing: we will always have dumb ideas and mistakes. We can and should do our best to eliminate them, but we’ll never succeed entirely. Because we know that, we’re negligent if we don’t do our best to minimize their consequences.

Book Idea

Graham Lee notes that computing turns 100 probably within your lifetime and proposes a book about programming that I want.

Vesper Sync Diary #13 part 3 - Thinking Too Much

There are two pieces of advice I’ve been getting:

One is that I’m thinking too much about this. It’ll be fine if I have a properly normalized schema and I use the appropriate indexes. After that, don’t worry.

(I admit that I’m prone to going down every performance rabbit-hole I can find.)

The other — partly related — is that the right way to deal with the notes table is to do create a clustered primary key as userID + noteID. This way all notes by a given user will be together.

And this is the default behavior. It’s good. Smarter people than I am have thought about this.

And I can drop the integer identity column.

Update a few hours later: No. Wait. The best database guy I know tells me to do it the way I was thinking: surrogate key integer identity column as clustering key.

He also suggests I don’t need to add a unique constraint for noteID + userID, since noteID is a UUID. A unique constraint on noteID is all that’s needed.

SettleApp

I helped Ondřej with some of the wording on the home page for SettleApp.

Some advice for developers in general, off the top of my head:

  • Use words that humans use. (No “algorithms.”)

  • Use verbs. Verbs are vivid. Verbs trigger people’s imaginations — they picture themselves using the app.

  • Use curly quotes.

  • Don’t use comma splices.

  • Don’t talk about how you rewrote any part of your app. Nobody cares.

  • Don’t talk about how it’s native or about how it’s ready for the latest version of the OS. These things are assumed.

  • Don’t be repetitive.

  • Simplify.

  • Cut, cut, cut, cut, cut, cut.

  • Cut.

Vesper Sync Diary #13 part 2 - Maybe It’ll Be UUIDs After All

I slept on the thing about using 53-bit integers as note IDs. I woke up and it felt weird — slightly but detectably. I listen to those feelings.

As previously mentioned, I’m still learning my new environment. Any new environment takes time. The thing to do, when things seem weird, is more research.

Though the initial problem was a JavaScript problem, the thing I’m trying to optimize is a SQL Server database. So I’m doing some more research on the database.

UUID data type

SQL Server has a UNIQUEIDENTIFIER type. (Sorry for the shouting, but database people are shout-y.)

I assume this saves space over just using a 36-character UUID string — I figure it strips the dashes, at least.

That’s wider than a 64-bit integer, and I don’t love it, but it’s not the worst thing ever to happen. The great thing about UUIDs is that they can be created on the client and guaranteed (for practical, real-world values of “guaranteed”) not to collide.

(Remember that, in Vesper’s case, a note ID need be unique only for a given user.)

I really wanted 64-bit integers. But I’m willing to accept UUIDs because it’s not weird and is a standard practice.

Clustered indexes

Azure SQL Server insists that each table has a clustered index — that is, an index which tells the database how to physically lay out the data.

The primary key is by default the clustered index. The primary key for the notes table is noteID + userID.

Recall that UUIDs look like this:

4C958F48-A332-44D2-A0AD-3E91EF172C6D
0ACA46EE-86F8-4443-BF51-A52A4506FD6C

There’s no order to them. So, to insert a new note could mean inserting it anywhere in the table rather than at the end.

That sucks.

(Yes, I know about NEWSEQUENTIALID(), but I’m generating IDs on the client.)

But here’s the thing: I had this exact same problem even before using UUIDs, because I was using random integers.

What I want instead is a way to add a row at the end, like adding a page to a book. For that I’d need to create a separate identity column with a monotically-increasing integer — like adding a page number — and then make that the clustering key. (The primary key and clustering key do not have to be the same.)

Alternate approach

Having primary key as noteID + userID, and a separate identity clustering key, is just about the same thing as this layout:

Primary key: identity integer (page number style)
Constraint: noteID + userID is unique

This alternate approach feels more right.

My goal is to make noteID + userID unique, and to enable fast lookups. Adding a constraint (which adds an index) makes that work. I don’t also have to make it the primary key.

Do I still have the problem of JavaScript and big integers?

Does my code ever need to see the identity integer? The clients don’t — it’s a server-side detail only.

But if my server-side JavaScript code needs to reference it, then once it passes 9007199254740992 notes then we’re screwed.

You know what? I’m fine with that. To hit that number, every human on Earth would have to create over a million notes. I won’t worry about it. I’d be a billionaire long before then, and my blog posts will be me figuring out which island I should buy.

(With that much money this problem would be solvable, in other words. Solvable by other people who I pay very, very well.)

Summary of changes

I think it’s a plan.

Client will have to make these changes:

  • Generate note IDs as UUIDs.
  • Store note IDs as UUIDs. (Change database schema and data model objects.)
  • Fix code that expects note.uniqueID to be a 64-bit integer.

Server will have to make these changes:

  • Update notes and deletednotes tables to have an integer identity clustered primary key.
  • Update notes and deletednotes tables so that noteID is a UUID.
  • Updates notes and deletednotes tables with a unique constraint on noteID + userID.

The good news: it’s possible that none of the server-side JavaScript will have to change. If there are any changes, they’re few and small.

The bad news: there’s more client-side code to change than I’d like. Normally a schema change in a development version isn’t a rough thing, but it does mean going into the database migration code again, which I don’t like doing.

The other good news: but the database migration will actually be simpler, since I was using UUIDs for note IDs in Vesper 1.0. I can preserve those UUIDs across the migration, where until just now I was changing note ID to 64-bit integers. So that’s cool. Simpler data migration is better.

PS The only thing that gives me pause — and something always gives me pause — is the idea that I could make userID + noteID the primary clustered key. (userID first.) This would mean that notes don’t get added at the end of the table — but all notes for a given user would be stored together. I think. I don’t have an easy way to test this, but my gut says that adding notes at the end is the better way to do this, and I should just rely on that unique constraint index to make lookups fast.

This is where the value of knowing your environment comes in. If I had five years, say, of SQL Server experience, I’d probably just know the best answer.

Well, the only way to learn an environment is to work in it, and that’s what I’m doing. Later I may look back and laugh. (Wouldn’t be the first time. Not even close.)

PPS If it seems like I’m working things out by writing them up here, it’s because I am.

Also, if it seems like I sometimes have to try everything before I get to an answer I like — well, that’s often also true.

PPPS Why did 53-bit integers seem weird enough to bug me? I have to plan for all options. There’s always the chance that another developer helps with this code — web app, for instance. Or we could make the API open. (Doubtful, but I have to plan for craziness.) Would the 53-bit limit be observed? It’s a data corruption bug waiting to happen. Even though it’s one little thing — an upper limit on note IDs — it’s too much complexity. Sync needs to be as simple as possible.

Drew’s Core Data Sync

Ensembles 1.0 ships:

Today, Ensembles is a framework worthy of consideration if you want to add sync to a Core Data app. As you would expect from a 1.0 release, it is robust and stable, and is already shipping in a handful of apps.

[Sponsor] Briefs

If you’re reading this, you’re likely a developer of impeccable taste who cares and sweats the details. Don’t you want to work with designers who care as much about those same details? If so, ask them to check out Briefs.

It will let them create professional quality prototypes from their mockups and wireframes using a powerful Mac application. A brief is easily shared and can be deployed to iOS and Android devices. They’re great for quick demos. Briefs lets you worry about the final product, while your designer is busy playing with ideas that might or might not work out.

Recommend Briefs to every designer you work with and make your life easier. Send them to http://giveabrief.com and tell them they can try out Briefs for free.

Vesper Sync Diary #13: Unlucky Numbers

I’ve been writing Cocoa apps for so long that I’m rarely surprised. I don’t know everything — surprises happen, and of course there are things that I know I don’t know — but, in the main, I know the intricacies well enough that I don’t even have to think about them.

But writing syncing means learning a new environment. For me it’s JavaScript, Node.js, and Azure. And I just ran across a big surprise in JavaScript.

If you have Node.js installed on your system, you can open Terminal and do this.

Launch Node. (Just type node and hit return.)

Copy-and-paste the following number and hit return: 9223372036854775807.

Node evaluates the expression. What do you think you get?

> 9223372036854775807
9223372036854776000

That’s right. It’s not the same number.

Some of you will note that the number I chose is the maximum value of a 64-bit integer. The problem occurs with sufficiently large integers. Here’s the deal: JavaScript numbers are 64-bit floating point numbers, which means they don’t have enough precision for very large integers.

I had no idea.

I should have known. Well, it’s a thing I needed to learn, and I just learned it.

Why Care?

When does a note-taking app need to deal with very large numbers? After all, if someone types a big number into a note, it doesn’t matter, because the entire note is treated as a string.

Here’s the thing: unique IDs for notes and users are 64-bit integers in my system.

So when the iOS app sends a note to the server, it sends a 64-bit integer along with note text and other properties of the note.

JavaScript code takes that data and then inserts the data in the database.

The “takes that data” part is where JavaScript turns the note’s unique ID from 9223372036854775807 (for instance) to 9223372036854776000 (for instance). We can’t have that.

Options

I’m writing this up as soon as I learned about it. I have to figure out what to do.

The first thing: user IDs might as well be 32-bit integers. (JavaScript can handle 32-bit integers accurately.)

Before we get anywhere near 2147483647 (max for 32-bit integers), we will have hired a team of the brightest server-side programmers in the world, and they’ll figure out what to do.

But that still leaves note IDs.

Option 1: stringify

On the server side, treat the unique ID as a string. JavaScript won’t mess with it if it’s a string. Strings are tough and opaque.

Problem: database bloat. Storing strings takes up way more space than storing 64-bit integers.

Plus it’s inelegant.

Option 2: 32-bit integers

Collisions are more likely on assigning a random note ID when using 32-bit integers instead of 64-bit integers.

Now, the client app does check for collisions by looking at the note IDs it knows about.

But there may be a few note IDs it doesn’t know about (that haven’t been synced from another client, for instance).

That’s a very, very small risk. Most of the time there won’t be unknown note IDs, and when there are it will tend to be very few.

But still: ugh.

(Remember that global note ID collisions aren’t an issue. A given note ID has to be unique for a given user only.)

Option 3: 53-bit integers

The largest integer JavaScript can handle is 2-to-the-53rd power.

> Math.pow(2, 53)
9007199254740992

The downside is still that there’s a chance of collisions — but, really, that’s a sufficiently large number size that I’m not at all concerned. (Remember that the client app checks, so the chance of a collision is really with a new note ID and any note IDs the client doesn’t know about yet, which will usually be zero anyway.)

And — it just occurred to me that I can add a second collision check: if a note comes from the server that matches the unique ID of an existing note, and the creation dates of the notes are not identical, then I can deal with that. (By giving one of the two notes a new ID and syncing.)

This is what I’ll do.

Twitter Feed for This Blog

I just set up @inessential_com as a Twitter feed for this blog.

I don’t post everything (just longer things) to my @brentsimmons account. The new @inessential_com account will get everything.

NSDictionary Tackled

Bartosz Ciechanowski does a deep dive into NSDictionary. I love this and stuff like this.

Mark on What’s Really Dangerous

Mark Bernstein: It’s Not C.

Any computer language capable of doing real work is also capable of being confusing, capable of being misused, capable of being subverted.

Mark’s right. Free computing is free speech.

I certainly don’t recommend banning C, as if anyone could. I don’t use it much, but when performance is an issue, I do. (Note that I need to update that code: needs more braces.)

However, it’s worth thinking about less-dangerous alternatives. Programmers will always make mistakes, and that’s why we do automated testing, static analysis, code reviews, and so on. That same reason — the inevitability of mistakes — is a reason to use higher-level languages when possible.

Unfortunately, “when possible” is just not very often when it comes to portable libraries such as OpenSSL. What would you write it in, if not C? I don’t know, but I think it makes sense to have an alternative language for this where you don’t have to be so damn careful with memory, where things like buffer over-runs are impossible.

I suppose that static analysis tools could get good enough to catch these errors. Maybe. C is so flexible that I just don’t know if they’ll ever be good enough. But if they get that good, then the culture and the tools have to make it so that no developer would ever fail to run and pay attention to static analysis results. That seems like a longshot to me, but I’d be pleased to be wrong.

An even bigger longshot is counting on developers to get better at testing. That’s like counting on people not to drink and drive. The culture has done a good job at stigmatizing this particular bad idea — but it still happens. That’s not to say we shouldn’t try — we should, definitely, for sure — but alone it’s not enough to fix the problem.

Though the OpenSSL bug affected servers, I note that most people don’t write their web services and sites in C. It’s do-able. You could, and surely some people are, somewhere. But to most of us it sounds crazy, and for good reason.

As computing evolves, the domains where C “sounds crazy” will continue to expand. Consider that people are already treating JavaScript as assembler (see CoffeeScript and TypeScript). As an industry we continue to move toward higher-level languages, and that’s a good thing.

Big Nerd Dependency Injection

Graham Lee: Dependency Injection, iOS and You:

The reason brittle object graphs are bad is that you cannot easily replace parts of the application. If an object expects to ask its environment for a load of other objects around it, then you cannot simply tell it that it should be using another object. Dependency injection fixes that. It tells the object, “Hey, these are the objects you should work with,” so that if we want to change the collaborators we just inject different things.

I’d link to everything published by Big Nerd Ranch. You should subscribe to its RSS feed.

NSEgoItems

Uli Kusterer suggests that app makers reconsider before adding an NSStatusItem.

Feature Request for Text Editors

(Maybe there’s a text editor that does this that I don’t know about. But I haven’t found one.)

I have my Node.js project open in my text editor. Folders and files are in the left-hand sidebar. Click on one to show that file. You know the picture.

(Wouldn’t have to be Node.js. Could be any programming project.)

Now I’m looking at a file that references a function that I want to look up. As an Xcode user, I expect to be able to cmd-click on the name and go right there, even if it’s in another file. And then I expect to be able to go right back to where I was with a keyboard shortcut. (Or with a swipe on my mouse.)

But it doesn’t work. The text editor is smart enough to be able to give me a list of the symbols in each file, but it’s not putting this all together.

I’ve made this work before — at least the navigate-to-symbol part — by maintaining a Ctags file. But this is a pain. It needs to be updated when things change.

What I want is the obvious thing: the text editor — which knows what language each file is, which knows where the root folder is — should handle creating and updating the Ctags file behind the scenes. It should do this for me.

And the second part of this — going back easily — is just as critical.

My point: a significant chunk of programming is just navigating, and no text editors (that I know of) get this completely.

It should work like a browser. Forward and back.

(For bonus points: it’s not file-based but based on position in file. Say I’m at line 300, and I go to the top of the file to include another file. I want to jump back to line 300 the same way I’d go back to a separate file.)

If there is a text editor that does this — all of this — let me know what it is because I want it right now.

That Pretty Much Wraps it Up for C

I love C. So damn much. I enjoy writing straight C.

I love having some memory and some abstractions and the flexibility to do whatever I want. I love that it imposes a discipline that prevents me from using that flexibility in stupid ways.

But between goto fail and the Heartbleed bug I have to wonder if my beloved language should be retired — at least for everything that could be a security issue. (Which is potentially everything, I suppose.)

There are two things that make other languages better for these things: 1) in many languages these particular bugs are impossible, and 2) there are often better static analysis tools that can prove that those particular flaws don’t exist in a chunk of code.

The Heartbleed bug is a major hassle. What worries me is that the next time something else might happen — maybe the power grid goes down or ATMs stop working. (Or worse.)

If we’re serious about protecting ourselves from the NSA and other malevolent entities, maybe we have to move away from C.

(I suggest this with great reluctance. And a realization that the cost of this would be huge.)

Backend

BaasBox is another open source backend for iOS, Android, and JavaScript apps.

“Security”

Were the National Security Agency an agency charged with the security of our nation, it would have reported the Heartbleed bug immediately instead of exploiting it.

Archive