Should users edit data in serialized text files?

Tuesday, December 8th, 2009 | Software Development

Let me begin with an allegory.


❝We work at a great software shop. We are tasked with writing a large, complex piece of software. The domain for this app includes a rich, hierarchical data structure, with a ton of cross-references between elements, and our user will need to manage all these data elements. Furthermore, we will have a lot of users on the system at once, and they will frequently work on overlapping parts of the domain model concurrently, as well as browsing the history of actions taken and how they affect the model. We expect our users to be mostly “advanced”, so we can show them a complex view of this complex domain, and present them with some power features. Our basic architecture will be a speedy user interface, which the user will spend a great deal of time in, and a server which stores the domain model elements.

This looks like a pretty hefty design challenge. Let’s say one guy on your team (Sloth) proposes the following:

We can store our complex data model by picking one level in the hierarchy, let’s say the parents of most of the leaves, and writing it to a file. The rest of the hierarchy will be roughly mapped to directories containing these files. Inside the file, we’ll start with a bit of metadata, listing all the cross-references to the parts of the domain outside the file. We’ll come up with a nifty grammar so that we can serialize into each file in a roughly human-readable format. [ed. Sloth has been reading about domain-specific languages lately and has been waiting for a chance to engineer one]. To provide long-term persistence over server restarts, and also to handle the versioning and history requirement, we’ll use a Version Control System like Mercurial, and save each new revision of the file there using an account owned by the user. So that’s how we’ll store the data.

The rest of the team is mostly nodding to this point, they especially like his idea to use a VCS to handle persistence and versioning. One engineer points out that this means the versioning and history will only be available per-file, so they can’t version the leaves of the domain model. Everyone nods uncertainly, since that requirement wasn’t clear. Another engineer says he’s worried about the concurrent editing, if there are conflicts between users, they’ll have to merge that serialized file, which will happen more if it contains a larger chunk of the model. He asks Sloth,

“Why do you want to store a subtree of depth 2 in the serialized file? We don’t want the user to have to edit those files by hand in case of a conflict, right?”

Sloth gets a little defensive, and replies, “Oh, I thought they would edit them by hand all the time. The UI can just show a master-detail view with the directory structure and the contents of the selected file. Then we won’t even need a server!”

Everyone gets very uncomfortable. The tech lead feels it’s his job to lead Sloth a little.

“In other programs I’ve worked on of this complexity, we don’t show the user the data model in such a raw form. We want to separate the metadata out, especially all those cross-links at the top. And what will we do if we want to change the presentation a little? We’ll be stuck with whatever is in our serialized data format.”

Another engineer pipes up. “Yeah, this data model is looking really complicated, I don’t see why we’d make the user hand-edit the data. Maybe if we were just a couple guys and didn’t have time to write a server, we’d have to do that. What about avoiding conflicts? There’s no way the users will be able to keep those files consistent when they make changes.”

The database engineer is feeling a little left out by the proposed file format, since it puts him out of work. He joins in on the criticism.

“All those cross-links between the data model elements are supposed to be one-to-many. If I link from element A to element B, I want to have some unique key in B that I can reference. That way, when B changes, we can find the incoming links and keep the data consistent. How are you going to have users maintain those keys?”

Sloth is sheepish now, but still defensive, so he sticks to his proposal. “They can just use whatever they want as keys. Any string would work. They just give a name to element B and then refer to it by name in A.”

“I dunno,” says the DB guy, “it seems to me that we’re going to make it really hard to maintain the consistency of the data model when users are mucking with the keys. If they want to give a name to a model element, that’s fine, but I don’t want to use their name as our internal key. What if they use the same name for several elements? Would you allow that? How would we figure out which is which?”

Sloth shoots himself in the foot: “I guess the users will just have to resolve those problems themselves. They can pick non-colliding names, maybe based on the directory the data file is in? And I see your point, there’s no good way for us to ensure the integrity of the data anymore, but that doesn’t seem to be a requirement. Maybe when we export the data to the upstream system, we can flag all the problems and make the user fix them then.”

“Oh god, you’re retarded.” says the rest of the team, nearly in unison. “We’re not doing it that way,” says the tech lead.



So in our allegory, Sloth gets stuck on his idea of having a cool DSL for encoding the data model, and by the end of the meeting, he’s killed the referential integrity of the system. What would you have done differently in this design? I know one thing I would do, even if we want to use the VCS to store some serialized files like Sloth suggests, is to show the users a more conceptual view of the data. It’s crazy in a system of this complexity to have no distinction between the form of the long-term persisted data and the presentation we show the user. Right?

Well, in my cynical way, my allegory is really about source code and IDE’s. The poor users editing serialized data files in a DSL are you and me. If you look at it in a new light, like in this story, it sounds pretty crazy.

This whole notion came to me in a conversation with Jesse Wilson. The idea of ditching the serialized text file as the encoding of a program is entirely his. At first, I was really skeptical. It doesn’t seem to buy us much, and comes at a great cost. We have to ditch so many tools that rely on text files and directories. We also have to use an IDE that understands whatever alternative, better way we can come up with to encode a program.

But, the more I think about this, and the more I work on the grammar for Noop, and the more I watch myself and my co-workers massage the ascii text in a source file to align parameters, keep the imports tidy, and so on, the more it makes sense. And we could eliminate a lot of complexity from all our tools. Refactorings become easier to write, and our IDE doesn’t need to do incremental parsing of our text to keep its AST in sync. Even better, our IDE can become a lot smarter about how the program is presented to us, the users. I’m going to mock up some pages to show what I mean, sometime soon.

The biggest drawback is that you can’t edit this code on the command line anymore. You need to use an IDE. If you want to grep or sed your way through the files, you’re probably out of luck, unless you want to do it on the persisted form of the data, which might not be very human-readable. Looking towards the future, though, I see a lot of the compilation and testing steps moving to cloud-hosted farms, letting us increase the size of our software and its transitive dependencies. I see a need for tooling to get more sophisticated around helping us visualize and manage the interactions in our code. And I think we could spend less time thinking about how to format this text file full of code, organize our imports, remember that each file needs its own copyright, worry about whether the number of methods in this class makes the file too long, whether we want to make this new thing a public static inner class just to avoid making extra little files on the filesystem, and so on.

Another big win: we don’t have to have compile errors. If we manipulate the AST directly as we code, then the referential integrity of our program can be maintained with each transformation we apply to the source. We would never write software that forced the user to reconcile a bunch of data integrity problems at the end of a long series of operations, like we do with the compiler today.

In my last post, help, my IDE is full of compile errors, I wrote about the problem of keeping our IDE’s fully configured correctly for our project, and sharing their understanding of the project metadata with the build system. Well, if we are going to change the IDE in a major way by using a different data representation for our code, then we may as well have a server between the IDE and the VCS persistence layer. Now the server can maintain the correct state of the metadata for us, and keep it consistent across the team, and across my several machines.

I’m having trouble keeping this urge from taking root in Noop, my programming language. To be fair to Noop, this isn’t a language problem per sé. Java could be made to work in this system – mostly. We want a unique key for an identifier other than a user-defined symbol, so we’d have to do some translation of the AST between parsing the Java files and showing to the user. Also, to get finer-grained versioning, we’d have to break up methods into their own files, so the actual .java source file would be produced from a number of files. It just seems like a new language gives us a new start, and would save me time and effort trying to find a pretty way to serialize all the metadata like documentation into the source files with a complex grammar to maintain.

We could also re-write the VCS so we don’t need to serialize anything to files. But that seems pretty crazy, right?

Tags:

3 Comments to Should users edit data in serialized text files?

[...] This post was mentioned on Twitter by Alex Eagle, Carter Rabasa. Carter Rabasa said: RT @Jakeherringbone: Should users edit data in serialized text files? http://bit.ly/5SeBmf [...]

Jeff Miller
December 8, 2009

Visual Age for Java’s ENVY repository went exactly down the lines you describe, including method-level version history. It worked great except when it got corrupted, and then you were in a world of hurt, wondering how much history you’d lost.

So I would expect there’s wisdom (or at least experience) there that might inform Noop’s attitude toward storing code as a consistent, fully attributed binary form.

Andy Andrews
March 9, 2010

I feel that what coders like most about an IDE is that it makes their compiled language closer to an interpreted one. I like how Python united syntax and formatting.

So users could edit any data, provided the editor monitored them in real time for consistency and any other consequences of interest. If you assume an interpreted versus a compiled environment, a lot of these problems disappear.

Leave a comment

About Me

I'm Alex Eagle. I live in Sunnyvale, CA and I'm a code monkey.

eag...@captcha.me
LinkedIn.com/in/AlexEagle
Twitter.com/jakeherringbone

Add to Google Reader or Homepage

 Subscribe in a reader

Tweets

  • @LaChilangringa thank you, he will be called Walter and might like trains or frogs. You were at the rally? What did your sign say? in reply to LaChilangringa 2010-11-06
  • It says I'm not eligible to get a payout in the Buzz settlement. I'll have to settle for juggling with the Buzz developers. :) 2010-11-03
  • It's Movember and you can sponsor my mustache. http://goo.gl/Z1O4 I miss the beard; It's very drafty on my face today. 2010-11-02
  • Can 4 guys make themselves look enough like Mount Rushmore to fool Google Goggles image search? Love the demo slam. http://demoslam.com 2010-10-20
  • Saw Dalai Lama on Thurs, running last 6mi of SF women's marathon with Peggy today. Too many crazy crowds this week! 2010-10-17
  • Attn: people of the future. We wanted to avoid all that litter! It was our 2nd priority, right after annoying noises. http://bit.ly/cJzkGT 2010-10-09
  • Headed to Hardly Strictly bluegrass in GG park. Elvis Costello free! 2010-10-03
  • I vote that @TCooganPlants is having a rough week and deserves nachos. Who's with me? 2010-09-29
  • More updates...

Powered by Twitter Tools