IDE
Should users edit data in serialized text files?
Let me begin with an allegory.
❝We work at a great software shop. We are tasked with writing a large, complex piece of software. The domain for this app includes a rich, hierarchical data structure, with a ton of cross-references between elements, and our user will need to manage all these data elements. Furthermore, we will have a lot of users on the system at once, and they will frequently work on overlapping parts of the domain model concurrently, as well as browsing the history of actions taken and how they affect the model. We expect our users to be mostly “advanced”, so we can show them a complex view of this complex domain, and present them with some power features. Our basic architecture will be a speedy user interface, which the user will spend a great deal of time in, and a server which stores the domain model elements.
This looks like a pretty hefty design challenge. Let’s say one guy on your team (Sloth) proposes the following:
We can store our complex data model by picking one level in the hierarchy, let’s say the parents of most of the leaves, and writing it to a file. The rest of the hierarchy will be roughly mapped to directories containing these files. Inside the file, we’ll start with a bit of metadata, listing all the cross-references to the parts of the domain outside the file. We’ll come up with a nifty grammar so that we can serialize into each file in a roughly human-readable format. [ed. Sloth has been reading about domain-specific languages lately and has been waiting for a chance to engineer one]. To provide long-term persistence over server restarts, and also to handle the versioning and history requirement, we’ll use a Version Control System like Mercurial, and save each new revision of the file there using an account owned by the user. So that’s how we’ll store the data.
The rest of the team is mostly nodding to this point, they especially like his idea to use a VCS to handle persistence and versioning. One engineer points out that this means the versioning and history will only be available per-file, so they can’t version the leaves of the domain model. Everyone nods uncertainly, since that requirement wasn’t clear. Another engineer says he’s worried about the concurrent editing, if there are conflicts between users, they’ll have to merge that serialized file, which will happen more if it contains a larger chunk of the model. He asks Sloth,
“Why do you want to store a subtree of depth 2 in the serialized file? We don’t want the user to have to edit those files by hand in case of a conflict, right?”
Sloth gets a little defensive, and replies, “Oh, I thought they would edit them by hand all the time. The UI can just show a master-detail view with the directory structure and the contents of the selected file. Then we won’t even need a server!”
Everyone gets very uncomfortable. The tech lead feels it’s his job to lead Sloth a little.
“In other programs I’ve worked on of this complexity, we don’t show the user the data model in such a raw form. We want to separate the metadata out, especially all those cross-links at the top. And what will we do if we want to change the presentation a little? We’ll be stuck with whatever is in our serialized data format.”
Another engineer pipes up. “Yeah, this data model is looking really complicated, I don’t see why we’d make the user hand-edit the data. Maybe if we were just a couple guys and didn’t have time to write a server, we’d have to do that. What about avoiding conflicts? There’s no way the users will be able to keep those files consistent when they make changes.”
The database engineer is feeling a little left out by the proposed file format, since it puts him out of work. He joins in on the criticism.
“All those cross-links between the data model elements are supposed to be one-to-many. If I link from element A to element B, I want to have some unique key in B that I can reference. That way, when B changes, we can find the incoming links and keep the data consistent. How are you going to have users maintain those keys?”
Sloth is sheepish now, but still defensive, so he sticks to his proposal. “They can just use whatever they want as keys. Any string would work. They just give a name to element B and then refer to it by name in A.”
“I dunno,” says the DB guy, “it seems to me that we’re going to make it really hard to maintain the consistency of the data model when users are mucking with the keys. If they want to give a name to a model element, that’s fine, but I don’t want to use their name as our internal key. What if they use the same name for several elements? Would you allow that? How would we figure out which is which?”
Sloth shoots himself in the foot: “I guess the users will just have to resolve those problems themselves. They can pick non-colliding names, maybe based on the directory the data file is in? And I see your point, there’s no good way for us to ensure the integrity of the data anymore, but that doesn’t seem to be a requirement. Maybe when we export the data to the upstream system, we can flag all the problems and make the user fix them then.”
“Oh god, you’re retarded.” says the rest of the team, nearly in unison. “We’re not doing it that way,” says the tech lead.
❞
So in our allegory, Sloth gets stuck on his idea of having a cool DSL for encoding the data model, and by the end of the meeting, he’s killed the referential integrity of the system. What would you have done differently in this design? I know one thing I would do, even if we want to use the VCS to store some serialized files like Sloth suggests, is to show the users a more conceptual view of the data. It’s crazy in a system of this complexity to have no distinction between the form of the long-term persisted data and the presentation we show the user. Right?
Well, in my cynical way, my allegory is really about source code and IDE’s. The poor users editing serialized data files in a DSL are you and me. If you look at it in a new light, like in this story, it sounds pretty crazy.
This whole notion came to me in a conversation with Jesse Wilson. The idea of ditching the serialized text file as the encoding of a program is entirely his. At first, I was really skeptical. It doesn’t seem to buy us much, and comes at a great cost. We have to ditch so many tools that rely on text files and directories. We also have to use an IDE that understands whatever alternative, better way we can come up with to encode a program.
But, the more I think about this, and the more I work on the grammar for Noop, and the more I watch myself and my co-workers massage the ascii text in a source file to align parameters, keep the imports tidy, and so on, the more it makes sense. And we could eliminate a lot of complexity from all our tools. Refactorings become easier to write, and our IDE doesn’t need to do incremental parsing of our text to keep its AST in sync. Even better, our IDE can become a lot smarter about how the program is presented to us, the users. I’m going to mock up some pages to show what I mean, sometime soon.
The biggest drawback is that you can’t edit this code on the command line anymore. You need to use an IDE. If you want to grep or sed your way through the files, you’re probably out of luck, unless you want to do it on the persisted form of the data, which might not be very human-readable. Looking towards the future, though, I see a lot of the compilation and testing steps moving to cloud-hosted farms, letting us increase the size of our software and its transitive dependencies. I see a need for tooling to get more sophisticated around helping us visualize and manage the interactions in our code. And I think we could spend less time thinking about how to format this text file full of code, organize our imports, remember that each file needs its own copyright, worry about whether the number of methods in this class makes the file too long, whether we want to make this new thing a public static inner class just to avoid making extra little files on the filesystem, and so on.
Another big win: we don’t have to have compile errors. If we manipulate the AST directly as we code, then the referential integrity of our program can be maintained with each transformation we apply to the source. We would never write software that forced the user to reconcile a bunch of data integrity problems at the end of a long series of operations, like we do with the compiler today.
In my last post, help, my IDE is full of compile errors, I wrote about the problem of keeping our IDE’s fully configured correctly for our project, and sharing their understanding of the project metadata with the build system. Well, if we are going to change the IDE in a major way by using a different data representation for our code, then we may as well have a server between the IDE and the VCS persistence layer. Now the server can maintain the correct state of the metadata for us, and keep it consistent across the team, and across my several machines.
I’m having trouble keeping this urge from taking root in Noop, my programming language. To be fair to Noop, this isn’t a language problem per sé. Java could be made to work in this system – mostly. We want a unique key for an identifier other than a user-defined symbol, so we’d have to do some translation of the AST between parsing the Java files and showing to the user. Also, to get finer-grained versioning, we’d have to break up methods into their own files, so the actual .java source file would be produced from a number of files. It just seems like a new language gives us a new start, and would save me time and effort trying to find a pretty way to serialize all the metadata like documentation into the source files with a complex grammar to maintain.
We could also re-write the VCS so we don’t need to serialize anything to files. But that seems pretty crazy, right?
Help! My IDE is full of compile errors!
I was working on some code just now, in a project I don’t normally work on. It turns out I need to make some changes there, and once I’m done, I won’t need to work on that project anymore. This is a common enough thing for engineers, I’m guessing.
So I loaded up the code in my IDE so I could start making sense of it. I followed some instructions provided by the project team, as far as what to check out from version control, and then I used the plugin in my IDE that deals with the build system, hoping to setup my project correctly. And by correctly, there are a bunch of things I need to be happy and productive:
- All the right code is displayed, including all the source roots and modules
- All the needed dependencies are added as libraries, so the code compiles
- If any special development kit is needed, like the right JVM, it’s added
- Whatever plugins I’ve used when I worked on this project in the past should be installed and setup
- If there are multiple modules/sub-projects, they are also added correctly, and depend on each other as source, not via their compiled jars
- Source attached to all the binary dependencies, even third-party ones
- All the transitive dependencies are downloaded to my machine
- The right preferences are available, such as the Copyright block that should be pasted into new files, code style, import order, etc.
- If there are any special arguments needed to launch the program or the tests, those are added in my Run Configurations.
- Whenever I change any of these things, other people on my team have their IDE , and when they make changes, my IDE setup is updated
When you really get serious about your IDE being set up correctly, the state of your IDE becomes valuable. You really don’t want your setup to get borked, or to start over on a new machine, because it takes you a while to get through all these items at the various point they show up. Some of them never get set up right, like all the code formatting preferences, and we just slog through it with a partially-configured IDE.
This valuable state isn’t committed to the version control system, at least typically, because much of the information is duplicated from the files used by the build system to compile and test the program. That part of the IDE setup should be created or synced by some tool associated with our build system. But the rest of the settings are up to us; the build system doesn’t care about them, since it has a read-only view of the sources.
Why is this a big deal? When you’re working on a very large, modular, corporate codebase, you only have a nice, happy IDE configuration for the part of the project you work on regularly. You’re willing to put in that ramp-up effort on one machine, for that one module or subset of the code, and only for one version of the IDE, etc. You hope it’s only once. But what if you pair with other members of the team, their setup might suck, or if you’re trying out a new version of the IDE, or have the project setup on a home computer and a laptop as well as your work machine, or (and this is the real stretch), if you’re doing a code review and want to browse the proposed changes in your IDE, maybe experimenting with some suggested improvements and running tests? These are all times when you have to work in an IDE-not-happy state.
As much as I appreciate that we all have different preferences for our IDE, and that we take our position very strongly, I think that’s the major contributor to this problem. If we had a true standard for project metadata, which allowed for the entire spectrum of metadata I bulleted earlier, and worked with both the build system and the IDE, then all the tools would work fine. But not all the tools cover the whole feature set – if the copyright on our project changes, IDEA 8+ users want their metadata to reflect that, but no one else has a box in the IDE to update. And we don’t want to go monkey with that project metadata by hand, since in most IDE’s and build tools, it’s not very friendly. (Unless you’ve got a nice XPath/XQuery library in your head)
What’s the answer to this problem? Sadly, we have two impossible tasks: pick a build system and an IDE that we’ll make everyone use, and get that build system and that IDE to share their metadata, either by using the same location for it, or at least syncing everything correctly. I personally believe that if we really solved this, any coder who knows how to use an IDE well would appreciate the convenience of complete and correct metadata so much that it would be a ‘killer feature’, and they’d drop their IDE. Crazy talk, yes. But I think there’s more we can fix once we go down this path, which I’ll write about next time.
About Me
Tweets
- I played the ice hockey for the second time in about 8 years. I was about as good as ever, I guess. Which was fairly bad. 16 hrs ago
- I finally jailbroke an iPhone. Now I feel like I have decent geek cred again. 2 days ago
- Lost a bolt on my lower control arm. Found out about it when the wheel came partly off. http://twitgoo.com/fw9e0 3 days ago
- Wow we have the craziest channel 1.6 on broadcast TV where I live, that runs this show: http://intensit.tv/ 5 days ago
- Dorfmeister is playing Zurich the day after I leave. Worst! 6 days ago
- 70 fresh, organic oranges from our tree were sitting on the table this morning. So, marmalade had to be canned. It's tasty! 6 days ago
- Moles, cousins, and unattended baggage #10kpyramid 1 week ago
- @mdauber You live in Sunnyvale too? And NBC is ruining your olympics also? We should get together. 1 week ago
- More updates...
Powered by Twitter Tools