Should users edit data in serialized text files?
Let me begin with an allegory.
❝We work at a great software shop. We are tasked with writing a large, complex piece of software. The domain for this app includes a rich, hierarchical data structure, with a ton of cross-references between elements, and our user will need to manage all these data elements. Furthermore, we will have a lot of users on the system at once, and they will frequently work on overlapping parts of the domain model concurrently, as well as browsing the history of actions taken and how they affect the model. We expect our users to be mostly “advanced”, so we can show them a complex view of this complex domain, and present them with some power features. Our basic architecture will be a speedy user interface, which the user will spend a great deal of time in, and a server which stores the domain model elements.
This looks like a pretty hefty design challenge. Let’s say one guy on your team (Sloth) proposes the following:
We can store our complex data model by picking one level in the hierarchy, let’s say the parents of most of the leaves, and writing it to a file. The rest of the hierarchy will be roughly mapped to directories containing these files. Inside the file, we’ll start with a bit of metadata, listing all the cross-references to the parts of the domain outside the file. We’ll come up with a nifty grammar so that we can serialize into each file in a roughly human-readable format. [ed. Sloth has been reading about domain-specific languages lately and has been waiting for a chance to engineer one]. To provide long-term persistence over server restarts, and also to handle the versioning and history requirement, we’ll use a Version Control System like Mercurial, and save each new revision of the file there using an account owned by the user. So that’s how we’ll store the data.
The rest of the team is mostly nodding to this point, they especially like his idea to use a VCS to handle persistence and versioning. One engineer points out that this means the versioning and history will only be available per-file, so they can’t version the leaves of the domain model. Everyone nods uncertainly, since that requirement wasn’t clear. Another engineer says he’s worried about the concurrent editing, if there are conflicts between users, they’ll have to merge that serialized file, which will happen more if it contains a larger chunk of the model. He asks Sloth,
“Why do you want to store a subtree of depth 2 in the serialized file? We don’t want the user to have to edit those files by hand in case of a conflict, right?”
Sloth gets a little defensive, and replies, “Oh, I thought they would edit them by hand all the time. The UI can just show a master-detail view with the directory structure and the contents of the selected file. Then we won’t even need a server!”
Everyone gets very uncomfortable. The tech lead feels it’s his job to lead Sloth a little.
“In other programs I’ve worked on of this complexity, we don’t show the user the data model in such a raw form. We want to separate the metadata out, especially all those cross-links at the top. And what will we do if we want to change the presentation a little? We’ll be stuck with whatever is in our serialized data format.”
Another engineer pipes up. “Yeah, this data model is looking really complicated, I don’t see why we’d make the user hand-edit the data. Maybe if we were just a couple guys and didn’t have time to write a server, we’d have to do that. What about avoiding conflicts? There’s no way the users will be able to keep those files consistent when they make changes.”
The database engineer is feeling a little left out by the proposed file format, since it puts him out of work. He joins in on the criticism.
“All those cross-links between the data model elements are supposed to be one-to-many. If I link from element A to element B, I want to have some unique key in B that I can reference. That way, when B changes, we can find the incoming links and keep the data consistent. How are you going to have users maintain those keys?”
Sloth is sheepish now, but still defensive, so he sticks to his proposal. “They can just use whatever they want as keys. Any string would work. They just give a name to element B and then refer to it by name in A.”
“I dunno,” says the DB guy, “it seems to me that we’re going to make it really hard to maintain the consistency of the data model when users are mucking with the keys. If they want to give a name to a model element, that’s fine, but I don’t want to use their name as our internal key. What if they use the same name for several elements? Would you allow that? How would we figure out which is which?”
Sloth shoots himself in the foot: “I guess the users will just have to resolve those problems themselves. They can pick non-colliding names, maybe based on the directory the data file is in? And I see your point, there’s no good way for us to ensure the integrity of the data anymore, but that doesn’t seem to be a requirement. Maybe when we export the data to the upstream system, we can flag all the problems and make the user fix them then.”
“Oh god, you’re retarded.” says the rest of the team, nearly in unison. “We’re not doing it that way,” says the tech lead.
❞
So in our allegory, Sloth gets stuck on his idea of having a cool DSL for encoding the data model, and by the end of the meeting, he’s killed the referential integrity of the system. What would you have done differently in this design? I know one thing I would do, even if we want to use the VCS to store some serialized files like Sloth suggests, is to show the users a more conceptual view of the data. It’s crazy in a system of this complexity to have no distinction between the form of the long-term persisted data and the presentation we show the user. Right?
Well, in my cynical way, my allegory is really about source code and IDE’s. The poor users editing serialized data files in a DSL are you and me. If you look at it in a new light, like in this story, it sounds pretty crazy.
This whole notion came to me in a conversation with Jesse Wilson. The idea of ditching the serialized text file as the encoding of a program is entirely his. At first, I was really skeptical. It doesn’t seem to buy us much, and comes at a great cost. We have to ditch so many tools that rely on text files and directories. We also have to use an IDE that understands whatever alternative, better way we can come up with to encode a program.
But, the more I think about this, and the more I work on the grammar for Noop, and the more I watch myself and my co-workers massage the ascii text in a source file to align parameters, keep the imports tidy, and so on, the more it makes sense. And we could eliminate a lot of complexity from all our tools. Refactorings become easier to write, and our IDE doesn’t need to do incremental parsing of our text to keep its AST in sync. Even better, our IDE can become a lot smarter about how the program is presented to us, the users. I’m going to mock up some pages to show what I mean, sometime soon.
The biggest drawback is that you can’t edit this code on the command line anymore. You need to use an IDE. If you want to grep or sed your way through the files, you’re probably out of luck, unless you want to do it on the persisted form of the data, which might not be very human-readable. Looking towards the future, though, I see a lot of the compilation and testing steps moving to cloud-hosted farms, letting us increase the size of our software and its transitive dependencies. I see a need for tooling to get more sophisticated around helping us visualize and manage the interactions in our code. And I think we could spend less time thinking about how to format this text file full of code, organize our imports, remember that each file needs its own copyright, worry about whether the number of methods in this class makes the file too long, whether we want to make this new thing a public static inner class just to avoid making extra little files on the filesystem, and so on.
Another big win: we don’t have to have compile errors. If we manipulate the AST directly as we code, then the referential integrity of our program can be maintained with each transformation we apply to the source. We would never write software that forced the user to reconcile a bunch of data integrity problems at the end of a long series of operations, like we do with the compiler today.
In my last post, help, my IDE is full of compile errors, I wrote about the problem of keeping our IDE’s fully configured correctly for our project, and sharing their understanding of the project metadata with the build system. Well, if we are going to change the IDE in a major way by using a different data representation for our code, then we may as well have a server between the IDE and the VCS persistence layer. Now the server can maintain the correct state of the metadata for us, and keep it consistent across the team, and across my several machines.
I’m having trouble keeping this urge from taking root in Noop, my programming language. To be fair to Noop, this isn’t a language problem per sé. Java could be made to work in this system - mostly. We want a unique key for an identifier other than a user-defined symbol, so we’d have to do some translation of the AST between parsing the Java files and showing to the user. Also, to get finer-grained versioning, we’d have to break up methods into their own files, so the actual .java source file would be produced from a number of files. It just seems like a new language gives us a new start, and would save me time and effort trying to find a pretty way to serialize all the metadata like documentation into the source files with a complex grammar to maintain.
We could also re-write the VCS so we don’t need to serialize anything to files. But that seems pretty crazy, right?
Help! My IDE is full of compile errors!
I was working on some code just now, in a project I don’t normally work on. It turns out I need to make some changes there, and once I’m done, I won’t need to work on that project anymore. This is a common enough thing for engineers, I’m guessing.
So I loaded up the code in my IDE so I could start making sense of it. I followed some instructions provided by the project team, as far as what to check out from version control, and then I used the plugin in my IDE that deals with the build system, hoping to setup my project correctly. And by correctly, there are a bunch of things I need to be happy and productive:
- All the right code is displayed, including all the source roots and modules
- All the needed dependencies are added as libraries, so the code compiles
- If any special development kit is needed, like the right JVM, it’s added
- Whatever plugins I’ve used when I worked on this project in the past should be installed and setup
- If there are multiple modules/sub-projects, they are also added correctly, and depend on each other as source, not via their compiled jars
- Source attached to all the binary dependencies, even third-party ones
- All the transitive dependencies are downloaded to my machine
- The right preferences are available, such as the Copyright block that should be pasted into new files, code style, import order, etc.
- If there are any special arguments needed to launch the program or the tests, those are added in my Run Configurations.
- Whenever I change any of these things, other people on my team have their IDE , and when they make changes, my IDE setup is updated
When you really get serious about your IDE being set up correctly, the state of your IDE becomes valuable. You really don’t want your setup to get borked, or to start over on a new machine, because it takes you a while to get through all these items at the various point they show up. Some of them never get set up right, like all the code formatting preferences, and we just slog through it with a partially-configured IDE.
This valuable state isn’t committed to the version control system, at least typically, because much of the information is duplicated from the files used by the build system to compile and test the program. That part of the IDE setup should be created or synced by some tool associated with our build system. But the rest of the settings are up to us; the build system doesn’t care about them, since it has a read-only view of the sources.
Why is this a big deal? When you’re working on a very large, modular, corporate codebase, you only have a nice, happy IDE configuration for the part of the project you work on regularly. You’re willing to put in that ramp-up effort on one machine, for that one module or subset of the code, and only for one version of the IDE, etc. You hope it’s only once. But what if you pair with other members of the team, their setup might suck, or if you’re trying out a new version of the IDE, or have the project setup on a home computer and a laptop as well as your work machine, or (and this is the real stretch), if you’re doing a code review and want to browse the proposed changes in your IDE, maybe experimenting with some suggested improvements and running tests? These are all times when you have to work in an IDE-not-happy state.
As much as I appreciate that we all have different preferences for our IDE, and that we take our position very strongly, I think that’s the major contributor to this problem. If we had a true standard for project metadata, which allowed for the entire spectrum of metadata I bulleted earlier, and worked with both the build system and the IDE, then all the tools would work fine. But not all the tools cover the whole feature set - if the copyright on our project changes, IDEA 8+ users want their metadata to reflect that, but no one else has a box in the IDE to update. And we don’t want to go monkey with that project metadata by hand, since in most IDE’s and build tools, it’s not very friendly. (Unless you’ve got a nice XPath/XQuery library in your head)
What’s the answer to this problem? Sadly, we have two impossible tasks: pick a build system and an IDE that we’ll make everyone use, and get that build system and that IDE to share their metadata, either by using the same location for it, or at least syncing everything correctly. I personally believe that if we really solved this, any coder who knows how to use an IDE well would appreciate the convenience of complete and correct metadata so much that it would be a ‘killer feature’, and they’d drop their IDE. Crazy talk, yes. But I think there’s more we can fix once we go down this path, which I’ll write about next time.
Making the case for unit tests to live in the code they test
This is perhaps the most controversial idea in Noop. I had an interesting discussion that included Cédric Beust, in which I defended this idea from a lot of doubts, mostly around why it would be needed, and the challenges with tools that expect a conventional code layout. So, I went to spec out exactly how it could work. And I’m happy to report that I think it totally works.
First, what’s wrong with writing unit tests the conventional way?
- Create a separate src-test or src/test source root, which is to contain just tests, so that we can compile the production code independently and enforce that prod code doesn’t have deps on tests.
- Create a new source file in that source root, in a package/namespace mirroring the location of the prod file you want to test.
- Name the new class with a convention like “*Test” so our team understands where to find the unit tests for a file.
- In the test fixture, create an instance of the class-under-test, supplying fakes or mocks for some of the constructor/setter dependencies as needed.
- Start calling methods in the class and sensing whether they did the right thing. Modify the code under test to make some fields or methods package-protected or friendly to the test class so it can white-box test where you need to.
- Optionally mark that change with a comment or annotation like @VisibleForTesting so your team can understand why this field or method isn’t private as they’d expect.
- Get the tests green, and send off your code review. The reviewer needs to flip between the test file and the production file, so they can use the test-as-documentation to understand what you intend the code to do, and help you find corner cases you missed in your test.
Here’s what’s wrong.
- You want to change code in the prod class with the corresponding unit test visible. Maybe you have an IDE plugin like TestDox that can navigate between the two classes based on the naming convention. If not, you have to navigate manually, and maybe change your window layout to show the two side-by-side.
- Even when you can see both prod and test code, you have to scroll around the two files to see the tests that cover the method you’re working on.
- You refactor the prod class, changing the name or package, and the test class now is misnamed or in the wrong place. Maybe the package-protected access breaks so the compiler tells you, or maybe your code reviewer notices. But your refactoring tool probably doesn’t help you, and it’s easy to check in this mistake.
- You changed the production code just to expose some fields for testing. You really want the test to have special access into the class internals that wouldn’t be allowed at production runtime, but the language doesn’t distinguish these two runtimes.
- As the code grows, you refactor the class by extracting some methods and pulling out some responsibilities into their own classes. You have to read through all the tests to understand what behavior should be extracted into a new test. Some tests are no longer unit tests, since they test behavior that’s now an interaction between the original class and the new class, so you might move those tests to a third location.
- It’s really annoying in the code review to correlate the test and prod changes, especially if your code review tool has a high latency when navigating between files.
- The test needs to be written as a class with methods, even though you should never create an instance of that class yourself, nor call any of the methods yourself.
- The test provides fantastic documentation, because it’s executable. The test always tells you the truth, even when comments get outdated. Sadly, that documentation is not found in your generated docs, because your doc tool doesn’t know where to find the tests. Some BDD frameworks have a separate tool to create the “spec” for the class, but it still doesn’t appear in places where the class API appears.
Sucky! Ok, so what are we going to do about it? Here’s what I propose in Noop:
Unit tests are a special entity, declared with a unittest keyword. The keyword is followed by a string literal, which allows you to write your intent in a normal sentence rather than “escaped” as a camel-case method name. They may appear either in a file dedicated to testing, or better, in the class being tested.
class TestThis() { String printHello(String name) { return "Hello %s!" % name; } unittest "It should print hello" { printHello("Fred") should equal("Hello Fred!"); } }
The unittest is a member of the class, just like a field or a method, so it naturally has intimate access to any fields or methods. And the test fixture is naturally provided as the “this” reference in the test, allowing you to simply call methods in the class.
Let’s deal with the objections now.
“Um, where does the ‘this’ instance come from?” -> I’m assuming here that the language has built-in dependency injection, so it’s normal to expect instances to be created for you. In the example above, it’s obvious how to make an instance of TestThis, so why should you have to make one yourself?
“Ok, but what about a less trivial constructor? What if the constructor needs some service that we want to mock?” -> Again, dependency injection to the rescue. You need a way to add “bindings” to the DI runtime, so that it understands how to provide instances of whatever objects you need in the constructor. These bindings need be declared in the unittest declaration, before the block that contains the logic and assertions, so that the ‘this’ reference is available in the first line of the block.
“I like having the tests in their own source root, like it’s been since the old days!” -> I think this is just an artifact of language tools that don’t understand tests, so the easiest way to compile the right set of sources, search for tests, and deal with dependencies was to create a separate source root. Starting with a fresh slate today, I don’t see why it’s necessary to continue to appease language tools at the cost of maintaining a separate package hierarchy and all the rest.
“But the production code shouldn’t depend on the tests, and the tests have extra dependencies that the production code doesn’t!” -> the first half is solved because the unittests aren’t methods, so there is no way to reference them in production code. The second half might require that there’s a way to make an import statement that only imports something for use by the tests. Maybe “test import junit.Asserts”? And the language tools need to provide a different set of dependency libraries in the classpath when running the tests vs. running production code
“But you don’t want to ship those tests to production!” -> well, I’m not sure why that’s a big deal, as there’s no way to call into that code without a test runner. But sure, the compiler or other language tools can support a “test” mode, just as many compilers have a “debug” mode to dictate whether the full symbol table is emitted.
“But this mixes testing and development!” -> You should really start doing Test-Driven Development. Testing is part of developing. They should be mixed.
I’m very keen on your comments on this one. Don’t leave me hanging!
Code style: too many choices
Q. Should a method parameter list continue on the next line aligned with the open paren, or 4 spaces from the left?
Every time I deal with code style, it annoys me. That’s because I have to do some annoying busy work to make my pretty text file serialization of the program conform to the group consensus in my team.
Don’t get me wrong, I think having a common code style is important and I’m glad all my teams do it. I just want to have it taken care of for me, I’d rather think in terms of editing the AST rather than pushing around ascii art in a bunch of files.
So why don’t we have the perfect code formatter? There are too many options. Look at this dialog for my Scala plugin in IDEA:

And that’s just the “Spaces” tab. If I have the tool in my IDE, along with a command line tool to tidy up pre-submit, there’s no way I’ll get all of those settings right for the N projects I work on using M machines.
So if I’m right, the only way to stop spending this annoying time is either to stop editing text files (Jesse Wilson Jesse Wilson is starting to convince me of that), or just pick one global style so there’s no need to configure my tools. Sun proposed a style for Java, but somehow every Java shop I work in has a slightly different guideline, so that didn’t work.
The funny thing is, unless you’re a zealot about your personal preferences, you agree that it’s an arbitrary choice. The method parameters can continue onto the next line in any reasonable way. We just want to make a consensus so we can stop talking about style. In Noop, I’m going to push that we provide a non-configurable lint tool with the language, and bake it into the tooling in some way so it’s the natural choice.
What code reviews ought to be
Google is the first place I’ve worked where code reviews are a formal part of every change. And I’m really sold on them. I will always make code reviews a part of committing code from now on, because it guarantees that someone else understands my code. Sure, it’s also handy if the reviewer notices a bug, or suggests a nicer API I could have used, or has some good refactoring ideas. But the most important result for me is “shared code stewardship” - code I write isn’t “my” code, it’s part of the team’s product and it’s going to need maintenance by other people. The code review is often the first chance for someone else on the team to really read through the code and understand my changes.
But, I think it sucks that no one read the code earlier. Code reviews are like a less effective form of pair programming.
The more you pair program, the more often you find that the code review is already done. Your pair can be your reviewer, they have already signed off on the implementation, and they’re a great co-steward of the code for the future. Or, as one team I’ve worked with does, you can send the review to a third person, just to have some extra familiarity with the changes that the pair has made.
By pairing on the change as you make it, the pair may have some input near the beginning of the coding session. As soon as you start to create a new class, and you give it a name, they might say, “hey, I think there’s this other pattern we might use for this,” and you change course immediately. Contrast that with a typical code review - you’ve got to have some pretty serious objections to someone’s code before you tell them to go back to the beginning and use a different design. You’ll never do it if the improvement is minor - it’s not worth changing it at that point.
So code reviews end up being more about style, minor nits, and reminders to fill in the documentation. That’s great, but it’s a poor substitute for having that person onboard sooner. Here’s another reason: if you want to suggest a design change in a code review, you’ve got to explain it in your comments. If you want to make a small nit like changing the order of import/require/include statements, you add a comment and make work for the author to make your change. In both cases, it would be better to just grab the keyboard and make the change, explaining your reasoning as you go. There’s no separate prose you need to write in a comment explaining the change.
So, here’s my proposal for an awesome code review tool:
- Pick the reviewer before you start making changes. They’re like your pair, but don’t have to be in your timezone.
- You commit to a throw-away branch as you work on your change. The branch is automatically set up for your review, and allows commits by the reviewer(s).
- When you commit on your branch, the reviewer gets notified, and can browse your code, in a read/write mode. Ideally it should be easy for the reviewer to load an IDE project from your review branch.
- The reviewer can add inline comments in the code, which don’t appear in the source file, but in some extra metadata file. They can also change the code directly. They commit their changes, with a commit comment, and the author is notified.
- The author considers the suggested changes, and can accept them or roll them back, with further comments if desired.
- Bonus: the tool facilitates easy screen sharing, so if you IM with the reviewer and don’t agree or understand, you just remotely pair on the spot.
- When there is some consensus and the change is done, the author does a merge back to the dev branch or trunk, and the review tool archives the review branch.
My top 8 reasons to use IDEA over Eclipse
I use IntelliJ IDEA as my primary IDE, although I’ve used Eclipse on and off in the past. The last time I switched back to Eclipse was to maintain the Eclipse plugin for Testability Explorer, and I was reminded of some things that I really missed. Here are my killer features:
Diff editor is full-featured
You’ve finished making some changes to a codebase you care about, so before you commit, you take care to look through the delta you’re about to commit and make sure it includes the right stuff, and to double-check your work. Often, I notice little things while I’m doing this - like missing documentation, some statements that could be refactored to a shorter form, or such. So, I begin editing the right side of the diff, and I need to have my real code editor there, so I can get completion in my {@link} javadocs, code completion, and so on.
I’ve seen people using Eclipse who get to this step and have to go back to the normal IDE window and then go find the file they were reading in the diff view. Too much context switching for me.
Completion shows compatible types
You start coding:
Widget w = new |
At the cursor, you’d like to get a completion list that shows the Widget class and any subtypes. This is especially important for some APIs that you don’t know well or don’t use often.
java.io.Writer w = new |
Man, I love getting suggestions like FileWriter there. I think Eclipse can only attempt to sort the list of completions in order of some heuristic of how likely you are to pick it.
Apple-arrow mapped to line home/end
It’s a minor thing on the face of it: on Mac, the home and end keys take you to the beginning and end of the document. Since Eclipse is on SWT, it’s not surprising that you get the native Mac behavior. In IDEA, maybe because it’s Swing, I get beginning and end of line instead. Whenever I use Eclipse, this majorly screws me up, because I forget to re-map my brain and lose my place in my editor.
I’ve tried to fix this in the OS, but it was too painful.
Understands prod/test code
I often feel like IDEA is made for people who write code, and Eclipse is made for people to write Eclipse plugins. This is an example of IDEA understanding how coding works - you always have a prod folder and a test folder. Instead of labelling these folders as just Source folders as in Eclipse, you tell IDEA whether they contain prod or test resources (or get it from an external model like a Maven POM). Now, when you do a search, you can search just prod code or just test code. When you make a new Test with the testdox plugin, it knows which source folder the new class belongs in.
Save on blur
We always work from a version control system. So, in my mind, the “buffer” I’m editing is my working copy. If I want to revert my changes, I use the VCS to do it. So, I hate the notion of “saving.” I don’t see how it’s ever useful to have different contents in the in-memory editor than you have on the disk. When I make changes to a file, I want the changes to get flushed out to disk. IDEA can do it when the IDE loses focus, which is a nice time. That way, my shell always sees what I typed. When I use Eclipse, I run something and it seems like my changes didn’t take effect. Then I realize there’s an editor tab with a star on it. The star makes me really frustrated, because it’s just the IDE telling me that it *knows* my disk has stale content, and asking me to do the busy work for it.
Diff annotations in gutter
Since we’re always coding against version control, our job is often to craft incremental changes to the codebase. I want to see that delta at all times, especially in the editor. That delta is my actual work product and it’s annoying to have to invoke another view to see what I’ve done so far, and revert changes or copy the previous content.
I’ve noticed that I’m more exploratory in my coding with this feature - if I don’t like the path I’m going down, it’s trivial to remove the last few changes and start a different change. I think there’s an Eclipse plugin for this, I guess that would work if it’s truly universal on all editor views and consistent with the VCS plugin, but I really don’t install many plugins.
SVN support built-in
It kills me to watch Eclipse users install a Subversion client plugin. It’s not easy to do, since you have to choose between JNI and pure Java, and the JNI version seems to have problems locating the native libraries in my experience. Maybe the install has gotten better, but still, this is another example where Eclipse should really support what I do out of the box. I think Subversion is popular enough that I should be able to check out a project with a freshly installed IDE.
Understands multi-changelists
I have more than one change happening in the same working copy a lot of the time. I like moving files into different lists, so when it’s time to commit some work, I have the right set of changes already grouped together. In Eclipse, I would usually toggle the checkboxes on the commit dialog to choose the files to include in the commit, and it was really hard to remember which classes belonged in the change. And I made a lot of mistakes.
Java - final variable to constructor parameter
The quickest way to type code is not to type, instead, let the IDE fix things as you go. Setting up the dependencies of a class is a common place where there’s a lot of typing, at least in a language with no properties like Java. I like to create my fields first, and if I make them final, IDEA will offer to add them as constructor parameters. Eclipse doesn’t - you must go the other way around, creating the constructor and adding parameters, then using the auto-fixer to create the fields.
$SwitchMap$
I recently became the maintainer of the Testability Explorer project. It’s a library that inspects Java bytecode, assessing how difficult the code will be to unit test. Check out the author’s blog: Misko Hevery. For more info about the project, see http://testability-explorer.googlecode.com.
Testability explorer looks for a few things in your code. One of them is some mutable object that’s globally accessible. For example, the Evil Singleton makes it hard to unit test anything that uses it, but a public static final String is fine, since String is immutable.
So I was investigating a strange issue where a large mutable global cost was assigned to some classes, due to an anonymous inner class. It seemed that there was some globally mutable state hidden in SomethingOrOther$1, whenever SomethingOrOther has a switch statement with a variable of type Enum as the argument. That doesn’t make your code hard to test, so what’s going on?
Try it at home! Compile this with javac:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | public class HasStaticCost { public int compare() { Fruit fruit = Fruit.APPLE; switch(fruit) { case APPLE: return 1; case ORANGE: return -1; default: return 0; } } public enum Fruit { APPLE, ORANGE } } |
And run javac HasStaticCost. The outputs are:
HasStaticCost$1.classHasStaticCost$Fruit.classHasStaticCost.class
Three classes? What’s in that first one?? We’ll have to look at the bytecode:
class com.google.test.metric.collection.HasStaticCost$1 extends java.lang.Object { static final int[] $SwitchMap$com$google$test$metric$collection$HasStaticCost$Fruit; static {}; Code: 0: invokestatic #1; //Method com/google/test/metric/collection/HasStaticCost$Fruit.values:()[Lcom/google/test/metric/collection/HasStaticCost$Fruit; 3: arraylength 4: newarray int 6: putstatic #2; //Field $SwitchMap$com$google$test$metric$collection$HasStaticCost$Fruit:[I 9: getstatic #2; //Field $SwitchMap$com$google$test$metric$collection$HasStaticCost$Fruit:[I 12: getstatic #3; //Field com/google/test/metric/collection/HasStaticCost$Fruit.APPLE:Lcom/google/test/metric/collection/HasStaticCost$Fruit; 15: invokevirtual #4; //Method com/google/test/metric/collection/HasStaticCost$Fruit.ordinal:()I 18: iconst_1 19: iastore 20: goto 24 23: astore_0 // same for Fruit.ORANGE 39: return }
This is pretty crazy. The ordinal values of the Fruit enum are stored into a static array, named $SwitchMap$<enumname-encoded-with-dollars>, and stored in this synthesized inner class, and accessible to other classes in the same package. And that synthesized class isn’t stored with the enum, instead, each class that switches on the enum gets blessed with some ugly global state. Why? There was a release of the JVM in Java 1.5, along with the new language features, so you’d think the switch statement would understand an enum type. But this bytecode looks like a complete hack to make the switch statement think it’s operating on an int instead. Here’s how the switch is implemented:
0: getstatic #2; //Field com/google/test/metric/collection/HasStaticCost$Fruit.APPLE:Lcom/google/test/metric/collection/HasStaticCost$Fruit; 3: astore_1 4: getstatic #3; //Field com/google/test/metric/collection/HasStaticCost$1.$SwitchMap$com$google$test$metric$collection$HasStaticCost$Fruit:[I 7: aload_1 8: invokevirtual #4; //Method com/google/test/metric/collection/HasStaticCost$Fruit.ordinal:()I 11: iaload 12: lookupswitch{ //2 1: 40; 2: 42; default: 44 }
At instruction 12, I would think the value 1 would naturally refer to the first ordinal value of the enum without needing to load the constant from the static array as it does.
It turns out this was just bad decision making by the Java 1.5 committee. They intended to implement Java 1.5 language features in a way that would execute on the 1.4 JVM, I bet because they knew that BigCorp wasn’t going to upgrade their production environments and the language features would sit on the shelf for a few years unless they could be deployed with little risk. So we got stuck with big things like generic type erasure, and small things like this hack for the switch statement… but hey, at least adoption would be easy.
I fixed testability explorer by whitelisting field names matching /\$SwitchMap\$.*/. Gross. Although that’s straightened out, it still leaves that bad taste in my mouth. Because it turned out that there was some feature added later on in Java 1.5 that did require a change to the VM, and so all these hacks are in there for no good reason. And, of course, BigCorp stayed on 1.4 for years.
Assert, Validate, Precondition
Several recent posts have been about what I’d like to see in a programming language, and those ideas prompted me to code. What I’ve been creating is a new programming language, very similar to Java and targetted to the JVM, but with these testability and best practice ideas included.
- Immutability - final should be the default
- Testing - unit tests should be in the core language, not a library
- Dependency Injection: class parameters
The language is called Nil, and the project is hosted at http://nil.googlecode.com.
Some ideas come from work, where I contribute to Testability Explorer (http://testability-explorer.googlecode.com), some come from blogs I read like this post from Michael Feathers (author of Working Efectively with Legacy Code) A Wish for the next Mainstream Programming Language. Some are inspired by other languages, like Objective-C, ActionScript, and Scala.
I should afford some introspection here: I realize that, as Steve Yegge says, “your language is doomed to fail, with probability 1 minus epsilon.” So even if contributors and I produce something useful, it’s not going to be used. That’s fine, there are lots of little-known languages with a small, über-geek following.
So, today’s idea, is about assertions. In testing, there are various libraries modelled after JUnit, which is nice, and as I’ve said, the language could make life a little easier by allowing the test to live in the class it tests.
Another thing the language could do is provide a context-sensitive assertThat() method. I’ve seen various API’s, like Springs Validator class, and Preconditions in Google Collections, which are asserts in your production code. I like these, just because you fail faster. It’s also a nice executable way to document the preconditions of your method. You use them because you don’t want to depend on JUnit. But that’s silly. Why not have the same powerful API to express your expectations in your test that you have in production code? If assertThat() was a method automatically mixed into all classes, then it just fails the current test if called in a test, and it throws an IllegalStateException with the same lovely message when called in production code.
Another idea: what if a method could define an internal DSL? Then the equals() and isA() and so on don’t have to be imported into the namespace of the test. When you call assertThat(”a”, isA(String.class)), you can’t resolve the isA() method, but when assertThat() evaluates its arguments, the isA() method is defined in that scope. IDE’s would need an easy way to find methods that are legal in such a scope, but that seems doable. The advantage is that the isA() method doesn’t clutter the completions when you’re not inside assertThat()! The API appears to have only the assertThat() method.
I’ll go code some of this now!
USB keyboard scancode conversion adapter
The world of Dvorak today kind of sucks. The major suck points for me are:
- Bugs in OS support for switching layouts. In Mac at least, option-’ gives me æ and option-[ gives me ” because it’s modifying the QWERTY position character. Locking my screen with LockTight, I tell it my shortcut is shift-option-command-P, but I really have to hit L.
- Software that reads scancodes in some situations where it shouldn’t. Especially as a developer, I run across tools that don’t use the OS-provided character mapping.
- Pair programming - I would like to have a Dvorak keyboard, plugin to a qwerty co-workers machine, and both keyboards work at the same time.
- Sitting down at a QWERTY keyboard. People make fun of us when we can’t use their keyboard, but honestly, it’s really annoying. I’d like to have dvorak with me all the time.
- Working through remote X sessions, VMWare, over Synergy, or in the BIOS or bootloader - you can never be sure if you have us->dvorak mapped, or if it’s mapped twice. Especially when using more than one of those remote means.
- Really low-tech people are put off from trying dvorak by the prospect of changing a setting in their OS/windowing toolkit
There is a good solution, and dvorak and other alternative layouts would really benefit if we could find someone who has the ability and desire to do this:
Make a keychain-form-factor adapter, with a male and female USB-A connector, that converts and propagates USB-keyboard scancodes.
I think this just requires a USB keyboard controller and USB keyboard host controller chip (with accompanying capacitors and what-not and what-have-you) and an EEPROM or similar with a mapping programmed in. And a hacked-up enclosure.
A test-driven modern language
Here’s another in my recent series of posts about ways I think the Java language could support modern coding practices.
An obvious practice that we use extensively today is unit testing and test-driven development, and again, Java and other languages don’t provide built-in support. Instead, we have a few libraries to create test code, execute it, and provide mock dependencies, and then some standards for naming and directory layout that help us organize the test code and correlate it with production code.
What sorts of facilities would our dream language have for testing?
In Java, tests have a lot of repetitive code, because they are just classes. All test methods are in the same form, namely that they are non-static, take no arguments, marked as a test via an annotation or naming convention, throw all checked exceptions, and return void. We should probably have a test keyword that starts a block, and this should act like a test.
Just like the keywords extends and implements, we should have a way to associate two objects with a new relationship: tests. This way, we can easily see for a given class, what classes test it, and IDE’s can consistently navigate between a class and its test(s). It also would allow some conveniences. Instead of having to mark methods package-private so they are visible to the unit test, we could make private methods visible to tests, like an implicit C++ “friend” relationship.
So, with our two new keywords we have an example like this:
class Foo { private int helperMethod() { //stuff } } class FooTest tests Foo { test helper { int i = new Foo().helperMethod(); assertThat(i, equals(10)); } }
Now we have created a test class and some tests. Notice the assertion in that test - where do the assertThat() and equals() come from? Instead of importing some utility classes, maybe the tests keyword could also cause my FooTest class to extend from a base Test class rather than from Object by default, so these could be implemented there. Or if the language had mixins, the test methods could come from a mixin and not require polymorphism (we don’t care that our test class may be cast to a Test type).
The example also uses a Hamcrest-style expectation, to make the assertion more fluent.
The same thing could also be done if tests could be written directly in the class they test, and the compiler will have to ignore it when not in testing mode. In that case, the mixin would be automatically added to classes that contain tests, so that the assertions are available:
class Foo { // implicit mixin Asserts int calculateDay(Date date) { ... } test calculateDay { // using the enclosing instance, requires that it have a default constructor int day = calculateDay(new Date()); assertThat(day, equals(31)); // probably more realistic int otherDay = new Foo().calculateDay(new Date()); assertThat(day, equals(31)); } }
What more can we do to make writing tests the most convenient and fastest way to code? For one, we could have a built-in test runner. Having marked our tests with a new syntax, we can find all the tests and execute them using an equivalent program to the compiler. In that test running mode, we can relax the security model of the runtime, avoiding some common problems with testing sensitive code.
Once we have our test runner, we can lower the bar to getting good testing infrastructure in a project. One major pain in the butt is instrumenting the code to get the line-level test coverage. Because libraries like JUnit execute code that’s been compiled by the normal compiler, we have to do something funny like twiddle the bytecode after compilation, or use a custom classloader to do it on the fly. If the language understood testing, then it could also always provide coverage data when the tests are executed. It would be easier for IDE’s to show how much of your code is executed in the tests, as well.
Finally, there is the ever-annoying issue of setting up dependencies. The language needs access to different libraries at test-time, and we also want to enforce that production code may not have dependencies on test code. With language-level support for testing, there can be compile-time checks that test code is not used from production code, and we can have a second path of libraries passed only to the test runner.
Does this seem like a good idea, or does the language step over the line here and take over the job of a framework? Should we encourage testing private methods this way, or does it violate encapsulation principles?
About Me
Tweets
- Man, nothing kills a super bowl bowl party like someone winning the super bowl 1 day ago
- Wow hulu desktop is a great app. Web apps may be improving but they can't keep up. Can't wait till this comes built-in to your TV! 3 days ago
- @bsneade people like to be scared. The media just gives them what they want! 3 days ago
- Help me stop corporate influence over elections. There's a great, easy email form at http://fixcongressfirst.org #fixcongressfirst 5 days ago
- My economic theory: http://twitgoo.com/d4lci 5 days ago
- http://bit.ly/7I4nDl ...well, California is still missing community fish hatcheries and county-subsidized greenhouses. 1 week ago
- iPad is lame. Just another viewer for your iTunes purchases and no innovations in the online experience. I'd rather have a Chrome OS netbook 1 week ago
- Tore my lip when it stuck to a cold popsicle during intermission. It's bleeding a little, which I hope stops before the trombone entrance... 2 weeks ago
- More updates...
Powered by Twitter Tools.