I've been digging through Typing Haskell in Haskell and wanted to jot down some quick notes on the Kind system, so I've got some idea of what's a Haskell wart and what's necessary for type system to function. I never really understood the point of the Kind system when I was learning Haskell - I figured out how it worked, but didn't understand why I would need it, and I think I've gotten a grand total of about 2 or 3 kind errors when programming in Haskell.
The best analogy I can think of is to generics in Java. [a] translates to List<a>, Tree a is TreeMap<a>, monads would be Monad<a> if Java had them, etc. The Kind system basically keeps track of how many type parameters a type should have, so you don't write something like List<Int, Bool> or other nonsensical types. In the presence of type variables, this is pretty important, because Haskell will let you write things like `m a` where m is any type that takes exactly one argument.
Actually, they give a bit more information than that. Each type parameter also has its own kind, so that there's a difference between List<Int> and List<List<Int>>. This isn't really an issue when writing out types explicitly, but it can give you early warning for type inference errors, telling you that you have a primitive type in one place and a parameterized one in another.
It looks like Kinds are mostly used when unifying applications of type variables together. The example above is no problem: it gets down to unifying Int and List<Int>, sees that the former is a TCon and the latter is a TAp, and fails. But say it had been unifying `a` and Int, where `a` had been bound to List<Int> by a previous substitution. Without Kinds, this would record a constraint `a -> Int`, and then the unification error would be caught as the substitutions were merged. With them, you can catch the error immediately, before the variables are bound, and show more information about what went wrong.
Part of my confusion with Kinds is that they're represented in curried form, like nearly everything else in Haskell. So a two-arg type has kind `* -> * -> *`, instead of, say, `<<>, <>>` if you were just recording the arity of each type parameter as a list. I'm going to experiment with the latter representation in Eve, since it lets me use standard list functions (fold, map, etc.) instead of manually programming the recursion. [Edit: this turns out not to work very well. The type of Kind would then be [Kind], which is an invalid circular type. I could do a `data Kind = Star | KCon [Kind]`, but I'm not sure this gets me anything over the straight curried representation, and it has two representations (Star and KCon []) for a primitive kind.] And ideally, I could avoid exposing the concept of Kinds at all to the user, just using them to generate more friendly error messages from the typechecker.
Monday, January 19, 2009
Wednesday, November 26, 2008
Sum types vs. typeclasses
I'm trying to figure out how to get rid of sum types.
Perhaps I should back up a bit. For readers that aren't familiar with Haskell or Ocaml (are there any?), sum types are when a type can take on several alternate values. For example, data List a = Cons a (List a) | Nil or data Color = White | Grey | Black. Each alternative value (or family of values) gets labeled by a data constructor.
Typeclasses are sets of types. When you a function with the signature Eq a => a -> a -> Bool, it works on any type, as long as that type is an instance of Eq. Types provide "evidence" that they're really a member of the appropriate typeclass in the form of a dictionary of methods.
I want to get rid of one of one or the other because their usage overlaps. I'd like to follow the Python philosophy of "There should be one - and only one - obvious way to do things", yet it's not entirely obvious when you should use multiple constructors and when you should use multiple instances of a typeclass. Whenever your language forces you to make decisions about program organization, that's time that's not spent making decisions about the problem domain.
For example, in Typing Haskell in Haskell, there are separate types for Tyvars, Tycons, etc, each with a single constructor. There is also a separate Type type with a Tyvar constructor that takes a Tyvar, a Tycon constructor that takes a Tycon, etc. There is a Qual type (representing qualified somethings), which is parameterized so that you can either have a qualified predicate (like in an instance declaration) or a qualified type (like in a function signature). And there is a Types typeclass that has Type, Qual, and Pred as its members, so that you can apply a substitution to anything that may contain types.
It's done this way to eliminate the possibility of type errors while still keeping the code as general as possible. For example, you couldn't have a typesafe varBind function (Tyvar -> Type -> Maybe Subst) without having Tyvar as a separate class. If you passed in a Type as the first argument (as I did in early versions of Eve's typechecker), you run the risk of a type error if it's something other than a TVar.
However, it's not obvious to do this until you actually write a function that uses a subset of the type. Typecheckers are well-understood, and you can build off the implementations of researchers before you. But most of the time when you write software, you don't know what functions you'll need or what types they'll take until a customer request comes in. And in Haskell, that often requires rewriting large amounts of software - changing the constructor from "TVar String" to "TVar Tyvar" means that the type of every TVar alternative needs to be changed, the addition of a new constructor means that every function of that type needs to be changed, and changing from a type to a typeclass means that each alternative needs to be rewritten as a method of the appropriate instance.
So I'm thinking about what it would be like to get rid of sum types and alternatives entirely, and replace them with type classes. Instead of having "data Tree a = Branch (Tree a) a (Tree a) | Leaf", you'd represent branches with a "Branch { left :: Tree, val :: a, right :: Tree}" record, leaves with None (like in Python), and then there'd be a typeclass with Branch and None as instances and the appropriate methods, pulled from their pattern-matching alternatives.
Couple problems I can think of with this:
- In Haskell at least, data structures can't be parameterized over typeclasses, only over types. So there's no way to express { left :: Tree, right :: Tree } where Tree is the typeclass containing both Branch and None. It would be nice if we could - among other things, this would make typeclasses much more like Java interfaces and would eliminate the need for existential types in heterogenous lists. But I highly suspect it would make type inference undecidable, given the known difficulties with type inference and subtying.
- Typeclass signatures expect the same type, not just the same typeclass. For example, consider insert :: Tree collection => collection -> elem -> collection (and ignore that you need an MPTC w/fundep or associated type to define the element). When you call insert on a leaf, you want it to return a new branch containing that element, with null children. But that gives you None as the type of the first collection and Branch as the type of the second collection, which won't unify, even though they're both members of the Tree typeclass.
Similar issues would come up in a lot of circumstances - for example, (==) :: Eq a => a -> a-> Bool means that two types in the same typeclass cannot be compared for equality. You have to have that sum type, otherwise you break other typeclasses. This also would make straight enums virtually useless, if you couldn't compare the different alternatives for equality.
I'm thinking that it's best to give up on this, and just put up with having both sum types and typeclasses. Maybe there could be an automated refactoring between the two of them. I really don't like the added complexity that having both introduces.
I'm running into similar issues when coming up with types for Python's overloaded * and % operators. For numbers, it's easy: (*) :: Num a => a -> a -> a. But * is also used for sequence repetition, and in that context, it should be (*) :: Sequence a => a -> Int -> a. And % is used for string interpolation, which can't be typed at all without dependent types. Probably the solution there is to avoid such gratuitous overloading and find new names for the offending functions. I'd go with renaming sequence * as 'repeat', since it's not used all that frequently, and renaming numeric % as 'mod'.
Sunday, November 9, 2008
OO as an organizing principle
I've been toying with the idea of completely revamping the prototype-based object system in favor of an associated-types based typeclass system, similar to the one that was just implemented in GHC. Something feels "off" about the change though, and it's similar to the feeling I had with multimethods. I'm writing this entry to try and pinpoint my reservations, and then see if there's any compromise that resolves them.
I guess I could sum up my problem like this: Object-orientation is an excellent principle for organizing programs, but a sucky language feature.
The nice thing about all OO languages is that given a parameter or other value, you know exactly what you can do with it. For example, if you have a Java method with the signature:
In a structurally-typed language, you have this same ability, except that the method body should be viewed as a series of assertions about the type. For example, in Python:
This method asserts that names has a method __iter__ that obeys the iterator protocol, and that each element returned by the iterable can be added to a string, and that the return value of that is a legal parameter for call_something_else. The code is more general than the Java above, but it reveals less information about the values involved. This is why it's so difficult to build a decent autocompletion engine for Python, and perhaps why dynamic languages have a reputation for being difficult to use on large projects.
In both cases, if the objects you have in mind don't support the operation you want to perform, it's immediately obvious what to do. You write a method that performs that operation. This is problematic if you don't control that class, which is why Ruby and C# have moved to open classes. (In Java or Python, the idiomatic solution for that is to subclass it or write an adapter that you control that delegates to the base class.)
And it's usually obvious where this method should go: on the class that you need to manipulate.
In a language with multimethods or pattern-matching, it's still obvious that you ought to write a method. But now, it's no longer obvious where the method should go: do you define it with the first argument type? The second argument type? The generic function declaration, if your language requires that you declare generic functions instead of assuming all functions are generic? A group of related functions? It's own module?
And then every time someone uses the method, they need to remember where it was defined and add the appropriate import statement. If they forget, they might get behavior they don't expect, as the appropriate method for the given generic function won't even have been loaded.
I tried to come up with some patterns for module arrangements when I was first starting this blog, based on my limited experience. But this is inherently problem-domain-specific: there's no one right organizing principle that applies to everyone's programs.
So again, it's a flexibility vs. productivity tradeoff. A multimethod or typeclass system gives you a lot more flexibility in how you arrange your code. But this requires that you make a lot more decisions on how you arrange your code, and each one of those decisions takes time. And if you get it wrong, it's usually not all that easy to change, requiring that you rework a rats-nest of imports and dependencies.
I guess I could sum up my problem like this: Object-orientation is an excellent principle for organizing programs, but a sucky language feature.
The nice thing about all OO languages is that given a parameter or other value, you know exactly what you can do with it. For example, if you have a Java method with the signature:
public void doSomething(Collection String> names);you know exactly what you can do with names. You can add objects to it; you can iterate over it; you can empty it out. And for each object inside it, you can take a substring, concatenate it to another string, etc. If you want to figure out what you can do with each expression, just look in the JavaDocs or use your IDE's autocompletion.
In a structurally-typed language, you have this same ability, except that the method body should be viewed as a series of assertions about the type. For example, in Python:
def do_something(names):
for name in names:
call_something_else(name + 'foobar');
This method asserts that names has a method __iter__ that obeys the iterator protocol, and that each element returned by the iterable can be added to a string, and that the return value of that is a legal parameter for call_something_else. The code is more general than the Java above, but it reveals less information about the values involved. This is why it's so difficult to build a decent autocompletion engine for Python, and perhaps why dynamic languages have a reputation for being difficult to use on large projects.
In both cases, if the objects you have in mind don't support the operation you want to perform, it's immediately obvious what to do. You write a method that performs that operation. This is problematic if you don't control that class, which is why Ruby and C# have moved to open classes. (In Java or Python, the idiomatic solution for that is to subclass it or write an adapter that you control that delegates to the base class.)
And it's usually obvious where this method should go: on the class that you need to manipulate.
In a language with multimethods or pattern-matching, it's still obvious that you ought to write a method. But now, it's no longer obvious where the method should go: do you define it with the first argument type? The second argument type? The generic function declaration, if your language requires that you declare generic functions instead of assuming all functions are generic? A group of related functions? It's own module?
And then every time someone uses the method, they need to remember where it was defined and add the appropriate import statement. If they forget, they might get behavior they don't expect, as the appropriate method for the given generic function won't even have been loaded.
I tried to come up with some patterns for module arrangements when I was first starting this blog, based on my limited experience. But this is inherently problem-domain-specific: there's no one right organizing principle that applies to everyone's programs.
So again, it's a flexibility vs. productivity tradeoff. A multimethod or typeclass system gives you a lot more flexibility in how you arrange your code. But this requires that you make a lot more decisions on how you arrange your code, and each one of those decisions takes time. And if you get it wrong, it's usually not all that easy to change, requiring that you rework a rats-nest of imports and dependencies.
Sunday, September 14, 2008
Method binding
So, method binding. I'll start by enumerating some of the properties that I want in an object system:
A to-do list of things to do to clean this up:
- Doesn't repeat the receiver argument, i.e. "obj.method(arg1, arg2)" instead of "obj.method(obj, arg1, arg2)"
- Retrieving a method binds it to the receiver, i.e. "f = obj.method" can be called as "f(arg1, arg2)", not "f(obj, arg1, arg2)"
- Can store bound methods in another data structure without accidentally rebinding the receiver. "{ foo: obj.method }.foo()" binds the receiver to obj, not { foo: ... }.
- Can store groups of methods (probably unbound, since there's nothing to bind to at this point) on a separate object and add them as a mixin
- Methods are available in unbound form on the constructor: "Str.split('foo bar', ' ')" == 'foo bar'.split(' ')".
- Constructors are called like ordinary function, i.e. "MyClass()" instead of "new MyClass()". This is really handy when implementing methods that are really classes (like any sort of specialized iterator) or factory functions.
- "is_instance(Str, 'foo')" or "add_superclass(Iterator, { ... })" need to work
- "super(obj).method()" needs to invoke the method on the superclass (obj.proto.proto in a prototype-based system, obj.cls.superclass in a class-based), with the receiver bound to the original object.
- #1 & #2 => a separate concept of bound methods. Must store the intended receiver, either in a closure or as a function attribute. The bound method must be created at attribute access time; otherwise, it has no idea what the receiver will be
- #2 & #3 => an escape hatch to ignore the method-binding machinery if the method is already bound. This could be as simple as looking for a field on the function object
- #4 => functions have a separate existence from their class/prototype. This has been an unwritten assumption in everything I've done, but it isn't the case in eg. Java.
- #5 => Either the constructor is not the class/prototype object, or there has to be a field on the object that indicates whether one should or should not bind methods on that object.
- #4 => The default for record literals should be that methods are not bound when accessed
- #7 => Either the constructor is the class/prototype object, or the class/prototype object needs to be a field on the constructor
- #6 => If the constructor is the class/prototype object, then all objects will end up inheriting function fields & methods by accident. This probably isn't a huge problem, but might be an issue if we make function-manipulation functions methods instead of built-ins.
A to-do list of things to do to clean this up:
- Record objects should never have method_receiver set. The methods that currently apply to records should be moved to standalone functions or eliminated entirely. I'm thinking Record.iter => vars(record), and then eliminating the other ones entirely.
- If the field access result has a method_self field, don't perform the autobinding, fixing #3.
- Change the make_class library function so it creates the constructor as an internal define and then references the constructor itself when setting the prototype, fixing #7
Saturday, September 13, 2008
Direction Change - To Prototypes
First off, I'd like to thank everyone who commented on the last couple of posts on multimethods. I spent a while reading through the Cecil Spec, Dylan Reference Manual, and associated papers by Chambers, Dujardin, Moon, et al.
And in the end, I decided to punt on the feature entirely.
Eve's goal is "to be a practical purely-functional programming language". To that end, there're several constraints that I'm placing upon myself:
Instead, I've moved towards a single-dispatch prototype-based language that's similar to JavaScript or Io, without JavaScript's tangled constructor pattern. A layer of syntactic sugar sits on top of this, providing a class-based OO system that's virtually identical to Python's. The code on GitHub's already been updated to implement this, though it's a bit buggy, and the rest of the blog post is devoted to explaining some of the rationale behind these language features.
Records
Like JavaScript and Python, every value in Eve is a record, capable of holding arbitrary fields. This includes functions and primitives. I've found this very useful in Python - it's nice to be able to include name/doc/type attributes directly on functions, or encoding/escaping info on strings.
Fields are accessed through normal dot notation: "obj.value" accesses field "value" on "obj". There are also three binary operators on fields:
Another unresolved issue is whether two primitives with equal values but different fields should be considered ==. For example, does "2 == 2 | { unit: :inches }"? Does "1000000 == 1000000 | { proto: Timestamp }"? When I was doing ArcLite in JavaScript, it would've been really nice if the answer was yes, because then I could pass my Arc data (which had been augmented with methods needed by the interpreter) directly into JavaScript code and have them just work. OTOH, most of the other use-cases would suggest no, because you don't want your dates to be automatically coerced to ints, and you don't want your inches to be confused with feet.
A "no" answer is also friendlier towards a potential compiler. I plan to use a tagged immediate representation for common objects (fixedints, maybe bools and small strings). However, primitives with special fields will always need to be boxed, because the fields alone take up more than one word of space. If object equality includes field equality, then I can always be sure that an immediate != a boxed value, while otherwise I might have to actually examine the boxed value.
Prototypes
All objects have two special fields:
* "proto", the object's prototype.
* "im_receiver", the value of 'self' for bound methods. I'm not sure on the name; it's intended to be reminiscent of the im_self and im_func attributes on Python bound methods. (I've since changed it to method_receiver in Git; I think "im_" is a bit of an onion.)
Proto is pretty familiar: when field lookup fails on an object, it tries the object's prototype, and so on down the chain. (If it fails on everything, then the object's "attr" method is called if it exists, mimicking Python's getattr or Ruby's method_missing.)
im_receiver is there to mimick a certain Python behavior: if you call "obj.method" in Python, you don't just get back method, you get back method bound to obj, so that you don't have to re-specify obj when you call it. Similarly, if an im_receiver attribute exists on an object and has a value of None, "obj.method" will return a new method with "obj" bound as the first argument. If im_receiver is bound to another object, it will use that object instead - this is how superclass lookup is implemented, and may be useful for other metaprogramming. If im_receiver is missing, you get the raw function, so you can do pure collections of functions.
The method binding behavior is very much broken at this point, enough that I frequently can't remember whether a method will be bound or unbound when retrieved. It'll be overhauled in the near future, and I'll probably do a blog entry to clarify some of the issues.
Classes
Eve layers a class-based syntax on top of the prototypes, since nearly every language that lacks a class-based OO system tends to reinvent one. A class definition like this one:
translates into a function call like this:
(The internal defines themselves translate into lambdas in a letrec.)
make_class is a function that invokes the init method and then adjusts the prototype:
In general, this seems to be working out fine, modulo a couple problems with letrec and the above confusion about method binding. It's also really cool to have class creation as a library function, because it makes it trivial to put together classes dynamically.
And in the end, I decided to punt on the feature entirely.
Eve's goal is "to be a practical purely-functional programming language". To that end, there're several constraints that I'm placing upon myself:
- Must be learnable within a weekend by a motivated and reasonably intelligent programmer.
- Must "look" familiar to programmers used to working with Python, Ruby, or JavaScript
- Must not have assignment or mutable data structures in the core language (I'm willing to relax this restriction in the future, like how Haskell added in mutable cells with IORefs and STRefs, but the defaults should always steer programmers towards purely-functional alternatives)
- Must be convenient to use for the types of programming that programmers like to do in their spare time: scripting, string manipulation, calculation & data analysis (often from screen scraping), and webapps.
Instead, I've moved towards a single-dispatch prototype-based language that's similar to JavaScript or Io, without JavaScript's tangled constructor pattern. A layer of syntactic sugar sits on top of this, providing a class-based OO system that's virtually identical to Python's. The code on GitHub's already been updated to implement this, though it's a bit buggy, and the rest of the blog post is devoted to explaining some of the rationale behind these language features.
Records
Like JavaScript and Python, every value in Eve is a record, capable of holding arbitrary fields. This includes functions and primitives. I've found this very useful in Python - it's nice to be able to include name/doc/type attributes directly on functions, or encoding/escaping info on strings.
Fields are accessed through normal dot notation: "obj.value" accesses field "value" on "obj". There are also three binary operators on fields:
- Extension: "obj | { foo: 2, bar: 3 }" creates a new value that's the same as obj, but contains the additional fields "foo" and "bar", overriding existing fields if present. This is the same as $.extend({}, obj, {foo: 2, bar: 3}) in many JavaScript libraries.
- Restriction: "obj & ['foo', 'bar']" creates a new value from obj that *only* contains fields "foo" and "bar". I'm unsure whether to use strings or symbols ("[:foo, :bar]") for the field labels, or even whether symbols will stay in the language as a separate concept at all (I'm leaning towards yes, but comments on the symbol blog post suggested that I just use strings).
- Exclusion: "obj ~ ['foo', 'bar']" creates a new value from obj with foo and bar deleted.
Another unresolved issue is whether two primitives with equal values but different fields should be considered ==. For example, does "2 == 2 | { unit: :inches }"? Does "1000000 == 1000000 | { proto: Timestamp }"? When I was doing ArcLite in JavaScript, it would've been really nice if the answer was yes, because then I could pass my Arc data (which had been augmented with methods needed by the interpreter) directly into JavaScript code and have them just work. OTOH, most of the other use-cases would suggest no, because you don't want your dates to be automatically coerced to ints, and you don't want your inches to be confused with feet.
A "no" answer is also friendlier towards a potential compiler. I plan to use a tagged immediate representation for common objects (fixedints, maybe bools and small strings). However, primitives with special fields will always need to be boxed, because the fields alone take up more than one word of space. If object equality includes field equality, then I can always be sure that an immediate != a boxed value, while otherwise I might have to actually examine the boxed value.
Prototypes
All objects have two special fields:
* "proto", the object's prototype.
* "im_receiver", the value of 'self' for bound methods. I'm not sure on the name; it's intended to be reminiscent of the im_self and im_func attributes on Python bound methods. (I've since changed it to method_receiver in Git; I think "im_" is a bit of an onion.)
Proto is pretty familiar: when field lookup fails on an object, it tries the object's prototype, and so on down the chain. (If it fails on everything, then the object's "attr" method is called if it exists, mimicking Python's getattr or Ruby's method_missing.)
im_receiver is there to mimick a certain Python behavior: if you call "obj.method" in Python, you don't just get back method, you get back method bound to obj, so that you don't have to re-specify obj when you call it. Similarly, if an im_receiver attribute exists on an object and has a value of None, "obj.method" will return a new method with "obj" bound as the first argument. If im_receiver is bound to another object, it will use that object instead - this is how superclass lookup is implemented, and may be useful for other metaprogramming. If im_receiver is missing, you get the raw function, so you can do pure collections of functions.
The method binding behavior is very much broken at this point, enough that I frequently can't remember whether a method will be bound or unbound when retrieved. It'll be overhauled in the near future, and I'll probably do a blog entry to clarify some of the issues.
Classes
Eve layers a class-based syntax on top of the prototypes, since nearly every language that lacks a class-based OO system tends to reinvent one. A class definition like this one:
class Range(Iterator):
def init(start=0, stop, step=1): { start: start, stop: stop, step: step }
def iter(self): self
def get(self): self.start
def next(self): self | { start: self.start + self.step }
def is_valid(self): self.start <= self.stop def len(self): (self.stop - self.start) / self.step
translates into a function call like this:
Range = (def ()
def init(start=0, stop, step=1): { start: start, stop: stop, step: step }
def iter(self): self
def get(self): self.start
def next(self): self | { start: self.start + self.step }
def is_valid(self): self.start <= self.stop
def len(self): (self.stop - self.start) / self.step
make_class(Iterator, locals())
)()
(The internal defines themselves translate into lambdas in a letrec.)
make_class is a function that invokes the init method and then adjusts the prototype:
def make_class(superclass, methods):
{| *args | methods.init(*args) | {
proto: methods | { proto: superclass },
method_receiver: None
}} | methods | { proto: superclass }
In general, this seems to be working out fine, modulo a couple problems with letrec and the above confusion about method binding. It's also really cool to have class creation as a library function, because it makes it trivial to put together classes dynamically.
Monday, August 4, 2008
More on multimethods
Some more thoughts to synthesize some of yesterday's post, in preparation for implementation:
Capabilities
I remembered another important consideration after writing yesterday's blog post: security. Say that you have a feature that lets you dynamically import a module yet restrict the imports that it can access:
If the only methods that are accessible are those that are explicitly imported, then you know that no code in these modules may access other, unsafe functionality. But if multimethods silently import other implementations, there's no guarantee that one of the multimethods in eve.data.sequence doesn't have an implementation in unsafe.file_access that it can use to stomp on the local hard disk. Security auditing just got much more difficult: instead of simply looking at the data you're passing in to the plugin and ensuring that safe_modules really are safe, you've got to examine every module that subtypes a type in safe_modules.
Method Dispatch
One of the comments in the other post pointed out that Dylan solved many of these multiple dispatch in an intended-for-production-use language. Dylan's method dispatch is similar to Cecil's: it compares each argument for specificity, and if two argument positions conflict, then the methods are ambiguous and don't participate in method dispatch.
Dylan's approach also handles keyword arguments: since it makes no reference to argument position, the arguments don't have to be positional.
If I stick with multimethods, I'll probably go with something like this. Dylan's approach is actually more complicated than I need, because they allow multiple inheritance and so need to worry about monotonic linearizations and such. I should be able to use simple subtyping rules on the declared method types, straight out of TAPL.
Dropping multimethods entirely?
I'm thinking that it may be best to drop multimethods entirely and rely on a very Pythonic single-dispatch object system. Obviously it'll be a bit different since objects are immutable, but languages like Python, Java, Smalltalk, and Ruby have basically figured this out and worked out most of the kinks. I'm reminded of GvR's Language Design is Not Just Solving Puzzles - there's a point where each additional feature you add increases complexity rather than reducing it.
So, we can treat methods as simple fields of records where the first argument is the record itself. Same as in OO C, or Python. Maybe have some syntactic sugar to make it easy. This approach has already been well-tested, it's "good enough", and it lets me get to the more interesting work of defining libraries.
There's one issue with this: in my Python programming, I frequently start prototyping using built-in data structures (dicts, tuples) and standalone functions, and then later refactor it to use classes. The prototyping stage is necessary, because it's usually not clear what fields are necessary, what functions are necessary, and how those functions should be organized. But the process of refactoring introduces a big discontinuity that usually requires that the program be rewritten entirely, because it changes the calling syntax to access everything that becomes a class.
I'd like to be able to transparently "harden" a program, making the classes and data structures more well-defined and organized without having to touch every single place that I use them.
The existing equivalence between function calls and record field access is one attempt at this, but it's limited to being able to compute fields from the existing data in the record. A good start - no need for the @property built-in with this - but it falls down horribly when you need additional arguments.
Some code will help. What I want is for this:
To be equivalent to this:
With equivalent meaning that the method calls look exactly the same, so you could first convert the method definition to the class form, have your program still work, and then convert each individual call to the obj.method syntax at your leisure.
Bonus points if the resulting syntax looks like the one for function pipelining; I really dislike the idea of introducing two different function call syntaxes like I suggested yesterday.
The obvious solution is to overload the function call syntax so that if the function isn't found, it looks for a matching field in the first argument and then executes that. Then the method invocation syntax could just be syntactic sugar.
Another neat (though potentially confusing) effect of this is that local function definitions shadow method calls, so you can locally specialize particular method calls. I'm a bit worried about name conflicts with this, though.
There're two problems with this: it leaves no way to get the actual bound method, and the precedence rules are wrong if it needs to coincide with function pipelining. The first problem can be worked around easily enough using partial-application: if obj->foo(arg1, arg2) is syntactic sugar for foo(obj, arg1, arg2), then obj->foo(?, ?) is syntactic sugar for {| x, y | foo(obj, x, y)}, which is exactly what you'd use a bound method for. This is also more flexible, since you can bind any one of the arguments instead of just the message receiver.
The second one is potentially a problem. Normally, you'd expect:
to invoke method1 on object, then invoke method2 on the result. But with the existing -> definition, it parses to:
Oh wait, that's correct. So the confusing part is if were expecting to use partial application to setup a pipeline, like the example at the beginning:
This would parse to map(filter(list_dir('plugins'), ext(?, 'eve'), ?) ...), which is nonsense. Perhaps just allowing people people to parenthesize expressions? Combining this with sane methods (like map & filter being methods of sequence types), the above example would look like this:
Not so bad. The big test is what happens if call them in an order that's not bound to the receiver:
I think that'd work, and it doesn't look so bad, but it's probably best just to try it out and see rather than work it all out in my head.
Capabilities
I remembered another important consideration after writing yesterday's blog post: security. Say that you have a feature that lets you dynamically import a module yet restrict the imports that it can access:
safe_modules = ['eve.data.*', 'eve.math', 'eve.regexp']
plugins = list_dir('plugins') -> filter(ext(?, 'eve'), ?)
-> map(dynamic_import(?, safe_modules), ?)
If the only methods that are accessible are those that are explicitly imported, then you know that no code in these modules may access other, unsafe functionality. But if multimethods silently import other implementations, there's no guarantee that one of the multimethods in eve.data.sequence doesn't have an implementation in unsafe.file_access that it can use to stomp on the local hard disk. Security auditing just got much more difficult: instead of simply looking at the data you're passing in to the plugin and ensuring that safe_modules really are safe, you've got to examine every module that subtypes a type in safe_modules.
Method Dispatch
One of the comments in the other post pointed out that Dylan solved many of these multiple dispatch in an intended-for-production-use language. Dylan's method dispatch is similar to Cecil's: it compares each argument for specificity, and if two argument positions conflict, then the methods are ambiguous and don't participate in method dispatch.
Dylan's approach also handles keyword arguments: since it makes no reference to argument position, the arguments don't have to be positional.
If I stick with multimethods, I'll probably go with something like this. Dylan's approach is actually more complicated than I need, because they allow multiple inheritance and so need to worry about monotonic linearizations and such. I should be able to use simple subtyping rules on the declared method types, straight out of TAPL.
Dropping multimethods entirely?
I'm thinking that it may be best to drop multimethods entirely and rely on a very Pythonic single-dispatch object system. Obviously it'll be a bit different since objects are immutable, but languages like Python, Java, Smalltalk, and Ruby have basically figured this out and worked out most of the kinks. I'm reminded of GvR's Language Design is Not Just Solving Puzzles - there's a point where each additional feature you add increases complexity rather than reducing it.
So, we can treat methods as simple fields of records where the first argument is the record itself. Same as in OO C, or Python. Maybe have some syntactic sugar to make it easy. This approach has already been well-tested, it's "good enough", and it lets me get to the more interesting work of defining libraries.
There's one issue with this: in my Python programming, I frequently start prototyping using built-in data structures (dicts, tuples) and standalone functions, and then later refactor it to use classes. The prototyping stage is necessary, because it's usually not clear what fields are necessary, what functions are necessary, and how those functions should be organized. But the process of refactoring introduces a big discontinuity that usually requires that the program be rewritten entirely, because it changes the calling syntax to access everything that becomes a class.
I'd like to be able to transparently "harden" a program, making the classes and data structures more well-defined and organized without having to touch every single place that I use them.
The existing equivalence between function calls and record field access is one attempt at this, but it's limited to being able to compute fields from the existing data in the record. A good start - no need for the @property built-in with this - but it falls down horribly when you need additional arguments.
Some code will help. What I want is for this:
def foo(self, arg1, arg2): ...
foo(obj, x, y)
To be equivalent to this:
class Foo:
def foo(self, arg1, arg2): ...
obj.foo(x, y)
With equivalent meaning that the method calls look exactly the same, so you could first convert the method definition to the class form, have your program still work, and then convert each individual call to the obj.method syntax at your leisure.
Bonus points if the resulting syntax looks like the one for function pipelining; I really dislike the idea of introducing two different function call syntaxes like I suggested yesterday.
The obvious solution is to overload the function call syntax so that if the function isn't found, it looks for a matching field in the first argument and then executes that. Then the method invocation syntax could just be syntactic sugar.
Another neat (though potentially confusing) effect of this is that local function definitions shadow method calls, so you can locally specialize particular method calls. I'm a bit worried about name conflicts with this, though.
There're two problems with this: it leaves no way to get the actual bound method, and the precedence rules are wrong if it needs to coincide with function pipelining. The first problem can be worked around easily enough using partial-application: if obj->foo(arg1, arg2) is syntactic sugar for foo(obj, arg1, arg2), then obj->foo(?, ?) is syntactic sugar for {| x, y | foo(obj, x, y)}, which is exactly what you'd use a bound method for. This is also more flexible, since you can bind any one of the arguments instead of just the message receiver.
The second one is potentially a problem. Normally, you'd expect:
obj->method1(a, b)->method2(c, d)
to invoke method1 on object, then invoke method2 on the result. But with the existing -> definition, it parses to:
method2(method1(obj, a, b), c, d)
Oh wait, that's correct. So the confusing part is if were expecting to use partial application to setup a pipeline, like the example at the beginning:
safe_modules = ['eve.data.*', 'eve.math', 'eve.regexp']
plugins = list_dir('plugins') -> filter(ext(?, 'eve'), ?)
-> map(dynamic_import(?, safe_modules), ?)
This would parse to map(filter(list_dir('plugins'), ext(?, 'eve'), ?) ...), which is nonsense. Perhaps just allowing people people to parenthesize expressions? Combining this with sane methods (like map & filter being methods of sequence types), the above example would look like this:
safe_modules = ['eve.data.*', 'eve.math', 'eve.regexp']
plugins = list_dir('plugins') -> filter(? -> ext('eve'))
-> map(dynamic_import(?, safe_modules))
Not so bad. The big test is what happens if call them in an order that's not bound to the receiver:
FILTERS = [ext(?, 'svn'), starts_with(?, '.'), ends_with(?, '~'), ext(?, 'swp')]
def all_files(dir):
list_dir(dir) -> (fold(FILTERS, map -> flip, ?))
I think that'd work, and it doesn't look so bad, but it's probably best just to try it out and see rather than work it all out in my head.
Saturday, August 2, 2008
The pitfalls of multimethods
I'm usually reluctant to write a blog post without checking in any code, since it's a bad habit to get into. But the last entry resulted in a lot of good feedback, much of which indicated the feature wasn't necessary at all. That's about the best outcome that could be hoped for - the fewer the features, the better the language. But it's better if I decide to cut the features before I write them. ;-)
The existing interpreter already has multimethods - sorta. But the method lookup algorithm is basically a mess. Here're some of the pitfalls I've run into. If I decide to nix the feature entirely, at least I'll have a paper trail of why it didn't work so I don't make the same mistakes later.
Basics of Multimethods
Some basic background for those who aren't familiar with multimethods. In "normal" OO languages, methods are bound to instances - you get polymorphism by changing the type of the instance. With multimethods, methods are instead bound to generic functions, which then dispatch to a particular implementation based on the type of all arguments. So instead of just specializing on the first argument (the message receiver), you can specialize on any combination of arguments.
An example:
Eve's subscript notation is syntactic sugar for a "get" method call. Instead of having separate concepts of slicing, indexing, and iterations, the following examples all invoke the same method:
Since sequence can itself be any object that conforms to the sequence type, this dispatches on the type of both arguments. Oftentimes, there's some specific linearization algorithm that determines the order in which methods are tried. Eve's current interpreter uses simple textual order - more recently defined variables are tried first, then others if they raise a TypeError. This is a complete mess, though - among other things, it means that the textual order of the file matters, and that the order of imports can create subtle differences in which method is called. One of the sections in this blog entry will about method linearization order.
And now the pitfalls:
Closures
Ideally, generic functions should be first-class values just like anything else. But what happens in the presence of lexical scoping? Say that you have a generic function 'add' that's defined over numbers and their subclasses, so that you can add two ints or two floats or two vectors or two matrices. Then you try to define this function:
Should the new definition of 'add' define a new method on the existing generic function, letting you pass normal numbers into 'foo' in addition to FunkySubtypes? Should the new method be visible outside the definition of 'foo', even though it was declared as lexically scoped? Or should the definition shadow the generic function, so that 'add' is only a regular function that can't be used with normal numbers?
To my knowledge, the only language that's dealt with this issue is CLOS. AFAIK (correct me if I'm wrong), it adds a new method to the existing generic function, which is then visible globally. But this is a gross violation of encapsulation - you wouldn't expect that defining a local closure could change the behavior of the program at a faraway call site.
The ideal semantics, to me, would have it locally add a new method that applies only within the body of 'foo' - this keeps changes local, yet lets you redefine special cases as necessary. But then what happens if you bind 'add' to a record field, return it, or pass it to a function? It seems like it'd be most useful to have it pass the new generic function, including the newly-defined method. But this means that you have to compute new dispatch rules whenever a multimethod "escapes" from a function, so certain operations that look computationally simple could become quite heavyweight.
Now what if instead of being defined locally, 'add' was the name of a parameter? Should the parameter shadow the existing multimethod, or should the definitions be combined together? What if 'add' doesn't hold a function at all, but instead is just an ordinary value? What if it does hold a function, but the name collision is accidental?
I'm inclined to make defs the only way to add methods to a generic function, for this reason. Too easy to accidentally create terrible messes. Also, usually if you bind a variable, you want that variable to mean *only* what you've bound it to, and not that plus whatever else happens to be in the environment.
Module system
Since implementations end up bound to the generic function, there's a problem when integrating multimethods with module systems. How do you tell which implementations should be visible at the call site?
For example, what if some client code looked like this:
The intent is for first_two to work with any sort of sequence, but this requires that the all implementations of 'get', everywhere within the program, be visible. Okay, we can do that - as long as we're dealing with a known variable. Now what if, instead of 'get', we were calling a function (possibly a multimethod) that had been passed in from elsewhere? I guess it would be up to the caller to package up whatever dispatching was necessary to determine which method is most applicable.
Now what if you didn't want to call any matching method in any module? After all, the point of a module system is to hide unrelated names from each other - if you break that down, you risk some very odd name-collision bugs, akin to the mess that comes from using Prototype.js or Mootools or monkey-patching things in Ruby.
Ultimately, the problem here seems to be that you want to import associated methods, when they match your intended use, and yet you don't want to import unrelated methods that have the same name but completely different semantics. In other words, you want the computer to be a mind-reader.
To my knowledge, the only language that's thought seriously about these issues is Cecil. They use the concept of Extension Modules: whenever a module defines a class and methods that subclass an existing method, the module is defined as an "extension module" to the module which contains the original signature. When the original interface is imported, all extension modules are implicitly imported as well.
This has the following problems when transferred to Eve:
Method Dispatch
Then there are problems in resolving method priority. With single-dispatch & single-inheritance, the subtype relationship has a partial order, and so you can just use that order for method dispatch. With single dispatch & multiple inheritance, you don't necessarily have a partial order, but the language usually forbids any class hierarchy that leads to cycles in the inheritance graph. But with multiple dispatch, you may have a different ordering for each parameter, and the orderings may conflict in ways that are not detectable to the compiler. Worse, two modules that are correct in isolation may result in conflicts when combined.
For example, say you have the following type relationship:
If you call foo with a B and a Y, which method should be called? The existing Eve interpreter tries methods in textual order (actually, reverse textual order owing to an interpreter quirk), much like how pattern-matching works in Haskell/Erlang/Ocaml. So the first variant gets called.
Usually, this is what you want - the point of subtyping is to provide a specialization, and so specialized methods should take precedence over textual order. But there're two different possible specializations here. You could take the second definition, which is more specific in the second argument, or the third definition, which is more specific in the first argument.
CLOS and Dylan, I believe, do the latter, preferencing argument positions in left-to-right order. This keeps behavior well-defined, but can sometimes lead to confusing mistakes.
Cecil raises an error in situations like this, where the specificity is ambiguous. But this can a problem when importing hierarchies from two different modules. The parameters to 'foo' may have come from anywhere; by adding a new module with new subtypes, you risk making the call to 'foo' ambiguous (and hence incorrect) even though each individual subtype is correct in isolation.
Keyword and varargs
The left-to-right ordering creates an additional problem with keyword arguments. Python has a nice feature (not yet implemented in Eve, but it'd be cool) where you can supply any positional argument as a keyword argument and take it out of the order. But what should the specificity rules be for keyword arguments then? Two different calls might have different precedence rules based on the order in which they supplied keyword arguments, yet the whole point of keywords is that order doesn't matter.
One possible resolution would be to just not implement keyword arguments. For most usages, passing in a literal record or dict is just as effective, and doesn't raise problems like this (plus, keyword arguments can have similar issues with first-class functions, where you need the internal parameter names to match up in addition to their positions). But it's yet another complicating factor for multimethods.
Conclusion:
I'm leaning towards just dropping multimethods from the design. The vast majority of practical applications do not need multimethods - after all, people do just fine with Python and Java. For the opening example with subscripting, I could just have separate methods for get/slice/restrict. That'd just sidestep the problem, and probably get most of the functionality I need.
Another question this opens up is what to use for regular single-dispatch in the absence of multiple dispatch, since I was depending quite a bit on it. One possibility is to bring back the . as syntax, and use it like in Python, to bind methods to objects:
Then perhaps have some syntactic sugar for defining classes, which are basically records of functions.
This is simple and well-tested: the one thing I'm not sure about is how it interacts with the equivalency between record fields and local function calls. I really like this equivalency; among other things, it lets you prototype with explicit data fields, and then switch to a computed representation later. It looks like it works okay here too, except that we should provide some sort of 'bind' operator so that given a record { field1, field2 } and a method method1, we have bind(record, 'method1') = { field1, field2, method1 }. I could also see this being very useful for eg. template variables or other quick prototyping.
The existing interpreter already has multimethods - sorta. But the method lookup algorithm is basically a mess. Here're some of the pitfalls I've run into. If I decide to nix the feature entirely, at least I'll have a paper trail of why it didn't work so I don't make the same mistakes later.
Basics of Multimethods
Some basic background for those who aren't familiar with multimethods. In "normal" OO languages, methods are bound to instances - you get polymorphism by changing the type of the instance. With multimethods, methods are instead bound to generic functions, which then dispatch to a particular implementation based on the type of all arguments. So instead of just specializing on the first argument (the message receiver), you can specialize on any combination of arguments.
An example:
Eve's subscript notation is syntactic sugar for a "get" method call. Instead of having separate concepts of slicing, indexing, and iterations, the following examples all invoke the same method:
sequence[0] => get(sequence, 0) # Indexing
sequence[1..5] => get(sequence, Range(1, 5)) # Slicing
sequence[[1, 4, 5, 9]] => get(sequence, [1, 4, 5, 9]) # Retrieve specific indices
sequence[] => get(sequence) # Iterator query
Since sequence can itself be any object that conforms to the sequence type, this dispatches on the type of both arguments. Oftentimes, there's some specific linearization algorithm that determines the order in which methods are tried. Eve's current interpreter uses simple textual order - more recently defined variables are tried first, then others if they raise a TypeError. This is a complete mess, though - among other things, it means that the textual order of the file matters, and that the order of imports can create subtle differences in which method is called. One of the sections in this blog entry will about method linearization order.
And now the pitfalls:
Closures
Ideally, generic functions should be first-class values just like anything else. But what happens in the presence of lexical scoping? Say that you have a generic function 'add' that's defined over numbers and their subclasses, so that you can add two ints or two floats or two vectors or two matrices. Then you try to define this function:
def foo(num1 as Int, number_like_but_some_funky_subtype):
def add(num1 as Int, num2 as FunkySubtype):
# whatever
add(num1, number_like_but_some_funky_subtype)
Should the new definition of 'add' define a new method on the existing generic function, letting you pass normal numbers into 'foo' in addition to FunkySubtypes? Should the new method be visible outside the definition of 'foo', even though it was declared as lexically scoped? Or should the definition shadow the generic function, so that 'add' is only a regular function that can't be used with normal numbers?
To my knowledge, the only language that's dealt with this issue is CLOS. AFAIK (correct me if I'm wrong), it adds a new method to the existing generic function, which is then visible globally. But this is a gross violation of encapsulation - you wouldn't expect that defining a local closure could change the behavior of the program at a faraway call site.
The ideal semantics, to me, would have it locally add a new method that applies only within the body of 'foo' - this keeps changes local, yet lets you redefine special cases as necessary. But then what happens if you bind 'add' to a record field, return it, or pass it to a function? It seems like it'd be most useful to have it pass the new generic function, including the newly-defined method. But this means that you have to compute new dispatch rules whenever a multimethod "escapes" from a function, so certain operations that look computationally simple could become quite heavyweight.
Now what if instead of being defined locally, 'add' was the name of a parameter? Should the parameter shadow the existing multimethod, or should the definitions be combined together? What if 'add' doesn't hold a function at all, but instead is just an ordinary value? What if it does hold a function, but the name collision is accidental?
I'm inclined to make defs the only way to add methods to a generic function, for this reason. Too easy to accidentally create terrible messes. Also, usually if you bind a variable, you want that variable to mean *only* what you've bound it to, and not that plus whatever else happens to be in the environment.
Module system
Since implementations end up bound to the generic function, there's a problem when integrating multimethods with module systems. How do you tell which implementations should be visible at the call site?
For example, what if some client code looked like this:
def first_two(sequence): [sequence[0], sequence[1]]
The intent is for first_two to work with any sort of sequence, but this requires that the all implementations of 'get', everywhere within the program, be visible. Okay, we can do that - as long as we're dealing with a known variable. Now what if, instead of 'get', we were calling a function (possibly a multimethod) that had been passed in from elsewhere? I guess it would be up to the caller to package up whatever dispatching was necessary to determine which method is most applicable.
Now what if you didn't want to call any matching method in any module? After all, the point of a module system is to hide unrelated names from each other - if you break that down, you risk some very odd name-collision bugs, akin to the mess that comes from using Prototype.js or Mootools or monkey-patching things in Ruby.
Ultimately, the problem here seems to be that you want to import associated methods, when they match your intended use, and yet you don't want to import unrelated methods that have the same name but completely different semantics. In other words, you want the computer to be a mind-reader.
To my knowledge, the only language that's thought seriously about these issues is Cecil. They use the concept of Extension Modules: whenever a module defines a class and methods that subclass an existing method, the module is defined as an "extension module" to the module which contains the original signature. When the original interface is imported, all extension modules are implicitly imported as well.
This has the following problems when transferred to Eve:
- It relies on nominal subtyping; Eve has structural subtyping. For example, a Range in Eve is a subtype of a sequence, and has all the normal sequence methods defined. However, anything with 'start' and 'stop' integer fields is a subtype of Range, and should also get the sequence methods "for free". How would a module elsewhere that defines { start as Int, stop as Int } types know that it should be an extension module of Range? Does it have to search every installed module for ones with compatible signatures?
- It doesn't interact well with first-class generic functions. If GFs are first-class, you might pass them into modules that don't import the original function. How do you handle the extension modules then? The receiver isn't even importing the signature module, let alone the extension modules.
Method Dispatch
Then there are problems in resolving method priority. With single-dispatch & single-inheritance, the subtype relationship has a partial order, and so you can just use that order for method dispatch. With single dispatch & multiple inheritance, you don't necessarily have a partial order, but the language usually forbids any class hierarchy that leads to cycles in the inheritance graph. But with multiple dispatch, you may have a different ordering for each parameter, and the orderings may conflict in ways that are not detectable to the compiler. Worse, two modules that are correct in isolation may result in conflicts when combined.
For example, say you have the following type relationship:
type A
type B subtypes A
type X
type Y subtypes X
def foo(a as A, x as X): ...
def foo(a as A, x as Y): ...
def foo(a as B, x as X): ...
If you call foo with a B and a Y, which method should be called? The existing Eve interpreter tries methods in textual order (actually, reverse textual order owing to an interpreter quirk), much like how pattern-matching works in Haskell/Erlang/Ocaml. So the first variant gets called.
Usually, this is what you want - the point of subtyping is to provide a specialization, and so specialized methods should take precedence over textual order. But there're two different possible specializations here. You could take the second definition, which is more specific in the second argument, or the third definition, which is more specific in the first argument.
CLOS and Dylan, I believe, do the latter, preferencing argument positions in left-to-right order. This keeps behavior well-defined, but can sometimes lead to confusing mistakes.
Cecil raises an error in situations like this, where the specificity is ambiguous. But this can a problem when importing hierarchies from two different modules. The parameters to 'foo' may have come from anywhere; by adding a new module with new subtypes, you risk making the call to 'foo' ambiguous (and hence incorrect) even though each individual subtype is correct in isolation.
Keyword and varargs
The left-to-right ordering creates an additional problem with keyword arguments. Python has a nice feature (not yet implemented in Eve, but it'd be cool) where you can supply any positional argument as a keyword argument and take it out of the order. But what should the specificity rules be for keyword arguments then? Two different calls might have different precedence rules based on the order in which they supplied keyword arguments, yet the whole point of keywords is that order doesn't matter.
One possible resolution would be to just not implement keyword arguments. For most usages, passing in a literal record or dict is just as effective, and doesn't raise problems like this (plus, keyword arguments can have similar issues with first-class functions, where you need the internal parameter names to match up in addition to their positions). But it's yet another complicating factor for multimethods.
Conclusion:
I'm leaning towards just dropping multimethods from the design. The vast majority of practical applications do not need multimethods - after all, people do just fine with Python and Java. For the opening example with subscripting, I could just have separate methods for get/slice/restrict. That'd just sidestep the problem, and probably get most of the functionality I need.
Another question this opens up is what to use for regular single-dispatch in the absence of multiple dispatch, since I was depending quite a bit on it. One possibility is to bring back the . as syntax, and use it like in Python, to bind methods to objects:
# Existing pipelining calls
obj -> method # record access, same as method(obj)
obj -> method1 -> method2 # method2(method1(obj))
obj -> method(?, arg) # partial app, same as {| x | method(x, arg)}(obj)
# New bound method calls
obj.method # bound method, same as {| *args | method(obj, *args) }
obj.method(arg) # method invocation, same as (obj->method)(obj, arg)
Then perhaps have some syntactic sugar for defining classes, which are basically records of functions.
This is simple and well-tested: the one thing I'm not sure about is how it interacts with the equivalency between record fields and local function calls. I really like this equivalency; among other things, it lets you prototype with explicit data fields, and then switch to a computed representation later. It looks like it works okay here too, except that we should provide some sort of 'bind' operator so that given a record { field1, field2 } and a method method1, we have bind(record, 'method1') = { field1, field2, method1 }. I could also see this being very useful for eg. template variables or other quick prototyping.
Subscribe to:
Posts (Atom)