Quantcast

Rethinking filter

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Rethinking filter

Martin Odersky
We have a number of open tickets which all want that

  for (x <- xs if guard) body

should expand to

  for (x <- xs if (guard) body

rather than the current expansion

  for (x <- xs filter (x => guard)) body

The motivations are

(1) you might want to write a guard that depends on side effects in body.
(2) you might want to avoid constructing a secondary list or other
data structure containing the filtered elements.

One problem is how to integrate this with for-yield collections. It
would be unacceptable to have different behaviors of guards depending
on whether we are in a for-loop or a for-yield.
The other problem is how to do this for all kinds of generators,
collections and others, in a uniform way.

Unlike in Haskell, there's no unit element that we can use in a
for-expression. So the only alternative seems to be to make use of
filterForeach, filterMap, filterFlatMap operations in the generator
classes. That is, for-expression expansion would still work as
described by Jorge:
>>>>

  for (x <- c) f(x)

simply translates to:

  c.foreach(x => f(x))

Likewise:

  for (x <- c) yield f(x)

simply translates to:

  c.map(x => f(x))

Something with a guard:

  for (x <- c if p(x)) yield f(x)

translates to:

  c.filter(x => p(x)).map(x => f(x))

And a nested for:

  for (x1 <- c1; x2 <- c2) yield f(x1, x2)

translates to:

  c1.flatMap(x1 => c2.map(x2 => f(x1, x2)))

<<<

But, there would be a post-processing step, where

  c1.filter(p).map(f)   is rewritten to c1.filterMap(p, f)
  c1.filter(p).flatMap(f)   is rewritten to c1.filterFlatMap(p, f)
  c1.filter(p).foreach(f)   is rewritten to c1.filterForeach(p, f)

These rewritings would only happen for filters/maps/flatMaps generated
in for-expressions and only if the c1 type had defined the filterXXX
operations. We don't want to demand that for every generator type,
because that would make it too complicated to define generators. But
collections would uniformly have those operations, and implement them
with lazy filters. For instance, in class TraversableLike:

  def filterForeach[U](p: A => Boolean, f: A => U): Unit =
     this foreach (x => if (p(x)) f(x))

  def filterMap[That](p: A => Boolean, f: A => B)(implicit bf:
BuilderFactory[B, That, Repr]): Repr = {
     val b = bf(repr)
     this foreach (x => if (p(x)) b += f(x))
     b.result
  }

To deal with multiple filters, we should also extend the rewriting rules to

  c1.filter(p).filterMap(q, f)   is rewritten to c1.filterMap(p && q, f)
  c1.filter(p).filterFlatMap(q, f)   is rewritten to c1.filterFlapMap(p && q, f)
  c1.filter(p).filterForeach(f)   is rewritten to c1.filterForeach(p && q, f)

and apply rewritings as long as possible.

Question: Should we do this? Advantages I can see: (1) More efficient
operations through some amount of deforestation. (2) Less surprises
for users expecting imperative behavior.
Disadvantages: (1) It complicates the rules, and possibly the behavior
because now filterMap, etc would decide whether filter was strict or
lazy. (2) It would break some code.

Btw, one could also consider to go further with deforestation, for
instance with rules like this:

  c1.flatMap(x => f(x).map(g))  rewrites to    c1.mapFlatMap(g, f)

Cheers

 -- Martin
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Rethinking filter

Eastsun
+1
I'm looking for this so long time

It would make for expression more powerful and useful, and encourage us to use for expression rather than map,filter,... etc.


Martin Odersky wrote
We have a number of open tickets which all want that

  for (x <- xs if guard) body

should expand to

  for (x <- xs if (guard) body

rather than the current expansion

  for (x <- xs filter (x => guard)) body

The motivations are

(1) you might want to write a guard that depends on side effects in body.
(2) you might want to avoid constructing a secondary list or other
data structure containing the filtered elements.

One problem is how to integrate this with for-yield collections. It
would be unacceptable to have different behaviors of guards depending
on whether we are in a for-loop or a for-yield.
The other problem is how to do this for all kinds of generators,
collections and others, in a uniform way.

Unlike in Haskell, there's no unit element that we can use in a
for-expression. So the only alternative seems to be to make use of
filterForeach, filterMap, filterFlatMap operations in the generator
classes. That is, for-expression expansion would still work as
described by Jorge:
>>>>

  for (x <- c) f(x)

simply translates to:

  c.foreach(x => f(x))

Likewise:

  for (x <- c) yield f(x)

simply translates to:

  c.map(x => f(x))

Something with a guard:

  for (x <- c if p(x)) yield f(x)

translates to:

  c.filter(x => p(x)).map(x => f(x))

And a nested for:

  for (x1 <- c1; x2 <- c2) yield f(x1, x2)

translates to:

  c1.flatMap(x1 => c2.map(x2 => f(x1, x2)))

<<<

But, there would be a post-processing step, where

  c1.filter(p).map(f)   is rewritten to c1.filterMap(p, f)
  c1.filter(p).flatMap(f)   is rewritten to c1.filterFlatMap(p, f)
  c1.filter(p).foreach(f)   is rewritten to c1.filterForeach(p, f)

These rewritings would only happen for filters/maps/flatMaps generated
in for-expressions and only if the c1 type had defined the filterXXX
operations. We don't want to demand that for every generator type,
because that would make it too complicated to define generators. But
collections would uniformly have those operations, and implement them
with lazy filters. For instance, in class TraversableLike:

  def filterForeach[U](p: A => Boolean, f: A => U): Unit =
     this foreach (x => if (p(x)) f(x))

  def filterMap[That](p: A => Boolean, f: A => B)(implicit bf:
BuilderFactory[B, That, Repr]): Repr = {
     val b = bf(repr)
     this foreach (x => if (p(x)) b += f(x))
     b.result
  }

To deal with multiple filters, we should also extend the rewriting rules to

  c1.filter(p).filterMap(q, f)   is rewritten to c1.filterMap(p && q, f)
  c1.filter(p).filterFlatMap(q, f)   is rewritten to c1.filterFlapMap(p && q, f)
  c1.filter(p).filterForeach(f)   is rewritten to c1.filterForeach(p && q, f)

and apply rewritings as long as possible.

Question: Should we do this? Advantages I can see: (1) More efficient
operations through some amount of deforestation. (2) Less surprises
for users expecting imperative behavior.
Disadvantages: (1) It complicates the rules, and possibly the behavior
because now filterMap, etc would decide whether filter was strict or
lazy. (2) It would break some code.

Btw, one could also consider to go further with deforestation, for
instance with rules like this:

  c1.flatMap(x => f(x).map(g))  rewrites to    c1.mapFlatMap(g, f)

Cheers

 -- Martin
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Rethinking filter

Jesper Nordenberg
In reply to this post by Martin Odersky
martin odersky wrote:
> We have a number of open tickets which all want that
>
>   for (x <- xs if guard) body
>
> should expand to
>
>   for (x <- xs if (guard) body

(I assume this should read "for (x <- xs) if (guard) body")

>
> rather than the current expansion
>
>   for (x <- xs filter (x => guard)) body
>
> The motivations are
>
> (1) you might want to write a guard that depends on side effects in body.
> (2) you might want to avoid constructing a secondary list or other
> data structure containing the filtered elements.

I'm against this suggestion if these are the only reasons for change.
It's not hard to manually move the guard inside the for-body.

Regarding the yield-case, the only motivation I can see is performance,
but it's not worth complicating the rewrite rules for that. It's better
to put more work into the optimizer in that case.

/Jesper Nordenberg

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Rethinking filter

Christos KK Loverdos
In reply to this post by Martin Odersky
Hi Martin,

On Thu, Oct 15, 2009 at 13:31, martin odersky <[hidden email]> wrote:
We have a number of open tickets which all want that

 for (x <- xs if guard) body

should expand to

 for (x <- xs if (guard) body

rather than the current expansion

 for (x <- xs filter (x => guard)) body

The motivations are

(1) you might want to write a guard that depends on side effects in body.
(2) you might want to avoid constructing a secondary list or other
data structure containing the filtered elements.

One problem is how to integrate this with for-yield collections. It
would be unacceptable to have different behaviors of guards depending
on whether we are in a for-loop or a for-yield.
The other problem is how to do this for all kinds of generators,
collections and others, in a uniform way.

Unlike in Haskell, there's no unit element that we can use in a
for-expression. So the only alternative seems to be to make use of
filterForeach, filterMap, filterFlatMap operations in the generator
classes. That is, for-expression expansion would still work as
described by Jorge:
>>>>

 for (x <- c) f(x)

simply translates to:

 c.foreach(x => f(x))

Likewise:

 for (x <- c) yield f(x)

simply translates to:

 c.map(x => f(x))

Something with a guard:

 for (x <- c if p(x)) yield f(x)

translates to:

 c.filter(x => p(x)).map(x => f(x))

And a nested for:

 for (x1 <- c1; x2 <- c2) yield f(x1, x2)

translates to:

 c1.flatMap(x1 => c2.map(x2 => f(x1, x2)))

<<<

But, there would be a post-processing step, where

 c1.filter(p).map(f)   is rewritten to c1.filterMap(p, f)
 c1.filter(p).flatMap(f)   is rewritten to c1.filterFlatMap(p, f)
 c1.filter(p).foreach(f)   is rewritten to c1.filterForeach(p, f)

These rewritings would only happen for filters/maps/flatMaps generated
in for-expressions and only if the c1 type had defined the filterXXX
operations. We don't want to demand that for every generator type,
because that would make it too complicated to define generators. But
collections would uniformly have those operations, and implement them
with lazy filters. For instance, in class TraversableLike:

 def filterForeach[U](p: A => Boolean, f: A => U): Unit =
    this foreach (x => if (p(x)) f(x))

 def filterMap[That](p: A => Boolean, f: A => B)(implicit bf:
BuilderFactory[B, That, Repr]): Repr = {
    val b = bf(repr)
    this foreach (x => if (p(x)) b += f(x))
    b.result
 }

To deal with multiple filters, we should also extend the rewriting rules to

 c1.filter(p).filterMap(q, f)   is rewritten to c1.filterMap(p && q, f)
 c1.filter(p).filterFlatMap(q, f)   is rewritten to c1.filterFlapMap(p && q, f)
 c1.filter(p).filterForeach(f)   is rewritten to c1.filterForeach(p && q, f)

and apply rewritings as long as possible.

Question: Should we do this? Advantages I can see: (1) More efficient
operations through some amount of deforestation. (2) Less surprises
for users expecting imperative behavior.
Disadvantages: (1) It complicates the rules, and possibly the behavior
because now filterMap, etc would decide whether filter was strict or
lazy. (2) It would break some code.

Btw, one could also consider to go further with deforestation, for
instance with rules like this:

 c1.flatMap(x => f(x).map(g))  rewrites to    c1.mapFlatMap(g, f)

Cheers

 -- Martin


 
It is funny I thought about (some of) these things (but not very deeply, I should say) quite a long ago and my resolution was that Scala will be able in the future to optimize some common cases and eliminate intermediate structures. Not sure if this sheds any light though...

--
 __~O
-\ <,       Christos KK Loverdos
(*)/ (*)      http://ckkloverdos.com
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Re: Rethinking filter

Iulian Dragos-2
In reply to this post by Jesper Nordenberg


On Thu, Oct 15, 2009 at 1:47 PM, Jesper Nordenberg <[hidden email]> wrote:

The motivations are

(1) you might want to write a guard that depends on side effects in body.
(2) you might want to avoid constructing a secondary list or other
data structure containing the filtered elements.

I'm against this suggestion if these are the only reasons for change. It's not hard to manually move the guard inside the for-body.

Regarding the yield-case, the only motivation I can see is performance, but it's not worth complicating the rewrite rules for that. It's better to put more work into the optimizer in that case.


The optimizer needs to finds out that 'filter' is side-effect free, that map builds the exact same collection type, that the predicate and mapping function are again side-effect free and can be reordered... Maybe when an effect type-system is in place this should be done, but I wouldn't hold my breath.

If it was only for efficiency, I'd agree with you, but having guards that rely on side-effects in the mapping function is more intuitive for most people. The current behavior is a source of subtle bugs. And no matter who does the rewrite, the optimizer or the 'rewrite' rules, some complexity (and arguably, a lot more if we rely on the optimizer) will appear somewhere. Speccing it allows the programmer to rely on it, and actually simplifies the implementation. The added complexity in the specification doesn't seem that great to me, so I'm in favor of this change.

cheers,
iulian
 

/Jesper Nordenberg




--
« Je déteste la montagne, ça cache le paysage »
Alphonse Allais
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Rethinking filter

Jesper Nordenberg
Iulian Dragos wrote:
> The optimizer needs to finds out that 'filter' is side-effect free, that
> map builds the exact same collection type, that the predicate and
> mapping function are again side-effect free and can be reordered...
> Maybe when an effect type-system is in place this should be done, but I
> wouldn't hold my breath.

Yes, an effect tracking system is required to make this "stream fusion"
work properly (although it doesn't have to be integrated with the type
system).

> If it was only for efficiency, I'd agree with you, but having guards
> that rely on side-effects in the mapping function is more intuitive for
> most people. The current behavior is a source of subtle bugs. And no
> matter who does the rewrite, the optimizer or the 'rewrite' rules, some
> complexity (and arguably, a lot more if we rely on the optimizer) will
> appear somewhere. Speccing it allows the programmer to rely on it, and
> actually simplifies the implementation. The added complexity in the
> specification doesn't seem that great to me, so I'm in favor of this change.

Well, I think the key question is: are functions passed to filter, map
and flatMap allowed to have side effects? And, if so, what restrictions
apply and what properties can be relied upon? If we start by specifying
this, I think the implementations details will fall out naturally.

/Jesper Nordenberg

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Re: Rethinking filter

Seth Tisue
In reply to this post by Jesper Nordenberg
>>>>> "Jesper" == Jesper Nordenberg <[hidden email]> writes:

 >> The motivations are
 >> (1) you might want to write a guard that depends on side effects in
 >> body.  (2) you might want to avoid constructing a secondary list or
 >> other data structure containing the filtered elements.

 Jesper> I'm against this suggestion if these are the only reasons for
 Jesper> change.  It's not hard to manually move the guard inside the
 Jesper> for-body.

Agree.

(1) doesn't seem important enough to do for 2.8, given everything else
that's going on, especially given that only imperative code (and rather
tricky imperative code, too) benefits from the additional complication.

As for (2), I'd point out that it's not hard to add ".view" to avoid
constructing intermediates.

--
Seth Tisue @ Northwestern University / http://tisue.net
lead developer, NetLogo: http://ccl.northwestern.edu/netlogo/
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Re: Rethinking filter

Jorge Ortiz
I agree with Seth and Jesper.

I rather like the simplicity of the current transformation. The "combine if it looks like this AND this method is defined" rule seems overly complicated and a can of worms. Would it apply the filterXXX methods based on the static type or the run-time type? How "deep" does the deforestation go?

That said, I haven't seen the open tickets regarding this issue. Which tickets are they? Perhaps they contain compelling arguments for change.

--j

On Thu, Oct 15, 2009 at 6:47 AM, Seth Tisue <[hidden email]> wrote:
>>>>> "Jesper" == Jesper Nordenberg <[hidden email]> writes:

 >> The motivations are
 >> (1) you might want to write a guard that depends on side effects in
 >> body.  (2) you might want to avoid constructing a secondary list or
 >> other data structure containing the filtered elements.

 Jesper> I'm against this suggestion if these are the only reasons for
 Jesper> change.  It's not hard to manually move the guard inside the
 Jesper> for-body.

Agree.

(1) doesn't seem important enough to do for 2.8, given everything else
that's going on, especially given that only imperative code (and rather
tricky imperative code, too) benefits from the additional complication.

As for (2), I'd point out that it's not hard to add ".view" to avoid
constructing intermediates.

--
Seth Tisue @ Northwestern University / http://tisue.net
lead developer, NetLogo: http://ccl.northwestern.edu/netlogo/

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Rethinking filter

Eric Willigers
In reply to this post by Eastsun
Eastsun wrote:
> +1

> It would make for expression more powerful and useful, and encourage us to
> use for expression rather than map,filter,... etc.

+1

filterMap is an important performance win.

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Re: Rethinking filter

Martin Odersky
I have become convinced that filter in for comprehensions needs to be lazy. Why?
Consider this program fragment:

  var limit = 2
  for (i <- List(0, 1, 2, 3) if i <= limit; j <- 0 until limit) yield
{ limit = 4; (i, j) }

What does it return?

  List((0,0), (0,1), (1,0), (1,1), (1,2), (1,3), (2,0), (2,1), (2,2), (2,3))

So you see that the first index is computed with the previously given
limit. It is not affected by the assignment in the yield. No surprise
because we know filter is strict. However, the second index does go up
to 3 after the first loop iteration, so it does take the changed limit
into account. If you look at the translation it's clear why it has to
be so:

 List(0, 1, 2, 3).filter(_ <= limit).flatMap(i => (0 unitl limit).map(j => {
    limit = 4;
    (i, j)
  }))

You see that the second map is in the closure passed to the first
flatMap, so its evaluation gets interleaved with the flatMap
iteration. But the filter is in the generator of that flatMap, so gets
computed before the flatMap is even started.

But for someone who has not 100% internalized that translation (and
who has?), it's very weird indeed that _some_ parts of the for
comprehension are evaluated before the body is started, but other
parts are interleaved.

So I think we need a better translation of `for'. I have found a
scheme which is different from filterMap, filterFlatMap, filterForeach
and which looks simpler. We postulate a new method withFilter which
like filter takes a predicate as argument. The `for' translation is
exactly as it is now, but using withFilter instead of filter. This
means that one has the guarantee that the result of withFilter is
further taken apart using one of map, flatMap, forEach, or another
withFilter. For instance, the for expression above would translate to:

  List(0, 1, 2, 3).withFilter(_ <= limit).flatMap(i => (0 unitl
limit).map(j => {
    limit = 4;
    (i, j)
  }))

It is recommended (but not enforced) that withFilter is lazy. That is,
it should return a proxy object with methods for map, flatMap,
foreach, and withFilter. Each of these would combine a guard with the
original operation. I append an implementation of withFilter for class
TraversableLike which would be inherited by all collection classes.

To ensure a smooth transition, I propose that if the `for' translation
scheme cannot find the `withFilter'  method, it uses `filter' instead,
but emits a deprecated warning.

I believe this proposal has a lot going for it: It's not significantly
more complex than the previous translation, it leads to more intuitive
behavior, and it can be a win in performance because it reduces the
number of intermediate structures that are built.

The only possible downside is that it might break some code. But I
doubt that would be very likely, as code that relies on a filter
seeing a snapshot of the state before the loop body is executed seems
rather weird.

I think this also settles the case for Range. Range should be strict,
and we do not need an exception for its filter anymore.

Cheers

 -- Martin

Here's the proposed implementation of withFilter in class TraversableLike:

  def withFilter(p: A => Boolean) = new WithFilter(p)

  class WithFilter(p: A => Boolean) {

    def map[B, That](f: A => B)(implicit bf: BuilderFactory[B, That,
Repr]): That = {
      val b = bf(repr)
      for (x <- self)
        if (p(x)) b += f(x)
      b.result
    }

    def flatMap[B, That](f: A => Traversable[B])(implicit bf:
BuilderFactory[B, That, Repr]): That = {
      val b = bf(repr)
      for (x <- self)
        if (p(x)) b ++= f(x)
      b.result
    }

    def foreach[U](f: A => U): Unit =
      for (x <- self)
        if (p(x)) f(x)

    def withFilter(q: A => Boolean): WithFilter =
      new WithFilter(x => p(x) && q(x))
  }
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Re: Rethinking filter

Daniel Sobral
I like this much better than the previous proposals. It doesn't change "filter" in unexpected ways, makes guards more intuitive, it is rather simple and has performance benefits.
 
And I have no queasy feelings about it, which I had about the other proposals, in which I found myself grudgingly agreeing that there was a problem, and yet disliked the proposed alternative.
 
So, +1.

On Fri, Oct 16, 2009 at 7:24 AM, martin odersky <[hidden email]> wrote:
I have become convinced that filter in for comprehensions needs to be lazy. Why?
Consider this program fragment:

 var limit = 2
 for (i <- List(0, 1, 2, 3) if i <= limit; j <- 0 until limit) yield
{ limit = 4; (i, j) }

What does it return?

 List((0,0), (0,1), (1,0), (1,1), (1,2), (1,3), (2,0), (2,1), (2,2), (2,3))

So you see that the first index is computed with the previously given
limit. It is not affected by the assignment in the yield. No surprise
because we know filter is strict. However, the second index does go up
to 3 after the first loop iteration, so it does take the changed limit
into account. If you look at the translation it's clear why it has to
be so:

 List(0, 1, 2, 3).filter(_ <= limit).flatMap(i => (0 unitl limit).map(j => {
   limit = 4;
   (i, j)
 }))

You see that the second map is in the closure passed to the first
flatMap, so its evaluation gets interleaved with the flatMap
iteration. But the filter is in the generator of that flatMap, so gets
computed before the flatMap is even started.

But for someone who has not 100% internalized that translation (and
who has?), it's very weird indeed that _some_ parts of the for
comprehension are evaluated before the body is started, but other
parts are interleaved.

So I think we need a better translation of `for'. I have found a
scheme which is different from filterMap, filterFlatMap, filterForeach
and which looks simpler. We postulate a new method withFilter which
like filter takes a predicate as argument. The `for' translation is
exactly as it is now, but using withFilter instead of filter. This
means that one has the guarantee that the result of withFilter is
further taken apart using one of map, flatMap, forEach, or another
withFilter. For instance, the for expression above would translate to:

 List(0, 1, 2, 3).withFilter(_ <= limit).flatMap(i => (0 unitl
limit).map(j => {
   limit = 4;
   (i, j)
 }))

It is recommended (but not enforced) that withFilter is lazy. That is,
it should return a proxy object with methods for map, flatMap,
foreach, and withFilter. Each of these would combine a guard with the
original operation. I append an implementation of withFilter for class
TraversableLike which would be inherited by all collection classes.

To ensure a smooth transition, I propose that if the `for' translation
scheme cannot find the `withFilter'  method, it uses `filter' instead,
but emits a deprecated warning.

I believe this proposal has a lot going for it: It's not significantly
more complex than the previous translation, it leads to more intuitive
behavior, and it can be a win in performance because it reduces the
number of intermediate structures that are built.

The only possible downside is that it might break some code. But I
doubt that would be very likely, as code that relies on a filter
seeing a snapshot of the state before the loop body is executed seems
rather weird.

I think this also settles the case for Range. Range should be strict,
and we do not need an exception for its filter anymore.

Cheers

 -- Martin

Here's the proposed implementation of withFilter in class TraversableLike:

 def withFilter(p: A => Boolean) = new WithFilter(p)

 class WithFilter(p: A => Boolean) {

   def map[B, That](f: A => B)(implicit bf: BuilderFactory[B, That,
Repr]): That = {
     val b = bf(repr)
     for (x <- self)
       if (p(x)) b += f(x)
     b.result
   }

   def flatMap[B, That](f: A => Traversable[B])(implicit bf:
BuilderFactory[B, That, Repr]): That = {
     val b = bf(repr)
     for (x <- self)
       if (p(x)) b ++= f(x)
     b.result
   }

   def foreach[U](f: A => U): Unit =
     for (x <- self)
       if (p(x)) f(x)

   def withFilter(q: A => Boolean): WithFilter =
     new WithFilter(x => p(x) && q(x))
 }



--
Daniel C. Sobral

Something I learned in academia: there are three kinds of academic reviews: review by name, review by reference and review by value.
Loading...