Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Views, Copies, and the SettingWithCopyWarning Issue #10954

Closed
nickeubank opened this issue Aug 31, 2015 · 45 comments
Closed

Views, Copies, and the SettingWithCopyWarning Issue #10954

nickeubank opened this issue Aug 31, 2015 · 45 comments
Labels
API Design Copy / view semantics Indexing Related to indexing on series/frames, not to indexes themselves

Comments

@nickeubank
Copy link
Contributor

As pandas approaches its 1.0 release, I would like to raise a concern about one aspect of the pandas architecture that I think is a threat to its widespread adoption: how pandas works with copies and views when setting values (what I will refer to here as the SettingWithCopyWarning issue).

The summary of my concern is the following:

1. SettingWithCopyWarning is a threat to data integrity
2. It is unreasonable to expect the average user to avoid a `SettingWithCopyWarning` issue, as doing
    so requires keeping track of the plethora of factors that determine what generates a copy and what
    generates a view.
    2a. Views made sense in `numpy`, but not in `pandas`
    2b. Chain-indexing is a much more subtle problem than suggested in the `pandas` docs. 
3. Given (1) and (2), data integrity in `pandas` relies on users noticing a non-exception warning in the
    flow of their output.
4. Even aside from the threat to data integrity, this behavior is unpythonic, and likely to frustrate
    alienate lots of potential users of `pandas`. 
5. I think solutions can be found that would have only limited effects on performance for the majority of  
    users

Taking each of these in turn:

(1) SettingWithCopyWarning is a threat to data integrity

The fact that assignment operations do different things depending on whether the target is a view or a copy has already been recognized as a threat to the predictability of pandas. Indeed, the reason a warning was added is because users were consistently asking why pandas was doing un-anticipated things when SettingWithCopyWarning came into play.

(2) It is unreasonable to expect the average user to avoid a SettingWithCopyWarning issue, as doing so requires keeping track of the plethora of factors that determine what generates a copy and what generates a view.

Figuring out when a function will return a copy and when it will return a view in pandas is not simple. Indeed, the pandas documentation doesn't even try to explain when each will occur (link http://pandas-docs.github.io/pandas-docs-travis/indexing.html?highlight=views#indexing-view-versus-copy):

The reason for having the SettingWithCopy warning is this. Sometimes when you slice an array you will simply get a view back, which means you can set it no problem. However, even a single dtyped array can generate a copy if it is sliced in a particular way. A multi-dtyped DataFrame (meaning it has say float and object data), will almost always yield a copy. Whether a view is created is dependent on the memory layout of the array."

(2a) Views made sense in numpy, but not in pandas
Views entered the pandas lexicon via numpy. But the reason they were so useful in numpy is that they were predictable because numpy arrays are always single-typed. In pandas, no such consistent, predictable behavior exists.

(2b) Chain-indexing is a much more subtle problem than suggested in the pandas docs
At first glance, the pandas docs suggest that the SettingWithCopyWarning is easily avoided by avoiding chain-indexing or using .loc. This, I fear, is misleading for two reasos. First, the canonical example of chain indexing in the docs (dfmi['one']['second'] = value) seems to suggest that one can avoid chain indexing by just not falling into the trap of this kind of double slicing. The problem, however, is that these slices need not appear near one another. I know I've had trouble with code of the form:

df2 = dfmi['one']

# Lots of intermediate code that doesn't change dfmi or df2

df2['second'] = 5

Moreover, using .loc only solves this problem if one notices the chained indexing and attempts to fix it in one place. Just consistently using .loc[] (for example, in both the first and second problematic slicings above) would not solve the problem.

(3) Given (1) and (2), data integrity in pandas relies on users noticing a non-exception warning in the flow of their output.

This seems really problematic. If a users is printing values as they go along (which CS developers may not do, but interactive casual users often do to monitor the progress of their code), these warnings are easy to miss. And that seems very dangerous.

(4) Even aside from the threat to data integrity, this behavior is unpythonic, and likely to frustrate alienate lots of potential users of pandas

I suspect I come to pandas from a different perspective than many developers. I am an economist and political scientist who has gotten deeper and deeper into computer science over the past several years for instrumental purposes. As a result, I think I have a pretty good sense of how applied users approach something like pandas, and I can just see this peculiarity of pandas driving this class of users batty. I've taken a year of computer science course work and am one of the most technically trained social scientists I know, and it drives me batty.

It's also unpythonic -- the behavior of basic operators (like =) should not depend on the type of columns in a DataFrame. Python 3 changed the behavior of the / operator because it was felt the behavior of / should not do depend on whether you were working with floats or ints. Since whether functions return a view or copy is in large part (but not exclusively) a function of whether a DataFrame is single or multi-typed (which occurs when some columns are floats and some are ints), we have the same problem -- the operation of a basic operation (=) depends on data types.

In other words, if one of the aims of pandas is to essentially surplant R among applied data scientists, then I think this is a major threat to achieving that goal.

(5) I think solutions can be found that would have only limited effects on performance for the majority of users
pandas uses views because they're so damn fast, so I understand the reluctance to drop them, but I think there are ways to minimize the performance hit. Obviously more talented developers with a better understanding of the innards of pandas may have better suggestions, but hopefully this can get the ball rolling.

* Solution 1: Move `views` to the background.
   When a user tries to look at an object and it's possible to return a view, do so. But just never let a 
   user assign values to a view -- any time an attempt is made to set on a view, convert it to a copy
   before executing the assignment. Views will still operate in the background providing high speed 
   data access in read-only environments, but users don't have to worry about what they're dealing 
   with. Users who *really* need access to views can work with `numpy` arrays. 

  (I would also note that given the unpredictability of when one will get a view or copy, it's not clear to 
  me how anyone can write code that takes advantage of the behavior of views, which makes me 
  doubt there are many people for whom this would seriously impact performance or written code, but 
  I'd be happy to hear if anyone has workarounds!)

* Solution 2: Create an indexer that always returns copies (like .take(), but for axis labels). 
   This would at least give users who want to avoid views all together a way to do so without littering
   their code with `.copy()`s. 

* Solution 3: Change the `SettingWithCopyWarning` to an exception by default. 
  This is currently a setting, but the global default is for it to be a warning. Personally, I still don't like 
  this solution since, as a result of (2) this means `pandas` will now raise exceptions unpredictably, but 
  at least data integrity will be preserved.     

pandas is a brilliant tool, and a huge improvement on everything else out there. I am eager to see it becomes the standard not only among python users, but among data analysts more broadly. Hopefully, by addressing this issue, we can help make this happen.

With that in mind, I would like to suggest the need for two things:

  1. A discussion about the desirability of the various solutions proposed above
  2. Volunteers to help implement this change. Unfortunately, I don't have the programming sophistication or knowledge of pandas internals to take this on alone, and this is likely too big an undertaking for any one individual anyone, so a team is likely to be necessary.
@jreback
Copy link
Contributor

jreback commented Aug 31, 2015

@nickeubank you realize that it is simple enough to just change the default to raise on the error

In [7]: pd.set_option('chained_assignment','raise')

@nickeubank
Copy link
Contributor Author

@jreback Yes, and I've done that in my own code. But I think the issue of inconsistency of behavior remains, and not all users (especially newbies and non-programmers) will be aware of that / know they should be aware of that.

@jreback
Copy link
Contributor

jreback commented Aug 31, 2015

@nickeubank I think you are missing a lot of the point here.

  • This has ALWAYS been the case, IOW, setting on a view. Until we had SettingWithCopy an assignment would silently just fail. We got so many question of why does this not work when one does chain indexing.
  • It is simply not possible from a language perspective to detect chain indexing directly; it has to be inferred
  • not using views is NOT an option, you defeat the purpose of using numpy in its entirety.
  • we are not going to add another indexer, there are already too many, this will just cause even more issues
  • These are really a small set of edge cases

simplest and best answer is just to change the default to raise. The warning is pretty good and has very very few (if any) false positives (the thought was that we don't want to show exceptions when it might not be a problem, and that is why it is was a warning to begin with).

Writing better docs helps a little, but most people simply don't read them.

The reason you need views is to avoid copying everything each time. One of the main purposes of numpy is to essentially share memory when you can, to avoid the copies.

df = Dataframe(.....)
slice_of_df = df.loc[0:100000]

so you want to copy this every time I want to use it? If its small it doesn't matter. But if this is a non-trivial size you will eat up memory amazingly fast. and the point of pandas will be gone. You might as well just read in your data anew each time from a csv file for each operation.

@nickeubank
Copy link
Contributor Author

@jreback I've floated that as Solution 3, and I think that's better than nothing, but I think my concern is that it's going to result in lots of people submitting requests for explanations for why code sometimes fails.

Re: performance: my preferred solution is Solution 1, which I don't think eliminates views, it just moves them away from the user. pandas could still return views whenever possible, but convert views to copies when those types would lead to different outcomes. Basically, this approach would be analogous to how R handles passing objects in functions via pass by promise, in which objects are passed to functions as references unless they are modified, then on-the-fly makes a new copy. From the user perspective, one can think of the program always doing pass by value, but the performance hit only occurs when necessary.

But that's my view, and why I want to solicit input from others!

[Edit: Modified to response to some of jreback's additional comments 11:26 pst]

@jankatins
Copy link
Contributor

+1 on setting the default to raise... better clean errors (and easy solutions) than silently wrong analysis results...

Also, I think the docs should encourage the usage of .copy() when asisgning a slice (e.g. df2 = df1.ix[<what, ever>].copy()). In all but some Big Data™ situations, the copy will not matter performance wise and if I remember some twitter comments right, it even makes the following code faster due to the missing checks.

@shoyer
Copy link
Member

shoyer commented Aug 31, 2015

Thanks for writing this up! I agree with most of the concerns.

I wonder if solution 1 (copy on write for views) is technologically feasible. That seems like the best of the alternatives to me, and it's also what R and MATLAB do. It's not always desirable, but using a tool like NumPy directly is already necessary for high performance code.

I'm also not a fan of a new indexer, and I think making making SettingWithCopy warning an error would be a mistake, because chained indexing (with indexing appearing on different liens) can sometimes be the clearest way to write code.

@jreback
Copy link
Contributor

jreback commented Aug 31, 2015

@shoyer well, pandas already DOES copy-on-write. That's exactly what .loc does when it sees a view. That is not the problem at all. The issue is this.

df[...][...] = value is 2 different python actions that are completely independent.

The setting has no idea that it is in fact part of a chain. So it if you make a copy at this point, you are setting on the copy, if you don't you STILL maybe setting a copy from the previous operation. This is the entire problem. IF pandas had lazy evaluation then this would be a no-brainer, but it is eager and not possible to resolve this. copy-on-write will not fix the issue at all.

Always returning a copy would fix this, but is way too expensive to be a soln. The soln is either or both of:

  1. education. More documentation. expand the warning messages / docs
  2. raise on a chained setting operation. I don't think their are any documented false positives, so I don't see a downside here.

@shoyer
Copy link
Member

shoyer commented Aug 31, 2015

@jreback I think we have some misunderstanding about what "copy-on-write" means? I am referring to the array being indexed (e.g., df), not the new value (e.g., value). I don't think .loc currently does that.

If we had copy-on-write, df[...][...] = value would always fail, thus removing the need for SettingWithCopyWarning entirely (in a different way). As it is, this sometimes succeeds, if the first indexing operation returns a view.

I do understand the issues with Python syntax and lazy evaluation. I don't think we need to always make copies with indexing -- that would indeed be way too expensive in general. But instead of how we currently mark objects so we can later issue the SettingWithCopy warning, we could simply make a copy of the object being indexed at that point and then proceed with the indexing operation. The only difference is that we would only need to mark objects created with a view rather than all indexing results.

@jreback
Copy link
Contributor

jreback commented Aug 31, 2015

But instead of how we currently mark objects so we can later issue the SettingWithCopy warning, we could simply make a copy of the object being indexed at that point and then proceed with the indexing operation.

How do you actually know when to do this? This is the problem. You don't know the future operation.

@shoyer
Copy link
Member

shoyer commented Aug 31, 2015

How do you actually know when to do this? This is the problem. You don't know the future operation.

Perhaps I'm missing something? I'm pretty sure the only future operations where we would need to make copies are:

  • inside __setitem__
  • when inplace=True (if the method doesn't already copy all the data first)

There's also the issue of setting with .values (e.g., df.values[...] = ...), but that already only works inconsistently and we can't even issue warnings because it's done on the NumPy side.

@nickeubank
Copy link
Contributor Author

I think we might be talking past each other a little bit here – I think I can clarify matters in a couple hours when I finish a set of meetings... Sorry for the delay.

@shoyer
Copy link
Member

shoyer commented Aug 31, 2015

@jreback and I were discussing this on gitter.

One interesting case is supporting chained indexing via a series. This currently works (via views) and doesn't even raise a SettingWithCopy warning:

In [1]: import pandas as pd

In [2]: df = pd.DataFrame({'x': [0, 1], 'y': [2, 3]})

In [3]: series = df['x']

In [4]: series[0] = -10

In [5]: df
Out[5]:
    x  y
0 -10  2
1   1  3

This is convenient, and it's actually possible to guarantee that it works 100% of the time with views. I expect it's also widely relied upon.

So a possible rule is:

  • Indexing out a single column of dataframe to produce series always returns a view
  • Indexing a dataframe to produce a dataframe always returns a copy (which may be a copy-on-write if it's faster to use views internally)

@jreback
Copy link
Contributor

jreback commented Aug 31, 2015

however cannot guarantee that a single column is actually a view if it's say object dtype. it may be possible but I am not sure. IF I would allow mom consolidation (eg columns map directly to an individual block); then I think this is possible.

@nickeubank
Copy link
Contributor Author

Thanks for wrestling with this @jreback and @shoyer .

Just to be clear on what what's going on (and to make sure I follow correctly), do the following code snippets correctly correctly characterize what we're discussing?

Code:

df = pd.DataFrame({'col1':[1,2], 'col2':[3,4]})
intermediate = df.loc[1:1,]   # slice 1
intermediate['col1'] = -99  # slice 2

Current Behavior

# The following always happens
In[]: intermediate
Out[]:
       col1  col2
    1   -99     4

# This happens If slice 1 generated a view:
In[]: df
Out[]:
          col1  col2
         0     1     3
         1   -99     4

# This happens If slice 1 generated a copy:
In[]: df
Out[]:
          col1  col2
         0     1     3
         1    2     4

New (suggested) behavior

My solution 1, and what I think @shoyer has been suggesting with all slices behaving "as-if" they are copies:

In[]: intermediate
Out[]:
       col1  col2
    1   -99     4

In[]: df
Out[]:
          col1  col2
         0     1     3
         1     2     4

However, noting that in the "new" behavior, slice 1 may have generated a view behind the scenes.

Chained-Slicing on a Single Line

Under this new regime, of course, we would still see a failure (but now always see a failure) from:

In[]:
   df = pd.DataFrame({'col1':[1,2], 'col2':[3,4]})
   df.loc[1:1,]['col1'] = -99
   df

Out[]:
          col1  col2
         0     1     3
         1     2     4

Since this would result in the creation of a new DataFrame that isn't actually reference by a user variable. Any chance that this is something that this case could have it's own exception that wouldn't apply in the case of chained assignment across multiple lines of code?

I'm ok if not -- as long as it always fails, is any easy idiom to tell users to avoid, and fails due to a simple principle (slicing always returns something that behaves like a copy, which means this fails for the same reason that

    df['col1'].replace(1,-99) # Executed without assignment or ``inplace=True``

fails.

"as-if" Copies

And also just to be clear, @jreback , you asked previously if I was suggesting that

df = Dataframe(.....)
slice_of_df = df.loc[0:100000]

would always generate a copy. Hopefully this is a little clear now, but my suggestion would be "No", slice_of_df would (if possible) be a view. The view would only be coerced into a copy IF a situation emerged in which views and copies would give rise to different behaviors.

@ellisonbg
Copy link
Contributor

In my experience, my students and I often run into this issue and it is one of the few (very) rough edges in pandas. In most usage cases of pandas it "just does the right thing", but with the issue being raised here that is not the case. If I understand the situation, I think the most problematic part is that the behavior depends on state that is hidden from the user. I think that hidden state needs to be removed or shown to the user before they make these API calls.

@jreback jreback added Indexing Related to indexing on series/frames, not to indexes themselves API Design Needs Discussion Requires discussion from core team before further action labels Sep 1, 2015
@jreback
Copy link
Contributor

jreback commented Sep 1, 2015

@nickeubank so I implemented the above here:

jreback@8b684b9

In [1]: df = DataFrame({'col1':[1,2], 'col2':[3,4]})

In [2]: intermediate = df.loc[1:1,]

In [3]: intermediate['col1'] = -99

In [4]: intermediate
Out[4]: 
   col1  col2
1   -99     4

In [5]: df
Out[5]: 
   col1  col2
0     1     3
1     2     4

In [6]: df.loc[1:1,]['col1'] = -99
ValueError: chained indexing detected, you can fix this ......

Basically the setting on copy machinery already tracked this, so just a matter of actually doing something.
This is prob fragile (I know it won't work on py3, but that's a small change). But implements this.

Chained indexing should I could actually make work, but might be some more hoops / complexity, so an exception might just be better (as I am doing now).

@nickeubank
Copy link
Contributor Author

@jreback Thanks so much! I really think this is a great improvement!

@shoyer
Copy link
Member

shoyer commented Sep 1, 2015

@jreback Using garbage collection to check for chained indexing is a nice trick! It does seem fragile, though. In particular, I wonder if there are strange cases (e.g., unit tests?) that could run into this inadvertently...

@jreback
Copy link
Contributor

jreback commented Sep 1, 2015

if u have a better trick that can differentiate these cases - all ears

@shoyer
Copy link
Member

shoyer commented Sep 1, 2015

@jreback The alternative would be to not issue a warning or error for chained indexing at all (similar to NumPy), now that it is entirely predictable whether DataFrame indexing returns a copy or a view.

@nickeubank
Copy link
Contributor Author

I'm agnostic. A warning might be nice, but the behavior no longer feels pathological or unexpected, so I'm less worried.

@jreback @shoyer You two had a conversation about still using views when someone slices a full column -- I'm open to that (since it will be consistent), but I would prefer we didn't. If even a slice of a column behaves "as-if" it were a copy, then users will never be required to think about or even understand views. I think that has the potentially to really improve the accessibility of pandas to a lot of non-programmers. Column views is useful, but it's a small enough case I'm not sure it's worth introducing an entire concept to the pandas ecosystem that users have to wrestle with.

@shoyer
Copy link
Member

shoyer commented Sep 1, 2015

@nickeubank The reason why I think we should use views when selecting a single column is that we encourage users to think of DataFrames as "a dict of Series objects". Python always uses views for dictionary elements, so it's surprising if modifying one of these series does not change the original dataframe. I agree that copies are more intuitive (especially to R and MATLAB users), but views are an essential part of how Python works as a programming language.

@nickeubank
Copy link
Contributor Author

@shoyer I can respect that if that's how we go. Just want to put out that alternative for discussion.

@CarstVaartjes
Copy link

Just to add my two cents: the warning or exception also do not always really hold true. I have plenty of situations where this something like this generates a warning when starting with a hypothetical DataFrame org_df (just an example piece of code):

x_df = org_df[org_df['b'] < 100]
del org_df
mask = x_df['a'] > 10
x_df['b'] = 0
x_df.loc[mask, 'b'] = 1

but it still delivers a correct result even with the warning (we actually implemented tests to continue checking for this). From memory management perspective we also usually like the smaller views...

@shoyer
Copy link
Member

shoyer commented Sep 1, 2015

@CarstVaartjes Yes, that's a pretty common situation and one of the big reasons why I dislike SettingWithCopy warning. With the proposal to use copy-on-write, this would now work exactly the same, except without warnings. Note that indexing like org_df[org_df['b'] < 100] does actually create a copy (usually?).

@jreback
Copy link
Contributor

jreback commented Sep 1, 2015

@CarstVaartjes this was you are expecting?

In [17]: np.random.seed(1234)

In [18]: org_df = DataFrame({'a' : np.random.randint(0,1000,size=100), 'b' : np.random.randint(0,1000,size=100) })

In [19]: org_df
Out[19]: 
      a    b
0   815  901
1   723  750
2   294  559
3    53  244
4   204  374
5   372  687
..  ...  ...
94  805  332
95  365  965
96  806  117
97  135  593
98  996  208
99  707  520

[100 rows x 2 columns]

In [20]: x_df = org_df[org_df['b'] < 100]

In [21]: del org_df

In [22]: mask = x_df['a'] > 10

In [23]: x_df['b'] = 0

In [24]: x_df.loc[mask, 'b'] = 1

In [25]: x_df
Out[25]: 
      a  b
8   689  1
10  233  1
11  154  1
47  275  1
55  243  1
57  828  1
79  840  1

@CarstVaartjes
Copy link

@jreback yes exactly!
for me that sets of a huge number of warnings, while I'm aiming for manipulating a subset. Which in itself works fine, but the warnings are scary :)
And something like "x_df = org_df[org_df['b'] < 100].copy()" looks strange too for me...

@nickeubank
Copy link
Contributor Author

@shoyer @jreback If we're agreed on the goal of having slices behave "as-if" copies, then I think there's a second situation we need to cover. @jreback commit ensures changes to a slice won't affect the original dataset, but we also need to ensure that changes to the original dataset won't propagate forward to slices. In other words, we need to ensure the following:

In [1]:
original = pd.DataFrame({'col1':[1,2], 'col2':[3,4]})
subset = original.loc[1:1,]

original.loc[1,'col1'] = -99
subset

Out[1]: 
   col1  col2
1     2     4

And not:

In [1]:
original = pd.DataFrame({'col1':[1,2], 'col2':[3,4]})
subset = original.loc[1:1,]

original.loc[1,'col1'] = -99
subset

Out[1]: 
   col1  col2
1   -99     4

(Note that this behavior is actually currently inconsistent for the same reasons SettingWithACopy is inconsistent...)

I don't know a lot about internals, but in this example does original "know" that it has spawned subset? If so, we can add something to the setting function to ensure that, before making changes, frames always tell their "children" to convert themselves to copies.

@jreback
Copy link
Contributor

jreback commented Sep 2, 2015

@nickeubank your last is not possible, we don't refcount things. Just getting the original copy-on-write to work is quite non-trivial.

@nickeubank
Copy link
Contributor Author

@jreback This is really pretty analogous to the SettingOnCopyCopy situation in that behavior is unpredictable, except in this case we don't even have a warning. I recognize it may be non-trivial, but this seems similarly important. Sorry I don't know enough about internals to offer concrete ways of addressing this, by maybe someone else can offer some suggstions? @shoyer ?

@shoyer
Copy link
Member

shoyer commented Sep 2, 2015

I agree handling this case is important for copy-on-write to work as expected, but indeed I'm not sure how we could make that work without doing our own reference counting. This may be why copy-on-write is often a built in language feature. Hmm...

On Wed, Sep 2, 2015 at 9:31 AM, Nick Eubank notifications@www.greatytc.com
wrote:

@jreback This is really pretty analogous to the SettingOnCopyCopy situation in that behavior is unpredictable, except in this case we don't even have a warning. I recognize it may be non-trivial, but this seems similarly important. Sorry I don't know enough about internals to offer concrete ways of addressing this, by maybe someone else can offer some suggstions? @shoyer ?

Reply to this email directly or view it on GitHub:
#10954 (comment)

@nickeubank
Copy link
Contributor Author

Perhaps this is a dumb question, but is there a reason that frames can't just keep a list of their "offspring" as an attribute and have subsetting functions (.loc[], or the slice operator []) add information on "offspring" views to this list when they subset a frame? Then when setting on a frame, one just first converts "offspring" in this list to copies?

@shoyer
Copy link
Member

shoyer commented Sep 2, 2015

Yes, this sort of thing could be done with weak references, though that
might have unfortunate performance implications.

On Wed, Sep 2, 2015 at 9:39 AM, Nick Eubank notifications@www.greatytc.com
wrote:

Perhaps this is a dumb question, but is there a reason that frames can't
just keep a list of their "offspring" as an attribute and have subsetting
functions (.loc[], or the slice operator []) add information on
"offspring" views to this list when they subset a frame? Then when setting
on a frame, one just first converts "offspring" in this list to copies?


Reply to this email directly or view it on GitHub
#10954 (comment).

@jreback
Copy link
Contributor

jreback commented Sep 2, 2015

@shoyer well, I am already using weakrefs to track this at the moment, BUT what @nickeubank is proposing is significantly more complicated.

@jorisvandenbossche
Copy link
Member

Really cool! I also agree with most of the concerns raised in the original issue. This is really one of the difficult parts of pandas (that I sometimes even don't try to explain, together with the details of the __getitem__ semantics ..).
But reading up on the discussion, it seems we can possibly find a solution. Although it will still be the question on how to put this in a release. In any case it will break people's code. For this last one, would it be possible to somehow detect when the actual behaviour would have been changed (previously value in original df changed, now not anymore, eg the case were the intermediate was obtained by slicing of dataframe with one dtype), so we could warn for such cases for first release?

@shoyer
Copy link
Member

shoyer commented Sep 2, 2015

@jorisvandenbossche

For this last one, would it be possible to somehow detect when the actual behaviour would have been changed (previously value in original df changed, now not anymore, eg the case were the intermediate was obtained by slicing of dataframe with one dtype), so we could warn for such cases for first release?

I believe these are exactly the cases where users are currently setting SettingWithCopy warning. So in some sense, we are already warning about it...

@jorisvandenbossche
Copy link
Member

@shoyer True, but in such a case, this warning was actually a kind of false positive (as the original frame was adapted since it was a view and not a copy (like the warning says), and the warnings tries to warn for when that does not happen, no?). So people could have learnt to ignore it ...

@nickeubank
Copy link
Contributor Author

@shoyer @jorisvandenbossche Maybe I'm wrong, but I don't think that we currently warn when someone modified the ORIGINAL (parent) dataframe about the fact that effects on slices (which may or may not be views) are unpredictable, only when the changes are made to the SLICE (child).

i.e. we currently warn against inconsistent backwards propagation of changes, but not inconsistent forward propagation.

@Patrick-DS
Copy link

Hi everyone,

has there been any progress on this issue? I feel like it's enough of a pain for me to start reading the source code and try to find a fix, but this Issue thread is more than 4 years old now, so I don't know if this discussion moved somewhere else or if the situation is gonna remain as it is.

@nickeubank
Copy link
Contributor Author

Sadly, I failed in my effort to fix it. I just couldn't reliably track all the relevant refs when objects were passed to constructors.

My sense is this is gonna live on until the massive refactoring that's been on the board for years to move to arrow based data structures.

@AkshaySapra
Copy link

Is this at all relevant to the accepted answer here?

It seems the answer is old and something has changed. But seeing your comment above it appears it has not changed?

@vyasr
Copy link
Contributor

vyasr commented Jan 5, 2023

@jorisvandenbossche is this closeable now that #46958 is merged? Copy-on-write is the planned default behavior in the future and will make these concerns moot, correct?

@nickeubank
Copy link
Contributor Author

It will be! what's the release timeline for this? Not seeing in linked issue...

@vyasr
Copy link
Contributor

vyasr commented Jan 21, 2023

I believe copy-on-write is intended to become the default in pandas 2.0 but I could be mistaken there. I am not aware of a target date for the 2.0 release but I think it will be the next non-patch release.

@jorisvandenbossche
Copy link
Member

Indeed, with the changes in #46958 / #48998 (and formally accepted as PDEP-7: https://pandas.pydata.org/pdeps/0007-copy-on-write.html), the SettingWithCopyWarning will finally disappear.

This is set to become the default (and only) behaviour in pandas 3.0, but you can already enable this future behaviour right now with:

pd.options.mode.copy_on_write = True

@jorisvandenbossche jorisvandenbossche removed the Needs Discussion Requires discussion from core team before further action label Dec 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Copy / view semantics Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
10 participants