Quantcast

open versus io.open

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

open versus io.open

Benny Malengier
John,

I see in your commits.

-            key_file = open(filename, "w")
+            if sys.version_info[0] < 3:
+                key_file = open(filename, "w")
+            else:
+                key_file = open(filename, "w", encoding="utf-8")


For another project, I discovered that io.open in python 2.7 also has the encoding keyword. I don't know if there are not subtle differenbce between the two or not, but for that project, we could change those lines by

  key_file = io.open(filename, "w", encoding="utf-8")

An advantage of this was also that in python 2 you then get nice UnicodeDecodeErrors, while just using open can give strange errors if you eg open a windows utf-16 file (which in that project happens if a windows users saves csv files)..

Benny


------------------------------------------------------------------------------
Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS,
MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current
with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft
MVPs and experts. ON SALE this month only -- learn more at:
http://p.sf.net/sfu/learnmore_122712
_______________________________________________
Gramps-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gramps-devel
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: open versus io.open

John Ralls-2

On Jan 17, 2013, at 12:02 PM, Benny Malengier <[hidden email]> wrote:

> John,
>
> I see in your commits.
>
> -            key_file = open(filename, "w")
> +            if sys.version_info[0] < 3:
> +                key_file = open(filename, "w")
> +            else:
> +                key_file = open(filename, "w", encoding="utf-8")
>
>
> For another project, I discovered that io.open in python 2.7 also has the encoding keyword. I don't know if there are not subtle differenbce between the two or not, but for that project, we could change those lines by
>
>   key_file = io.open(filename, "w", encoding="utf-8")
>
> An advantage of this was also that in python 2 you then get nice UnicodeDecodeErrors, while just using open can give strange errors if you eg open a windows utf-16 file (which in that project happens if a windows users saves csv files)..
>

Benny,

The change above is from r21147. You'll see in r21148 that I already did that.

Regards,
John Ralls



------------------------------------------------------------------------------
Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS,
MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current
with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft
MVPs and experts. ON SALE this month only -- learn more at:
http://p.sf.net/sfu/learnmore_122712
_______________________________________________
Gramps-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gramps-devel
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: open versus io.open

Benny Malengier
Is this because you use git?

If so, it would be interesting that you use a devel branch, use git squash to combine the commits in master (http://stackoverflow.com/questions/5308816/how-to-use-git-merge-squash ), and use the single commit then for push to subversion. That would make it easier for the subversion users as wel as for code review.

Benny


2013/1/17 John Ralls <[hidden email]>

On Jan 17, 2013, at 12:02 PM, Benny Malengier <[hidden email]> wrote:

> John,
>
> I see in your commits.
>
> -            key_file = open(filename, "w")
> +            if sys.version_info[0] < 3:
> +                key_file = open(filename, "w")
> +            else:
> +                key_file = open(filename, "w", encoding="utf-8")
>
>
> For another project, I discovered that io.open in python 2.7 also has the encoding keyword. I don't know if there are not subtle differenbce between the two or not, but for that project, we could change those lines by
>
>   key_file = io.open(filename, "w", encoding="utf-8")
>
> An advantage of this was also that in python 2 you then get nice UnicodeDecodeErrors, while just using open can give strange errors if you eg open a windows utf-16 file (which in that project happens if a windows users saves csv files)..
>

Benny,

The change above is from r21147. You'll see in r21148 that I already did that.

Regards,
John Ralls




------------------------------------------------------------------------------
Master HTML5, CSS3, ASP.NET, MVC, AJAX, Knockout.js, Web API and
much more. Get web development skills now with LearnDevNow -
350+ hours of step-by-step video tutorials by Microsoft MVPs and experts.
SALE $99.99 this month only -- learn more at:
http://p.sf.net/sfu/learnmore_122812
_______________________________________________
Gramps-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gramps-devel
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: open versus io.open

John Ralls-2

On Jan 17, 2013, at 11:53 PM, Benny Malengier <[hidden email]> wrote:

Is this because you use git?

If so, it would be interesting that you use a devel branch, use git squash to combine the commits in master (http://stackoverflow.com/questions/5308816/how-to-use-git-merge-squash ), and use the single commit then for push to subversion. That would make it easier for the subversion users as wel as for code review.

Yes.

Yes, that's what I did (though with git rebase -i, not git merge -- git merge and git-svn don't get along at all). I cut about 50 commits down to 8 or 9 (the others are fixes to pre-existing bugs that I made along the way) to present a progression of related changes. This way the commit messages can explain what's going on in each changeset (rename files, replace module x, etc.) rather than having a single massive changeset that does a dozen things at once, and no way to separate which bits of the changeset implement what changes. It might complicate code review if you do it one changeset at a time, because (as in the instant case) you think of an improvement in one only to find that I did exactly that two changesets later, but if you're examining a file down the road and trying to understand what it does and why it's implemented the way it is, having an understandable history makes the task much easier.

Regards,
John Ralls


------------------------------------------------------------------------------
Master HTML5, CSS3, ASP.NET, MVC, AJAX, Knockout.js, Web API and
much more. Get web development skills now with LearnDevNow -
350+ hours of step-by-step video tutorials by Microsoft MVPs and experts.
SALE $99.99 this month only -- learn more at:
http://p.sf.net/sfu/learnmore_122812
_______________________________________________
Gramps-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gramps-devel
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: open versus io.open

Tim Lyons
Administrator
In reply to this post by Benny Malengier
Benny Malengier wrote
I see in your commits.

-            key_file = open(filename, "w")
+            if sys.version_info[0] < 3:
+                key_file = open(filename, "w")
+            else:
+                key_file = open(filename, "w", encoding="utf-8")


For another project, I discovered that io.open in python 2.7 also has the
encoding keyword. I don't know if there are not subtle difference between
the two or not, but for that project, we could change those lines by

  key_file = io.open(filename, "w", encoding="utf-8")

An advantage of this was also that in python 2 you then get nice
UnicodeDecodeErrors, while just using open can give strange errors if you
eg open a windows utf-16 file (which in that project happens if a windows
users saves csv files)..

I am in my standard state of confusion.

In NarWeb, it was changed to write to the HTML file to:
of = codecs.EncodedFile(string_io, 'utf-8', self.encoding, 'xmlcharrefreplace')

I found that didn't work at all, so changed it to the four lines with the + above, which did work.

It is now
of = io.open(fname, "w", encoding = self.encoding, errors = 'xmlcharrefreplace')
which doesn't work because io.open expects to write unicode, and all the output at present is in str (as it is for all the reports AIUI)


I am not criticising the people who changed it (and I don't want to single out individuals, because I know they are all doing a great job in trying to sort out all the problems in converting to Python3/GTK3 etc. etc., and are trying to fix a great many problems at once).

I am just asking what you think it should actually be? And as a warning to anyone who is going to make the suggested changes. It looks to me like it has to be the four lines with the + above. (Although that doesn't seem to allow specification of the encoding of the output file in Python2).
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: open versus io.open

John Ralls-2

On Jan 23, 2013, at 4:36 PM, Tim Lyons <[hidden email]> wrote:

> Benny Malengier wrote
>> I see in your commits.
>>
>> -            key_file = open(filename, "w")
>> +            if sys.version_info[0] < 3:
>> +                key_file = open(filename, "w")
>> +            else:
>> +                key_file = open(filename, "w", encoding="utf-8")
>>
>>
>> For another project, I discovered that io.open in python 2.7 also has the
>> encoding keyword. I don't know if there are not subtle difference between
>> the two or not, but for that project, we could change those lines by
>>
>>  key_file = io.open(filename, "w", encoding="utf-8")
>>
>> An advantage of this was also that in python 2 you then get nice
>> UnicodeDecodeErrors, while just using open can give strange errors if you
>> eg open a windows utf-16 file (which in that project happens if a windows
>> users saves csv files)..
>
>
> I am in my standard state of confusion.
>
> In NarWeb, it was changed to write to the HTML file to:
> of = codecs.EncodedFile(string_io, 'utf-8', self.encoding,
> 'xmlcharrefreplace')
>
> I found that didn't work at all, so changed it to the four lines with the +
> above, which did work.
>
> It is now
> of = io.open(fname, "w", encoding = self.encoding, errors =
> 'xmlcharrefreplace')
> which doesn't work because io.open expects to write unicode, and all the
> output at present is in str (as it is for all the reports AIUI)
>
>
> I am not criticising the people who changed it (and I don't want to single
> out individuals, because I know they are all doing a great job in trying to
> sort out all the problems in converting to Python3/GTK3 etc. etc., and are
> trying to fix a great many problems at once).
>
> I am just asking what you think it should actually be? And as a warning to
> anyone who is going to make the suggested changes. It looks to me like it
> has to be the four lines with the + above. (Although that doesn't seem to
> allow specification of the encoding of the output file in Python2).

If I left the codecs.open() in Narweb that was a mistake: I used codecs.open() first and then discovered io.open() and intended to change all instances. That said, they're essentially the same.

The problem I was trying to solve (and thought that I had) is UnicodeEncodingErrors when using translated strings with IO. With Linux default encoding was coming up 'ascii', and that doesn't play well in languages other than English. This turned up in some surprising places, including ini files, the recent file list, and plugins.

I settled upon io.open() with encoding and errors specified as the solution. The documented behavior of setting an encoding when opening a file is that the IO is transcoded on-the-fly between the stated encoding of the file and the internal representation. In Py3, the internal representation is either UTF-16 or UTF-32, depending upon the OS, but we needn't worry about that, as IO is always transcoded into something else. Py2 is more complicated, as strings can be represented as either str or unicode, with str being a collection of bytes which are assumed to be intelligible to the host OS.

IMO all Gramps files should be encoded in utf-8. It's simply the only portable representation that with broad language support.
(Note that this is separate from locale.getfilesystemencoding(), which is supposed to be the encoding used by the OS for file *names*. File *content* is always up to the application.) Since we use Gtk, all GUI text IO *must* be in utf-8 so that Gtk widgets will display it properly. Reports can be in whatever encoding the user wants (though we should bias towards utf-8), but in that case transcoding needs to gracefully handle cases where the codepoint can't be represented by the chosen encoding. There are a couple of exceptions: xml, html, and xhtml should always be utf-8. Some formats are specified in ASCII,  and in theory GEDCOM should support ANSEL even though AFAIK no one actually emits it.

So having finished my diatribe, let's get back to the problem:

What is actually failing with Narweb? What did you see when the file was io.opened, and what do you see now that you've changed it to use the builtin open function (bearing in mind that in Py3 that's supposed to be an alias of io.open)?

Are you able to test in both Py3 and Py2?

Regards,
John Ralls



------------------------------------------------------------------------------
Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS,
MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current
with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft
MVPs and experts. ON SALE this month only -- learn more at:
http://p.sf.net/sfu/learnnow-d2d
_______________________________________________
Gramps-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gramps-devel
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: open versus io.open

Benny Malengier
In reply to this post by Tim Lyons

2013/1/24 Tim Lyons <[hidden email]>
It is now
of = io.open(fname, "w", encoding = self.encoding, errors =
'xmlcharrefreplace')
which doesn't work because io.open expects to write unicode, and all the
output at present is in str (as it is for all the reports AIUI)


I agree with what John wrote.
About this piece, there is a difference for python2 and 3, which is why we test so much on python version when handling files.
As you note, output in python2 is str, which should work great, also if you should write unicode to it. In python3, str is now called bytes, and the str of python3 is the unicode of python2.
So in python3, all internal things are unicode, and if you push a byte to it, you obtain an error. This is a _good_ change in python. Once you agree in Gramps we should follow this way of working, you understand all internal API should be unicode (str in python3), and conversion only happens in the writing level.
In python3 you can still write bytes, but you need to use BytesIO. It is best _not_ to use this if we have unicode and want utf-8 out. Only if we ourself want to encode (eg ANSEL) is this nice. At the moment, our codebase uses BytesIO still in different places as that was for porting often the fastest when you support python2 and python3 at the same time.

So, I guess with above I just want to indicate that nicest would be all unicode internally. As we support both python2 and python3, this is not always practical (only python 3.3 will understand u'string' as unicode, in python 3.2 this gives a syntax error, so for python 2.7 all strings should be u'blabla' to have unicode internally, but we can't do that as we want to support python 3.2! ).
So, if for simplicity we keep bytes internally, we need of = io.open(fname, "wb" ...
and make sure we always write bytes instead of unicode.
When we deprecate python 2.7 support in the far future for some reason, we can change this.

Benny


------------------------------------------------------------------------------
Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS,
MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current
with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft
MVPs and experts. ON SALE this month only -- learn more at:
http://p.sf.net/sfu/learnnow-d2d
_______________________________________________
Gramps-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gramps-devel
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: open versus io.open

Tim Lyons
Administrator
In reply to this post by John Ralls-2

On 24 Jan 2013, at 03:38, John Ralls wrote:

>
> On Jan 23, 2013, at 4:36 PM, Tim Lyons <[hidden email]> wrote:
>
>> Benny Malengier wrote
>>> I see in your commits.
>>>
>>> -            key_file = open(filename, "w")
>>> +            if sys.version_info[0] < 3:
>>> +                key_file = open(filename, "w")
>>> +            else:
>>> +                key_file = open(filename, "w", encoding="utf-8")
>>>
>>>
>>> For another project, I discovered that io.open in python 2.7 also  
>>> has the
>>> encoding keyword. I don't know if there are not subtle difference  
>>> between
>>> the two or not, but for that project, we could change those lines by
>>>
>>> key_file = io.open(filename, "w", encoding="utf-8")
>>>
>>> An advantage of this was also that in python 2 you then get nice
>>> UnicodeDecodeErrors, while just using open can give strange errors  
>>> if you
>>> eg open a windows utf-16 file (which in that project happens if a  
>>> windows
>>> users saves csv files)..
>>
>>
>> I am in my standard state of confusion.
>>
>> In NarWeb, it was changed to write to the HTML file to:
>> of = codecs.EncodedFile(string_io, 'utf-8', self.encoding,
>> 'xmlcharrefreplace')
>>
>> I found that didn't work at all, so changed it to the four lines  
>> with the +
>> above, which did work.
>>
>> It is now
>> of = io.open(fname, "w", encoding = self.encoding, errors =
>> 'xmlcharrefreplace')
>> which doesn't work because io.open expects to write unicode, and  
>> all the
>> output at present is in str (as it is for all the reports AIUI)
>>
>>
>> I am not criticising the people who changed it (and I don't want to  
>> single
>> out individuals, because I know they are all doing a great job in  
>> trying to
>> sort out all the problems in converting to Python3/GTK3 etc. etc.,  
>> and are
>> trying to fix a great many problems at once).
>>
>> I am just asking what you think it should actually be? And as a  
>> warning to
>> anyone who is going to make the suggested changes. It looks to me  
>> like it
>> has to be the four lines with the + above. (Although that doesn't  
>> seem to
>> allow specification of the encoding of the output file in Python2).
>
> If I left the codecs.open() in Narweb that was a mistake: I used  
> codecs.open() first and then discovered io.open() and intended to  
> change all instances. That said, they're essentially the same.
>
> The problem I was trying to solve (and thought that I had) is  
> UnicodeEncodingErrors when using translated strings with IO. With  
> Linux default encoding was coming up 'ascii', and that doesn't play  
> well in languages other than English. This turned up in some  
> surprising places, including ini files, the recent file list, and  
> plugins.
>
> I settled upon io.open() with encoding and errors specified as the  
> solution. The documented behavior of setting an encoding when  
> opening a file is that the IO is transcoded on-the-fly between the  
> stated encoding of the file and the internal representation. In Py3,  
> the internal representation is either UTF-16 or UTF-32, depending  
> upon the OS, but we needn't worry about that, as IO is always  
> transcoded into something else. Py2 is more complicated, as strings  
> can be represented as either str or unicode, with str being a  
> collection of bytes which are assumed to be intelligible to the host  
> OS.
>
> IMO all Gramps files should be encoded in utf-8. It's simply the  
> only portable representation that with broad language support.
> (Note that this is separate from locale.getfilesystemencoding(),  
> which is supposed to be the encoding used by the OS for file  
> *names*. File *content* is always up to the application.) Since we  
> use Gtk, all GUI text IO *must* be in utf-8 so that Gtk widgets will  
> display it properly. Reports can be in whatever encoding the user  
> wants (though we should bias towards utf-8), but in that case  
> transcoding needs to gracefully handle cases where the codepoint  
> can't be represented by the chosen encoding. There are a couple of  
> exceptions: xml, html, and xhtml should always be utf-8.

So it looks as though the option in NarWeb to choose the encoding  
should be removed?

If I go backI  earlier versions of Gramps, for example Gramps32 NarWeb  
has:
5806 of = codecs.EncodedFile(open(fname, "w"), 'utf-8', self.encoding,  
'xmlcharrefreplace')

I have tested my greek test tree, and with UTF-8 encoding this gives
<?xml version="1.0" encoding="ISO-8859-1" standalone="no"?><!DOCTYPE  
html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
        "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en-GB" lang="en-
GB"> <td class="ColumnSurname"><a name="Χ" title="Surname with  
letter Χ">Χατζησάββας</a></td>

With "ISO-8859-1" encoding I get
<?xml version="1.0" encoding="ISO-8859-1" standalone="no"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
        "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en-GB" lang="en-
GB"> <td class="ColumnSurname"><a name="&#935;" title="Surname  
with letter  
&#935
;">&#935;&#945;&#964;&#950;&#951;&#963;&#940;&#946;&#946;&#945;&#962;</
a></td>
So it seems that NarWeb used to support different output encodings.  
Shouldn't it continue to support them?

> Some formats are specified in ASCII,  and in theory GEDCOM should  
> support ANSEL even though AFAIK no one actually emits it.
>
> So having finished my diatribe, let's get back to the problem:
>
> What is actually failing with Narweb? What did you see when the file  
> was io.opened, and what do you see now that you've changed it to use  
> the builtin open function (bearing in mind that in Py3 that's  
> supposed to be an alias of io.open)?

OK. So at present I am testing trunk with Python2.x (On my Mac, using  
the library from John's Mac gramps.app alpha)

When I run a typical text report (Kinship) with HTML output. It works  
OK.

When I run NarWeb, I get:

   File "/Users/tim/gramps-devel/trunk/Contents/Resources/lib/
python2.7/site-packages/gramps/plugins/lib/libhtml.py", line 429, in  
write
     method('%s%s' % (tabs, item))         # else write the line
TypeError: must be unicode, not str

For text reports, libhtml->libhtmlbackend->docbackend DocBackend.open  
(I think that is the right path of inheritance) does
             self.__file = open(self.filename, "w")

The copy of NarWeb that I am using does:
of = io.open(fname, "w", encoding = self.encoding, errors =  
'xmlcharrefreplace')

Now, I understand that internally everything 'should' be in Unicode.  
However, ALL the literals (for example in htmldoc, NarWeb etc.) are  
strings, not Unicode. It seems that libhtml has some clever coding to  
deal with both strings and Uncode.

There are LOTS of string literals. It does not seem reasonable to  
change them all. And it would not make sense just to do the conversion  
to Unicode in one of the library routines - if internally everything  
should be in Unicode.

> Are you able to test in both Py3 and Py2?

My main machine, and the machine I do my development and testing on is  
a Mac. However, I do have access to a cheap laptop with Ubuntu, which  
I occasionally fire up to check something both on Py2 and Py3.

I haven't tried the codecs.EncodedFile code again recently, either for  
Py2 or Py3. I can't exactly remember what error got, think it was to  
do with encode or decode, but also I see that there may have been a  
simple coding error as well, in that string_io may not have been  
assigned.


I will carry on trying to get it to work!

Regards,
Tim.


------------------------------------------------------------------------------
Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS,
MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current
with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft
MVPs and experts. ON SALE this month only -- learn more at:
http://p.sf.net/sfu/learnnow-d2d
_______________________________________________
Gramps-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gramps-devel
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: open versus io.open

Tim Lyons
Administrator
In reply to this post by Benny Malengier
Thanks Benny and John for your help and advice.

As ever, it turned out to be more complicated than I thought, with the need to support:
- Python2 and 3,
- tar and ordinary output and
- various encodings.

I have committed revision 21757 http://sourceforge.net/p/gramps/code/21258/ which I hope fixes all the above variations.

It turns out (at the point of a print statement I inserted in libhtml.write()) that Python2 writes mostly 8-bit strings (type=str), but also unicode (type=unicode) sometimes, while Python3 always writes text strings (type=str) (unicode). (I am not sure where libhtml does the conversion) So it is not always bytes internally.

Also, although you said: "There are a couple of exceptions: xml, html, and xhtml should always be utf-8", I think that html can have a variety of encodings [1], though it may be true that it _should_ be UTF-8. Anyway, NarWeb is setup to support different output encodings for the HTML files, so I have maintained the support for this.

Unfortunately, because of the different data types in python2 and 3, and the different ways conversion is done, I have had to provide different code for Python 2 and 3.


[1] http://www.w3.org/International/questions/qa-html-encoding-declarations#quicklookup
Loading...