Quantcast

GEDCOM import character encosing

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

GEDCOM import character encosing

Tim Lyons
Administrator
Peter (and anyone else interested in MS Windows),

Could you have a look at 0004439: [Info]: characters ignored on a  
Gedcom encoded ANSI (cp1252 West Europe, USA)?

The encoding could be changed to support Windows conventions, if that  
is thought to be worthwhile.

If I don't hear that a change is thought worthwhile, I will just close  
this bug (won't fix) on 20th March 2012.

Thanks.

------------------------------------------------------------------------------
Keep Your Developer Skills Current with LearnDevNow!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-d2d
_______________________________________________
Gramps-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gramps-devel
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re : GEDCOM import character encosing

jerome
Yes, this can be closed.

I reported this more as information than for a bug fix!

As said, maybe to replace the use of 'latin-1' to 'latin-9' as a minor improvement (some recent characters)? Anyway, the use of 'UTF-8 is a much better solution for non-ASCII characters into Gedcom file format.

Gramps will not be able to fix this lack on exchange for data encoded by some 'windows programs', which write ANSI and are thinking 'windows-1251/2' ...

Thank you.

Jérôme

--- En date de : Mar 13.3.12, Tim Lyons <[hidden email]> a écrit :

> De: Tim Lyons <[hidden email]>
> Objet: GEDCOM import character encosing
> À: "[hidden email] List" <[hidden email]>
> Cc: "Peter Landgren" <[hidden email]>, "Jérôme Rapinat" <[hidden email]>
> Date: Mardi 13 mars 2012, 19h02
> Peter (and anyone else interested in
> MS Windows),
>
> Could you have a look at 0004439: [Info]: characters ignored
> on a Gedcom encoded ANSI (cp1252 West Europe, USA)?
>
> The encoding could be changed to support Windows
> conventions, if that is thought to be worthwhile.
>
> If I don't hear that a change is thought worthwhile, I will
> just close this bug (won't fix) on 20th March 2012.
>
> Thanks.
>

------------------------------------------------------------------------------
This SF email is sponsosred by:
Try Windows Azure free for 90 days Click Here
http://p.sf.net/sfu/sfd2d-msazure
_______________________________________________
Gramps-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gramps-devel
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Re : GEDCOM import character encosing

Tim Lyons
Administrator

On 18 Mar 2012, at 14:55, jerome wrote:

> Yes, this can be closed.
>
> I reported this more as information than for a bug fix!
>
> As said, maybe to replace the use of 'latin-1' to 'latin-9' as a  
> minor improvement (some recent characters)? Anyway, the use of  
> 'UTF-8 is a much better solution for non-ASCII characters into  
> Gedcom file format.
>
> Gramps will not be able to fix this lack on exchange for data  
> encoded by some 'windows programs', which write ANSI and are  
> thinking 'windows-1251/2' ...

I don't think so. That is not what I was saying. It would be quite  
simple (as far as I can see) for Gramps to fix the exchange for data  
encoded by windows programs that think 'cp1252' ('windows-1252').

If we changed Gramps so that when the input file said 'ANSI' we read  
the file as though it were 'cp1252' ('windows-1252'), then this might  
help windows users. It would not affect anyone who was using GEDCOM  
correctly, because GEDCOM does not allow ANSI, so anyone using it  
correctly would not say 'ANSI".

(I don't think there would be any point in changing to latin-9, as  
that s probably not what the user really meant - he probably really  
meant 'cp1252' ('windows-1252'))

Tim.

>
> Thank you.
>
> Jérôme
>
> --- En date de : Mar 13.3.12, Tim Lyons <[hidden email]> a  
> écrit :
>
>> De: Tim Lyons <[hidden email]>
>> Objet: GEDCOM import character encosing
>> À: "[hidden email] List" <[hidden email]
>> >
>> Cc: "Peter Landgren" <[hidden email]>, "Jérôme Rapinat" <[hidden email]
>> >
>> Date: Mardi 13 mars 2012, 19h02
>> Peter (and anyone else interested in
>> MS Windows),
>>
>> Could you have a look at 0004439: [Info]: characters ignored
>> on a Gedcom encoded ANSI (cp1252 West Europe, USA)?
>>
>> The encoding could be changed to support Windows
>> conventions, if that is thought to be worthwhile.
>>
>> If I don't hear that a change is thought worthwhile, I will
>> just close this bug (won't fix) on 20th March 2012.
>>
>> Thanks.
>>


------------------------------------------------------------------------------
This SF email is sponsosred by:
Try Windows Azure free for 90 days Click Here
http://p.sf.net/sfu/sfd2d-msazure
_______________________________________________
Gramps-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gramps-devel
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Re : GEDCOM import character encosing

jerome
> If we changed Gramps so that when the input file said 'ANSI'
> we read the file as though it were 'cp1252'
> ('windows-1252'), then this might help windows users.

Like with ANSEL but for ANSI, cp1252 (and maybe cp1251) ?

> (I don't think there would be any point in changing to
> latin-9, as that s probably not what the user really meant -
> he probably really meant 'cp1252' ('windows-1252'))

Well, then in french, users should avoid to write 'sister' into Gedcom encoded cp1252!
{sister: sœur}... or '€' into note?
http://en.wikipedia.org/wiki/ISO/IEC_8859-15


Jérôme

--- En date de : Dim 18.3.12, Tim Lyons <[hidden email]> a écrit :

> De: Tim Lyons <[hidden email]>
> Objet: Re: Re : GEDCOM import character encosing
> À: "[hidden email] List" <[hidden email]>
> Cc: "Jérôme Rapinat" <[hidden email]>
> Date: Dimanche 18 mars 2012, 17h19
>
> On 18 Mar 2012, at 14:55, jerome wrote:
>
> > Yes, this can be closed.
> >
> > I reported this more as information than for a bug
> fix!
> >
> > As said, maybe to replace the use of 'latin-1' to
> 'latin-9' as a minor improvement (some recent characters)?
> Anyway, the use of 'UTF-8 is a much better solution for
> non-ASCII characters into Gedcom file format.
> >
> > Gramps will not be able to fix this lack on exchange
> for data encoded by some 'windows programs', which write
> ANSI and are thinking 'windows-1251/2' ...
>
> I don't think so. That is not what I was saying. It would be
> quite simple (as far as I can see) for Gramps to fix the
> exchange for data encoded by windows programs that think
> 'cp1252' ('windows-1252').
>
> If we changed Gramps so that when the input file said 'ANSI'
> we read the file as though it were 'cp1252'
> ('windows-1252'), then this might help windows users. It
> would not affect anyone who was using GEDCOM correctly,
> because GEDCOM does not allow ANSI, so anyone using it
> correctly would not say 'ANSI".
>
> (I don't think there would be any point in changing to
> latin-9, as that s probably not what the user really meant -
> he probably really meant 'cp1252' ('windows-1252'))
>
> Tim.
>
> >
> > Thank you.
> >
> > Jérôme
> >
> > --- En date de : Mar 13.3.12, Tim Lyons <[hidden email]>
> a écrit :
> >
> >> De: Tim Lyons <[hidden email]>
> >> Objet: GEDCOM import character encosing
> >> À: "[hidden email]
> List" <[hidden email]>
> >> Cc: "Peter Landgren" <[hidden email]>,
> "Jérôme Rapinat" <[hidden email]>
> >> Date: Mardi 13 mars 2012, 19h02
> >> Peter (and anyone else interested in
> >> MS Windows),
> >>
> >> Could you have a look at 0004439: [Info]:
> characters ignored
> >> on a Gedcom encoded ANSI (cp1252 West Europe,
> USA)?
> >>
> >> The encoding could be changed to support Windows
> >> conventions, if that is thought to be worthwhile.
> >>
> >> If I don't hear that a change is thought
> worthwhile, I will
> >> just close this bug (won't fix) on 20th March
> 2012.
> >>
> >> Thanks.
> >>
>
>

------------------------------------------------------------------------------
This SF email is sponsosred by:
Try Windows Azure free for 90 days Click Here
http://p.sf.net/sfu/sfd2d-msazure
_______________________________________________
Gramps-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gramps-devel
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Re : GEDCOM import character encosing

Tim Lyons
Administrator

On 18 Mar 2012, at 16:59, jerome wrote:

>> If we changed Gramps so that when the input file said 'ANSI'
>> we read the file as though it were 'cp1252'
>> ('windows-1252'), then this might help windows users.
>
> Like with ANSEL but for ANSI, cp1252 (and maybe cp1251) ?
>
>> (I don't think there would be any point in changing to
>> latin-9, as that s probably not what the user really meant -
>> he probably really meant 'cp1252' ('windows-1252'))
>
> Well, then in french, users should avoid to write 'sister' into  
> Gedcom encoded cp1252!
> {sister: sœur}... or '€' into note?
> http://en.wikipedia.org/wiki/ISO/IEC_8859-15


The more I think about it, the more convinced I am that GEDCOM  
advertised as ANSI should actually be parsed with the Windows-1252  
encoding, especially given what wikipedia says about web browsers:  
"Most modern web browsers and e-mail clients treat the MIME charset  
ISO-8859-1 as Windows-1252 in order to accommodate such mislabeling.  
This is now standard behavior in the draft HTML 5 specification, which  
requires that documents advertised as ISO-8859-1 actually be parsed  
with the Windows-1252 encoding."

Jérôme or someone else who has software that outputs GEDCOM as ANSI,  
could you please send me a file (preferably zipped so it doesn't get  
re-coded in-flight) that contains characters like € (euro sign), upper  
and lower case oe {sister: sœur}, upper and lower case S with caron,  
capital Y with umlaut, upper and lower case Y with caron.

All these characters differ between latin-1, latin-9 and Windows-1252  
(i.e. there are three DIFFERENT encodings for these characters).

Tim.
------------------------------------------------------------------------------
This SF email is sponsosred by:
Try Windows Azure free for 90 days Click Here
http://p.sf.net/sfu/sfd2d-msazure
_______________________________________________
Gramps-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gramps-devel
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Re : GEDCOM import character encosing

jerome
Tim,


About characters, to use 'cp1252-utf8.ged' from 'cp1252.zip' should provide most of them (encoded utf8), we can add additional characters after an import into Gramps (utf-8), for generating a Gedcom export and maybe command like: $ iconv -f utf-8 -t WINDOWS-1252 or CP1252 <gramps.ged >cp1252.ged
Then to change the Gedcom header to ANSI !!!
OK, not a good testing environment ... :(

I suppose you want to have a gedcom generated under Windows OS?
Sorry, except the above command, I do not have any solution for using this encoding. Maybe ask Josip but it seems to me that he rather uses CP1251!

My samples of gedcom which used ANSI were very old (€ did not exist on last century...). Also, we do not often use sign like € or $ (except maybe on one note, sometimes). About the correct character for sister translation, most genealogists under Windows do not often set the correct one, cause of encoding issue "sœur" becomes "soeur" ... :(
The meaning is understood but it is not correct when we read this alternate set of characters (o + e).
http://en.wikipedia.org/wiki/%C5%92

Note, it seems to be also used in english (British)!
http://en.wikipedia.org/wiki/%C5%92#English


Jérôme

--- En date de : Ven 23.3.12, Tim Lyons <[hidden email]> a écrit :

> De: Tim Lyons <[hidden email]>
> Objet: Re: Re : GEDCOM import character encosing
> À: "[hidden email] List" <[hidden email]>
> Cc: "Jérôme Rapinat" <[hidden email]>
> Date: Vendredi 23 mars 2012, 19h34
>
> On 18 Mar 2012, at 16:59, jerome wrote:
>
> >> If we changed Gramps so that when the input file
> said 'ANSI'
> >> we read the file as though it were 'cp1252'
> >> ('windows-1252'), then this might help windows
> users.
> >
> > Like with ANSEL but for ANSI, cp1252 (and maybe cp1251)
> ?
> >
> >> (I don't think there would be any point in changing
> to
> >> latin-9, as that s probably not what the user
> really meant -
> >> he probably really meant 'cp1252'
> ('windows-1252'))
> >
> > Well, then in french, users should avoid to write
> 'sister' into Gedcom encoded cp1252!
> > {sister: sœur}... or '€' into note?
> > http://en.wikipedia.org/wiki/ISO/IEC_8859-15
>
>
> The more I think about it, the more convinced I am that
> GEDCOM advertised as ANSI should actually be parsed with the
> Windows-1252 encoding, especially given what wikipedia says
> about web browsers: "Most modern web browsers and e-mail
> clients treat the MIME charset ISO-8859-1 as Windows-1252 in
> order to accommodate such mislabeling. This is now standard
> behavior in the draft HTML 5 specification, which requires
> that documents advertised as ISO-8859-1 actually be parsed
> with the Windows-1252 encoding."
>
> Jérôme or someone else who has software that outputs
> GEDCOM as ANSI, could you please send me a file (preferably
> zipped so it doesn't get re-coded in-flight) that contains
> characters like € (euro sign), upper and lower case oe
> {sister: sœur}, upper and lower case S with caron, capital
> Y with umlaut, upper and lower case Y with caron.
>
> All these characters differ between latin-1, latin-9 and
> Windows-1252 (i.e. there are three DIFFERENT encodings for
> these characters).
>
> Tim.

------------------------------------------------------------------------------
This SF email is sponsosred by:
Try Windows Azure free for 90 days Click Here
http://p.sf.net/sfu/sfd2d-msazure
_______________________________________________
Gramps-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gramps-devel
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Re : GEDCOM import character encosing

Josip-3
On 23.03.2012 21:19, jerome wrote:
>
>
> I suppose you want to have a gedcom generated under Windows OS?
> Sorry, except the above command, I do not have any solution for using this encoding. Maybe ask Josip but it seems to me that he rather uses CP1251!
>

I (will) use cp1250
http://en.wikipedia.org/wiki/Windows-1250

But i don't know anything about gedcom nor i use it.


--
Josip

------------------------------------------------------------------------------
This SF email is sponsosred by:
Try Windows Azure free for 90 days Click Here
http://p.sf.net/sfu/sfd2d-msazure
_______________________________________________
Gramps-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gramps-devel
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Re : GEDCOM import character encosing

Tim Lyons
Administrator
Josip-3 wrote
I (will) use cp1250
What do you mean? How and where will/do you use cp1250? How do you configure things so that you use this encoding?

Do you have access to any Windows applications that claim to output GEDCOM files in ANSI encoding (maybe PAF? which I think may be a free download) and if so, would you be able to send me a GEDCOM file produced from such an application (preferably zipped so it doesn't get  re-coded in-flight) that contains characters like € (euro sign), upper  and lower case oe {sister: sœur}, upper and lower case S with caron,  capital Y with umlaut, upper and lower case Y with caron?

Tim.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Re : GEDCOM import character encosing

Josip-3
On 25.03.2012 19:59, Tim Lyons wrote:

>
> Josip-3 wrote
>>
>> I (will) use cp1250
>>
>
> What do you mean? How and where will/do you use cp1250? How do you configure
> things so that you use this encoding?
>
> Do you have access to any Windows applications that claim to output GEDCOM
> files in ANSI encoding (maybe PAF? which I think may be a free download) and
> if so, would you be able to send me a GEDCOM file produced from such an
> application (preferably zipped so it doesn't get  re-coded in-flight) that
> contains characters like € (euro sign), upper  and lower case oe {sister:
> sœur}, upper and lower case S with caron,  capital Y with umlaut, upper and
> lower case Y with caron?
>
I mean that Windows codepage for my language is cp1250.
Like i said i am not using gedcom and know anything about them.

Instaledd PAF 5.2.18 and put in it few people whose names contain
characters you are interested in.
I don't use some of that characters so i hope it is correct one.
Umlaut is diaresis?
No Y with caron but with circumflex (caron flipped horizontally)?

Exported as GEDCOM 5.5 ANSI

--
Josip

------------------------------------------------------------------------------
This SF email is sponsosred by:
Try Windows Azure free for 90 days Click Here
http://p.sf.net/sfu/sfd2d-msazure
_______________________________________________
Gramps-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gramps-devel

testansi.zip (1K) Download Attachment
Loading...