Quantcast

What's the best method for GEDCOM merge?

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

What's the best method for GEDCOM merge?

Douglas S. Blank
I'm wondering how to best merge GEDCOM data from on-line genealogy sites.
For example, say I find some children that I didn't know about from
parents from people in my database, perhaps on a page like:

http://www.gencircles.com/users/marshallc83/1/data/550

If I use the GEDCOM family download and import, I then have replicated the
individual, his parents, and some of the children, and have to merge them
in GRAMPS (which very conservatively keeps most (all?) information,
including duplicate birth events, and is quite time-consuming to manually
fix).

Is there some GEDCOM editing I could do beforehand that would make this
easier? I guess I could just remove the replicated people, and then
connect the imported to pre-existing ones once they are imported. Is that
currently the best approach?

In an ideal merge, if a matched person is found already existing, would
GRAMPS only add new info to those matched people, and then add unmatched
people (as it does currently)? Would there be hints that the user could
provide, either in the GEDCOM file, or interactively, as to who is a
match?

I use the CSV Import a lot for this. But I'd rather get the complete data
through GEDCOM without having to re-key it.

[Asking this question to developers as this may involve some technical
GEDCOM editing, and maybe discussion on how to do this better.]

-Doug




-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/
_______________________________________________
Gramps-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gramps-devel
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: What's the best method for GEDCOM merge?

Julio Sánchez-2
2007/9/1, Douglas S. Blank <[hidden email]>:

> I'm wondering how to best merge GEDCOM data from on-line genealogy sites.
> For example, say I find some children that I didn't know about from
> parents from people in my database, perhaps on a page like:
>
> http://www.gencircles.com/users/marshallc83/1/data/550
>
> If I use the GEDCOM family download and import, I then have replicated the
> individual, his parents, and some of the children, and have to merge them
> in GRAMPS (which very conservatively keeps most (all?) information,
> including duplicate birth events, and is quite time-consuming to manually
> fix).

My working copy of GRAMPS 2.0.12 (that is what I use for real work)
does much better merging, only duplicating things that are really
different.  What is different or not, I think requires some thought.
Case differences in names, name splitting in parts (prefix, suffix,
etc.), I don't find useful to consider different.  On the other hand,
other name differences are crucial to keep.

Similarly, when should an event that gives the date but omits the
place be merged into an event that gives both?  The result might imply
that the sources for the first event now seem to back the place for
the event and may induce to error.  I think that some preference
options should be available to tailor how aggressive should merging
be.

Another similar question, how should two different informations only
differing in privacy options be merged, if at all?  I'd probably merge
them, keeping the primary record option.  Others might prefer marking
the result always private. And so on.

I plan on forward porting this to later versions, but I have been
procrastinating.

> Is there some GEDCOM editing I could do beforehand that would make this
> easier? I guess I could just remove the replicated people, and then
> connect the imported to pre-existing ones once they are imported. Is that
> currently the best approach?

I think that GRAMPS should have a method to do merging automatically
while importing.  This requires a number of things, but would be
inmensely useful.  To me at least.

The ticket is having a method to detect with little room for error if
two people are the same or not.  Many possibilities exist but the only
method widely used in practice is matching on _UID, a nonstandard
extension to GEDCOM that many, including PAF, use.

An _UID or Unique Identifier is a string assigned to an entry by its
creator.  It may be purely random (collision rate may be made
arbitrarily low by choosing the length) or it may use other generating
method.  Unique identifiers are opaque.  One unique identifier
designates one precise person, but it is absolutely necessary that a
person may have more than one.

Many programs generate _UID and use it for merging.  GRAMPS should
generate (possibly based on handles) _UID on export *iff* no _UID
exists for the entry.

My code also has a modified merging candidate search routine that
matches on _UID, so that after import it is easier to find duplicate
records, that are given weight 1000.  I also have filters for finding
candidate duplicate parents and candidate duplicate spouses, I think I
needed extending the generic filters for one of those to work.

Still, one big remaining problem is source and place duplicates.
After importing a GEDCOM, I merge sources and places before anything,
otherwise duplicate events and source references are not detected and
are left unmerged.

For me, the definition of ideal merging should include three concepts:

   - Preservation: No information should be lost
   - Idempotence: It must be possible to merge the same information
again and again without duplication everytime
   - Aggregation: Combining information that, while not identical
byte-per-byte, are semantically equivalent (here is where preference
settings or even human guidance play a role)

Ideal support for this, should make easy preserving these properties,
both while importing or as an afterthought.  My code achieves a lot of
this in the second case, but I will eventually do something about the
first.

If I forward port the code to current versions, I need a design
decision.  I need a routine for each information type (Person, Event,
Source Reference, Name, etc.) that determines not whether two
information items are equal, but whether they are mergeable, i.e.
semantically equivalent, with the possibility of taking into account
user preferences as well.  Generally speaking, is not identical to
is_equal(), though it some cases it is.

For ease of coding, it would be easier if it is part of the object
model, so that one can call is_mergeable() on an object of any kind.
Additionally, it would help hiding object internals and keeping them
localized.  But this would involve putting into the data model a
function that is used only for merging.  Is this acceptable?

Is there any good alternative for this?  I thought of subclassing each
class in the merging routines.  This would help with polimorphism, but
would not hide the internals.

Julio

-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/
_______________________________________________
Gramps-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gramps-devel
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: What's the best method for GEDCOM merge?

Douglas S. Blank
Julio,

I think this should be forward-ported early in this next GRAMPS revision,
so that any methods needed (like the proposed is_mergeable()) can be
defined and integrated sooner rather than later.

Perhaps you could write a sample is_mergeable() method for one of the
object types, and show how this would work with personal choices for
directing the merge? Also, it might help to see what your code must do in
order to do a good merge with the 2.0.12 version. Can you make that
available?

It does seem that the details of whether each type of object can be merged
will be pretty specific to that object type, so that detailed code will
have to exist somewhere. It should be fairly easy to refactor if it is
decided on one method of implementation or another.

Let me know if I can help in any way... I bringing my system up-to-date
this weekend so I can begin running (and working on) 3.0.

-Doug


On Sat, September 1, 2007 10:40 am, Julio Sánchez wrote:

> 2007/9/1, Douglas S. Blank <[hidden email]>:
>> I'm wondering how to best merge GEDCOM data from on-line genealogy
>> sites.
>> For example, say I find some children that I didn't know about from
>> parents from people in my database, perhaps on a page like:
>>
>> http://www.gencircles.com/users/marshallc83/1/data/550
>>
>> If I use the GEDCOM family download and import, I then have replicated
>> the
>> individual, his parents, and some of the children, and have to merge
>> them
>> in GRAMPS (which very conservatively keeps most (all?) information,
>> including duplicate birth events, and is quite time-consuming to
>> manually
>> fix).
>
> My working copy of GRAMPS 2.0.12 (that is what I use for real work)
> does much better merging, only duplicating things that are really
> different.  What is different or not, I think requires some thought.
> Case differences in names, name splitting in parts (prefix, suffix,
> etc.), I don't find useful to consider different.  On the other hand,
> other name differences are crucial to keep.
>
> Similarly, when should an event that gives the date but omits the
> place be merged into an event that gives both?  The result might imply
> that the sources for the first event now seem to back the place for
> the event and may induce to error.  I think that some preference
> options should be available to tailor how aggressive should merging
> be.
>
> Another similar question, how should two different informations only
> differing in privacy options be merged, if at all?  I'd probably merge
> them, keeping the primary record option.  Others might prefer marking
> the result always private. And so on.
>
> I plan on forward porting this to later versions, but I have been
> procrastinating.
>
>> Is there some GEDCOM editing I could do beforehand that would make this
>> easier? I guess I could just remove the replicated people, and then
>> connect the imported to pre-existing ones once they are imported. Is
>> that
>> currently the best approach?
>
> I think that GRAMPS should have a method to do merging automatically
> while importing.  This requires a number of things, but would be
> inmensely useful.  To me at least.
>
> The ticket is having a method to detect with little room for error if
> two people are the same or not.  Many possibilities exist but the only
> method widely used in practice is matching on _UID, a nonstandard
> extension to GEDCOM that many, including PAF, use.
>
> An _UID or Unique Identifier is a string assigned to an entry by its
> creator.  It may be purely random (collision rate may be made
> arbitrarily low by choosing the length) or it may use other generating
> method.  Unique identifiers are opaque.  One unique identifier
> designates one precise person, but it is absolutely necessary that a
> person may have more than one.
>
> Many programs generate _UID and use it for merging.  GRAMPS should
> generate (possibly based on handles) _UID on export *iff* no _UID
> exists for the entry.
>
> My code also has a modified merging candidate search routine that
> matches on _UID, so that after import it is easier to find duplicate
> records, that are given weight 1000.  I also have filters for finding
> candidate duplicate parents and candidate duplicate spouses, I think I
> needed extending the generic filters for one of those to work.
>
> Still, one big remaining problem is source and place duplicates.
> After importing a GEDCOM, I merge sources and places before anything,
> otherwise duplicate events and source references are not detected and
> are left unmerged.
>
> For me, the definition of ideal merging should include three concepts:
>
>    - Preservation: No information should be lost
>    - Idempotence: It must be possible to merge the same information
> again and again without duplication everytime
>    - Aggregation: Combining information that, while not identical
> byte-per-byte, are semantically equivalent (here is where preference
> settings or even human guidance play a role)
>
> Ideal support for this, should make easy preserving these properties,
> both while importing or as an afterthought.  My code achieves a lot of
> this in the second case, but I will eventually do something about the
> first.
>
> If I forward port the code to current versions, I need a design
> decision.  I need a routine for each information type (Person, Event,
> Source Reference, Name, etc.) that determines not whether two
> information items are equal, but whether they are mergeable, i.e.
> semantically equivalent, with the possibility of taking into account
> user preferences as well.  Generally speaking, is not identical to
> is_equal(), though it some cases it is.
>
> For ease of coding, it would be easier if it is part of the object
> model, so that one can call is_mergeable() on an object of any kind.
> Additionally, it would help hiding object internals and keeping them
> localized.  But this would involve putting into the data model a
> function that is used only for merging.  Is this acceptable?
>
> Is there any good alternative for this?  I thought of subclassing each
> class in the merging routines.  This would help with polimorphism, but
> would not hide the internals.
>
> Julio
>
> -------------------------------------------------------------------------
> This SF.net email is sponsored by: Splunk Inc.
> Still grepping through log files to find problems?  Stop.
> Now Search log events and configuration files using AJAX and a browser.
> Download your FREE copy of Splunk now >>  http://get.splunk.com/
> _______________________________________________
> Gramps-devel mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/gramps-devel
>


--
Douglas S. Blank
Associate Professor, Bryn Mawr College
http://cs.brynmawr.edu/~dblank/
Office: 610 526 6501

-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/
_______________________________________________
Gramps-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gramps-devel
Loading...