|
I'm wondering how to best merge GEDCOM data from on-line genealogy sites.
For example, say I find some children that I didn't know about from parents from people in my database, perhaps on a page like: http://www.gencircles.com/users/marshallc83/1/data/550 If I use the GEDCOM family download and import, I then have replicated the individual, his parents, and some of the children, and have to merge them in GRAMPS (which very conservatively keeps most (all?) information, including duplicate birth events, and is quite time-consuming to manually fix). Is there some GEDCOM editing I could do beforehand that would make this easier? I guess I could just remove the replicated people, and then connect the imported to pre-existing ones once they are imported. Is that currently the best approach? In an ideal merge, if a matched person is found already existing, would GRAMPS only add new info to those matched people, and then add unmatched people (as it does currently)? Would there be hints that the user could provide, either in the GEDCOM file, or interactively, as to who is a match? I use the CSV Import a lot for this. But I'd rather get the complete data through GEDCOM without having to re-key it. [Asking this question to developers as this may involve some technical GEDCOM editing, and maybe discussion on how to do this better.] -Doug ------------------------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now >> http://get.splunk.com/ _______________________________________________ Gramps-devel mailing list [hidden email] https://lists.sourceforge.net/lists/listinfo/gramps-devel |
|
2007/9/1, Douglas S. Blank <[hidden email]>:
> I'm wondering how to best merge GEDCOM data from on-line genealogy sites. > For example, say I find some children that I didn't know about from > parents from people in my database, perhaps on a page like: > > http://www.gencircles.com/users/marshallc83/1/data/550 > > If I use the GEDCOM family download and import, I then have replicated the > individual, his parents, and some of the children, and have to merge them > in GRAMPS (which very conservatively keeps most (all?) information, > including duplicate birth events, and is quite time-consuming to manually > fix). My working copy of GRAMPS 2.0.12 (that is what I use for real work) does much better merging, only duplicating things that are really different. What is different or not, I think requires some thought. Case differences in names, name splitting in parts (prefix, suffix, etc.), I don't find useful to consider different. On the other hand, other name differences are crucial to keep. Similarly, when should an event that gives the date but omits the place be merged into an event that gives both? The result might imply that the sources for the first event now seem to back the place for the event and may induce to error. I think that some preference options should be available to tailor how aggressive should merging be. Another similar question, how should two different informations only differing in privacy options be merged, if at all? I'd probably merge them, keeping the primary record option. Others might prefer marking the result always private. And so on. I plan on forward porting this to later versions, but I have been procrastinating. > Is there some GEDCOM editing I could do beforehand that would make this > easier? I guess I could just remove the replicated people, and then > connect the imported to pre-existing ones once they are imported. Is that > currently the best approach? I think that GRAMPS should have a method to do merging automatically while importing. This requires a number of things, but would be inmensely useful. To me at least. The ticket is having a method to detect with little room for error if two people are the same or not. Many possibilities exist but the only method widely used in practice is matching on _UID, a nonstandard extension to GEDCOM that many, including PAF, use. An _UID or Unique Identifier is a string assigned to an entry by its creator. It may be purely random (collision rate may be made arbitrarily low by choosing the length) or it may use other generating method. Unique identifiers are opaque. One unique identifier designates one precise person, but it is absolutely necessary that a person may have more than one. Many programs generate _UID and use it for merging. GRAMPS should generate (possibly based on handles) _UID on export *iff* no _UID exists for the entry. My code also has a modified merging candidate search routine that matches on _UID, so that after import it is easier to find duplicate records, that are given weight 1000. I also have filters for finding candidate duplicate parents and candidate duplicate spouses, I think I needed extending the generic filters for one of those to work. Still, one big remaining problem is source and place duplicates. After importing a GEDCOM, I merge sources and places before anything, otherwise duplicate events and source references are not detected and are left unmerged. For me, the definition of ideal merging should include three concepts: - Preservation: No information should be lost - Idempotence: It must be possible to merge the same information again and again without duplication everytime - Aggregation: Combining information that, while not identical byte-per-byte, are semantically equivalent (here is where preference settings or even human guidance play a role) Ideal support for this, should make easy preserving these properties, both while importing or as an afterthought. My code achieves a lot of this in the second case, but I will eventually do something about the first. If I forward port the code to current versions, I need a design decision. I need a routine for each information type (Person, Event, Source Reference, Name, etc.) that determines not whether two information items are equal, but whether they are mergeable, i.e. semantically equivalent, with the possibility of taking into account user preferences as well. Generally speaking, is not identical to is_equal(), though it some cases it is. For ease of coding, it would be easier if it is part of the object model, so that one can call is_mergeable() on an object of any kind. Additionally, it would help hiding object internals and keeping them localized. But this would involve putting into the data model a function that is used only for merging. Is this acceptable? Is there any good alternative for this? I thought of subclassing each class in the merging routines. This would help with polimorphism, but would not hide the internals. Julio ------------------------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now >> http://get.splunk.com/ _______________________________________________ Gramps-devel mailing list [hidden email] https://lists.sourceforge.net/lists/listinfo/gramps-devel |
|
Julio,
I think this should be forward-ported early in this next GRAMPS revision, so that any methods needed (like the proposed is_mergeable()) can be defined and integrated sooner rather than later. Perhaps you could write a sample is_mergeable() method for one of the object types, and show how this would work with personal choices for directing the merge? Also, it might help to see what your code must do in order to do a good merge with the 2.0.12 version. Can you make that available? It does seem that the details of whether each type of object can be merged will be pretty specific to that object type, so that detailed code will have to exist somewhere. It should be fairly easy to refactor if it is decided on one method of implementation or another. Let me know if I can help in any way... I bringing my system up-to-date this weekend so I can begin running (and working on) 3.0. -Doug On Sat, September 1, 2007 10:40 am, Julio Sánchez wrote: > 2007/9/1, Douglas S. Blank <[hidden email]>: >> I'm wondering how to best merge GEDCOM data from on-line genealogy >> sites. >> For example, say I find some children that I didn't know about from >> parents from people in my database, perhaps on a page like: >> >> http://www.gencircles.com/users/marshallc83/1/data/550 >> >> If I use the GEDCOM family download and import, I then have replicated >> the >> individual, his parents, and some of the children, and have to merge >> them >> in GRAMPS (which very conservatively keeps most (all?) information, >> including duplicate birth events, and is quite time-consuming to >> manually >> fix). > > My working copy of GRAMPS 2.0.12 (that is what I use for real work) > does much better merging, only duplicating things that are really > different. What is different or not, I think requires some thought. > Case differences in names, name splitting in parts (prefix, suffix, > etc.), I don't find useful to consider different. On the other hand, > other name differences are crucial to keep. > > Similarly, when should an event that gives the date but omits the > place be merged into an event that gives both? The result might imply > that the sources for the first event now seem to back the place for > the event and may induce to error. I think that some preference > options should be available to tailor how aggressive should merging > be. > > Another similar question, how should two different informations only > differing in privacy options be merged, if at all? I'd probably merge > them, keeping the primary record option. Others might prefer marking > the result always private. And so on. > > I plan on forward porting this to later versions, but I have been > procrastinating. > >> Is there some GEDCOM editing I could do beforehand that would make this >> easier? I guess I could just remove the replicated people, and then >> connect the imported to pre-existing ones once they are imported. Is >> that >> currently the best approach? > > I think that GRAMPS should have a method to do merging automatically > while importing. This requires a number of things, but would be > inmensely useful. To me at least. > > The ticket is having a method to detect with little room for error if > two people are the same or not. Many possibilities exist but the only > method widely used in practice is matching on _UID, a nonstandard > extension to GEDCOM that many, including PAF, use. > > An _UID or Unique Identifier is a string assigned to an entry by its > creator. It may be purely random (collision rate may be made > arbitrarily low by choosing the length) or it may use other generating > method. Unique identifiers are opaque. One unique identifier > designates one precise person, but it is absolutely necessary that a > person may have more than one. > > Many programs generate _UID and use it for merging. GRAMPS should > generate (possibly based on handles) _UID on export *iff* no _UID > exists for the entry. > > My code also has a modified merging candidate search routine that > matches on _UID, so that after import it is easier to find duplicate > records, that are given weight 1000. I also have filters for finding > candidate duplicate parents and candidate duplicate spouses, I think I > needed extending the generic filters for one of those to work. > > Still, one big remaining problem is source and place duplicates. > After importing a GEDCOM, I merge sources and places before anything, > otherwise duplicate events and source references are not detected and > are left unmerged. > > For me, the definition of ideal merging should include three concepts: > > - Preservation: No information should be lost > - Idempotence: It must be possible to merge the same information > again and again without duplication everytime > - Aggregation: Combining information that, while not identical > byte-per-byte, are semantically equivalent (here is where preference > settings or even human guidance play a role) > > Ideal support for this, should make easy preserving these properties, > both while importing or as an afterthought. My code achieves a lot of > this in the second case, but I will eventually do something about the > first. > > If I forward port the code to current versions, I need a design > decision. I need a routine for each information type (Person, Event, > Source Reference, Name, etc.) that determines not whether two > information items are equal, but whether they are mergeable, i.e. > semantically equivalent, with the possibility of taking into account > user preferences as well. Generally speaking, is not identical to > is_equal(), though it some cases it is. > > For ease of coding, it would be easier if it is part of the object > model, so that one can call is_mergeable() on an object of any kind. > Additionally, it would help hiding object internals and keeping them > localized. But this would involve putting into the data model a > function that is used only for merging. Is this acceptable? > > Is there any good alternative for this? I thought of subclassing each > class in the merging routines. This would help with polimorphism, but > would not hide the internals. > > Julio > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Splunk Inc. > Still grepping through log files to find problems? Stop. > Now Search log events and configuration files using AJAX and a browser. > Download your FREE copy of Splunk now >> http://get.splunk.com/ > _______________________________________________ > Gramps-devel mailing list > [hidden email] > https://lists.sourceforge.net/lists/listinfo/gramps-devel > -- Douglas S. Blank Associate Professor, Bryn Mawr College http://cs.brynmawr.edu/~dblank/ Office: 610 526 6501 ------------------------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now >> http://get.splunk.com/ _______________________________________________ Gramps-devel mailing list [hidden email] https://lists.sourceforge.net/lists/listinfo/gramps-devel |
| Powered by Nabble | Edit this page |
