Storing data from large sources

classic Classic list List threaded Threaded
25 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Re: Storing data from large sources

Tim Lyons
Administrator
Thanks for your thoughtful reply.

I was well aware of the approach you take, and was careful to ensure that nothing that was done would prevent you using Gramps just as you want to.

Frederico Muñoz wrote
> I have a book that details, on page 7:
>
> “In the 1870s B moved to the town of BT. It was here that I's father K was born in 1860. By the time he was 30 he had married.
> His first child M was born there. Shortly afterwards his wife died and two years later he married G. M was 12 before her brother
> I appeared.”
>

I would create a source reference for page 7 of the book, copy it to
the clipboard and use it in each event/assertion. I would further add
a specific *citation* (TEXT_FROM_SOURCE) to each different source
reference that deals only with the specific event. Example: in K's
birth event I would add "... It was here that I's father K was born in
1860...", etc. So, each Source Reference contains a different
citation, making it unique.
Yes, that is fine, and if you have the time to extract the specific words for each citation, it is a good approach. However, my point was that the family history part of the page was only those two sentences, and I am going to use those sentences to support many different facts. Sometimes, life is just too short to extract specific words. I think it is much better to help users to provide *some* source citation information, even if they do not take the time and effort to input some perfect data. With the GEPS, the user can input a citation, and then use it repeatedly if they want.

Frederico Muñoz wrote
>The objections listed are:

>  * The Source Reference does not allow the Media scan to be stored.
I agree. This was one of my original problems - although the scan
should be present in the Source, a way to link it to a source
reference would be nice.
You would be able to continue to store the scan in the source. However, if you wanted the scan to be related to the citation, then you would be able to store it there (either as well as in the source, or instead of in the source).

Frederico Muñoz wrote
>    * The Source Reference is not shared, there is a separate instance for each place where it occurs (e.g. each event).

I depend on this behaviour, and the GEPS makes explicit mention to the
way I do things.  I would like to note that what I do is what it
already present in GEDCOM
Nothing in the GEPS *forces* the citation to be shared. It would continue to be possible to have separate citation instances. In fact this would be the normal situation, and you would have to explicitly select an existing citation if you wanted it shared. The proposal is entirely consistent with GEDCOM and the way source citations are used there with text (notes) and multimedia links.

Frederico Muñoz wrote
> Note that there is an argument that separate source references for the different events is preferable, because the exact text
> that relates to that particular event can be attached. For example, for the birth event for person K, one could attach: “…the
> town of BT. It was here that I's father K was born in 1860…”. There are two objections to this:
>
>    * It is difficult to identify exactly which parts of the text are relevant to each event. Should I’s father be included in the source for K’s birth?

While there will always be the need for some personal criteria (this
is far from exact an exact science), this is no different from other
decisions concerning where should source references be added. I do not
understand the objection very well though (my fault), but yes: if the
supporting information concerning K's birth is derived from that
sentence I would use "...It was here that I's father K was born in
1860...". If I knew the place that "here" is supposed to mean I would
put it inside brackets. This way I will know exactly why I have K's
birth in 1860. Without this sort of event-specific citations (read,
TEXT_FROM_SOURCE, a source note added to a source reference) I would
have to go find out by reading the entire source.
Sorry this wasn't very clear. I agree that it is no different from any other decisions about how source information should be input. The point I was trying to make is that it is rather tedious to work out exactly which parts of the source text relate to each event that you are trying to provide the source for. Also, think about what happens when you (or someone else) later goes back to the source to check the conclusion. In practice you are likely to want to re-read the whole paragraph, because the doubt in your mind arises from some thought that was not present when you wrote the original source reference. Therefore it is quite likely that you want to check everything, not just the words that you had selected.

Having said all this though, the GEPS would not prevent you carefully citing separate exact words for each source.

Frederico Muñoz wrote
Well, sharing the Source Text note would be an option...the problem
here is that while sharing supporting citation work for one-liners not
all (and certainly not most in my experience) sources are like that.
And by making Source References something "shared" it stops being
possible to provide adequate citations that support the specific
event.

So, for me it is important that any improvement maintains the ability
to keep source reference specific content. Since I use citations for
everything (this is why they exist, and in PAF for example citations
have a first-order UI element that helps a lot, I have made a feature
request about this) sharing Source References would not work since
changing a citation would mean changing all of them.
Just to emphasise the point again, the GEPS does not *force* you to share citations. You could continue to keep them unique to each event or other object. The GEPS *does* make citations first class UI objects, so that they can be examined as you wish.

Frederico Muñoz wrote
A different matter is the way to "split" sources. Again using Church
Records as an example I have often felt the need for an hierarchical
classification, similar to what is used in the repositories I use:
looking at http://pesquisa.adporto.pt/cravfrontoffice/default.aspx?page=regShow&ID=488904&searchMode=as
in the right side one can see a tree. The organisation is
hierarchical, with a top category for the Parish, which contains
different "series" (Baptism, Marriages, etc), each containing an
"installation unit" (a specific book, for a specific time period).
Since sources in Gramps are "flat" this is not entirely different from
what was done with Places.
I haven't suggested a hierarchical arrangement (except you can regard citations and sources as a two level hierarchy). I think that a hierarchy would be much too complicated, especially for Aunt Martha. The user would have to decide exactly which things were going to be at each level. He would have to decide how the 'location' attribute was going to be used at each level (e.g. at one level, the location refers to volume number, at the next to page, and at the next to line number). This provides more opportunity for confusion, and for inconsistency between different sources within one family tree. Finally, a hierarchical approach is not consistent with GEDCOM. With the approach in the GEPS, the Volume/Page location in the citation is consistent with GEDCOM, and one can take advice from GEDCOM documents as to how to structure your sources.

Reply | Threaded
Open this post in threaded view
|

Re: Storing data from large sources

Frederico Munoz
Hi Tim,

First of all a Happy New Year.

I think I've focused to much on the background information in my
comment, and to little in the GEPS itself, apologies. I didn't want to
convey the idea that I was opposed to the proposed solution, nor that
it didn't address my needs (after all it's quite clear that you took
explicit care in documenting the approach I use).

I'm sending a new batch of comments; I being a bit of "devil's
advocate" in some points, but only because I think that the more
comments the GEPS gets the better. There is also a bigger difference
here: I almost never use Census information, so I'm less sensible to
some of the problems that are being addressed, which is something that
no doubt tints my perspective.

2011/1/1 Tim Lyons <[hidden email]>:
>
> Thanks for your thoughtful reply.
>
> I was well aware of the approach you take, and was careful to ensure that
> nothing that was done would prevent you using Gramps just as you want to.

I'm not opposed to changes, even if they mean that the way I use it
would need to be changed. I'm not married to "my way" of doing things,
so I can easily adapt to any model that satisfies my needs. In any
event it's good to have as little impact as possible on existing
practices (but that shouldn't be something that gets in a way of a
better design)

> With the GEPS, the user
> can input a citation, and then use it repeatedly if they want.

One interesting thing about the proposal is that there will be
(correct me if I'm wrong) another layer added ("Source Citation
Information") that will *not* be shared and will be specific to each
citation. It's not entirely impossible to use the same arguments to
advocate that this one should also be shared, ad infinitum...

> You would be able to continue to store the scan in the source. However, if
> you wanted the scan to be related to the citation, then you would be able to
> store it there (either as well as in the source, or instead of in the
> source).

This is something useful. I sometime use the event gallery, but this
would be better.

> Nothing in the GEPS *forces* the citation to be shared. It would continue to
> be possible to have separate citation instances. In fact this would be the
> normal situation, and you would have to explicitly select an existing
> citation if you wanted it shared. The proposal is entirely consistent with
> GEDCOM and the way source citations are used there with text (notes) and
> multimedia links.

Yes, I can use different citations. However once the change is made it
would be a bit strange to have different citations that are identified
by exactly the same citation data and only differ in the notes,
gallery, etc. What I mean by this is that I'm more than willing to
adapt to any model instead of maintaining my own, and these decisions
will influence how people will use Gramps by default (e.g. I do not
use a "one source per page" approach because I feel it doesn't fit in
the existing model, even if it is possible).

> Also, think about what happens when you (or someone else) later
> goes back to the source to check the conclusion. In practice you are likely
> to want to re-read the whole paragraph, because the doubt in your mind
> arises from some thought that was not present when you wrote the original
> source reference. Therefore it is quite likely that you want to check
> everything, not just the words that you had selected.

Do note that it is already possible to share SourceRef notes, so even
if the SourceRefs themselves are not shareable the notes are. I
actually use this for the one-line examples: I use the same
TEXT_FROM_SOURCE note in more than one SourceRef, and changing it once
will be reflected in all of them. I'm not saying that this addresses
your needs (or even the needs of most users), just that it is
possible.

> Having said all this though, the GEPS would not prevent you carefully citing
> separate exact words for each source.

That is good, even because I want t make a GEPS/request/debate about
adding first-class UI support for those citations, even if via popup
or something. One of the problems here (and you have been very careful
in mentioning it, your GEPS is overall very detailed and thorough) is
that these kind of changes also have impact in other GEPS and feature
requests, or even sometimes depend on them.

> Just to emphasise the point again, the GEPS does not *force* you to share
> citations. You could continue to keep them unique to each event or other
> object. The GEPS *does* make citations first class UI objects, so that they
> can be examined as you wish.


Sorry, my mistake here. I'm using "citations" wrongly: when I use
citations I mean the TEXT_FROM_SOURCE note that is parte of a
SourceRef, while you mean SourceRefs (and rightly so since they are
SOURCE_CITATIONS in GEDCOM). What I mean by "first class UI citations"
is a way to quickly see what is the text in the source that a
particular  source reference contains and that supports that
particular event.

> I haven't suggested a hierarchical arrangement (except you can regard
> citations and sources as a two level hierarchy). I think that a hierarchy
> would be much too complicated, especially for Aunt Martha. The user would
> have to decide exactly which things were going to be at each level. He would
> have to decide how the 'location' attribute was going to be used at each
> level (e.g. at one level, the location refers to volume number, at the next
> to page, and at the next to line number). This provides more opportunity for
> confusion, and for inconsistency between different sources within one family
> tree. Finally, a hierarchical approach is not consistent with GEDCOM. With
> the approach in the GEPS, the Volume/Page location in the citation is
> consistent with GEDCOM, and one can take advice from GEDCOM documents as to
> how to structure your sources.

This is sufficiently different from the main aspects of your GEPS that
it can be considered a different issue altogether, I only mentioned it
because it could be useful to think about it a bit more (perhaps in a
different GEPS). I tend to think that it is possible to have a
non-flat approach that is GEDCOM compatible, not unlike the Locations
Tree View. However I haven't really thought about it much and it would
surely be something not trivial (and, again, that would be entangled
with other GEPS, like the "Evidences-type citations", etc).

Best regards,

Frederico

------------------------------------------------------------------------------
Learn how Oracle Real Application Clusters (RAC) One Node allows customers
to consolidate database storage, standardize their database environment, and,
should the need arise, upgrade to a full multi-node Oracle RAC database
without downtime or disruption
http://p.sf.net/sfu/oracle-sfdevnl
_______________________________________________
Gramps-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gramps-devel
Reply | Threaded
Open this post in threaded view
|

Re: Storing data from large sources

Frederico Munoz
In reply to this post by Gerald Britton-2
Hello,

2011/1/1 Gerald Britton <[hidden email]>:
> Whew!  Your idea of a quick answer and mine are clearly not in
> alignment!

Ehe, this one wasn't so long ago in hindsight, true.

> Anyway, I appreciate your insight and for many things your
> approach mirrors my own.  As I originally stated the problem however,
> I was referring to bulk sources such as censuses or BMD registers that
> contain data on thousands or even millions of people.  Plus, these
> sources frequently contain interesting information that do not have a
> natural home in GEDCOM or gramps, other than as textual data.

Indeed, my apologies, I almost entirely disregarded your initial
message to focus on the GEPS.

> For example, Canadian censuses often record the construction material
> of the house where a family lived when it was polled.

Damn, you guys are thorough!

>  Now, I may
> eventually want a Residence event; then, the construction material
> might be a good event attribute.  However, I wish to capture the data
> from the census *all at once* and *in one place*, including an image
> of the page where the data is found.  Later, I can build other event
> types from the data.  Also, I would like to have key/value pairs for
> each data point, for ease of comparison with other censuses.  This
> idea forms the foundation of the GEPS, I believe.

Yes, as I added in a latter message my perspective is tinted by my
almost complete reliance on Church Records, not Census (much to my
sadness).

> So I would have an Event (Census), with a Source (1901 Census of
> Canada), with a Source Reference (RG31, Alberta, Calgary, District 35,
> Sub District 1, page 2, line 3) and a matching Source Contents
> containing all the data points -- as key, value pairs -- on that line
> in the census (for at least one Canadian census, there are over 100
> data points!) plus a Media Object reference to the image of the page
> itself -- either stored locally or on the Library and Archives Canada
> site (which is also the Repository for my source).

I see, it makes sense given the tabular nature of Census... the
problem being the implementation of key/value pairs I suppose...

> The Census gramplet does much of what I'm talking about except that it
> stores the attributes as Event Reference attributes.  That has
> limitations (especially sharability) and I would argue that the number
> of sheep my g.grandfather had is not an attribute of the Census but
> rather of my grandfather or perhaps the farm he had at the time.

For me it is first of all an "attribute" of the source, which can then
be used as something that backs up attributes of a person or event.
The number of sheep is no different, I think, from e.g. the number of
children a person is said to have on a particular source (say, a death
certificate or something like that).

Having said that I do understand the problem here: when considering
that a Census of a particular year is a Source, and each entre a
SourceReference, all the information present in a single line (let
alone a single page, I imagine...) is used to back up numerous events,
hence the need to share these SourceRefs (since copying would mean
that correcting the census entry would have to be done multiple
times)... the only thing that comes to mind is that these information
should not be stored in the SourceRef but in the Source, and
SourceRefs would then reference them. This works for all kinds of
notes, but not for Images. This is something that could be
impractical, I'm not sure...

Cheers,

Frederico

------------------------------------------------------------------------------
Learn how Oracle Real Application Clusters (RAC) One Node allows customers
to consolidate database storage, standardize their database environment, and,
should the need arise, upgrade to a full multi-node Oracle RAC database
without downtime or disruption
http://p.sf.net/sfu/oracle-sfdevnl
_______________________________________________
Gramps-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gramps-devel
Reply | Threaded
Open this post in threaded view
|

Re: Storing data from large sources

Tim Lyons
Administrator
In reply to this post by Frederico Munoz
Frederico Muñoz wrote
One interesting thing about the proposal is that there will be
(correct me if I'm wrong) another layer added ("Source Citation
Information") that will *not* be shared and will be specific to each
citation. It's not entirely impossible to use the same arguments to
advocate that this one should also be shared, ad infinitum...
Actually, the original proposed solution (in section 2.3 of the GEPS) does not have any information in the "CitationRef", so there would not be any information that is specific, and hence there would not be an argument for an infinite regress. The design adds information to the CitationRef to support deduction content, but this addition is not an essential part of the GEPS.
Reply | Threaded
Open this post in threaded view
|

Re: Storing data from large sources

Tim Lyons
Administrator
In reply to this post by Benny Malengier
Benny Malengier wrote
2010/12/19 Nick Hall <nick__hall@hotmail.com>

> Benny,
>
> Tim has done a lot of work on the GEPS, and it is now at a stage where I
> think that it would be helpful if you could review it.
>
>
> http://gramps-project.org/wiki/index.php?title=GEPS_023:_Storing_data_from_large_sources
>
> My main concern is that the Citation Reference editor may by rather
> complicated and large. What do you think?
>

I'll try to find the time to read it this week. A long wiki page that!
Benny, have you been able to have a look at the GEPS, because I am still very keen to work on this change, and I believe that Nick would take the lead on it.
12