gramps 5.0, sqlite3 and narrativeweb performances

classic Classic list List threaded Threaded
24 messages Options
12
Reply | Threaded
Open this post in threaded view
|

gramps 5.0, sqlite3 and narrativeweb performances

Serge Noiraud-2
Hi,

With BSDDB, creating my narrativeweb takes 2 hours.
With sqlite3, after 30 hours, it's not finished.

I'm on an ubuntu 14.04.
could it be the following problem : sqlite3 and python 2.6 ?

gramps -v gives the following :
...
Databases:
-------------------------
  bsddb     :
      version     : 6.0.1
      db version  : 5.3.28
      location    : /usr/lib/python3/dist-packages/bsddb3/__init__.py
  sqlite3   :
      version     : 3.8.2
      py version  : 2.6.0
      location    : /usr/lib/python3.4/sqlite3/__init__.py

I'll try to find out where is the problem.
Pass 0 is OK (1 seconde max)
Pass 1 is OK (8 minutes)
Pass 2 (always waiting for. not finished after 30 hours ( 55% of the individuals pages)

Is there a way to tune sqlite ?

Serge

------------------------------------------------------------------------------
Mobile security can be enabling, not merely restricting. Employees who
bring their own devices (BYOD) to work are irked by the imposition of MDM
restrictions. Mobile Device Manager Plus allows you to control only the
apps on BYO-devices by containerizing them, leaving personal data untouched!
https://ad.doubleclick.net/ddm/clk/304595813;131938128;j
_______________________________________________
Gramps-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gramps-devel
Reply | Threaded
Open this post in threaded view
|

Re: gramps 5.0, sqlite3 and narrativeweb performances

DS Blank
On Wed, May 25, 2016 at 7:20 AM, Serge Noiraud <[hidden email]> wrote:
Hi,

With BSDDB, creating my narrativeweb takes 2 hours.
With sqlite3, after 30 hours, it's not finished.

I'm on an ubuntu 14.04.
could it be the following problem : sqlite3 and python 2.6 ?

gramps -v gives the following :
...
Databases:
-------------------------
  bsddb     :
      version     : 6.0.1
      db version  : 5.3.28
      location    : /usr/lib/python3/dist-packages/bsddb3/__init__.py
  sqlite3   :
      version     : 3.8.2
      py version  : 2.6.0
      location    : /usr/lib/python3.4/sqlite3/__init__.py

I'll try to find out where is the problem.
Pass 0 is OK (1 seconde max)
Pass 1 is OK (8 minutes)
Pass 2 (always waiting for. not finished after 30 hours ( 55% of the individuals pages)

Is there a way to tune sqlite ?

There are some ways to make it work better/faster. For example, perhaps by adding an index.

Can you share your database? Also, if you can give the command-line arguments for running the report, that would make it easy to replicate.

-Doug
 

Serge

------------------------------------------------------------------------------
Mobile security can be enabling, not merely restricting. Employees who
bring their own devices (BYOD) to work are irked by the imposition of MDM
restrictions. Mobile Device Manager Plus allows you to control only the
apps on BYO-devices by containerizing them, leaving personal data untouched!
https://ad.doubleclick.net/ddm/clk/304595813;131938128;j
_______________________________________________
Gramps-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gramps-devel


------------------------------------------------------------------------------
Mobile security can be enabling, not merely restricting. Employees who
bring their own devices (BYOD) to work are irked by the imposition of MDM
restrictions. Mobile Device Manager Plus allows you to control only the
apps on BYO-devices by containerizing them, leaving personal data untouched!
https://ad.doubleclick.net/ddm/clk/304595813;131938128;j
_______________________________________________
Gramps-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gramps-devel
Reply | Threaded
Open this post in threaded view
|

Re: gramps 5.0, sqlite3 and narrativeweb performances

enno
In reply to this post by Serge Noiraud-2
Hi Serge,
> With BSDDB, creating my narrativeweb takes 2 hours.
> With sqlite3, after 30 hours, it's not finished.
>
> I'm on an ubuntu 14.04.
> could it be the following problem : sqlite3 and python 2.6 ?
It's not really Python 2.6, but 3.4. What you see is the version number
of the Python sqlite3 module, for which I don't know whether updates
exist, but I think that you can check that on the Python site.
> Is there a way to tune sqlite ?
Yes. I started hacking a simple Python GEDCOM parser yesterday, to make
it export to sqlite, and found that it was really slow, both in Python 2
(what it was written for), and 3 (quick hack), and did a Google search
for sqlite python performance.

Results 1 and 2 are quite interesting:

http://codereview.stackexchange.com/questions/26822/myth-busting-sqlite3-performance-w-pysqlite

http://stackoverflow.com/questions/17867605/performance-of-insert-with-python-and-sqlite3

The 1st link in the 1st result points to:

http://stackoverflow.com/questions/1711631/improve-insert-per-second-performance-of-sqlite

And the primary advice that I read there is to make sure that sqlite3
doesn't write data to disk for every insert. In standard mode, it writes
data to disk for every commit, so if you have a commit for each insert
it is indeed very slow, as I saw in my primitive parser hack.

One possible remedy is to delay commits in code, like only sending a
commit for every 1000 inserts or so, or even waiting till all inserts
are done. Another is to change sqlite3 settings so that it doesn't write
every commit to disk right away. These and other ideas are all mentioned
in that improve article, which has the most interesting data in the
question, not in the answers like you would expect on StackExchange.

I haven't tried all things mentioned yet, but may do so later this week.
I'm hoping that what works for C will work in Python too.

regards,

Enno


------------------------------------------------------------------------------
Mobile security can be enabling, not merely restricting. Employees who
bring their own devices (BYOD) to work are irked by the imposition of MDM
restrictions. Mobile Device Manager Plus allows you to control only the
apps on BYO-devices by containerizing them, leaving personal data untouched!
https://ad.doubleclick.net/ddm/clk/304595813;131938128;j
_______________________________________________
Gramps-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gramps-devel
Reply | Threaded
Open this post in threaded view
|

Re: gramps 5.0, sqlite3 and narrativeweb performances

DS Blank
On Wed, May 25, 2016 at 7:50 AM, Enno Borgsteede <[hidden email]> wrote:
Hi Serge,
> With BSDDB, creating my narrativeweb takes 2 hours.
> With sqlite3, after 30 hours, it's not finished.
>
> I'm on an ubuntu 14.04.
> could it be the following problem : sqlite3 and python 2.6 ?
It's not really Python 2.6, but 3.4. What you see is the version number
of the Python sqlite3 module, for which I don't know whether updates
exist, but I think that you can check that on the Python site.
> Is there a way to tune sqlite ?
Yes. I started hacking a simple Python GEDCOM parser yesterday, to make
it export to sqlite, and found that it was really slow, both in Python 2
(what it was written for), and 3 (quick hack), and did a Google search
for sqlite python performance.

Enno, this is interesting, but looks to be only relevant to writing a sqlite database. I'm finding imports to be faster on sqlite than BSDDB, but I think that is due to indexes looking up data than raw write speeds (which could well be slower).

But there is something very slow about the narrative report. Serge, it looks like this will be easier to track down... I don't think I need your database or settings, as I am seeing the issue.

I'm trying this:

python3 -m cProfile -o data.profile ./Gramps.py -O "data.gramps DB-API" --action report --options=name=navwebpage

now and trying to find the bottleneck(s)... if anyone has hints on using pstats, please let me know.

-Doug

 

Results 1 and 2 are quite interesting:

http://codereview.stackexchange.com/questions/26822/myth-busting-sqlite3-performance-w-pysqlite

http://stackoverflow.com/questions/17867605/performance-of-insert-with-python-and-sqlite3

The 1st link in the 1st result points to:

http://stackoverflow.com/questions/1711631/improve-insert-per-second-performance-of-sqlite

And the primary advice that I read there is to make sure that sqlite3
doesn't write data to disk for every insert. In standard mode, it writes
data to disk for every commit, so if you have a commit for each insert
it is indeed very slow, as I saw in my primitive parser hack.

One possible remedy is to delay commits in code, like only sending a
commit for every 1000 inserts or so, or even waiting till all inserts
are done. Another is to change sqlite3 settings so that it doesn't write
every commit to disk right away. These and other ideas are all mentioned
in that improve article, which has the most interesting data in the
question, not in the answers like you would expect on StackExchange.

I haven't tried all things mentioned yet, but may do so later this week.
I'm hoping that what works for C will work in Python too.

regards,

Enno


------------------------------------------------------------------------------
Mobile security can be enabling, not merely restricting. Employees who
bring their own devices (BYOD) to work are irked by the imposition of MDM
restrictions. Mobile Device Manager Plus allows you to control only the
apps on BYO-devices by containerizing them, leaving personal data untouched!
https://ad.doubleclick.net/ddm/clk/304595813;131938128;j
_______________________________________________
Gramps-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gramps-devel


------------------------------------------------------------------------------
Mobile security can be enabling, not merely restricting. Employees who
bring their own devices (BYOD) to work are irked by the imposition of MDM
restrictions. Mobile Device Manager Plus allows you to control only the
apps on BYO-devices by containerizing them, leaving personal data untouched!
https://ad.doubleclick.net/ddm/clk/304595813;131938128;j
_______________________________________________
Gramps-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gramps-devel
Reply | Threaded
Open this post in threaded view
|

Re: gramps 5.0, sqlite3 and narrativeweb performances

DS Blank
In reply to this post by enno
On Wed, May 25, 2016 at 7:50 AM, Enno Borgsteede <[hidden email]> wrote:
Hi Serge,
> With BSDDB, creating my narrativeweb takes 2 hours.
> With sqlite3, after 30 hours, it's not finished.
>
> I'm on an ubuntu 14.04.
> could it be the following problem : sqlite3 and python 2.6 ?
It's not really Python 2.6, but 3.4. What you see is the version number
of the Python sqlite3 module, for which I don't know whether updates
exist, but I think that you can check that on the Python site.
> Is there a way to tune sqlite ?
Yes. I started hacking a simple Python GEDCOM parser yesterday, to make
it export to sqlite, and found that it was really slow, both in Python 2
(what it was written for), and 3 (quick hack), and did a Google search
for sqlite python performance.

Results 1 and 2 are quite interesting:

http://codereview.stackexchange.com/questions/26822/myth-busting-sqlite3-performance-w-pysqlite

http://stackoverflow.com/questions/17867605/performance-of-insert-with-python-and-sqlite3

The 1st link in the 1st result points to:

http://stackoverflow.com/questions/1711631/improve-insert-per-second-performance-of-sqlite

And the primary advice that I read there is to make sure that sqlite3
doesn't write data to disk for every insert. In standard mode, it writes
data to disk for every commit, so if you have a commit for each insert
it is indeed very slow, as I saw in my primitive parser hack.

Note that Gramps does imports as a single Gramps transaction (which is also a single SQL transaction). So that looks like the single best advice for making fast writes.

-Doug
 

One possible remedy is to delay commits in code, like only sending a
commit for every 1000 inserts or so, or even waiting till all inserts
are done. Another is to change sqlite3 settings so that it doesn't write
every commit to disk right away. These and other ideas are all mentioned
in that improve article, which has the most interesting data in the
question, not in the answers like you would expect on StackExchange.

I haven't tried all things mentioned yet, but may do so later this week.
I'm hoping that what works for C will work in Python too.

regards,

Enno


------------------------------------------------------------------------------
Mobile security can be enabling, not merely restricting. Employees who
bring their own devices (BYOD) to work are irked by the imposition of MDM
restrictions. Mobile Device Manager Plus allows you to control only the
apps on BYO-devices by containerizing them, leaving personal data untouched!
https://ad.doubleclick.net/ddm/clk/304595813;131938128;j
_______________________________________________
Gramps-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gramps-devel


------------------------------------------------------------------------------
Mobile security can be enabling, not merely restricting. Employees who
bring their own devices (BYOD) to work are irked by the imposition of MDM
restrictions. Mobile Device Manager Plus allows you to control only the
apps on BYO-devices by containerizing them, leaving personal data untouched!
https://ad.doubleclick.net/ddm/clk/304595813;131938128;j
_______________________________________________
Gramps-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gramps-devel
Reply | Threaded
Open this post in threaded view
|

Re: gramps 5.0, sqlite3 and narrativeweb performances

DS Blank
In reply to this post by DS Blank
On Wed, May 25, 2016 at 8:06 AM, Doug Blank <[hidden email]> wrote:
On Wed, May 25, 2016 at 7:50 AM, Enno Borgsteede <[hidden email]> wrote:
Hi Serge,
> With BSDDB, creating my narrativeweb takes 2 hours.
> With sqlite3, after 30 hours, it's not finished.
>
> I'm on an ubuntu 14.04.
> could it be the following problem : sqlite3 and python 2.6 ?
It's not really Python 2.6, but 3.4. What you see is the version number
of the Python sqlite3 module, for which I don't know whether updates
exist, but I think that you can check that on the Python site.
> Is there a way to tune sqlite ?
Yes. I started hacking a simple Python GEDCOM parser yesterday, to make
it export to sqlite, and found that it was really slow, both in Python 2
(what it was written for), and 3 (quick hack), and did a Google search
for sqlite python performance.

Enno, this is interesting, but looks to be only relevant to writing a sqlite database. I'm finding imports to be faster on sqlite than BSDDB, but I think that is due to indexes looking up data than raw write speeds (which could well be slower).

But there is something very slow about the narrative report. Serge, it looks like this will be easier to track down... I don't think I need your database or settings, as I am seeing the issue.

I'm trying this:

python3 -m cProfile -o data.profile ./Gramps.py -O "data.gramps DB-API" --action report --options=name=navwebpage

now and trying to find the bottleneck(s)... if anyone has hints on using pstats, please let me know.

One thing that I am seeing immediately running with the --debug flag:

./Gramps.py --debug="" -O "data.gramps DB-API" --action report --options=name=navwebpage

is that navwebpage is looking up the same information in the database over and over. I think it is true that BSDDB has lower overhead for individual database lookups. Some of Gramps is written in a manner that assumes that database lookups are negligible. If that cost is increased just a bit, then we can see huge slowdowns with repetitive db reads. If we instead store that information in a variable, it will be faster for both backends, but especially DB-API.

-Doug

 

-Doug

 

Results 1 and 2 are quite interesting:

http://codereview.stackexchange.com/questions/26822/myth-busting-sqlite3-performance-w-pysqlite

http://stackoverflow.com/questions/17867605/performance-of-insert-with-python-and-sqlite3

The 1st link in the 1st result points to:

http://stackoverflow.com/questions/1711631/improve-insert-per-second-performance-of-sqlite

And the primary advice that I read there is to make sure that sqlite3
doesn't write data to disk for every insert. In standard mode, it writes
data to disk for every commit, so if you have a commit for each insert
it is indeed very slow, as I saw in my primitive parser hack.

One possible remedy is to delay commits in code, like only sending a
commit for every 1000 inserts or so, or even waiting till all inserts
are done. Another is to change sqlite3 settings so that it doesn't write
every commit to disk right away. These and other ideas are all mentioned
in that improve article, which has the most interesting data in the
question, not in the answers like you would expect on StackExchange.

I haven't tried all things mentioned yet, but may do so later this week.
I'm hoping that what works for C will work in Python too.

regards,

Enno


------------------------------------------------------------------------------
Mobile security can be enabling, not merely restricting. Employees who
bring their own devices (BYOD) to work are irked by the imposition of MDM
restrictions. Mobile Device Manager Plus allows you to control only the
apps on BYO-devices by containerizing them, leaving personal data untouched!
https://ad.doubleclick.net/ddm/clk/304595813;131938128;j
_______________________________________________
Gramps-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gramps-devel



------------------------------------------------------------------------------
Mobile security can be enabling, not merely restricting. Employees who
bring their own devices (BYOD) to work are irked by the imposition of MDM
restrictions. Mobile Device Manager Plus allows you to control only the
apps on BYO-devices by containerizing them, leaving personal data untouched!
https://ad.doubleclick.net/ddm/clk/304595813;131938128;j
_______________________________________________
Gramps-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gramps-devel
Reply | Threaded
Open this post in threaded view
|

Re: gramps 5.0, sqlite3 and narrativeweb performances

DS Blank
On Wed, May 25, 2016 at 8:27 AM, Doug Blank <[hidden email]> wrote:
On Wed, May 25, 2016 at 8:06 AM, Doug Blank <[hidden email]> wrote:
On Wed, May 25, 2016 at 7:50 AM, Enno Borgsteede <[hidden email]> wrote:
Hi Serge,
> With BSDDB, creating my narrativeweb takes 2 hours.
> With sqlite3, after 30 hours, it's not finished.
>
> I'm on an ubuntu 14.04.
> could it be the following problem : sqlite3 and python 2.6 ?
It's not really Python 2.6, but 3.4. What you see is the version number
of the Python sqlite3 module, for which I don't know whether updates
exist, but I think that you can check that on the Python site.
> Is there a way to tune sqlite ?
Yes. I started hacking a simple Python GEDCOM parser yesterday, to make
it export to sqlite, and found that it was really slow, both in Python 2
(what it was written for), and 3 (quick hack), and did a Google search
for sqlite python performance.

Enno, this is interesting, but looks to be only relevant to writing a sqlite database. I'm finding imports to be faster on sqlite than BSDDB, but I think that is due to indexes looking up data than raw write speeds (which could well be slower).

But there is something very slow about the narrative report. Serge, it looks like this will be easier to track down... I don't think I need your database or settings, as I am seeing the issue.

I'm trying this:

python3 -m cProfile -o data.profile ./Gramps.py -O "data.gramps DB-API" --action report --options=name=navwebpage

now and trying to find the bottleneck(s)... if anyone has hints on using pstats, please let me know.

One thing that I am seeing immediately running with the --debug flag:

./Gramps.py --debug="" -O "data.gramps DB-API" --action report --options=name=navwebpage

is that navwebpage is looking up the same information in the database over and over. I think it is true that BSDDB has lower overhead for individual database lookups. Some of Gramps is written in a manner that assumes that database lookups are negligible. If that cost is increased just a bit, then we can see huge slowdowns with repetitive db reads. If we instead store that information in a variable, it will be faster for both backends, but especially DB-API.

Yes, this is the problem. I was able to reduce the time (looks like it will be hours and hours... still running) to 2 minutes and 17 seconds for my family tree (2,500 people). My quick fix (14 lines of code) is to put a cache on the database for this report. DB-API and BSDDB now run about the same amount of time.

Long term solution is to refine narwebpage to be smarter about database access. However, adding a cached layer on the database seems like a fine approach. My quick cache is just a dictionary lookup where all data is stored. A Least Recently Used (LRU) cache would be a bit smarter and use less memory. Probably not a bad idea for the database backend in general.

I'll work up a PR for further discussion.

-Doug
 

-Doug

 

-Doug

 

Results 1 and 2 are quite interesting:

http://codereview.stackexchange.com/questions/26822/myth-busting-sqlite3-performance-w-pysqlite

http://stackoverflow.com/questions/17867605/performance-of-insert-with-python-and-sqlite3

The 1st link in the 1st result points to:

http://stackoverflow.com/questions/1711631/improve-insert-per-second-performance-of-sqlite

And the primary advice that I read there is to make sure that sqlite3
doesn't write data to disk for every insert. In standard mode, it writes
data to disk for every commit, so if you have a commit for each insert
it is indeed very slow, as I saw in my primitive parser hack.

One possible remedy is to delay commits in code, like only sending a
commit for every 1000 inserts or so, or even waiting till all inserts
are done. Another is to change sqlite3 settings so that it doesn't write
every commit to disk right away. These and other ideas are all mentioned
in that improve article, which has the most interesting data in the
question, not in the answers like you would expect on StackExchange.

I haven't tried all things mentioned yet, but may do so later this week.
I'm hoping that what works for C will work in Python too.

regards,

Enno


------------------------------------------------------------------------------
Mobile security can be enabling, not merely restricting. Employees who
bring their own devices (BYOD) to work are irked by the imposition of MDM
restrictions. Mobile Device Manager Plus allows you to control only the
apps on BYO-devices by containerizing them, leaving personal data untouched!
https://ad.doubleclick.net/ddm/clk/304595813;131938128;j
_______________________________________________
Gramps-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gramps-devel




------------------------------------------------------------------------------
Mobile security can be enabling, not merely restricting. Employees who
bring their own devices (BYOD) to work are irked by the imposition of MDM
restrictions. Mobile Device Manager Plus allows you to control only the
apps on BYO-devices by containerizing them, leaving personal data untouched!
https://ad.doubleclick.net/ddm/clk/304595813;131938128;j
_______________________________________________
Gramps-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gramps-devel
Reply | Threaded
Open this post in threaded view
|

Re: gramps 5.0, sqlite3 and narrativeweb performances

DS Blank
On Wed, May 25, 2016 at 9:11 AM, Doug Blank <[hidden email]> wrote:
On Wed, May 25, 2016 at 8:27 AM, Doug Blank <[hidden email]> wrote:
On Wed, May 25, 2016 at 8:06 AM, Doug Blank <[hidden email]> wrote:
On Wed, May 25, 2016 at 7:50 AM, Enno Borgsteede <[hidden email]> wrote:
Hi Serge,
> With BSDDB, creating my narrativeweb takes 2 hours.
> With sqlite3, after 30 hours, it's not finished.
>
> I'm on an ubuntu 14.04.
> could it be the following problem : sqlite3 and python 2.6 ?
It's not really Python 2.6, but 3.4. What you see is the version number
of the Python sqlite3 module, for which I don't know whether updates
exist, but I think that you can check that on the Python site.
> Is there a way to tune sqlite ?
Yes. I started hacking a simple Python GEDCOM parser yesterday, to make
it export to sqlite, and found that it was really slow, both in Python 2
(what it was written for), and 3 (quick hack), and did a Google search
for sqlite python performance.

Enno, this is interesting, but looks to be only relevant to writing a sqlite database. I'm finding imports to be faster on sqlite than BSDDB, but I think that is due to indexes looking up data than raw write speeds (which could well be slower).

But there is something very slow about the narrative report. Serge, it looks like this will be easier to track down... I don't think I need your database or settings, as I am seeing the issue.

I'm trying this:

python3 -m cProfile -o data.profile ./Gramps.py -O "data.gramps DB-API" --action report --options=name=navwebpage

now and trying to find the bottleneck(s)... if anyone has hints on using pstats, please let me know.

One thing that I am seeing immediately running with the --debug flag:

./Gramps.py --debug="" -O "data.gramps DB-API" --action report --options=name=navwebpage

is that navwebpage is looking up the same information in the database over and over. I think it is true that BSDDB has lower overhead for individual database lookups. Some of Gramps is written in a manner that assumes that database lookups are negligible. If that cost is increased just a bit, then we can see huge slowdowns with repetitive db reads. If we instead store that information in a variable, it will be faster for both backends, but especially DB-API.

Yes, this is the problem. I was able to reduce the time (looks like it will be hours and hours... still running) to 2 minutes and 17 seconds for my family tree (2,500 people). My quick fix (14 lines of code) is to put a cache on the database for this report. DB-API and BSDDB now run about the same amount of time.

Long term solution is to refine narwebpage to be smarter about database access. However, adding a cached layer on the database seems like a fine approach. My quick cache is just a dictionary lookup where all data is stored. A Least Recently Used (LRU) cache would be a bit smarter and use less memory. Probably not a bad idea for the database backend in general.

I'll work up a PR for further discussion.

What makes this report really, crazy expensive is when we combine it with the Living proxy. The Living proxy may have to use many, many database lookups to determine if someone is alive. If you don't cache a lookup, then the repeated costs can become exponential, regardless of backend. So, caching the database helps linearly, but caching the Living proxy helps exponentially.

If we cache all proxies, then every lookup in nested proxies will be cached (lots of memory). If we make a special CachedProxy that we use selectively, we'll have to add it to each place that we want to speed up.

Maybe there is a compromise...

-Doug
 

------------------------------------------------------------------------------
Mobile security can be enabling, not merely restricting. Employees who
bring their own devices (BYOD) to work are irked by the imposition of MDM
restrictions. Mobile Device Manager Plus allows you to control only the
apps on BYO-devices by containerizing them, leaving personal data untouched!
https://ad.doubleclick.net/ddm/clk/304595813;131938128;j
_______________________________________________
Gramps-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gramps-devel
Reply | Threaded
Open this post in threaded view
|

Re: gramps 5.0, sqlite3 and narrativeweb performances

DS Blank
In reply to this post by DS Blank
CCing the whole list, which I assume you probably meant to do, and your reply has some great info (below).

On Wed, May 25, 2016 at 9:44 AM, paul womack <[hidden email]> wrote:
Doug Blank wrote:
Yes, this is the problem. I was able to reduce the time (looks like it will be hours and hours... still running) to 2 minutes and 17 seconds for my family tree (2,500 people). My quick fix (14 lines of code) is to put a cache on the database for this report. DB-API and BSDDB now run about the same amount of time.

Long term solution is to refine narwebpage to be smarter about database access. However, adding a cached layer on the database seems like a fine approach. My quick cache is just a dictionary lookup where all data is stored. A Least Recently Used (LRU) cache would be a bit smarter and use less memory. Probably not a bad idea for the database backend in general.

I'll work up a PR for further discussion.

Interesting;

I've been trying to familiarise myself with the Gramps model and codebase.

It's a tough hill to climb (I've not done Python or GTK before).

But I have been programming for 35 years, including SQL and SQL/Free text hybrids.

I have developed the following approach to cacheing in my own work.

1) Code super-clean, with all reads and writes (presumably called with some kind of primary key data)
via a single entry module. Any "clever" code that bypasses this API
will render all subsequent analysis false.

I think we are in pretty good shape to cache this. All items are always looked up with db.get_ITEM_from_handle(HANDLE). That is also the point that proxies override.
 

2) This allows a read cache to be layered in trivially, with the cache accesses via the same
primary key data as the read. Even a small (10-20 slot) cache can have a dramatic effect,

I think that would be true in this case as I see lots of repeated queries, one after another.
 

3) freeing slots for use when the cache is full. Unless I have overwhelming reasons to the contrary
I use a random number generator for this; LRU is better in general, but has a fatal flaw.

Consider a LRU cache used on repeated sequential access to a DB. If the cache
is 99% the size of the total data, the hit rate will be exactly 0%.

This is shocking. (I remember being almost physically shaken when this
happened to me)

It would take a truly extraordinary access sequence for a randomly purged cache to fail so badly.

I'll take "never being appalling" over "sometimes being excellent", given the choice.

I saw that you mentioned that earlier, and I was shocked. But it is true.
 

3) Keeping the cache accurate under writes.

    (a) just empty the whole cache. If the ratio of reads to write is high enough this IS acceptable,
    and is easy to code reliably.

    (b) empty the slot being written too. It'll re-load if needed.
    In my current, commercial, codebase, this is used on only 2 of 31 caches. All the rest use option (a).

    (c) transfer the written data to the cache. This feels like the high performance thing
    to do, but if the read:write ratio is high, it doesn't make much difference, and is harder to code.
    it might even have negative performance, since (usually) it will purge some read-data out of the cache
    to get its slot.

Yes, this is where it can introduce problems. But we have db.commit_ITEM() where we can void the cache.
 

In many cases it is possible to get high level calling code to use its contextual knowledge
to optimise accesses, but a low level; cache will often provide similar performance
gains without mucking up the calling code's logic.

Caches are GREAT; IMHO Gramps should have many (instances) of them.

(I noted, having seen references to Gramps using pickled-data,
that there are 72 separate calls to pickle.loads() in the Gramps source,
which would make it difficult to cache away that work)

I think putting a cache above those calls will work, but I should look at those 72 calls...

-Doug
 

  BugBear


------------------------------------------------------------------------------
Mobile security can be enabling, not merely restricting. Employees who
bring their own devices (BYOD) to work are irked by the imposition of MDM
restrictions. Mobile Device Manager Plus allows you to control only the
apps on BYO-devices by containerizing them, leaving personal data untouched!
https://ad.doubleclick.net/ddm/clk/304595813;131938128;j
_______________________________________________
Gramps-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gramps-devel
Reply | Threaded
Open this post in threaded view
|

Re: gramps 5.0, sqlite3 and narrativeweb performances

Paul Franklin-5
> I think we are in pretty good shape to cache this. All items are always
> looked up with db.get_ITEM_from_handle(HANDLE). That is also the point that
> proxies override.

I don't claim to be an expert on the DB layer, nor on
the proxies, but I've been looking at the living proxy
recently and I think it overrides gets from_gramps_id
also.  I don't think it's just handles.

------------------------------------------------------------------------------
Mobile security can be enabling, not merely restricting. Employees who
bring their own devices (BYOD) to work are irked by the imposition of MDM
restrictions. Mobile Device Manager Plus allows you to control only the
apps on BYO-devices by containerizing them, leaving personal data untouched!
https://ad.doubleclick.net/ddm/clk/304595813;131938128;j
_______________________________________________
Gramps-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gramps-devel
Reply | Threaded
Open this post in threaded view
|

Re: gramps 5.0, sqlite3 and narrativeweb performances

DS Blank
On Wed, May 25, 2016 at 12:01 PM, Paul Franklin <[hidden email]> wrote:
> I think we are in pretty good shape to cache this. All items are always
> looked up with db.get_ITEM_from_handle(HANDLE). That is also the point that
> proxies override.

I don't claim to be an expert on the DB layer, nor on
the proxies, but I've been looking at the living proxy
recently and I think it overrides gets from_gramps_id
also.  I don't think it's just handles.

Paul, no it doesn't override just handles. But the handle lookups are the bulk of the bottleneck. 

-Doug

 

------------------------------------------------------------------------------
Mobile security can be enabling, not merely restricting. Employees who
bring their own devices (BYOD) to work are irked by the imposition of MDM
restrictions. Mobile Device Manager Plus allows you to control only the
apps on BYO-devices by containerizing them, leaving personal data untouched!
https://ad.doubleclick.net/ddm/clk/304595813;131938128;j
_______________________________________________
Gramps-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gramps-devel


------------------------------------------------------------------------------
Mobile security can be enabling, not merely restricting. Employees who
bring their own devices (BYOD) to work are irked by the imposition of MDM
restrictions. Mobile Device Manager Plus allows you to control only the
apps on BYO-devices by containerizing them, leaving personal data untouched!
https://ad.doubleclick.net/ddm/clk/304595813;131938128;j
_______________________________________________
Gramps-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gramps-devel
Reply | Threaded
Open this post in threaded view
|

Re: gramps 5.0, sqlite3 and narrativeweb performances

DS Blank
In reply to this post by Serge Noiraud-2
On Wed, May 25, 2016 at 7:20 AM, Serge Noiraud <[hidden email]> wrote:
Hi,

With BSDDB, creating my narrativeweb takes 2 hours.
With sqlite3, after 30 hours, it's not finished.

I tried some more experiments with smaller files and different levels of caching.

If anyone wants to try these, here are the databases I'm creating:

./Gramps.py --config=behavior.database-backend:bsddb -C "example.gramps BSDDB" -i example/gramps/example.gramps 
./Gramps.py --config=behavior.database-backend:dbapi -C "example.gramps DBAPI" -i example/gramps/example.gramps 

Here are the results for me on an SSD harddrive (CORE i7) with Ubuntu 14.04 on the above generating narrative web pages. Options for navwebpage for ~/.gramps/report_options.xml:

<module name="navwebpage">
  <option name="ancestortree" value="True"/>
  <option name="archive" value="False"/>
  <option name="birthorder" value="False"/>
  <option name="caluri" value="/WEBCAL"/>
  <option name="citationreferents" value="Outline"/>
  <option name="cmsuri" value="/NAVWEB"/>
  <option name="contactimg" value=""/>
  <option name="contactnote" value=""/>
  <option name="create_thumbs_only" value="False"/>
  <option name="cright" value="1"/>
  <option name="css" value="Mainz"/>
  <option name="dl_descr1" value="Smith Family Tree"/>
  <option name="dl_descr2" value="Johnson Family Tree"/>
  <option name="down_fname1" value="/home/dblank/"/>
  <option name="down_fname2" value="/home/dblank/"/>
  <option name="encoding" value="UTF-8"/>
  <option name="ext" value=".html"/>
  <option name="familymappages" value="False"/>
  <option name="filter" value="0"/>
  <option name="footernote" value=""/>
  <option name="gallery" value="True"/>
  <option name="googleopts" value="FamilyLinks"/>
  <option name="graphgens" value="4"/>
  <option name="headernote" value=""/>
  <option name="homeimg" value=""/>
  <option name="homenote" value=""/>
  <option name="inc_addressbook" value="False"/>
  <option name="inc_events" value="False"/>
  <option name="inc_families" value="False"/>
  <option name="inc_gendex" value="True"/>
  <option name="inc_repository" value="False"/>
  <option name="incdownload" value="True"/>
  <option name="incl_private" value="False"/>
  <option name="introimg" value=""/>
  <option name="intronote" value=""/>
  <option name="linkhome" value="False"/>
  <option name="living_people" value="99"/>
  <option name="mapservice" value="Google"/>
  <option name="maxinitialimageheight" value="600"/>
  <option name="maxinitialimagewidth" value="800"/>
  <option name="name_format" value="1"/>
  <option name="navigation" value="Horizontal"/>
  <option name="nogid" value="False"/>
  <option name="pid" value="I0363"/>
  <option name="placemappages" value="False"/>
  <option name="showbirth" value="True"/>
  <option name="showdeath" value="False"/>
  <option name="showhalfsiblings" value="False"/>
  <option name="showparents" value="False"/>
  <option name="showpartner" value="False"/>
  <option name="target" value="/home/dblank/NAVWEB"/>
  <option name="title" value="My Family Tree"/>
  <option name="unused" value="True"/>
  <option name="usecal" value="False"/>
  <option name="usecms" value="False"/>
  <option name="years_past_death" value="0"/>
  <format name="html"/>
  <style name="default"/>
</module>

BSDDB, no cache: 568 seconds
BSDDB, with proxy LRU(10000) cache: 24 seconds
BSDDB, with proxy LRU(100000) cache: 19 seconds
BSDDB, with proxy full cache: 19 seconds

DBAPI, no cache: 771 seconds
DBAPI, with DB full cache: 143 seconds
DBAPI, with proxy LRU(10000) cache: 27 seconds
DBAPI, with proxy LRU(100000) cache: 22 seconds
DBAPI, with proxy full cache: 21 seconds
DBAPI, with proxy full + DB cache: 17 seconds

This database has:

   Number of citations: 2854
   Number of events: 3416
   Number of families: 738
   Number of media: 7
   Number of notes: 19
   Number of people: 2102
   Number of places: 1288
   Number of repositories: 3
   Number of sources: 4
   Number of tags: 2

I'll make a PR with the CacheProxyDb soon.

I think it makes sense to introduce this LRU CacheProxyDb only in certain places (eg, report generation) for now. There can be subtle issues introduced. For example, if you alter the cached version, it will remain altered.

For Gramps 5.1, we may want to push the caching down to the database level, which would be beneficial everywhere. But, we'll have to check for altered caches, and verify cache removals. I think it may be too close to a 5.0 release to begin such a project. 

But this shows that even for BSDDB the speed-up is significant, and even more so for DBAPI.

-Doug

 

I'm on an ubuntu 14.04.
could it be the following problem : sqlite3 and python 2.6 ?

gramps -v gives the following :
...
Databases:
-------------------------
  bsddb     :
      version     : 6.0.1
      db version  : 5.3.28
      location    : /usr/lib/python3/dist-packages/bsddb3/__init__.py
  sqlite3   :
      version     : 3.8.2
      py version  : 2.6.0
      location    : /usr/lib/python3.4/sqlite3/__init__.py

I'll try to find out where is the problem.
Pass 0 is OK (1 seconde max)
Pass 1 is OK (8 minutes)
Pass 2 (always waiting for. not finished after 30 hours ( 55% of the individuals pages)

Is there a way to tune sqlite ?

Serge

------------------------------------------------------------------------------
Mobile security can be enabling, not merely restricting. Employees who
bring their own devices (BYOD) to work are irked by the imposition of MDM
restrictions. Mobile Device Manager Plus allows you to control only the
apps on BYO-devices by containerizing them, leaving personal data untouched!
https://ad.doubleclick.net/ddm/clk/304595813;131938128;j
_______________________________________________
Gramps-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gramps-devel


------------------------------------------------------------------------------
Mobile security can be enabling, not merely restricting. Employees who
bring their own devices (BYOD) to work are irked by the imposition of MDM
restrictions. Mobile Device Manager Plus allows you to control only the
apps on BYO-devices by containerizing them, leaving personal data untouched!
https://ad.doubleclick.net/ddm/clk/304595813;131938128;j
_______________________________________________
Gramps-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gramps-devel
Reply | Threaded
Open this post in threaded view
|

Re: gramps 5.0, sqlite3 and narrativeweb performances

DS Blank
On Wed, May 25, 2016 at 12:23 PM, Doug Blank <[hidden email]> wrote:
On Wed, May 25, 2016 at 7:20 AM, Serge Noiraud <[hidden email]> wrote:
Hi,

With BSDDB, creating my narrativeweb takes 2 hours.
With sqlite3, after 30 hours, it's not finished.

I tried some more experiments with smaller files and different levels of caching.

If anyone wants to try these, here are the databases I'm creating:

./Gramps.py --config=behavior.database-backend:bsddb -C "example.gramps BSDDB" -i example/gramps/example.gramps 
./Gramps.py --config=behavior.database-backend:dbapi -C "example.gramps DBAPI" -i example/gramps/example.gramps 

Here are the results for me on an SSD harddrive (CORE i7) with Ubuntu 14.04 on the above generating narrative web pages. Options for navwebpage for ~/.gramps/report_options.xml:

<module name="navwebpage">
  <option name="ancestortree" value="True"/>
  <option name="archive" value="False"/>
  <option name="birthorder" value="False"/>
  <option name="caluri" value="/WEBCAL"/>
  <option name="citationreferents" value="Outline"/>
  <option name="cmsuri" value="/NAVWEB"/>
  <option name="contactimg" value=""/>
  <option name="contactnote" value=""/>
  <option name="create_thumbs_only" value="False"/>
  <option name="cright" value="1"/>
  <option name="css" value="Mainz"/>
  <option name="dl_descr1" value="Smith Family Tree"/>
  <option name="dl_descr2" value="Johnson Family Tree"/>
  <option name="down_fname1" value="/home/dblank/"/>
  <option name="down_fname2" value="/home/dblank/"/>
  <option name="encoding" value="UTF-8"/>
  <option name="ext" value=".html"/>
  <option name="familymappages" value="False"/>
  <option name="filter" value="0"/>
  <option name="footernote" value=""/>
  <option name="gallery" value="True"/>
  <option name="googleopts" value="FamilyLinks"/>
  <option name="graphgens" value="4"/>
  <option name="headernote" value=""/>
  <option name="homeimg" value=""/>
  <option name="homenote" value=""/>
  <option name="inc_addressbook" value="False"/>
  <option name="inc_events" value="False"/>
  <option name="inc_families" value="False"/>
  <option name="inc_gendex" value="True"/>
  <option name="inc_repository" value="False"/>
  <option name="incdownload" value="True"/>
  <option name="incl_private" value="False"/>
  <option name="introimg" value=""/>
  <option name="intronote" value=""/>
  <option name="linkhome" value="False"/>
  <option name="living_people" value="99"/>
  <option name="mapservice" value="Google"/>
  <option name="maxinitialimageheight" value="600"/>
  <option name="maxinitialimagewidth" value="800"/>
  <option name="name_format" value="1"/>
  <option name="navigation" value="Horizontal"/>
  <option name="nogid" value="False"/>
  <option name="pid" value="I0363"/>
  <option name="placemappages" value="False"/>
  <option name="showbirth" value="True"/>
  <option name="showdeath" value="False"/>
  <option name="showhalfsiblings" value="False"/>
  <option name="showparents" value="False"/>
  <option name="showpartner" value="False"/>
  <option name="target" value="/home/dblank/NAVWEB"/>
  <option name="title" value="My Family Tree"/>
  <option name="unused" value="True"/>
  <option name="usecal" value="False"/>
  <option name="usecms" value="False"/>
  <option name="years_past_death" value="0"/>
  <format name="html"/>
  <style name="default"/>
</module>

BSDDB, no cache: 568 seconds
BSDDB, with proxy LRU(10000) cache: 24 seconds
BSDDB, with proxy LRU(100000) cache: 19 seconds
BSDDB, with proxy full cache: 19 seconds

DBAPI, no cache: 771 seconds
DBAPI, with DB full cache: 143 seconds
DBAPI, with proxy LRU(10000) cache: 27 seconds
DBAPI, with proxy LRU(100000) cache: 22 seconds
DBAPI, with proxy full cache: 21 seconds
DBAPI, with proxy full + DB cache: 17 seconds

This database has:

   Number of citations: 2854
   Number of events: 3416
   Number of families: 738
   Number of media: 7
   Number of notes: 19
   Number of people: 2102
   Number of places: 1288
   Number of repositories: 3
   Number of sources: 4
   Number of tags: 2

I'll make a PR with the CacheProxyDb soon.

Here is is:


-Doug
 

I think it makes sense to introduce this LRU CacheProxyDb only in certain places (eg, report generation) for now. There can be subtle issues introduced. For example, if you alter the cached version, it will remain altered.

For Gramps 5.1, we may want to push the caching down to the database level, which would be beneficial everywhere. But, we'll have to check for altered caches, and verify cache removals. I think it may be too close to a 5.0 release to begin such a project. 

But this shows that even for BSDDB the speed-up is significant, and even more so for DBAPI.

-Doug

 

I'm on an ubuntu 14.04.
could it be the following problem : sqlite3 and python 2.6 ?

gramps -v gives the following :
...
Databases:
-------------------------
  bsddb     :
      version     : 6.0.1
      db version  : 5.3.28
      location    : /usr/lib/python3/dist-packages/bsddb3/__init__.py
  sqlite3   :
      version     : 3.8.2
      py version  : 2.6.0
      location    : /usr/lib/python3.4/sqlite3/__init__.py

I'll try to find out where is the problem.
Pass 0 is OK (1 seconde max)
Pass 1 is OK (8 minutes)
Pass 2 (always waiting for. not finished after 30 hours ( 55% of the individuals pages)

Is there a way to tune sqlite ?

Serge

------------------------------------------------------------------------------
Mobile security can be enabling, not merely restricting. Employees who
bring their own devices (BYOD) to work are irked by the imposition of MDM
restrictions. Mobile Device Manager Plus allows you to control only the
apps on BYO-devices by containerizing them, leaving personal data untouched!
https://ad.doubleclick.net/ddm/clk/304595813;131938128;j
_______________________________________________
Gramps-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gramps-devel



------------------------------------------------------------------------------
Mobile security can be enabling, not merely restricting. Employees who
bring their own devices (BYOD) to work are irked by the imposition of MDM
restrictions. Mobile Device Manager Plus allows you to control only the
apps on BYO-devices by containerizing them, leaving personal data untouched!
https://ad.doubleclick.net/ddm/clk/304595813;131938128;j
_______________________________________________
Gramps-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gramps-devel
Reply | Threaded
Open this post in threaded view
|

Re: gramps 5.0, sqlite3 and narrativeweb performances

DS Blank
BTW,

I found additional problems with navwebpage, and also fixed those in this PR. Briefly, the proxies were being set twice. They both (BSDDB and DBAPI) are running at about 16 seconds now with the PR. navwebpage still has some non-standard code (eg, calling a constructor in weird ways) but that will require some refactoring.

Note for anyone using proxies/filters: the order of applying filters and proxies does matter. That is, applying private then living is different from applying living then private. 

That is why the exporter has a filter step that allows you to change the order of the filters and proxies, and to be able to see the matches at each step.

For proper use, we eventually need to have the order be specified for each of these reports as well.

-Doug


On Wed, May 25, 2016 at 12:58 PM, Doug Blank <[hidden email]> wrote:
On Wed, May 25, 2016 at 12:23 PM, Doug Blank <[hidden email]> wrote:
On Wed, May 25, 2016 at 7:20 AM, Serge Noiraud <[hidden email]> wrote:
Hi,

With BSDDB, creating my narrativeweb takes 2 hours.
With sqlite3, after 30 hours, it's not finished.

I tried some more experiments with smaller files and different levels of caching.

If anyone wants to try these, here are the databases I'm creating:

./Gramps.py --config=behavior.database-backend:bsddb -C "example.gramps BSDDB" -i example/gramps/example.gramps 
./Gramps.py --config=behavior.database-backend:dbapi -C "example.gramps DBAPI" -i example/gramps/example.gramps 

Here are the results for me on an SSD harddrive (CORE i7) with Ubuntu 14.04 on the above generating narrative web pages. Options for navwebpage for ~/.gramps/report_options.xml:

<module name="navwebpage">
  <option name="ancestortree" value="True"/>
  <option name="archive" value="False"/>
  <option name="birthorder" value="False"/>
  <option name="caluri" value="/WEBCAL"/>
  <option name="citationreferents" value="Outline"/>
  <option name="cmsuri" value="/NAVWEB"/>
  <option name="contactimg" value=""/>
  <option name="contactnote" value=""/>
  <option name="create_thumbs_only" value="False"/>
  <option name="cright" value="1"/>
  <option name="css" value="Mainz"/>
  <option name="dl_descr1" value="Smith Family Tree"/>
  <option name="dl_descr2" value="Johnson Family Tree"/>
  <option name="down_fname1" value="/home/dblank/"/>
  <option name="down_fname2" value="/home/dblank/"/>
  <option name="encoding" value="UTF-8"/>
  <option name="ext" value=".html"/>
  <option name="familymappages" value="False"/>
  <option name="filter" value="0"/>
  <option name="footernote" value=""/>
  <option name="gallery" value="True"/>
  <option name="googleopts" value="FamilyLinks"/>
  <option name="graphgens" value="4"/>
  <option name="headernote" value=""/>
  <option name="homeimg" value=""/>
  <option name="homenote" value=""/>
  <option name="inc_addressbook" value="False"/>
  <option name="inc_events" value="False"/>
  <option name="inc_families" value="False"/>
  <option name="inc_gendex" value="True"/>
  <option name="inc_repository" value="False"/>
  <option name="incdownload" value="True"/>
  <option name="incl_private" value="False"/>
  <option name="introimg" value=""/>
  <option name="intronote" value=""/>
  <option name="linkhome" value="False"/>
  <option name="living_people" value="99"/>
  <option name="mapservice" value="Google"/>
  <option name="maxinitialimageheight" value="600"/>
  <option name="maxinitialimagewidth" value="800"/>
  <option name="name_format" value="1"/>
  <option name="navigation" value="Horizontal"/>
  <option name="nogid" value="False"/>
  <option name="pid" value="I0363"/>
  <option name="placemappages" value="False"/>
  <option name="showbirth" value="True"/>
  <option name="showdeath" value="False"/>
  <option name="showhalfsiblings" value="False"/>
  <option name="showparents" value="False"/>
  <option name="showpartner" value="False"/>
  <option name="target" value="/home/dblank/NAVWEB"/>
  <option name="title" value="My Family Tree"/>
  <option name="unused" value="True"/>
  <option name="usecal" value="False"/>
  <option name="usecms" value="False"/>
  <option name="years_past_death" value="0"/>
  <format name="html"/>
  <style name="default"/>
</module>

BSDDB, no cache: 568 seconds
BSDDB, with proxy LRU(10000) cache: 24 seconds
BSDDB, with proxy LRU(100000) cache: 19 seconds
BSDDB, with proxy full cache: 19 seconds

DBAPI, no cache: 771 seconds
DBAPI, with DB full cache: 143 seconds
DBAPI, with proxy LRU(10000) cache: 27 seconds
DBAPI, with proxy LRU(100000) cache: 22 seconds
DBAPI, with proxy full cache: 21 seconds
DBAPI, with proxy full + DB cache: 17 seconds

This database has:

   Number of citations: 2854
   Number of events: 3416
   Number of families: 738
   Number of media: 7
   Number of notes: 19
   Number of people: 2102
   Number of places: 1288
   Number of repositories: 3
   Number of sources: 4
   Number of tags: 2

I'll make a PR with the CacheProxyDb soon.

Here is is:


-Doug
 

I think it makes sense to introduce this LRU CacheProxyDb only in certain places (eg, report generation) for now. There can be subtle issues introduced. For example, if you alter the cached version, it will remain altered.

For Gramps 5.1, we may want to push the caching down to the database level, which would be beneficial everywhere. But, we'll have to check for altered caches, and verify cache removals. I think it may be too close to a 5.0 release to begin such a project. 

But this shows that even for BSDDB the speed-up is significant, and even more so for DBAPI.

-Doug

 

I'm on an ubuntu 14.04.
could it be the following problem : sqlite3 and python 2.6 ?

gramps -v gives the following :
...
Databases:
-------------------------
  bsddb     :
      version     : 6.0.1
      db version  : 5.3.28
      location    : /usr/lib/python3/dist-packages/bsddb3/__init__.py
  sqlite3   :
      version     : 3.8.2
      py version  : 2.6.0
      location    : /usr/lib/python3.4/sqlite3/__init__.py

I'll try to find out where is the problem.
Pass 0 is OK (1 seconde max)
Pass 1 is OK (8 minutes)
Pass 2 (always waiting for. not finished after 30 hours ( 55% of the individuals pages)

Is there a way to tune sqlite ?

Serge

------------------------------------------------------------------------------
Mobile security can be enabling, not merely restricting. Employees who
bring their own devices (BYOD) to work are irked by the imposition of MDM
restrictions. Mobile Device Manager Plus allows you to control only the
apps on BYO-devices by containerizing them, leaving personal data untouched!
https://ad.doubleclick.net/ddm/clk/304595813;131938128;j
_______________________________________________
Gramps-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gramps-devel




------------------------------------------------------------------------------
Mobile security can be enabling, not merely restricting. Employees who
bring their own devices (BYOD) to work are irked by the imposition of MDM
restrictions. Mobile Device Manager Plus allows you to control only the
apps on BYO-devices by containerizing them, leaving personal data untouched!
https://ad.doubleclick.net/ddm/clk/304595813;131938128;j
_______________________________________________
Gramps-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gramps-devel
Reply | Threaded
Open this post in threaded view
|

Re: gramps 5.0, sqlite3 and narrativeweb performances

DS Blank
In reply to this post by Serge Noiraud-2
On Wed, May 25, 2016 at 7:20 AM, Serge Noiraud <[hidden email]> wrote:
Hi,

With BSDDB, creating my narrativeweb takes 2 hours.
With sqlite3, after 30 hours, it's not finished.

Thanks, Serge, for testing and reporting this issue. It serves as a nice case of how different code and databases combinations can have interesting execution times.

To summarize, a PR was written to cache lookups based on handles (and fix some other issue in narrativeweb). Currently, the cache is a narrow fix, just designed to be used on top of proxies and filters in reports.

After the PR was applied, Serge found that the time for running the narrativeweb report on BSDDB went from 2 hours to about 8 minutes, and on DB-API it went from 30+ hours to about 7 minutes [1]. The root problem was that the LivingProxyDb is extremely database intensive and repetitive. In such a scenario, BSDDB beats it. However, when the code was refined, DB-API does slightly better.

(I've made a second PR to apply the cache to all reports that use the LivingProxyDb [2]).

Currently the database code is complicated by the fact that BSDDB and DB-API are very different in some key places, and those differences make writing some code (such as importing) complex and error prone. If DB-API works well in Gramps 5.0, I think we can simplify, and speed up, the import code. It would be great to not see any operation taking hours, days, weeks, or even months.

Regarding caching: we should study this closely for Gramps 5.1. We want to make sure that we don't prevent multi-user use, but make sure that our codebase works well with the db layer. We might consider getting rid of computing is_probably_alive and making that be an attribute of Person. I tried putting a cache on DB-API, but it didn't really help and was not multi-user aware [3].

BTW, I'll be largely off the grid until next week. Hopefully some progress can be made towards an alpha release.

-Doug


 

I'm on an ubuntu 14.04.
could it be the following problem : sqlite3 and python 2.6 ?

gramps -v gives the following :
...
Databases:
-------------------------
  bsddb     :
      version     : 6.0.1
      db version  : 5.3.28
      location    : /usr/lib/python3/dist-packages/bsddb3/__init__.py
  sqlite3   :
      version     : 3.8.2
      py version  : 2.6.0
      location    : /usr/lib/python3.4/sqlite3/__init__.py

I'll try to find out where is the problem.
Pass 0 is OK (1 seconde max)
Pass 1 is OK (8 minutes)
Pass 2 (always waiting for. not finished after 30 hours ( 55% of the individuals pages)

Is there a way to tune sqlite ?

Serge

------------------------------------------------------------------------------
Mobile security can be enabling, not merely restricting. Employees who
bring their own devices (BYOD) to work are irked by the imposition of MDM
restrictions. Mobile Device Manager Plus allows you to control only the
apps on BYO-devices by containerizing them, leaving personal data untouched!
https://ad.doubleclick.net/ddm/clk/304595813;131938128;j
_______________________________________________
Gramps-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gramps-devel


------------------------------------------------------------------------------
Mobile security can be enabling, not merely restricting. Employees who
bring their own devices (BYOD) to work are irked by the imposition of MDM
restrictions. Mobile Device Manager Plus allows you to control only the
apps on BYO-devices by containerizing them, leaving personal data untouched!
https://ad.doubleclick.net/ddm/clk/304595813;131938128;j
_______________________________________________
Gramps-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gramps-devel
Reply | Threaded
Open this post in threaded view
|

Re: gramps 5.0, sqlite3 and narrativeweb performances

Paul Franklin-5
Thanks for all your hard work Doug!

However, I noticed (I think) that the new (gen/proxy)
cache does an import from (gui) LRU, and I think
this goes against our principles, trying to keep gen
things clear of gui imports.

Maybe the LRU code can be moved from gui to gen
or something?

But I repeat my thanks for your efforts so far!

------------------------------------------------------------------------------
Mobile security can be enabling, not merely restricting. Employees who
bring their own devices (BYOD) to work are irked by the imposition of MDM
restrictions. Mobile Device Manager Plus allows you to control only the
apps on BYO-devices by containerizing them, leaving personal data untouched!
https://ad.doubleclick.net/ddm/clk/304595813;131938128;j
_______________________________________________
Gramps-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gramps-devel
Reply | Threaded
Open this post in threaded view
|

Re: gramps 5.0, sqlite3 and narrativeweb performances

enno
In reply to this post by DS Blank
Op 25-05-16 om 14:12 schreef Doug Blank:
On Wed, May 25, 2016 at 7:50 AM, Enno Borgsteede <[hidden email]> wrote:

http://stackoverflow.com/questions/1711631/improve-insert-per-second-performance-of-sqlite

And the primary advice that I read there is to make sure that sqlite3
doesn't write data to disk for every insert. In standard mode, it writes
data to disk for every commit, so if you have a commit for each insert
it is indeed very slow, as I saw in my primitive parser hack.

Note that Gramps does imports as a single Gramps transaction (which is also a single SQL transaction). So that looks like the single best advice for making fast writes.
That looks like it indeed.

I did some more tests today, with a hacked Python GEDCOM parser, published on RootsDev, just to see how fast SQLite can write.

This parser was originally built to read a GEDCOM file line by line, split lines into fields for level, tag, pointer (to ID), and value, and to store these tuples in linked hierarchical lists that represent the file structure, adding dictionaries for the top level objects, so that all INDI, FAM and other top level tuples can be found by ID, with the lower level (higher level number) tuples attached as the lists mentioned above.

For testing, I used GEDCOM files generated by Tamura Jones' GedFan program, ranging from 16 to 20 generations, i.e. 64 k to 1 M persons, and half of that for families. The parser can read those very fast, except that it runs out of memory on he 20 generation file, which has more than 7.8 Million lines. When I import that, my PC starts swapping by the time the parser has read a little less than 5 Million lines, so it doesn't work well for that. This is on an i7 with 8 GB RAM, in Linux Mint.

To test SQLite's writing speed, I first hacked this parser to write line number and line contents to the database, without any further parsing. With that hack, it's more like a copy program for text files, so it's not very realistic for GEDCOM files. It's just a test for simple inserts.

With this version, running with Python 2, because the parser was written for that, and I'm lazy, I can 'parse' the whole file in 26 seconds user time. That's 300,000 lines read, and inserted, per second.

To make the test a bit more realistic, I reactivated a part of the parsing, so that all lines are split into level, tag, and value, and an optional column for the ID, meaning that instead of writing 2 columns, in now writes 5: the line number as primary key (integer), and level, tag, ID, and value as text. I still left out the building of the linked lists and dictionaries, to prevent memory consumption from influencing this speed test. Input lines are still written to a single table, without foreign keys.

In this test, parsing 7.8 Million lines took 65 seconds, and that still gives a speed of 120,000 lines/inserts per second. This speed is achieved by using only 1 commit when the parsing is done.

Because of that single commit, this really is an all or nothing operation. When I press ^C during import, I get a DB file with 1 empty table, and otherwise I get 7,864,343 rows, in 323,849,216 bytes.

Importing the same file into the latest version of RootsMagic, running under Wine, takes about 5 minutes on this hardware, and creates a 491,223,040 byte normalized database.

It would be nice to figure out what speed we can achieve when we import a GEDCOM directly into the database, i.e. without storing more objects than needed in memory.

regards,

Enno


------------------------------------------------------------------------------
Mobile security can be enabling, not merely restricting. Employees who
bring their own devices (BYOD) to work are irked by the imposition of MDM
restrictions. Mobile Device Manager Plus allows you to control only the
apps on BYO-devices by containerizing them, leaving personal data untouched!
https://ad.doubleclick.net/ddm/clk/304595813;131938128;j
_______________________________________________
Gramps-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gramps-devel
Reply | Threaded
Open this post in threaded view
|

Re: gramps 5.0, sqlite3 and narrativeweb performances

DS Blank
In reply to this post by Paul Franklin-5
On Thu, May 26, 2016 at 12:00 PM, Paul Franklin <[hidden email]> wrote:
Thanks for all your hard work Doug!

However, I noticed (I think) that the new (gen/proxy)
cache does an import from (gui) LRU, and I think
this goes against our principles, trying to keep gen
things clear of gui imports.

Maybe the LRU code can be moved from gui to gen
or something?

But I repeat my thanks for your efforts so far!

You're welcome!

And you are right. Feel free to move LRU to a better, more appropriate place, or I can do it next week when I return.

-Doug

------------------------------------------------------------------------------
Mobile security can be enabling, not merely restricting. Employees who
bring their own devices (BYOD) to work are irked by the imposition of MDM
restrictions. Mobile Device Manager Plus allows you to control only the
apps on BYO-devices by containerizing them, leaving personal data untouched!
https://ad.doubleclick.net/ddm/clk/304595813;131938128;j
_______________________________________________
Gramps-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gramps-devel
Reply | Threaded
Open this post in threaded view
|

Re: gramps 5.0, sqlite3 and narrativeweb performances

DS Blank
In reply to this post by enno


On Thu, May 26, 2016 at 12:01 PM, Enno Borgsteede <[hidden email]> wrote:
Op 25-05-16 om 14:12 schreef Doug Blank:
On Wed, May 25, 2016 at 7:50 AM, Enno Borgsteede <[hidden email]> wrote:

http://stackoverflow.com/questions/1711631/improve-insert-per-second-performance-of-sqlite

And the primary advice that I read there is to make sure that sqlite3
doesn't write data to disk for every insert. In standard mode, it writes
data to disk for every commit, so if you have a commit for each insert
it is indeed very slow, as I saw in my primitive parser hack.

Note that Gramps does imports as a single Gramps transaction (which is also a single SQL transaction). So that looks like the single best advice for making fast writes.
That looks like it indeed.

I did some more tests today, with a hacked Python GEDCOM parser, published on RootsDev, just to see how fast SQLite can write.

This parser was originally built to read a GEDCOM file line by line, split lines into fields for level, tag, pointer (to ID), and value, and to store these tuples in linked hierarchical lists that represent the file structure, adding dictionaries for the top level objects, so that all INDI, FAM and other top level tuples can be found by ID, with the lower level (higher level number) tuples attached as the lists mentioned above.

For testing, I used GEDCOM files generated by Tamura Jones' GedFan program, ranging from 16 to 20 generations, i.e. 64 k to 1 M persons, and half of that for families. The parser can read those very fast, except that it runs out of memory on he 20 generation file, which has more than 7.8 Million lines. When I import that, my PC starts swapping by the time the parser has read a little less than 5 Million lines, so it doesn't work well for that. This is on an i7 with 8 GB RAM, in Linux Mint.

To test SQLite's writing speed, I first hacked this parser to write line number and line contents to the database, without any further parsing. With that hack, it's more like a copy program for text files, so it's not very realistic for GEDCOM files. It's just a test for simple inserts.

With this version, running with Python 2, because the parser was written for that, and I'm lazy, I can 'parse' the whole file in 26 seconds user time. That's 300,000 lines read, and inserted, per second.

To make the test a bit more realistic, I reactivated a part of the parsing, so that all lines are split into level, tag, and value, and an optional column for the ID, meaning that instead of writing 2 columns, in now writes 5: the line number as primary key (integer), and level, tag, ID, and value as text. I still left out the building of the linked lists and dictionaries, to prevent memory consumption from influencing this speed test. Input lines are still written to a single table, without foreign keys.

In this test, parsing 7.8 Million lines took 65 seconds, and that still gives a speed of 120,000 lines/inserts per second. This speed is achieved by using only 1 commit when the parsing is done.

Because of that single commit, this really is an all or nothing operation. When I press ^C during import, I get a DB file with 1 empty table, and otherwise I get 7,864,343 rows, in 323,849,216 bytes.

Importing the same file into the latest version of RootsMagic, running under Wine, takes about 5 minutes on this hardware, and creates a 491,223,040 byte normalized database.

It would be nice to figure out what speed we can achieve when we import a GEDCOM directly into the database, i.e. without storing more objects than needed in memory.

Enno, very useful baseline info. 

I think our import code can be sped up significantly in two places: the actual importer and parsers (XML and GEDCOM), and in the commit_ITEM() methods in the db code. But, because of the differences between BSDDB and DB-API I think that this should wait. Gramps 5.1 will be committed to one or the other, and, if it is DB-API, then we can do some major refactoring.

-Doug
 

regards,

Enno



------------------------------------------------------------------------------
Mobile security can be enabling, not merely restricting. Employees who
bring their own devices (BYOD) to work are irked by the imposition of MDM
restrictions. Mobile Device Manager Plus allows you to control only the
apps on BYO-devices by containerizing them, leaving personal data untouched!
https://ad.doubleclick.net/ddm/clk/304595813;131938128;j
_______________________________________________
Gramps-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gramps-devel
Reply | Threaded
Open this post in threaded view
|

Re: gramps 5.0, sqlite3 and narrativeweb performances

enno
Doug,

I think our import code can be sped up significantly in two places: the actual importer and parsers (XML and GEDCOM), and in the commit_ITEM() methods in the db code. But, because of the differences between BSDDB and DB-API I think that this should wait. Gramps 5.1 will be committed to one or the other, and, if it is DB-API, then we can do some major refactoring.

That's right. I did a profile for master a few days ago, and repeated that for the 3.4.9 that I normally use. For the latter, the top 5 items, ordered by tottime, are:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   196180   33.996    0.000   51.086    0.000 /home/enno/gramps/src/plugins/lib/libgedcom.py:1697(__getitem__)
   579373   33.363    0.000   39.255    0.000 {method 'put' of 'DB' objects}
        2   19.172    9.586   19.254    9.627 {method 'run' of 'gtk.Dialog' objects}
        1   18.322   18.322  332.646  332.646 {gtk._gtk.main}
    67101   15.421    0.000   15.421    0.000 {method 'values' of 'dict' objects}

Number 4 and 5 are irrelevant for the import test, but the other ones are, and in master on SQLite, __getitem__ gets an even bigger portion of the total time. This is __getitem__ in the IdMapper class, which does lots of dictionary scans. Number 5, values() calls can also largely be added here, because 3 out of 4 values() calls are in that same __getitem__ call. That's 3 out of 4 in code, so it doesn't necessarily say much about the percentage of calls that are actually done from __getitem__.

By adding print statements, I noticed that the IdMapper class builds a huge dictionary of IDs. It looks like the vast majority or even all of the GEDCOM IDs end up here during import, for the purpose of adding leading zeros, or mapping them to unique values in case GEDCOM IDs overlap with the ones in the database. Anyway, if this dictionary works like I think it does, it will eat a lot of RAM when you import a Million person GEDCOM file, and I think that it's one of the reasons why the import time seems to be exponentially related to the GEDCOM size.

When I look at the code, I see a couple of while statements and in operators, and I guess that they do a linear search in the dictionary, which next to the memory consumption itself, contributes to the import time being exponential. And IMO, this will likely be the place where a significant time can be gained, for instance when we can use the dictionary in a smarter way, and avoid those whiles and ins.

One possible way to avoid these, I hope, is by mapping input IDs to handles instead. Output IDs are not relevant to the linking of objects in the GEDCOM file, so I really don't see why they need to be stored in a dictionary. Lookups are always done by input ID, and should result in handles, not output IDs, which the parser really doesn't need to know, as far as I'm concerned. IDs must be formatted and made unique before new objects are written to the database, but the parser should never need to compare a GEDCOM ID to a Gramps ID in the database, nor access ID to ID mapping.

If this really works, and I have to draw this on paper to get a full insight, there will still be a dictionary left, to register handles for all input IDs, but I see no reason to use whiles or ins on that, because the only check that needs to be done is whether an input ID already exists in the dictionary, or not.

Other gains may indeed be found in the commit code, but I agree that those can wait. And my profile of master suggests that their influence may be smaller anyway.

regards,

Enno


------------------------------------------------------------------------------
What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
patterns at an interface-level. Reveals which users, apps, and protocols are
consuming the most bandwidth. Provides multi-vendor support for NetFlow,
J-Flow, sFlow and other flows. Make informed decisions using capacity
planning reports. https://ad.doubleclick.net/ddm/clk/305295220;132659582;e
_______________________________________________
Gramps-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gramps-devel
12