wesley tanaka

Drupal with millions of nodes

Drupal 6 has a few problems with large numbers of nodes.

Slow query problems in search index update (/cron.php)

There are three slow queries in the search index update process that gets run during cron.  Slow query #1, in node_update_index(), picks out the node IDs to (re)index.  Slow query #2, in node_update_index(), recalculates the value of node_cron_comments_scale.  Slow query #3, in search_update_totals(), figures out which rows have been deleted from {search_index} and need to be deleted from {search_total}.

Bug 312393: Slow query in node_update_index()

The Fix (Drupal 6.4):

  1. run this MySQL command (replacing d_ with your own Drupal table prefix): ALTER TABLE d_search_dataset ADD KEY type_reindex (type, reindex);
  2. Download and install the search_manynodes module (.info and .module files) at the bottom of this page
  3. Run this MySQL (replacing d_ with your own Drupal table prefix): INSERT INTO d_search_dataset (sid, type, reindex) SELECT nid AS sid, 'node' AS type, 2147483647 AS reindex FROM d_node;
  4. Download and apply the search_manynodes.patch file at the bottom of this page (also applies to Drupal 6.5)
Description of bug 312393:

The query in node_update_index() that picks the list of nodes to (re)index is slow. It appears to have been introduced in http://drupal.org/node/146466 and takes ~54 seconds to run with ~2.4 million nodes.  It is slow for two reasons:

  1. the ORDER BY d.reindex is not serviced by an index, forcing a filesort
  2. the LEFT JOIN forces a full table scan of the {node} table
mysql> explain SELECT n.nid
FROM d_node n
LEFT JOIN d_search_dataset d ON d.type = 'node'
AND d.sid = n.nid
WHERE d.sid IS NULL OR d.reindex <> 0
ORDER BY d.reindex ASC, n.nid ASC LIMIT 0, 100 \G

*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: n
type: index
possible_keys: NULL
key: PRIMARY
key_len: 4
ref: NULL
rows: 2435443
Extra: Using index; Using temporary; Using filesort
*************************** 2. row ***************************
id: 1
select_type: SIMPLE
table: d
type: ref
possible_keys: sid_type
key: sid_type
key_len: 55
ref: n.nid,const
rows: 10
Extra: Using where
To get rid of the filesort, the query can be split into two separate queries:
explain SELECT n.nid FROM d_node n
LEFT JOIN d_search_dataset d
ON d.type = 'node' AND d.sid = n.nid
WHERE d.sid IS NULL ORDER BY n.nid ASC
LIMIT 0, 100 \G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: n
type: index
possible_keys: NULL
key: PRIMARY
key_len: 4
ref: NULL
rows: 2435443
Extra: Using index
*************************** 2. row ***************************
id: 1
select_type: SIMPLE
table: d
type: ref
possible_keys: sid_type
key: sid_type
key_len: 55
ref: n.nid,const
rows: 10
Extra: Using where; Using index; Not exists

and this one, which requires an additional index:

ALTER TABLE d_search_dataset
ADD KEY type_reindex (type, reindex);
explain SELECT d.sid as nid FROM d_search_dataset d
WHERE d.type = 'node' AND d.reindex <> 0
ORDER BY d.reindex ASC LIMIT 0, 100 \G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: d
type: range
possible_keys: type_reindex
key: type_reindex
key_len: 55
ref: NULL
rows: 841285
Extra: Using where

These two can be combined into the same query (at least in MySQL):

(SELECT n.nid FROM {node} n LEFT JOIN {search_dataset} d ON d.type = 'node' AND d.sid = n.nid WHERE d.sid IS NULL ORDER BY n.nid ASC LIMIT 0, 100) UNION (SELECT d.sid as nid FROM {search_dataset} d WHERE d.type = 'node' AND d.reindex <> 0 ORDER BY d.reindex ASC LIMIT 0, 100)

In my one test, this reduced the query time from ~54sec to ~33sec

The bulk of the 33 sec is used by the first LEFT JOIN query.  We can remove that too by ensuring that there will never be a case where a node ID exists in {node} and not in {search_dataset}.  See the search_manynodes module at the end of this page for an example of how to do that.  We can then get rid of the LEFT JOIN query entirely, which reduces query time from ~33sec to ~0.04sec.

Bug 336483: Slow query in node_update_index()

The Fix (Drupal 6.6):

  1. ALTER TABLE node_comment_statistics ADD INDEX (comment_count);

Description of Bug 336483:

This is simply a query which does not have an index:

explain MAX(comment_count)
FROM d_node_comment_statistics \G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: d_node_comment_statistics
type: ALL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: 2435440
Extra:
Adding the index as described above causes the query to be optimized away:
explain MAX(comment_count)
FROM d_node_comment_statistics \G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: NULL
type: NULL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: NULL
Extra: Select tables optimized away

Which reduces the query time from ~4.77 sec to 0.00 sec

Bug 312390: Slow query in search_update_totals()

The Fix (Drupal 6.4):

  1. Apply 312390-6.4.patch (also applies to Drupal 6.5)
Description of bug:

search_update_totals() keeps the materialized view {search_total} up to date.  It does so in two steps.  Step 1 inserts or updates any total word scores that have changed during the current cron run.  Step 2 looks for rows that should be deleted from {search_total} with this query:

explain SELECT t.word AS realword, i.word
FROM d_search_total t
LEFT JOIN d_search_index i ON t.word = i.word
WHERE i.word IS NULL \G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: t
type: index
possible_keys: NULL
key: PRIMARY
key_len: 152
ref: NULL
rows: 1249382
Extra: Using index
*************************** 2. row ***************************
id: 1
select_type: SIMPLE
table: i
type: ref
possible_keys: word_sid_type,word
key: word
key_len: 152
ref: t.word
rows: 3
Extra: Using where; Using index; Not exists

which gets expensive as the number of rows in {search_total} grows large.

Bug 312395: Problem in search settings (admin/settings/search)

The fix (Drupal 6.4):

  1. Apply the fix for bug 312393 described above
  2. Apply 312395-6.4.patch from the bottom of this page (also applies to Drupal 6.5)
Description of problem:

EXPLAIN SELECT COUNT(*) FROM d_node n
LEFT JOIN d_search_dataset d
ON d.type = 'node' AND d.sid = n.nid
WHERE n.status = 1
AND (d.sid IS NULL OR d.reindex <> 0) \G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: n
type: index
possible_keys: node_status_type
key: node_status_type
key_len: 106
ref: NULL
rows: 1826583
Extra: Using where; Using index
*************************** 2. row ***************************
id: 1
select_type: SIMPLE
table: d
type: ref
possible_keys: sid_type
key: sid_type
key_len: 55
ref: n.nid,const
rows: 10
Extra: Using where

Changing the query to only use the {search_dataset}:

EXPLAIN SELECT COUNT(*) FROM d_search_dataset d
WHERE d.type = 'node' AND d.reindex <> 0 \G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: d
type: range
possible_keys: type_reindex
key: type_reindex
key_len: 55
ref: NULL
rows: 619703
Extra: Using where; Using index

increased the speed from ~45 sec to ~0.91 sec.

Bug 336358: Slow query in /cron.php ping.module ping_cron()

Note: this patch has been applied to Drupal 6 as of November 6, 2009 and should be included in the next (Drupal 6.15) Drupal 6 release.

The fix (Drupal 6.6):

  1. Apply 336358-6.6.patch from the bottom of this page

Description of problem:

The query run in the ping_cron() function does a full table scan on the {node} table:

EXPLAIN SELECT COUNT(*) FROM d_node
WHERE status = 1
AND (created > '1227090316'
OR changed > '1227090316') \G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: d_node
type: ALL
possible_keys: node_created,node_changed,node_status_type
key: NULL
key_len: NULL
ref: NULL
rows: 1826583
Extra: Using where

Simply splitting the query into two parts lets both individual parts use indexes which already exist:

EXPLAIN SELECT COUNT(*) FROM d_node
WHERE status = 1
AND created > '1227090316' \G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: d_node
type: range
possible_keys: node_created,node_status_type
key: node_created
key_len: 4
ref: NULL
rows: 55
Extra: Using where

EXPLAIN SELECT COUNT(*) FROM d_node
WHERE status = 1
AND changed > '1227090316' \G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: d_node
type: range
possible_keys: node_changed,node_status_type
key: node_changed
key_len: 4
ref: NULL
rows: 26
Extra: Using where

increased the speed from ~8.99 sec to ~(0.00 + 0.00) sec.

AttachmentSize
search_manynodes.info124 bytes
search_manynodes.module317 bytes
312395-6.4.patch655 bytes
312390-6.4.patch2 KB
search_manynodes.patch840 bytes
336358-6.6.patch796 bytes
312390-5.12.patch2.2 KB

Suggested Links

I am kind of surprised you

I am kind of surprised you run Drupal's standard search with Millions of nodes. On drupal.org we've had to drop this with only 300k nodes. Hod does it scale for you?

we too

We ( NowPublic ) long ago also went over to Sphinx. I would not even try running Drupal core search on our DB, also millions of nodes.

I'm surprised that the

I'm surprised that the reactions from Drupal (core?) developers are "Drupal is not designed to work with many nodes" instead of "great, maybe there's something here which might benefit many users of Drupal".

Like Damien Tournoud that made a patch "won't fix" because "the search module is not designed to scale to millions of nodes" without (I assume) testing the patch at all. A real world case later confirmed that the patch actually helped a Drupal site with only 547 nodes!

millions of users? terms?

Anyone try millions of users? Millions of terms?

I think Wesley is using the core indexing hooks but using Sphinx to actually service search queries.

Millions of nodes

Wesley,

I agree with you that we should be testing Drupal at these levels of objects. One further terrifyingly horrible query is pager_query where we do anything like SELECT count(*) FROM {node} WHERE status = 1 AND promoted = 1; Many will recognize this as a query that appears on the front page of Drupal. I was experiencing 20 second query times with this on a pretty hefty machine. There are definitely issues with millions of nodes, and definitely cases where Drupal must be prepared to handle this type of load. I therefore built a test database with 10,000,000 nodes so that I could investigate these further. Good work! Keep it up!

Millions of nodes = thousands of files

Hi

I am reading this posts with great interests because I think the effects of having millions of nodes is similar to indexing thousands of files.

I am developing a portal for knowledge management in which it is rather common that every post has some files attached to it. I am using core search module and "search files" module in order to index both the content of the post and the file.

I need to search in many files, right now I have uploaded mode that 10.000.

My system was working fine but, since a few weeks ago, the system has started to fail when indexing.

These are some of the details about my database:

• search_dataset: 4610 rows (49,4 Mb table size)
• search_files_files: 10.419 rows (2,2 Mb)
• search_index: 1.652.632 rows (133,5 Mb)
• search_total: 151.193 rows (6,2 Mb)

• search_directories: 1 row
• search_directories_files: 0 row
• search_files_helpers: 4 rows
• search_node_links: 12 rows

When I try to access "search settings" in the admin panel, the system gets very slow ant it returns the following errors:

- Fatal error: Maximum execution time of 60 seconds exceeded in D:\xampp\htdocs\observatorios\includes\database.mysqli.inc on line 108
- Fatal error: Maximum execution time of 60 seconds exceeded in D:\xampp\htdocs\observatorios\includes\session.inc on line 56
- Fatal error: Maximum execution time of 60 seconds exceeded in D:\xampp\htdocs\observatorios\includes\session.inc on line 14

I have tried to increase the maximum execution time but I have not been able to complete the query.

I run the cron only at night to avoid delays for the user. I call the cron.php every half an hour from 0:00 to 7:00 in order to make sure new contents are indexed. I get the following log message: Cron has been running for more than an hour and is most likely stuck.

I understand that the problems (and maybe the solutions) of having millions of nodes or thousands of files are the same and are realted to the search queries you mention in your article. Is this correct?

I read in your blog some solutions for fixing the problems but they are valid for drupal 6.4 or 6.5, but my drupal version is 6.10. Should I try it?

Do you know of better alternatives than core search + files search?

Regards

Are these patches valid for drupal version 6.12?

Hi

I am planning to try these patches in my system to check if they solve the problem?

Does anyone know if these patches are valid for drupal version 6.12?

Best regards

Error when trying to fix the problem (drupal 6.11)

Hi

When I am at step 3 of Bug 312939 I get the following error:

INSERT INTO drupal_search_dataset( sid,
TYPE , reindex )
SELECT nid AS sid, 'node' AS
TYPE , 2147483647 AS reindex
FROM drupal_node

MySQL ha dicho:

#1062 - Duplicate entry '1-node' for key 1

Any idea on what's wrong

Regards

The {search_dataset} table

The {search_dataset} table is a queue of things to be indexed.  If you're willing to reset the indexing process back to the beginning, you can probably do a DELETE FROM drupal_search_dataset, followed by the above INSERT query.

It's been a while since I looked at the code. Although you should always back up everything before trying instructions from the internet, you should especially do it before trying this.

Excellent tips

Thanks a lot, Wesley. We had problems with the search indexing on our site - and we only have about 20,000 nodes, currently! We've now implemented the fixes. It was great to see them listed out so neatly, with explanations. It really helped a lot (well - so far, so good!). Cheers again, Mani

Syndicate content