There are at least three issues with the catalog URL rewrite indexer and URL rewrite management in Magento:
- A reindex will add a new rows for products with duplicate url_key's - every single time you reindex
- Disabled products generate new rows
- Products that are not-visible individually generate new rows
Core bug
I’m pretty late to the party, but decided to investigate it anyway since I have to deal with many stores whose indexers get stuck because of various SQL errors related to the URL rewrite index and whose tables are way too big and out of proportions for the amount of categories and products they have.
This bug will make your table grow into the millions of rows and hundreds of Megabytes. Sure, you can eliminate all duplicate keys, but what about apparel or fashion stores that rotate products quickly? They often have duplicate URL keys, however some of the products have been been disabled a long time ago. The fact that duplicate URL keys exist isn't bad per se.
There are many suggestions on the internet saying to truncate the core_url_rewrite
table, but that's only a temporary fix - so I propose a couple more long-term approaches.
I guess the best way to fix it is to never have duplicate URL-keys. You can create a module that will prevent you from saving the product if another product exists with the same URL key.
I assume it’s pretty easy to do - change the backend model for the url key attribute: Mage_Catalog_Model_Product_Attribute_Backend_Urlkey
and add a check in the beforeSave method, before calling return parent::beforeSave()
;
Second option is to patch the issue, but it'll require a class rewrite of the indexer model (if you want to keep your store upgradable): https://gist.github.com/edannenberg/5310008
There is also this module which fixes all three issues, but I haven't tried it: http://www.magentocommerce.com/magento-connect/dn-d-patch-index-url-1.html
If the table is already too big
If your table is already too big, see if if it’s actually due to duplicate keys.
You can find out how many duplicate url key's you have with the following query:
SELECT COUNT(DISTINCT entity_id) AS amount, `value`, entity_id
FROM catalog_product_entity_varchar v
WHERE EXISTS (
SELECT *
FROM eav_attribute a
WHERE attribute_code = "url_key"
AND v.attribute_id = a.attribute_id
AND EXISTS (
SELECT *
FROM eav_entity_type e
WHERE entity_type_code = "catalog_product"
AND a.entity_type_id = e.entity_type_id
)
)
GROUP BY v.VALUE
ORDER BY `amount` DESC;
Also, instead of truncating the whole table, you can use the following query to clear out only the unnecessary rewrites (and make sure to create a backup first):
DELETE
FROM core_url_rewrite
WHERE is_system <> 1
AND id_path REGEXP "^[0-9]+_[0-9]+$";
And before doing this, make sure you have a grip on your SEO situation since it might destroy existing links on the web.