Difference between revisions of "Fix duplicate artists"
From DHVLab
Wiki admin (talk | contribs) |
Wiki admin (talk | contribs) |
||
Line 20: | Line 20: | ||
HAVING c > 1 | HAVING c > 1 | ||
ORDER BY c desc;</syntaxhighlight> | ORDER BY c desc;</syntaxhighlight> | ||
− | We get 393 results that need to be fixed. We now | + | We get 393 results that need to be fixed. We now try if ignoring slightly different spellings improve the precision. |
So we use MySQLs [http://dev.mysql.com/doc/refman/5.7/en/string-functions.html#function_soundex ''SOUNDEX''] function to generate | So we use MySQLs [http://dev.mysql.com/doc/refman/5.7/en/string-functions.html#function_soundex ''SOUNDEX''] function to generate | ||
the [https://de.wikipedia.org/wiki/Soundex SOUNDEX representations] of the full names and store them in a different column. | the [https://de.wikipedia.org/wiki/Soundex SOUNDEX representations] of the full names and store them in a different column. | ||
Line 54: | Line 54: | ||
....... | ....... | ||
</syntaxhighlight> | </syntaxhighlight> | ||
− | + | If you look at the result in detail, you will see that this approach is way too fuzzy however. | |
+ | <syntaxhighlight lang="sql"> | ||
+ | +------------+----------+---------+---------+ | ||
+ | | id | forename | surname | soundex | | ||
+ | +------------+----------+---------+---------+ | ||
+ | | 3114 | George | Inness | G6252 | | ||
+ | | 4483 | Georg | Kunz | G6252 | | ||
+ | | 2000575733 | Georg | Hinz | G6252 | | ||
+ | | 2002703524 | Georg | Muche | G6252 | | ||
+ | | 2006898663 | Georg | Hainz | G6252 | | ||
+ | +------------+----------+---------+---------+ | ||
+ | </syntaxhighlight> | ||
+ | So we stay with exact full name matches. We remove the duplicates and copy the result to a temporary table. | ||
+ | <syntaxhighlight lang="sql"> | ||
+ | CREATE TABLE artist_tmp AS ( | ||
+ | SELECT * | ||
+ | FROM artist | ||
+ | GROUP BY `fullname` | ||
+ | ORDER BY `id` | ||
+ | ); | ||
+ | </syntaxhighlight> | ||
+ | <syntaxhighlight lang="sql"> | ||
+ | UPDATE `artresource` as A SET `artist_id_new`=( | ||
+ | select id | ||
+ | from artist_tmp | ||
+ | where fullname = ( | ||
+ | SELECT fullname | ||
+ | FROM artist | ||
+ | WHERE id = A.artist_id | ||
+ | ) | ||
+ | ) | ||
+ | </syntaxhighlight> |
Revision as of 23:09, 20 May 2016
The ARTigo database was build ignoring duplicate artists,
which is very bad for statistic results.
To get an overview about the state, you can create a new column containing the full name of the artists first.
ALTER TABLE artists
ADD fullname VARCHAR(250)
AFTER surname;
Now update the column as a concatenation of both forename and surname. As CONCAT returns an empty string if any field is empty, we replace empty values by using IFNULL.
UPDATE artists
SET `fullname` = CONCAT(IFNULL(`firstname`,""), " ", IFNULL(`surname`,""))
Let us check how many artists have the exact same name.
SELECT forename, surname, COUNT(*) c
FROM artist
GROUP BY fullname
HAVING c > 1
ORDER BY c desc;
We get 393 results that need to be fixed. We now try if ignoring slightly different spellings improve the precision. So we use MySQLs SOUNDEX function to generate the SOUNDEX representations of the full names and store them in a different column.
ALTER TABLE artists
ADD soundex VARCHAR(35)
AFTER fullname;
UPDATE artists
SET `soundex` = SOUNDEX(fullname);
To get an overview we use
SELECT forename, surname, COUNT(*) AS count
FROM artist
GROUP BY soundex
HAVING count > 1
ORDER BY count desc;
Which results in 1973 entries with duplicate artists with the same name, but different amounts of individual duplicate occurrences. Altogether around 25615 wrong, unnecessary entries. To clarify that further, there is only one entry needed without any name, not 17360 ones.
+--------------------------------+------------------------------------------------+-------+
| forename | surname | c |
+--------------------------------+------------------------------------------------+-------+
| | | 17360 |
| NULL | Rembrandt Harmensz. van Rijn | 1379 |
| NULL | Meissener Porzellan Manufaktur | 1004 |
| NULL | Manufactuur Oud-Loosdrecht | 241 |
| | Unknown | 236 |
| NULL | Woodbury & Page | 200 |
| NULL | Monogrammist CK (1590) | 142 |
| NULL | Desguerrois & Co. | 108 |
| NULL | Meester van Delft | 99 |
.......
If you look at the result in detail, you will see that this approach is way too fuzzy however.
+------------+----------+---------+---------+
| id | forename | surname | soundex |
+------------+----------+---------+---------+
| 3114 | George | Inness | G6252 |
| 4483 | Georg | Kunz | G6252 |
| 2000575733 | Georg | Hinz | G6252 |
| 2002703524 | Georg | Muche | G6252 |
| 2006898663 | Georg | Hainz | G6252 |
+------------+----------+---------+---------+
So we stay with exact full name matches. We remove the duplicates and copy the result to a temporary table.
CREATE TABLE artist_tmp AS (
SELECT *
FROM artist
GROUP BY `fullname`
ORDER BY `id`
);
UPDATE `artresource` as A SET `artist_id_new`=(
select id
from artist_tmp
where fullname = (
SELECT fullname
FROM artist
WHERE id = A.artist_id
)
)