Difference between revisions of "Fix duplicate artists"

From DHVLab

(Created page with "The [http://www.artigo.org ''ARTigo''] database was build ignoring duplicate artists, which is very bad for statistic results.<br/> To get an overview about the state, you can...")
 
 
(4 intermediate revisions by the same user not shown)
Line 1: Line 1:
 
The [http://www.artigo.org ''ARTigo''] database was build ignoring duplicate artists,
 
The [http://www.artigo.org ''ARTigo''] database was build ignoring duplicate artists,
 
which is very bad for statistic results.<br/>
 
which is very bad for statistic results.<br/>
 +
'''In the following we are going to reduce the ''artist'' database by 60% (38011->15365) and correct 23104 inaccurately attributed artworks.'''
 +
 
To get an overview about the state, you can create a new column containing the full name of the artists first.
 
To get an overview about the state, you can create a new column containing the full name of the artists first.
<syntaxhighlight lang="sql">ALTER TABLE artists ADD fullname VARCHAR(250) AFTER surname;</syntaxhighlight>
+
<syntaxhighlight lang="sql">
 +
ALTER TABLE artists  
 +
ADD fullname VARCHAR(250)  
 +
AFTER surname;
 +
</syntaxhighlight>
 
Now update the column as a concatenation of both forename and surname. As CONCAT returns an empty string if
 
Now update the column as a concatenation of both forename and surname. As CONCAT returns an empty string if
 
any field is empty, we replace empty values by '' using IFNULL.
 
any field is empty, we replace empty values by '' using IFNULL.
<syntaxhighlight lang="sql">UPDATE artists SET `fullname` = CONCAT(IFNULL(`firstname`,""), " ", IFNULL(`surname`,""))</syntaxhighlight>
+
<syntaxhighlight lang="sql">
Now we use MySQLs [http://dev.mysql.com/doc/refman/5.7/en/string-functions.html#function_soundex ''SOUNDEX''] function to generate
+
UPDATE artists  
 +
SET `fullname` = CONCAT(IFNULL(`firstname`,""), " ", IFNULL(`surname`,""))
 +
</syntaxhighlight>
 +
Let us check how many artists have the exact same name.
 +
<syntaxhighlight lang="sql">
 +
SELECT forename, surname, COUNT(*) c
 +
FROM artist
 +
GROUP BY fullname
 +
HAVING c > 1
 +
ORDER BY c desc;</syntaxhighlight>
 +
We get 393 results that need to be fixed. We now try if ignoring slightly different spellings improve the precision.
 +
So we use MySQLs [http://dev.mysql.com/doc/refman/5.7/en/string-functions.html#function_soundex ''SOUNDEX''] function to generate
 
the [https://de.wikipedia.org/wiki/Soundex SOUNDEX representations] of the full names and store them in a different column.
 
the [https://de.wikipedia.org/wiki/Soundex SOUNDEX representations] of the full names and store them in a different column.
 
<syntaxhighlight lang="sql">
 
<syntaxhighlight lang="sql">
ALTER TABLE artists ADD soundex VARCHAR(35) AFTER fullname;
+
ALTER TABLE artists  
UPDATE artists SET `soundex` = SOUNDEX(fullname);
+
ADD soundex VARCHAR(35)  
 +
AFTER fullname;
 +
 
 +
UPDATE artists  
 +
SET `soundex` = SOUNDEX(fullname);
 
</syntaxhighlight>
 
</syntaxhighlight>
 
To get an overview we use
 
To get an overview we use
<syntaxhighlight lang="sql">SELECT forename, surname, COUNT(*) c FROM artist GROUP BY soundex HAVING c > 1 ORDER BY c desc;</syntaxhighlight>
+
<syntaxhighlight lang="sql">
 
+
SELECT forename, surname, COUNT(*) AS count
SELECT *, COUNT(*) c FROM artist GROUP BY soundex HAVING c > 1;
+
FROM artist  
 +
GROUP BY soundex  
 +
HAVING count > 1  
 +
ORDER BY count desc;</syntaxhighlight>
 +
Which results in 1973 entries with duplicate artists with the same name, but different amounts of individual duplicate occurrences. Altogether around 25615 wrong, unnecessary entries. To clarify that further, there is only one entry needed without any name, not 17360 ones.
 +
<syntaxhighlight lang="sql">
 +
+--------------------------------+------------------------------------------------+-------+
 +
| forename                      | surname                                        | c    |
 +
+--------------------------------+------------------------------------------------+-------+
 +
|                                |                                                | 17360 |
 +
| NULL                          | Rembrandt Harmensz. van Rijn                  |  1379 |
 +
| NULL                          | Meissener Porzellan Manufaktur                |  1004 |
 +
| NULL                          | Manufactuur Oud-Loosdrecht                    |  241 |
 +
|                                | Unknown                                        |  236 |
 +
| NULL                          | Woodbury & Page                                |  200 |
 +
| NULL                          | Monogrammist CK (1590)                        |  142 |
 +
| NULL                          | Desguerrois & Co.                              |  108 |
 +
| NULL                          | Meester van Delft                              |    99 |
 +
.......
 +
</syntaxhighlight>
 +
If you look at the result in detail, you will see that this approach is way too fuzzy however.
 +
<syntaxhighlight lang="sql">
 +
+------------+----------+---------+---------+
 +
| id        | forename | surname | soundex |
 +
+------------+----------+---------+---------+
 +
|      3114 | George  | Inness  | G6252  |
 +
|      4483 | Georg    | Kunz    | G6252  |
 +
| 2000575733 | Georg    | Hinz    | G6252  |
 +
| 2002703524 | Georg    | Muche  | G6252  |
 +
| 2006898663 | Georg    | Hainz  | G6252  |
 +
+------------+----------+---------+---------+
 +
</syntaxhighlight>
 +
So we stay with exact full name matches. We remove the duplicates and copy the result to a temporary table.
 +
<syntaxhighlight lang="sql">
 +
CREATE TABLE artist_tmp AS (
 +
    SELECT *
 +
    FROM artist
 +
    GROUP BY `fullname`
 +
    ORDER BY `id`
 +
);
 +
</syntaxhighlight>
 +
Now we update the references in the ''artresource'' table so that the artist_id points at the correct, unified artist entry
 +
<syntaxhighlight lang="sql">
 +
UPDATE `artresource` AS ar
 +
SET `artist_id`=(
 +
    SELECT a_t.id
 +
    FROM `artist` AS a
 +
    LEFT JOIN `artist_tmp` AS a_t
 +
    ON a.fullname = a_t.fullname
 +
    WHERE a.id = ar.artist_id
 +
);
 +
</syntaxhighlight>
 +
and replace the artist table.
 +
<syntaxhighlight lang="sql">
 +
DROP TABLE artist;
 +
RENAME TABLE artist_tmp TO artist;
 +
</syntaxhighlight>

Latest revision as of 01:57, 21 May 2016

The ARTigo database was build ignoring duplicate artists, which is very bad for statistic results.
In the following we are going to reduce the artist database by 60% (38011->15365) and correct 23104 inaccurately attributed artworks.

To get an overview about the state, you can create a new column containing the full name of the artists first.

ALTER TABLE artists 
ADD fullname VARCHAR(250) 
AFTER surname;

Now update the column as a concatenation of both forename and surname. As CONCAT returns an empty string if any field is empty, we replace empty values by using IFNULL.

UPDATE artists 
SET `fullname` = CONCAT(IFNULL(`firstname`,""), " ", IFNULL(`surname`,""))

Let us check how many artists have the exact same name.

SELECT forename, surname, COUNT(*) c 
FROM artist 
GROUP BY fullname 
HAVING c > 1 
ORDER BY c desc;

We get 393 results that need to be fixed. We now try if ignoring slightly different spellings improve the precision. So we use MySQLs SOUNDEX function to generate the SOUNDEX representations of the full names and store them in a different column.

ALTER TABLE artists 
ADD soundex VARCHAR(35) 
AFTER fullname;

UPDATE artists 
SET `soundex` = SOUNDEX(fullname);

To get an overview we use

SELECT forename, surname, COUNT(*) AS count 
FROM artist 
GROUP BY soundex 
HAVING count > 1 
ORDER BY count desc;

Which results in 1973 entries with duplicate artists with the same name, but different amounts of individual duplicate occurrences. Altogether around 25615 wrong, unnecessary entries. To clarify that further, there is only one entry needed without any name, not 17360 ones.

+--------------------------------+------------------------------------------------+-------+
| forename                       | surname                                        | c     |
+--------------------------------+------------------------------------------------+-------+
|                                |                                                | 17360 |
| NULL                           | Rembrandt Harmensz. van Rijn                   |  1379 |
| NULL                           | Meissener Porzellan Manufaktur                 |  1004 |
| NULL                           | Manufactuur Oud-Loosdrecht                     |   241 |
|                                | Unknown                                        |   236 |
| NULL                           | Woodbury & Page                                |   200 |
| NULL                           | Monogrammist CK (1590)                         |   142 |
| NULL                           | Desguerrois & Co.                              |   108 |
| NULL                           | Meester van Delft                              |    99 |
.......

If you look at the result in detail, you will see that this approach is way too fuzzy however.

+------------+----------+---------+---------+
| id         | forename | surname | soundex |
+------------+----------+---------+---------+
|       3114 | George   | Inness  | G6252   |
|       4483 | Georg    | Kunz    | G6252   |
| 2000575733 | Georg    | Hinz    | G6252   |
| 2002703524 | Georg    | Muche   | G6252   |
| 2006898663 | Georg    | Hainz   | G6252   |
+------------+----------+---------+---------+

So we stay with exact full name matches. We remove the duplicates and copy the result to a temporary table.

CREATE TABLE artist_tmp AS (
    SELECT *
    FROM artist
    GROUP BY `fullname`
    ORDER BY `id`
);

Now we update the references in the artresource table so that the artist_id points at the correct, unified artist entry

UPDATE `artresource` AS ar 
SET `artist_id`=(
    SELECT a_t.id 
    FROM `artist` AS a 
    LEFT JOIN `artist_tmp` AS a_t
    ON a.fullname = a_t.fullname
    WHERE a.id = ar.artist_id
);

and replace the artist table.

DROP TABLE artist;
RENAME TABLE artist_tmp TO artist;