cross-posted from: https://lemmy.cafe/post/1403198
> #### Overview
>
> This is a quick write up of what I had spent a few weeks trying to work out.
>
> The adventure happened at the beginning of October, so don't blindly copy paste
> queries without making absolutely sure you're deleting the right stuff. Use
> `select` generously.
>
> When connected to the DB - run `\timing`. It prints the time taken to execute
> every query - a really nice thing to get a grasp when things take longer.
>
> I've had duplicates in `instance`, `person`, `site`, `community`, `post` and
> `received_activity`.
>
> The quick gist of this is the following:
>
> - Clean up
> - Reindex
> - Full vacuum
>
> I am now certain vacuuming is not, strictly speaking, necessary, but it makes me
> feel better to have all the steps I had taken written down.
>
> `\d` - list tables (look at it as `describe database`);
>
> `\d tablename` - describe table.
>
> `\o filename\ - save all output to a file on a filesystem. `/tmp/query.sql` was my choice.
>
> ___
>
> #### `instance`
>
> You need to turn `indexscan` and `bitmapscan` off to actually get the duplicates
> ``` sql
> SET enable_indexscan = off;
> SET enable_bitmapscan = off;
> ```
>
> The following selects the dupes
> ``` sql
> SELECT
> id,
> domain,
> published,
> updated
> FROM instance
> WHERE
> domain IN (
> SELECT
> domain
> FROM
> instance
> GROUP BY domain
> HAVING COUNT(*) > 1
> )
> ORDER BY domain;
> ```
>
> Deleting without using the index is incredibly slow - turn it back on:
> ``` sql
> SET enable_indexscan = on;
> SET enable_bitmapscan = on;
> ```
>
> ``` sql
> DELETE FROM instance WHERE id = ;
> ```
>
> Yes, you can build a fancier query to delete all the older/newer IDs at
> once. No, I do not recommend it. Delete one, confirm, repeat.
>
> At first I was deleting the newer IDs; then, after noticing the same instances
> were still getting new IDs I swapped to targetting the old ones. After noticing
> *the same god damn instances* **still** getting new duplicate IDs, I had to dig
> deeper and, by some sheer luck discovered that I need to `reindex` the database
> to bring it back to sanity.
>
> `Reindexing` the database takes a *very* long time - don't do that. Instead
> target the table - that should not take more than a few minutes. This, of
> course, all depends on the size of the table, but `instance` is naturally going
> to be small.
>
> ``` sql
> REINDEX TABLE instance;
> ```
>
> If `reindexing` succeeds - you have cleaned up the table. If not - it will yell
> at you with the first name that it fails on. Rinse and repeat until it's happy.
>
> Side note - it is *probably* enough to only `reindex` the index that's failing,
> but at this point I wanted to ensure *at least* the whole table is in a good
> state.
>
> ___
>
> Looking back - if I could redo it - I would delete the new IDs only, keeping the
> old ones. I have no evidence, but I think getting rid of the old IDs introduced
> more duplicates in other related tables down the line. At the time, of course,
> it was hard to tell WTF was going on and making a *wrong* decision was better
> than making *no* decision.
> ___
>
> #### `person`
>
> The idea is the same for all the tables with duplicates; however, I had to
> modify the queries a bit due to small differences.
>
> What I did at first, and you **shouldn't** do:
>
> ```sql
> SET enable_indexscan = off;
> SET enable_bitmapscan = off;
>
> DELETE FROM person
> WHERE
> id IN (
> SELECT id
> FROM (
> SELECT id, ROW_NUMBER() OVER (PARTITION BY actor_id ORDER BY id)
> AS row_num
> FROM person) t
> WHERE t.row_num > 1 limit 1);
> ```
>
> The issue with the above is that it, again, runs a `delete` without using the
> index. It is horrible, it is sad, it takes forever. Don't do this. Instead,
> split it into a `select` without the index and a `delete` with the index:
>
> ```sql
> SET enable_indexscan = off;
> SET enable_bitmapscan = off;
>
> SELECT
> id, actor_id, name
> FROM person a
> USING person b
> WHERE
> a.id > b.id
> AND
> a.actor_id = b.actor_id;
> ```
>
> ``` sql
> SET enable_indexscan = on;
> SET enable_bitmapscan = on;
>
> DELETE FROM person WHERE id = ;
> ```
>
> `person` had dupes *into the thousands* - I just didn't have enough time at that
> moment and started deleting them in batches:
>
> ``` sql
> DELETE FROM person WHERE id IN (1, 2, 3, ... 99);
> ```
>
> Again - yes, it can probably all be done in one go. I hadn't, and so I'm not
> writing it down that way. This is where I used `\o` to then manipulate the output to be in batches using coreutils. You can do that, you can make the database do it for you. I'm a better shell user than an SQL user.
>
> `Reindex` the table and we're good to go!
>
> ``` sql
> REINDEX table person;
> ```
>
> ___
>
> #### `site`, `community` and `post`
>
> Rinse and repeat, really. `\d tablename`, figure out which column is the one to
> use when looking for duplicates and `delete-reindex-move on`.
>
> ___
>
> #### `received_activity`
>
> This one deserves a special mention, as it had *64 million rows* in the database
> when I was looking at it. Scanning such a table takes *forever* and, upon closer
> inspection, I realised there's nothing useful in it. It is, essentially, a log
> file. I don't like useless shit in my database, so instead of trying to find the
> duplicates, I decided to simply wipe most of it in hopes the dupes would go with
> it. I did it in 1 million increments, which took ~30 seconds each run on the
> single threaded 2GB RAM VM the database is running on. The reason for this was
> to keep the site running as `lemmy` backend starts timing out otherwise and
> that's not great.
>
> Before deleting anything, though, have a look at how much storage your tables
> are taking up:
>
> ``` sql
> SELECT
> nspname AS "schema",
> pg_class.relname AS "table",
> pg_size_pretty(pg_total_relation_size(pg_class.oid)) AS "total_size",
> pg_size_pretty(pg_relation_size(pg_class.oid)) AS "data_size",
> pg_size_pretty(pg_indexes_size(pg_class.oid)) AS "index_size",
> pg_stat_user_tables.n_live_tup AS "rows",
> pg_size_pretty(
> pg_total_relation_size(pg_class.oid) /
> (pg_stat_user_tables.n_live_tup + 1)
> ) AS "total_row_size",
> pg_size_pretty(
> pg_relation_size(pg_class.oid) /
> (pg_stat_user_tables.n_live_tup + 1)
> ) AS "row_size"
> FROM
> pg_stat_user_tables
> JOIN
> pg_class
> ON
> pg_stat_user_tables.relid = pg_class.oid
> JOIN
> pg_catalog.pg_namespace AS ns
> ON
> pg_class.relnamespace = ns.oid
> ORDER BY
> pg_total_relation_size(pg_class.oid) DESC;
> ```
>
> Get the number of rows:
>
> ```sql
> SELECT COUNT(*) FORM received_activity;
> ```
>
> Delete the rows at your own pace. You can start with a small number to get the
> idea of how long it takes (remember `\timing`? ;) ).
>
> ``` sql
> DELETE FROM received_activity where id < 1000000;
> ```
>
> **Attention!** Do let the `autovacuum` finish after every delete query.
>
> I ended up leaving ~3 million rows, which at the time represented ~ 3 days of
> federation. I chose 3 days as that is the timeout before an instance is marked
> as dead if no activity comes from it.
>
> Now it's time to `reindex` the table:
>
> ``` sql
> REINDEX TABLE received_activity;
> ```
>
> Remember the reported size of the table? If you check your system, nothing will
> have changed - that is because postgres *does not release* freed up storage to
> the kernel. It makes sense under normal circumstances, but this situation is
> anything but.
>
> Clean all the things!
>
> ```sql
> VACUUM FULL received_activity;
> ```
>
> *Now* you have reclaimed all that wasted storage to be put to better use.
>
> In my case, the *database* (not the table) shrunk by ~52%!
>
> ___
>
> I am now running a cronjob that deletes rows from `received_activity` that are
> older than 3 days:
>
> ``` sql
> DELETE FROM
> received_activity
> WHERE
> published < NOW() - INTERVAL '3 days';
> ```
>
> In case you're wondering if it's safe deleting such logs from the database -
> Lemmy developers seem to agree
> [here](https://github.com/LemmyNet/lemmy/pull/4130) and
> [here](https://github.com/LemmyNet/lemmy/pull/4131).
If the whole table is truncated - then potentially yes, if said activity gets retransmitted. But retransmission is only done to active nodes, and they get marked inactive if unresponsive for 3 days.
So realistically - no.