@Illecors - Technosorcery Lemmy

@Illecors@lemmy.cafe

2 Posts
5 Comments

Joined 5Y ago

Cake day: Apr 16, 2019

OverviewCommentsPostsSaved

@Illecors@lemmy.cafetoLemmy Administration@lemmy.ml•[SOLVED] Duplicate entries in Lemmy database

2•10M

If the whole table is truncated - then potentially yes, if said activity gets retransmitted. But retransmission is only done to active nodes, and they get marked inactive if unresponsive for 3 days.

So realistically - no.

@Illecors@lemmy.cafe to Lemmy Administration@lemmy.ml
•
10M
•

[SOLVED] Duplicate entries in Lemmy database

cross-posted from: https://lemmy.cafe/post/1403198 > #### Overview > > This is a quick write up of what I had spent a few weeks trying to work out. > > The adventure happened at the beginning of October, so don't blindly copy paste > queries without making absolutely sure you're deleting the right stuff. Use > `select` generously. > > When connected to the DB - run `\timing`. It prints the time taken to execute > every query - a really nice thing to get a grasp when things take longer. > > I've had duplicates in `instance`, `person`, `site`, `community`, `post` and > `received_activity`. > > The quick gist of this is the following: > > - Clean up > - Reindex > - Full vacuum > > I am now certain vacuuming is not, strictly speaking, necessary, but it makes me > feel better to have all the steps I had taken written down. > > `\d` - list tables (look at it as `describe database`); > > `\d tablename` - describe table. > > `\o filename\ - save all output to a file on a filesystem. `/tmp/query.sql` was my choice. > > ___ > > #### `instance` > > You need to turn `indexscan` and `bitmapscan` off to actually get the duplicates > ``` sql > SET enable_indexscan = off; > SET enable_bitmapscan = off; > ``` > > The following selects the dupes > ``` sql > SELECT > id, > domain, > published, > updated > FROM instance > WHERE > domain IN ( > SELECT > domain > FROM > instance > GROUP BY domain > HAVING COUNT(*) > 1 > ) > ORDER BY domain; > ``` > > Deleting without using the index is incredibly slow - turn it back on: > ``` sql > SET enable_indexscan = on; > SET enable_bitmapscan = on; > ``` > > ``` sql > DELETE FROM instance WHERE id = ; > ``` > > Yes, you can build a fancier query to delete all the older/newer IDs at > once. No, I do not recommend it. Delete one, confirm, repeat. > > At first I was deleting the newer IDs; then, after noticing the same instances > were still getting new IDs I swapped to targetting the old ones. After noticing > *the same god damn instances* **still** getting new duplicate IDs, I had to dig > deeper and, by some sheer luck discovered that I need to `reindex` the database > to bring it back to sanity. > > `Reindexing` the database takes a *very* long time - don't do that. Instead > target the table - that should not take more than a few minutes. This, of > course, all depends on the size of the table, but `instance` is naturally going > to be small. > > ``` sql > REINDEX TABLE instance; > ``` > > If `reindexing` succeeds - you have cleaned up the table. If not - it will yell > at you with the first name that it fails on. Rinse and repeat until it's happy. > > Side note - it is *probably* enough to only `reindex` the index that's failing, > but at this point I wanted to ensure *at least* the whole table is in a good > state. > > ___ > > Looking back - if I could redo it - I would delete the new IDs only, keeping the > old ones. I have no evidence, but I think getting rid of the old IDs introduced > more duplicates in other related tables down the line. At the time, of course, > it was hard to tell WTF was going on and making a *wrong* decision was better > than making *no* decision. > ___ > > #### `person` > > The idea is the same for all the tables with duplicates; however, I had to > modify the queries a bit due to small differences. > > What I did at first, and you **shouldn't** do: > > ```sql > SET enable_indexscan = off; > SET enable_bitmapscan = off; > > DELETE FROM person > WHERE > id IN ( > SELECT id > FROM ( > SELECT id, ROW_NUMBER() OVER (PARTITION BY actor_id ORDER BY id) > AS row_num > FROM person) t > WHERE t.row_num > 1 limit 1); > ``` > > The issue with the above is that it, again, runs a `delete` without using the > index. It is horrible, it is sad, it takes forever. Don't do this. Instead, > split it into a `select` without the index and a `delete` with the index: > > ```sql > SET enable_indexscan = off; > SET enable_bitmapscan = off; > > SELECT > id, actor_id, name > FROM person a > USING person b > WHERE > a.id > b.id > AND > a.actor_id = b.actor_id; > ``` > > ``` sql > SET enable_indexscan = on; > SET enable_bitmapscan = on; > > DELETE FROM person WHERE id = ; > ``` > > `person` had dupes *into the thousands* - I just didn't have enough time at that > moment and started deleting them in batches: > > ``` sql > DELETE FROM person WHERE id IN (1, 2, 3, ... 99); > ``` > > Again - yes, it can probably all be done in one go. I hadn't, and so I'm not > writing it down that way. This is where I used `\o` to then manipulate the output to be in batches using coreutils. You can do that, you can make the database do it for you. I'm a better shell user than an SQL user. > > `Reindex` the table and we're good to go! > > ``` sql > REINDEX table person; > ``` > > ___ > > #### `site`, `community` and `post` > > Rinse and repeat, really. `\d tablename`, figure out which column is the one to > use when looking for duplicates and `delete-reindex-move on`. > > ___ > > #### `received_activity` > > This one deserves a special mention, as it had *64 million rows* in the database > when I was looking at it. Scanning such a table takes *forever* and, upon closer > inspection, I realised there's nothing useful in it. It is, essentially, a log > file. I don't like useless shit in my database, so instead of trying to find the > duplicates, I decided to simply wipe most of it in hopes the dupes would go with > it. I did it in 1 million increments, which took ~30 seconds each run on the > single threaded 2GB RAM VM the database is running on. The reason for this was > to keep the site running as `lemmy` backend starts timing out otherwise and > that's not great. > > Before deleting anything, though, have a look at how much storage your tables > are taking up: > > ``` sql > SELECT > nspname AS "schema", > pg_class.relname AS "table", > pg_size_pretty(pg_total_relation_size(pg_class.oid)) AS "total_size", > pg_size_pretty(pg_relation_size(pg_class.oid)) AS "data_size", > pg_size_pretty(pg_indexes_size(pg_class.oid)) AS "index_size", > pg_stat_user_tables.n_live_tup AS "rows", > pg_size_pretty( > pg_total_relation_size(pg_class.oid) / > (pg_stat_user_tables.n_live_tup + 1) > ) AS "total_row_size", > pg_size_pretty( > pg_relation_size(pg_class.oid) / > (pg_stat_user_tables.n_live_tup + 1) > ) AS "row_size" > FROM > pg_stat_user_tables > JOIN > pg_class > ON > pg_stat_user_tables.relid = pg_class.oid > JOIN > pg_catalog.pg_namespace AS ns > ON > pg_class.relnamespace = ns.oid > ORDER BY > pg_total_relation_size(pg_class.oid) DESC; > ``` > > Get the number of rows: > > ```sql > SELECT COUNT(*) FORM received_activity; > ``` > > Delete the rows at your own pace. You can start with a small number to get the > idea of how long it takes (remember `\timing`? ;) ). > > ``` sql > DELETE FROM received_activity where id < 1000000; > ``` > > **Attention!** Do let the `autovacuum` finish after every delete query. > > I ended up leaving ~3 million rows, which at the time represented ~ 3 days of > federation. I chose 3 days as that is the timeout before an instance is marked > as dead if no activity comes from it. > > Now it's time to `reindex` the table: > > ``` sql > REINDEX TABLE received_activity; > ``` > > Remember the reported size of the table? If you check your system, nothing will > have changed - that is because postgres *does not release* freed up storage to > the kernel. It makes sense under normal circumstances, but this situation is > anything but. > > Clean all the things! > > ```sql > VACUUM FULL received_activity; > ``` > > *Now* you have reclaimed all that wasted storage to be put to better use. > > In my case, the *database* (not the table) shrunk by ~52%! > > ___ > > I am now running a cronjob that deletes rows from `received_activity` that are > older than 3 days: > > ``` sql > DELETE FROM > received_activity > WHERE > published < NOW() - INTERVAL '3 days'; > ``` > > In case you're wondering if it's safe deleting such logs from the database - > Lemmy developers seem to agree > [here](https://github.com/LemmyNet/lemmy/pull/4130) and > [here](https://github.com/LemmyNet/lemmy/pull/4131).