cross-posted from: https://lemmy.ml/post/34374544

Dropsitenews published a list of websites Facebook uses to train its AI on. Multiple Lemmy instances are on the list as noticed by user BlueAEther

Hexbear is on there too. Also Facebook is very interested in people uploading their massive dongs to lemmynsfw.

Full article here.

Link to the full leaked list download: Meta leaked list pdf

  • Bjarne@feddit.org
    link
    fedilink
    arrow-up
    1
    ·
    edit-2
    8 days ago

    Do they actually respect that? Did you saw the requests going away/being stuck in redirects? I always expected them to use a generic user agent if that happens. I mean they are arguably already disregarding copyright? Why should they adhere to a standard.

    • Gravitywell@sh.itjust.works
      link
      fedilink
      arrow-up
      8
      ·
      8 days ago

      They mainly self identify, it was super obvious when they started showing up in logs. Even without the user agents to Id, the volume of request make it clear that its clanker behavior.

      I’ve been meaning to setup a tar pit, but for now I just have nginx setup to redirect them and if they still keep trying fail2ban kicks in and blocks them by IP.

      It doesn’t matter if they respect it or not, iptables doesn’t give a fuck.