Tell NYT, Atlantic, USA Today to keep Wayback Machine

(savethearchive.com)

172 points | by doener 3 hours ago

14 comments

  • ctippett 1 hour ago
    Am I correct that this has come about because archive.org respects robots.txt and these sites have blocked their crawler from indexing their sites?

    I'm not sure how to articulate my thoughts on this exactly, other than to say it's disappointing that doing the right thing (i.e. respecting robots.txt) is rewarded with the burden of soliciting responses to a petition while at the same time others are rewarded with profit for ignoring those same directives.

    • joecool1029 12 minutes ago
      No, archive.org does NOT respect robots.txt. You need to reach out to them directly and ask your site not be included: https://blog.archive.org/2017/04/17/robots-txt-meant-for-sea...
    • Paracompact 1 hour ago
      Don't know if it helps your musings at all, but there's a good chance that if a high-profile crawler like archive.org disrespected their robots.txt, that archive.org would be faced with lawsuits (or some other form of pressure). This is not merely the most moral move; rather it is the only sensible move.

      The only reason "others are rewarded with profit" in cases like these are because pinkie-promise-style obligations don't affect players too small or shadowy to bother litigating.

      • GolfPopper 21 minutes ago
        >pinkie-promise-style obligations don't affect players too small or shadowy to bother litigating

        I think you're looking at the wrong end of the spectrum there. It's some of the biggest players who flaunt the rules.

        "Several AI companies said to be ignoring robots dot txt exclusion, scraping content without permission: report" (2024) https://www.tomshardware.com/tech-industry/artificial-intell...

    • cmeacham98 1 hour ago
      Correct. Example snippet from the nytimes.com robots.txt:

          User-agent: archive.org_bot
          Disallow: /
      • joecool1029 10 minutes ago
        Which they don’t respect. I’ve had it for my blog for years and they still added it to wayback machine, see my last comment for their official announcement of the ignore robots.txt policy, it is not new.
    • Gigachad 1 hour ago
      It's because they want to restrict AI companies from stealing content, but they can't do it if internet archive proxies it all for them.

      All of the LLMs would be massively less useful if it wasn't for scraping the latest news.

      • stephen_g 53 minutes ago
        LLMs have other ways of accessing the content, they don’t need the Web Archive.

        Every LLM company can afford to spin up a new subscriber account every day, proxying to appear different IPs from all sorts of ASNs, do some crawling until the account gets banned, and then do it again, and again, and again.

        • overfeed 19 minutes ago
          > LLMs have other ways of accessing the content, they don’t need the Web Archive.

          What's the conclusion from this train if thought? Just because some burglars can pick locks doesn't mean you should leave your front door unlocked.

          Locking a door (or robots.txt) is how one can establish mens rea for those who bypass them.

  • JustinGoldberg9 1 minute ago
    Need a cryptographically verifiable internet archive. This is probably not possible without something like web 3 or nostr or gpg pgp. Idk.
  • ajaimk 48 minutes ago
    Idea: allow scraping but can’t publish for 1 year?
  • someperson 3 hours ago
    Maybe they should have an escrow like Financial Times is available on NewsBank service with a 30 day escrow
  • Cider9986 1 hour ago
    I am looking forward to this (https://news.ycombinator.com/item?id=48070516)
  • WarmWash 50 minutes ago
    A bunch of people who have haven't ever loaded an ad or paid a subscription to those organizations are going to make a stand to demand they leave their backdoor open?
  • eranation 13 minutes ago
    I signed, but let’s be honest.

    A pie chart showing the times I used the wayback machine to read an old NYT article vs the times I visited it due to a highly upvoted top HN comment linking to a relatively new article so we all can bypass the paywall is a solid circle.

  • JumpCrisscross 1 hour ago
    I know a little about this debate on the Times and Atlantic sides. I’ll get some grief for this, but I asked a senior person at the former what they thought about the paywall workarounds that are frequent on HN—I was genuinely shocked to learn they hadn’t heard about it.

    In the end, we settled on agreeing that making such stuff available after 30 days, and possibly with access restrictions (can’t be pulled more than N times a day, in case it becomes relevant in the future) struck the right balance.

    To my knowledge, the Internet Archive hasn’t done any outreach on this issue. In addition to pressuring the publications, I’d put some pressure on them to negotiate.

    • themafia 42 minutes ago
      > can’t be pulled more than N times a day, in case it becomes relevant in the future

      In case it "becomes relevant." Wouldn't that benefit you either way? It makes you wonder if they have a dashboard of unfortunate digital statistics on display somewhere and worship of these numbers have replaced the underlying spirit of journalism.

  • WarmWash 55 minutes ago
    Can we just go back to ads and normalize blocking people who ad-block?

    I'm grown up now, I understand how things work, and I'd rather see Tide and Coke ads than pay $20/mo to 8 different orgs, while maintaining that ad free option for those who want it.

    The children of the internet probably won't sign a truce, so let's just cut them out and let intellectually honest people have a decent internet.

    • elashri 1 minute ago
      > Can we just go back to ads and normalize blocking people who ad-block?

      Nope, two problems

      1- Ads is privacy issue not only convenience issue. Targeted ads should not normalized.

      2- Companies figures out that even paying doesn't means you don't get ads. You probably are bigger target with more disposable income than average in such case.

    • shimman 44 minutes ago
      How about we go back to the era of humanity where modern marketing didn't exist?

      How much faster would consumer software be if adware was made illegal? How much faster would our devices be if we didn't have half the code base supporting malware?

      Acting like an ad enabled internet was the only option is extremely foolish, especially when the ad enabled internet was fully chosen and pushed onto the public by very specific people (thanks Newt Gingrich!).

      • kmoser 18 minutes ago
        > How about we go back to the era of humanity where modern marketing didn't exist?

        That era vastly predates the Internet, let alone the (relatively) ad-free pre-1980s Internet, neither of which we can return to in any meaningful fashion.

    • GolfPopper 16 minutes ago
      >cut them out and let intellectually honest people have a decent internet.

      Ah, so, take the money out of it completely? No subscriptions, and no ads? Sounds like a good idea to me.

    • goosejuice 37 minutes ago
      I'm a paying NYT subscriber for years. NYT has a ton of ads, even for subscribers. They don't offer an ad free version despite it being totally viable at a few more bucks a month based on their finances. Their ads are super disruptive to reading and their privacy policy appears to indicate they buy and sell your data.

      I dunno. That seems like a pretty big fuck you to a paying customer already when all they have to do is provide a sub for a few more bucks a month. But I guess I'm a child of the Internet.

    • chadgpt3 9 minutes ago
      We can't - LLMs don't proxy ads.
    • 32sGqt 47 minutes ago
      [dead]
  • kr108sdh 1 hour ago
    The petition should be to ban the AI theft. If it is on wayback, the bots could as well scrape the NYT directly.

    The NYT is of course guilty itself. It did not investigate the possible murder of its star witness Suchir Balaji and is too reserved in examining the consequences of AI in general.

    If they don't fulfill their journalistic and societal obligations, soon its own journalists will be replaced by AI bullet point slop like Axios.

  • LNSY 1 hour ago
    [flagged]
  • sublinear 1 hour ago
    After many years of these media outlets circling the drain, this is likely the clearest signal of their irrelevance. It's not like anyone is committing these rags to microfiche anymore.
    • giwook 1 hour ago
      And by what standards have you determined that these outlets are circling the drain?

      The work of independent journalists is more important than ever before.

      • beej71 1 hour ago
        More important than ever before and less market value than ever before. :(
      • awakeasleep 55 minutes ago
        It’s kind of shocking to read what you wrote, and realize those big media brands used to be independent journalism.
      • monkaiju 59 minutes ago
        Are we considering the NYT and USA Today "independent journalism" still? Seems dubious...
    • ks2048 34 minutes ago
      > the clearest signal of their irrelevance

      NYT had $2.82B in revenue in 2025.

    • themafia 45 minutes ago
      > It's not like anyone is committing these rags to microfiche anymore.

      I recommend you actually go and read those fiches. The press was not historically high quality. Mass media has had the same problems for decades.

      What it used to have was genuine independent competition.

  • righthand 1 hour ago
    Wouldn’t it be better to let these legacy news orgs (which aren’t really anything beyond advertising and data harvesting firms) block archive.org and thus no one will read their articles and they can go under? I’m struggling to think of a reason I need NY Times. I’ve never had a subscription and never seen writing that I thought benefited me as a citizen (they’re Very pro-war of any kind).
    • JumpCrisscross 1 hour ago
      > block archive.org and thus no one will read their articles and they can go under?

      …why would they go under if the people who don’t pay for news stop reading them?

      • sublinear 1 hour ago
        Media influence and authority has historically depended on getting cited by writing that is more directly relevant to the reader's concern (i.e. the topic of research).

        The paywalls were one thing, but disallowing archival is practically suicide.

    • ks2048 31 minutes ago
      Plenty to criticize about the NYT, but I see many people complaining about "legacy media" are often the same people getting "news" from random Twitter/X accounts.
    • b00ty4breakfast 1 hour ago
      if people are reading the articles through wayback, then they aren't making any money because no data is harvested and no click-thrus or impressions or whatever the metric is are registered.
  • xyzzy_plugh 1 hour ago
    The title freaked me out. I thought this was about the Wayback Machine going away but no, it's just news publications blocking being archived.

    I guess I don't really care. As soon as it becomes unworkable to view these publications through archivers I'll just stop viewing them altogether. I don't see this helping their bottom line though.

    • ameliaquining 1 hour ago
      As long as other people are reading them, they're important for understanding what's happening in the world and what information the public is getting, which is why we need an accessible archive of their content.
      • redwall_hp 1 hour ago
        Exactly. Libraries have kept microfiche archives of newspapers for forever, and they're an essential part of historical research.

        They also preserved old books. But now I guess they're becoming middlemen for access to limited ebook platforms that ensure books disappear when publishers lose interest.

        The "Information Age" is proving to be the setup for a dark age, when nonprofitable things are just thrown out and efforts to preserve them are actively fought.

        • layman51 1 hour ago
          I think part of this is important too because online news articles might have corrections, or certain paragraphs might get deleted in some rare situations. It's good to have a way of tracking those. Sometimes, the edits made to an article are very irrelevant to the actual message. I'm thinking stuff like typos, or even embarrassing gaffes like the recent time that a headline implied that the NATO acronym had the word "American" in it.
    • Barbing 57 minutes ago
      Archives protect truth