
Bluesky's Bold Move: Users React to New Data Scraping Proposal for AI Training
2025-03-15
Author: William
Overview of Bluesky's Proposal
In a surprising twist, the social network Bluesky has unveiled a proposal aimed at empowering users to control whether their posts and data can be harvested for purposes such as generative AI training and public archiving. This announcement, made on their GitHub page and discussed by CEO Jay Graber at the South by Southwest festival, has ignited a heated debate within the platform's community.
User Concerns
Users expressed widespread concern following Graber's social media post, interpreting the initiative as a departure from Bluesky's prior commitment not to sell user data to advertisers or utilize user content for AI training. One user, Sketchette, reacted vehemently, saying, "Oh, hell no! The beauty of this platform was the NOT sharing of information. Especially gen AI. Don’t you cave now."
Graber's Defense
In her defense, Graber noted that companies involved in AI development are already scraping data from across the internet, including public posts from Bluesky. "Everything on Bluesky is public, just like websites," she explained. Hence, the platform's goal is to establish a 'new standard' that regulates this data scraping, drawing inspiration from the robots.txt file used by websites to manage web crawler permissions.
Legal and Ethical Considerations
This aspect has brought the ongoing discussions about AI training and copyright into sharper focus, as the robot.txt file itself is not legally binding. Bluesky intends to develop a similar model that provides a 'machine-readable format' for users to communicate their data preferences. While it aims to encourage ethical considerations among data scrapers, it lacks legal enforceability.
Proposed User Control Settings
The proposal suggests that users of Bluesky and apps utilizing the underlying ATProtocol can modify their settings to manage their data across four specific categories: generative AI, protocol bridging (connecting different social ecosystems), bulk datasets, and web archiving (such as with the Internet Archive's Wayback Machine).
Expectations from Companies
If a user opts out of having their data used for generative AI training, the proposal asserts that companies and researchers creating AI training datasets are "expected to respect this intent" during data scraping and bulk transfers.
Community Reactions
Molly White, known for her insights on tech and blockchain, hailed the proposal as "a good proposal," and found it perplexing that users were disparaging Bluesky for it. She argued that rather than promoting AI scraping, the platform is attempting to introduce a consent mechanism for existing scraping practices.
However, White also cautioned about the challenges posed by the reliance on scrapers to honor these preferences. "We've seen some of these companies ignore clear signals like robots.txt," she pointed out, highlighting the potential vulnerabilities of any non-legally binding system.
Conclusion
As Bluesky charts this nuanced course, user reactions reflect a genuine concern for privacy and ethical data usage in an era dominated by AI advancements. The debate around this proposal encapsulates the larger conversation about data ownership and user consent in the fast-evolving digital landscape.
**Stay tuned as the situation develops and users continue to voice their opinions on this pivotal issue in tech!**