DeepSeek’s “Sputnik Moment”: Data and the Global AI Race
Silicon Valley and Washington are missing the main point on DeepSeek. DeepSeek is a data not hardware play. Data is the new infrastructure, a private right and national strategic asset. The U.S. needs a national data policy to aggregate and protect data.
DeepSeek, a Chinese startup, shocked the artificial intelligence (AI) industry and ignited angst in both Silicon Valley and Washington following release of DeepSeek-V3 and DeepSeek-R1 in January. DeepSeek surged to the top of the Apple App Store and Google Play Store with over 16 million users downloaded DeepSeek in the first three weeks. Marc Andreesen called DeepSeek ‘AI’s Sputnik moment’.
DeepSeek should not have surprised Silicon Valley and Washington. Two years ago while over 2,000 AI luminaries called for an immediate moratorium on training strong AI systems, I countered with America’s AI Ultimatum: Forge Ahead or Fall Behind. We proposed a call to arms rather than a cease fire: “As a matter of national interest, America must accelerate AI development to secure its lead in artificial intelligence and develop trustworthy AI systems.”
The Trump administration has responded quickly vowing the U.S. will dominate AI. On his first day in office, Trump rescinded legislation limiting AI development. A day later, he announced the Stargate Project, which seeks to raise up to $500 billion to build AI infrastructure in the United States. The Office of Science & Technology Policy concurrently called for proposals on an AI Action Plan due on March 15.
Data as Infrastructure
Silicon Valley and Washington are missing the main point on DeepSeek. DeepSeek is a data, not a hardware play. Data, not compute power, is the long-term bottleneck to foundation model performance. OpenAI observes that AI model intelligence roughly equals the log of the compute and data resources used to train and run it. A survey conducted by the GenAI Collective of its 25,000+ members indicates that data is the primary limitation in AI innovation.
Yet Wall Street has focused primarily on hardware. DeepSeek, which claims it trained its latest AI model for just $5.6 million, raises the specter of commoditizing hardware. Wall Street battered Nvidia, which lost over $600 billion or 20% of its stock value since the release of DeepSeek-R1. More efficient compute strategies are welcome. OpenAI regularly releases “mini” versions of its frontier models to reduce compute costs, including o3-mini released in response to DeepSeek-R1.
Data is the new infrastructure, a private right and national strategic asset. Data, not compute power, is the long-term bottleneck to foundation model performance. U.S. policy should treat data as a utility and make it a central part of any infrastructure plan. Data in the U.S. is currently fragmented across big tech datacenters, corporate data warehouses and government data silos. Akin to the Library of Congress, which was founded in 1800 to collect books and other publications, the U.S. should develop a federated data model that creates a central repository of data.
China announced its intent in 2017 to lead global AI by 2030. In the race for AI supremacy, the greater challenge for the U.S. is data, not infrastructure. In AI Superpowers, Kai-Fu Lee describes China as the ‘Saudi Arabia of Data’ and predicts China will win AI through its superior ability to amass, centralize, and leverage data resources.
China currently has a clear data advantage. Its National Security Law of 2017 requires companies operating in China to support its intelligence gathering operations. Foreign companies like Facebook comply with data requirements to operate in China. The Data Security Law claims Chinese right to gather “important data” relating to “critical information infrastructure” if deemed of national security interest.
China’s data strategy extends well beyond its national boundaries. China hoovers data from abroad with popular apps and technologies such as TikTok and DeepSeek while limiting foreign access through its Great Firewall restrictions. China has expanded into the vacuum left by U.S. domestic retrenchment, a policy of stepping in where the U.S. steps back. Since 2000, China has displaced the U.S. as the major trading partner for most countries in South America, Africa and Asia. Through its Belt and Road initiative, China is building data centers and intelligence gathering capacity in these countries. In 2020, China doubled U.S.’s share of global cross border data.
A National Data Store: The New Public Good
Like water and air, data in the Information Age is a public good. AI requires clean data as humans require clean water and air.
A National Data Store would accelerate AI development while enhancing data transparency, privacy and cybersecurity. A National Data Store would augment data available through Common Crawl with government and company data contributed in return for access to the data repository. Metadata would be available to authorized firms for AI research and other use cases. A National Data Store would bolster security by limiting access points, partitioning data, and applying best-in-class cybersecurity technology across the data pool. A National Data Store would also improve data privacy by giving citizens visibility into their data and a say in how their data is used. A National Data Store would promote data standards and protocols while ensuring data is clean and synthesized for ready use. The National Data Store requires modest initial investment as it could utilize existing datacenter infrastructure from Amazon AWS, Microsoft Azure, Google GCP and others. To reinforce trust, an independent National Data Store Board, akin to the Federal Reserve Board, would govern the National Data Store setting interoperability standards and enforcing privacy protections.
To catalyze support for the National Data Store, the U.S. government should launch a national challenge using AI with a data pool to solve a major health or energy issue using a shared problem using a data pool of government and corporate data. Many significant AI advances have been achieved through competitions using a large data pool to solve a widely shared problem such as the ImageNet Large Scale Visual Recognition Challenge in 2012. The health and energy sector are good starting points as these industries have a history of successful collaboration and the U.S. has large government data repositories that could be made available to authorized organizations participating in the challenge. The challenge would offer an incentive for government to coalesce its many data silos and for participating companies to contribute data to the National Data Store.
A National Data Store would create the global gold standard for data through best-in-class information architecture, cybersecurity, and clean data while promoting data protocols and standards around which the AI community would coalesce. Following are just a few of the immediate potential benefits of a National Data Store:
1. Data — a Strategic Resource: Data is a strategic resource that should be gathered, cleansed, secured, maintained, and selectively shared as a matter of national interest. China has had a consistent data strategy spanning more than three decades. As early as 1988, China launched a data strategy to establish ‘information sovereignty’ and expand its ‘information territory.’ By contrast, the U.S. has outsourced its data strategy to the private sector. The most lucrative U.S. technology companies are data platforms. Amazon, Apple, Facebook, Google and Microsoft together make over $1 trillion annually by gathering and monetizing private data. Private firms serve shareholder interests, not national interests. In the age of artificial intelligence, the U.S. must cohere its current fragmentary and siloed approach to data and infrastructure.
2. Accelerate AI: A National Data Store addresses two deficits inherent in our current siloed approach to data: Discovery Deficit and Diffusion Deficit. A common data repository improves transparency and unlocks data currently held in siloes. Broader data access accelerates innovation as illustrated by the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) in 2012, which was a breakthrough for the use of deep neural nets for image recognition. Drug discovery has also accelerated with artificial intelligence applied to publicly available data.
A National Data Store levels the playing field enabling AI startups to access larger data pools and compete with tech incumbents. Barriers to entry for disruptive AI innovation are currently high and rising. While the Common Crawl makes data available to startups, DeepSeek has shown this is only an entry point benefiting innovators of all stripes. A National Data Store would create the gold standard for data aggregating existing silos of public and private data while tasking the National Data Store to uncover and incorporate new sources of information.
3. Clean Metadata: About 80% of data science involves collecting, cleaning, and organizing data, while only 20% is spent on building models and making discoveries. Siloed data creates redundant efforts to collect, clean and maintain information. The inefficiencies and inadequacy of these redundant efforts increase as the firehose of new data accelerates. Much as the biotech industry collaborates to offset high research costs, a shared data resource would defray the rising costs of cleaning and maintaining the data while providing access to a larger pool of data. The U.S. created the Federal Reserve system when the financial system outgrew the ability of private capital alone to safeguard the economy. Data is a strategic asset that is growing beyond the point where private firms can safeguard national data interests.
4. Cybersecurity: Our current fragmented data approach provides many attack vectors heightening cybersecurity costs and risks. Theft of intellectual property costs Americans up to $600 billion annually as of 2018. AI and the near-term prospect of quantum computing significantly heightens cybersecurity risk. A National Data Store would limit entry points and enable best in class cybersecurity software to be applied across all sensitive U.S. data. Data partitioning technology could segment data to ensure malicious actors who gain entry would have no more data access than could be achieved through our current siloed approach with data.
5. Data Privacy: Americans leave data exhaust giving private companies and foreign actors insight into our daily activities, beliefs, predilections and vulnerabilities. Data platforms, which trade services for insight and gather data from myriad sources, know more about us than we are aware. Companies assemble digital profiles on our health, wealth, assets and liabilities, spending habits, location, travel habits, social behaviors, social network, beliefs and views. Armed with data analytics and AI tools, data platforms may know us better than we know ourselves. In the age of the quantified self, our digital twins are malleable figures for monetization and manipulation. As U.S. citizens become pawns on a geopolitical chessboard, individual and national sovereignty hang in the balance.
This must change. Data privacy is a fundamental democratic human right. Citizens should have knowledge of and control over their data. A National Data Store would be a first step in reestablishing and reinforcing data privacy rights. Unless required by law, personal information from the data repository should be available only as anonymous metadata unless authorized by the person, typically as a rules-based standard protocol or, as needed, on an ad hoc basis.
6. Augmented, not Artificial Intelligence: Augmented Intelligence is pro-worker. Artificial Intelligence displaces workers. Knowledge is a quasi-public good: we all benefit from the spread of knowledge and improved productivity of our neighbors. A National Data Store with clean metadata is a national resource that can be used to enhance information products to raise productivity and capabilities of American citizens.
7. Energy Efficiency: Public backlash against expanding data center and energy requirements may threaten continued AI development. Our current fragmented data approach creates highly inefficiency redundancies. Data centers already consume 2% of U.S. energy. With escalating AI data and compute requirement capacity requirements, this figure may rise to 9% by 2030. A National Data Store would introduce efficiencies that could alleviate escalating energy costs and data capacity constraints.
National Data Store Implementation: U.S. as the Data Gold Standard
By outsourcing its data policy to the private sector, the U.S. has forfeited global data leadership to Europe. Multinational U.S. companies must comply with Europe’s General Data Protection Regulation (GDPR) guidelines. In the absence of single data protection legislation, U.S. companies must attempt to comply with a jumble of hundreds of laws at both the federal and state levels, including the California Consumer Privacy Act (CCPA) modeled after GDPR.
The data realm has changed dramatically in recent years. The world generated over 120 zettabytes of data in 2023, more than sixty times the data exhaust produced in 2010. The Citadel Campus in Nevada, the world’s largest commercial datacenter, has capacity for at least 150 exabytes of data — 7,000 times larger than the Library of Congress, which has nearly one billion data files. Yet many of the laws and policies governing data predate the Internet. White & Case observes “there is no single data protection legislation in the United States. Rather, a jumble of hundreds of laws at both the federal and state levels serve to protect the personal data of U.S. residents.”
A well-implemented National Data Store program would bring our data policy into the 21st century by aggregating data across public and private data warehouses into a single repository while reducing cybersecurity risks and associated costs. A National Data Store would establish data privacy rights, uphold those standards, and empower citizens with insights into the use of their personal data. And a federated data model would address a national AI strategic threat helping to improve innovation and accelerate artificial intelligence initiatives by offering companies large and small access to a vast pool of anonymized, cleansed data — larger than any single repository of data held today.