The upcoming IndiaAI Datasets platform will allow access to sensitive and crucial data on a case-by-case basis to the companies or developers, depending on the conditions and purpose for which that data will be utilised.
The datasets platform, which is in the works, will house anonymous and non-personal data for developers and companies looking to train their AI models. The companies, however, will have to register themselves and make a request to access certain dataset on the platform.
Most datasets, however, will be open for use, officials said.
“The way the datasets platform has been designed is that every government department, which is publishing the data, will have the right to decide what they want to keep in an open domain, what they want to keep in a licensing domain or restricted domain, and what they want to keep in prohibited domain,” Abhishek Singh, additional secretary at the ministry of electronics and IT said, at Digital News Publishers Association (DNPA) Conclave on Thursday.
“In some cases the departments may think of making it available to everyone. But if there is some sensitive data, that might be made available to only a few people with certain conditions. We are not saying either open or closed, but we are making it pertinent with the data owners to decide what kind of access they will give and to whom,” Singh said, adding that the first version of the platform will be launched within next 10 days.
The dataset platform, which is being built at a cost of around Rs 200 crore, is one of the seven pillars of the Rs 10,000 crore IndiaAI mission.
MeitY is currently consulting different ministries to source datasets, which can be made available on the platform. The same is crucial given that the government has also started the process to create India-origin foundational models, for which it has already received 67 proposals.
“We are trying to get datasets from all sources – whether it’s the department of commerce or DPIIT, ministry of finance, trade data, export-import data, productivity, yield. Very granular data is available, but right now it is all in silos,” Singh said.
The developers, through the datasets platform can access sector-specific data or India-specific general data crucial for training large language models.
“We are getting very deep into sectoral data,” Singh said, adding that data generated on e-sanjeevani, which is the government’s telemedicine portal, can be used to build a health model. The government plans to use voice samples of doctor-patient consultation that can be used by firms building health models.
MeitY is also in talks with Prasar Bharati as it has content in different languages, which could be valuable for voice-enabled services. Financial Express