The WHO Early AI-supported Response with Social Listening Platform shows real-time information about how people are talking about COVID-19 online, so we can better manage as the infodemic and pandemic evolve. This project is a pilot of 20 countries, with plans to expand in the future.
Here you can read more about how the data is collected, how the data is processed, what to consider when using the data, and further definition of terms used.
The platform is powered by Citibeats, a text analytics platform specialized in social understanding. More information can be found at citibeats.com.
Data is collected daily from online conversations in publicly available sources, including Twitter, online forums, news comments, and blogs – in English, French, Spanish and Portuguese, for 20 pilot countries.
Other languages and countries will be considered in next phases of the initiative.
We are continuously adding data sources. If you have a suggestion of a data source to add to the initiative, please let us know here.
People’s opinions data from publicly available sources requires normalizing, sampling and cleaning to make it usable, and even then, we must be aware of the limitations.
Since each country has different population sizes as well as different levels of internet access and participation in sharing opinions online, we need to make them comparable. We normalize the data by ensuring that whenever countries are compared, it is always by relative proportion of the captured conversation per country.
Sampling is primarily used in order to control the amount of data we process (rather than for comparability purposes, which is covered by normalization). Sampling is determined by the data ‘query’ we use to define which opinions are collected from the data sources. In this case, the query contains broad COVID-19 keywords, adapted for every language. This does mean that if someone shares an opinion that is implicitly related to COVID, but not explicitly mentioning COVID (or closely related keyword), it will not be included in our sample. Any significant changes to sampling will be updated in the change-log, and users of the API will be automatically notified.
The data is representative of populations that use online sources to share opinions about COVID-19. It is difficult to give precise representativity information, since opinions do not contain demographic information (which can only be inferred) and utilization of data sources per demographic varies significantly by country. It should be noted that, as a global summary of our data, women are under-represented (this information is disaggregated and differences made visible in Gender Gap), as are elderly populations and low-income populations.
This social listening platform is intended as a sensing tool and early warning system, but this representativity limitation must be kept in mind.
Country level statistics about internet penetration, platform use and demographics can be found at: www.datareportal.com.
All data is presented anonymously and aggregated. Anonymous means no names of the authors of opinions are shared. Aggregated means we show summary statistics, rather than individual opinions, so no raw text is shared publicly. Moreover, it should be noted that this opinion's data comes from publicly available data and not sensitive private data.
Country attribution of the data depends on the data source.
In some cases (such as Twitter), this is self-reported in the profile information of users. In others, this is inferred from country level top-level domains (e.g. ‘.co.uk’ for the UK), or from local references mentioned in the text. This should be noted as a limitation of the precision of the data.
Once the data is collected, it is categorized, or classified, into one of the defined categories.
The categories have been defined as topics of interest by health information experts, as well as through a bottom-up analysis of the data. Categories may be adjusted, added or removed during the initiative, which will be notified via the API portal and github.
Data is categorized automatically, with human quality controls. This is achieved through semi-supervised machine learning. This means that from initial human-inputted examples defining a category, the system learns and infers which opinions belong to that category. Regular human review ensures quality control.
The categorization system learns and infers in each local context (in this case, in each country), to adapt to terminology and references made in each country, accounting for differences in language use and social context.
Female and Male data is estimated using aggregated and anonymized profiling (e.g. from names, bios). More can be read about Citibeats estimated gender disaggregation here.
It is possible to filter results by ‘intent’. Intent refers to opinions shared with a particular purpose - in this case, we are monitoring ‘questions’ and ‘complaints’. Intents are automatically detected by the Citibeats system, based on machine-learning models and fine-tuned to the context of COVID-19.
The WHO Early AI-supported Response with Social Listening Platform has been designed with health information professionals in mind, who need regular (typically weekly) snapshots of the public conversation.
Please be aware that depending on how the conversation evolves, category definitions may be changed, or new categories added (thereby changing relative proportions of the conversation), for the analysis to stay relevant.
Please note that if such changes are made, it would break the consistency of the analysis. For example, if we started with 40 categories in month 1, and added 2 new categories in month 3, since we are working with proportion of the conversation, it would not be entirely consistent to compare month 3 proportions of conversation for a category which has appeared during all months. Any such changes will be documented in the API portal (with registered users notified), and in the github.
Data will be updated daily, at 6:00 am UTC.
Since the WHO Early AI-supported Response with Social Listening Platform is intended for use by health information professionals, it is important to reflect on the differences of this type of data compared to typical data types analyzed by the community.
Most importantly, opinions data are just that - subjective opinions. If the top category for a given country in the platform is ‘Category X’, it does not necessarily mean that ‘Category X’ is the topic that health professionals should consider the top priority. Category X may be the most mentioned by people, but not necessarily be the most important to them; furthermore, if Category X is the most important in the minds of the general public, it may not necessarily be the most important to the public health community. These are signals that information professionals should use within the context and their knowledge of the current situation.
Whereas in a survey the questions asked and answers collected are generally structured, that is not the case in analyzing people’s opinions from public big data, which are unstructured. The benefits of analyzing social big data is that it is real-time and has large geographic and topic coverage. This should be kept in mind - this approach is suited as a ‘sensing’ or ‘early warning’ system, rather than a precise measurement tool.
The WHO Early AI-supported Response with Social Listening Platform is intended as a straightforward resource for health information professionals. For deeper analysis of public big data for your country, you may consider setting up your own social listening platform.
How did the virus emerge and how is it spreading?
Where do people think it comes from: lab derived, wild animal markets, animals, place where the virus comes from, imported food, etc.
Stigma on people who are thought of spreading the virus: racist expressions, attribution to poor people or immigrants.
Stigma expressed about or by infected people or have been infected.
What are the symptoms and how is it transmitted?
Confirmed symptoms as defined by WHO, including long-term symptoms.
Other discussed symptoms that have not yet been confirmed by WHO.
Comments on transmission from asymptomatic people, about asymptomatic people, or personal experience of asymptomatic people.
Comments on transmission from pre-symptomatic people, about pre-symptomatic people, or personal experience of pre-symptomatic people.
Modes of transmission confirmed and unconfirmed by WHO.
Actions that individuals take to protect themselves – discussion of recommended or also other types of actions that individuals should take to protect themselves.
Narratives about settings where transmission can be amplified: closed and semi-closed settings.
Vulnerable and risk groups:
- individuals with health conditions like lung or heart disease, diabetes or conditions that affect their immune system
- pregnant women
Anxiety, depression and other affections derived from the pandemic situation
How can it be treated or cured?
Medical treatment as per WHO treatment recommendations
Narratives about the vaccine itself (side effects, safety, etc)
Narratives by and about health care workers and vaccine
Narratives about vaccines in general, including discussion about others or communities that have different opinions about vaccines; can include any vaccine concerns, not just COVID-19
Comments on new treatment and vaccines from research and development and evidence and scientific processes
Discussion about treatments that are not proven to be effective (examples: sunlight, nutrition, herbal remedies, etc)
Specific myths that WHO and partners have reacted to taken steps to debunk reference
What is being done by government and health authorities and societal institutions?
Any discussion about tests – everything from reliability, to access to tests, types of tests, requirement to have tests, etc.
Any discussion about the process, requirements and steps involved in contact tracing, use of technology
Care given to patients in hospitals by medical personnel
Narratives about distribution, equity, access to COVID-19 vaccine
Individual protection measures recommended by governments/WHO such as wearing masks, handwashing, social distance, isolation when ill...
Measures implemented by governments in public settings: schools, workplaces, public transport...
Measures implemented or suggested by governments/WHO/population/private companies on travel: immunity passports, negative PCR or negative rapid test to enter a country, mandatory quarantine
Measures implemented by governments related to movement reduction: lock-down at home, territory lock-down, etc.
Equipment for health workers
Health technology used to treat patients: medicines, medical devices, vaccines, procedures and systems
Discussions about digital technology used to respond to pandemic: electronic data exchange, electronic notices of passenger lists to health authorities, biometric data coming from wearables, proximity apps (App Covid). Includes people’s attitudes to data privacy, or for modelling and predictive analytics.
Fatigue from interventions (lock-down, movement restrictions, masks...)
Narratives about faith and religion and COVID-19 (these narratives are recurring, usually around the time of religious holidays and outbreaks in faith based settings)
Narratives about industry, unions and COVID-19
Narratives about the environment and COVID-19 – some examples: shading in environment, waste water, air pollution as a secondary byproduct of lockdowns
Narratives about social inequalities and relation to COVID-19
Narratives about civil unrest and COVID-19
Narratives about youth, effects of pandemic on them, or actions youth is taking
What types of information are most engaging?
Conversations about facts, official statistics and data
Conversations about mis- and disinformation
Conversations about where people look for information
An ‘opinion’ is considered to be a unique contribution. We are not including social interactions (e.g. retweets, likes, shares) in our analysis.
‘Top category’ shows which category contains the most opinions, compared to other categories in that country. Values are proportions (%) of the conversation, where all the categories sum to 100% for each country.
Shows in which country the selected category is a ‘rising priority’. ‘Priority’ refers to the proportion of the conversation for that category, compared to the other categories in that country. ‘Rising’ means the change in priority, comparing the last 7 days with the 7 days prior to that.
It is important to note that ‘rising’ here is relative to the other categories in that country. For example, if Country A doubled the number of opinions in each and every category from Week 1 to Week 2, ‘rising’ would not show any increase.
This definition of ‘rising’ is used to enable comparability between countries. If you are interested in the absolute (rather than relative) rising, this is viewable on the Country Report page under ‘Trends’.
Shows which categories are talked about more by women than men (brown), and more by men than women (blue), as a proportion of the conversation of that gender.
Values are the difference between female and male proportions (%) of the conversation per category. So all female category %s sum to 100%, all male category %s sum to 100%, and we show the difference between these numbers.
Female and Male data is estimated using aggregated and anonymized profiling (e.g. from names, bios). Learn more.
Citibeats recognizes that female and male are not the only genders.
Filters only the opinions which are questions, and, highlights where the outlier countries are, according to proportion of the conversation per category. Values are proportions (%) of the conversation, where all the categories sum to 100% for each country.
‘Questions’ are defined here as a phrase expressed to elicit information, including expressing one's doubts about something or checking it’s validity or accuracy.
‘Complaints’ are defined here as statements that something is unsatisfactory or unacceptable, and which have some potential to be actionable.