10
Dec

Insights on Public Data & Visualisation – In conversation with co-founders of “How India Lives”

hil_img

In this blog post we share an interview with Avinash Celestine & Avinash Singh, co-founders of How India Lives. Covering an array of topics, they share their experiences and thoughts on handling public data and the nuances of data visualisation that has to be kept in mind when dealing with complex datasets.

Tell us about how “How India Lives” came into being and the gaps that the platform is trying to address?

How India Lives was founded by five senior journalists after we felt reliable and relevant data, especially demographic, is missing in India. Often, we used to spend tremendous amount of time to collect and clean the data. We started the venture three years back and it was incubated at the Tow-Knight Center at the City University of New York. We received a grant from Tow-Knight Foundation to kick start the operations. The vision is not to duplicate existing data products, but to create products where reliable data can be easily accessed.

hilHow India Lives is a platform to distribute all kinds of public data in a searchable form, aiming to help people make informed business and policy decisions. Currently, data-driven decision making by a range of users – marketers, researchers, retailers, governments etc is driven by customer profiles created on the basis of sample surveys conducted by a number of companies. While these surveys have their uses, we feel that public data, organised under the Census, the National Sample Survey and other sources, can dramatically improve the way companies understand their customers. Their coverage is comprehensive and highly granular (the Census, for example, provides data on hundreds of variables down to every village and ward in India) in a way that many private consumer-oriented surveys are not. Rich possibilities also reside in several datasets beyond the well-known ones. Some of these datasets can be used as proxies to answer questions related to understanding the consumer landscape or making public policy interventions.

However, to organize public data so as to be useful to users, requires a major effort. Much public data is scattered across different databases (in many cases, it is not even organised into a database), which don’t ‘talk’ to each other, or are in unusable formats (e.g. pdf). Our aim is to reorganize public data into a common, machine readable format, and in a way that users can search for data and compare data from disparate sources, for a single geography. We also aim for this platform to be visual in nature, capturing data via maps and other appropriate visualisations.

Howindialives.com is presently in beta version and is scheduled to become paid in early-2017. At that time, we will have 250 million data points, more than 5,500 metrics, and at least 600 data points for each of the 715,000 geographical locations in India.

In addition, we also offer data and technology consultancy services to companies, media outfits and non-profits. Some of our clients: Mint, HT Media, Confederation of Indian Industry (CII), Daksh India, CBGA India, Centre for Policy Research (CPR), ABP News, TARU Leading Edge, Arghyam, and Founding Fuel.

A lot of our common understanding of government programs and other public initiatives is primarily driven by the data points that media dailies choose to write on especially in the context of large datasets. Given your experience as a journalist across media entities, what is your take on this and is there a room for improvement?

When media covers the release of new data, the coverage is often superficial, and unable to take into account the complexity of a dataset. The classic example here is the census. The census has been releasing data since 2012 or so. Often we have found that when these data releases are covered, it is only to the extent of state level data or national level data.

We feel that this is insufficient to take into account the vast geographical complexities of a country such as India. Often we have found that drilling down to a greater depth (e.g. down to district level), gives us greater insights, since disparities within states can often be as dramatic as those between states (e.g. see this link and the below map for an example of how, on one measure, disparities within states are often equally important ). This is just one example. Another way is that relationships between variables and how one variable can ‘cut’ another and to explore interrelationships between them. Our exploration of differing education levels among dalit castes is an example.

Another weakness is simply a lack of awareness of what is out there and to think creatively about what datasets can be used to address a particular question.

Then there is the problem that many datasets, while publicly available, are difficult to access – e.g. they may be in pdf format and/or scanned images. For instance, we explored the relationship between the real estate expansion in Gurgaon and the political regime, using a dataset that was in the public domain, but in PDF form, ensuring that it was unlikely to be used by journalists.

It’s important to stress that these weaknesses are not necessarily due to the incapacity of journalists themselves – indeed if that was the main problem it could be more easily addressed through training etc. The problem is deeper and is related to the way in which many media organisations work. Because of the pressure of deadlines, the need to publish on a regular basis, and the extremely short cycle of news, most journalists simply don’t have the time to be able to spend time with a dataset and understand its complexity (this is true of all news reporting – not just ‘data journalism’). The role of senior editors in giving their journalists the time they need to report and explore a piece of data, and insulating them to an extent from the daily news cycle, is crucial.

How would you characterize the demand for Data Visualization driven Journalism as opposed to the more traditional forms such as anecdotal evidence and story-telling?

The demand for data visualisation – journalism is certainly high, and it’s driven by the increasing availability of large datasets, and the tools to explore and visualise them (e.g. R for data analysis and d3 for javascript driven visualisations). These are supply-side factors. The proliferation of online news outlets, social media etc has provided the demand-side push.

Given the vast volumes of data that is now both available and accessible, data visualisation is increasingly becoming an ideal mode to digest and understand such data. In your experience of having worked on large datasets and visualised the same, what according to you are the fundamentals of a good visualization effort? Are there any nuances that someone has to keep in mind while balancing breaking down complex data and its visual presentation?

The best data journalism is one that combines all forms of story-telling and does not restrict itself to any one. Good data journalism is one that explores a dataset, covers the views of people who understand the field or area relevant to that data, and makes it clear what the data can and cannot tell us. (An excellent example of reporting that does this is the series done in The Hindu on rape cases )

Note : Also see answer to following question

When visualising a large dataset, the decision to not focus on certain data points is equally important as choosing what to. Is there a method or an optimum way that one can make these and/or choices? How much of this decision is based on the target audience and the end impact you want to have, say, on public policy for instance.

Any good data visualisation is driven by a clear viewpoint, developed from exploring the data and an iterative process between posing broader questions and seeing what the data throws up. Simply putting the data out there and assuming that it ‘speaks for itself’ as many claim, will almost guarantee that people’s engagement with it will be low.

The argument is often made that ‘imposing’ your viewpoint on the data is a no-no since it introduces bias. This ignores the fact that the very act of selecting how the data is to be shown and what to display and not to display, introduces bias anyway (otherwise we could just dump a giant excel file on users and ask them to figure it out for themselves). It’s better to take a clear line on what your data shows, and make your assumptions and line of argument clear. Readers are often smart enough to reach their own conclusions on whether your arguments pass muster.

Once you have a clear line on what you want to say, you would necessarily organise your data in a way that makes the point. It’s also often helpful to have a few paras of introductory text, talking about the visualisation, and the argument, since this sets the context in which users ‘read’ the viz.

If the target audience is a layperson, who has less domain knowledge of the relevant field, it’s more important to take them through the viz and the arguments you are making, especially if the field itself is complex. If the audience does have domain knowledge, you can certainly assume some familiarity of the subject.

How India Lives has taken an interest in disseminating data relating to socio-economic issues. What have been some of your personal experiences in working on public datasets with respect to the Indian context? Can you give some examples?

  1. Data is in forms and formats which make it difficult to parse (e.g. pdf, indeed we have seen a case where the data for download – from a government site – was an excel file, but which, when opened, contained only a scanned jpeg image of a data table. The site admin obviously had to fulfil a requirement that he disseminate the data in excel format, but had his own creative interpretation of what that entailed.
  2. Data is geographically incompatible. For instance, census data is based on districts as of census 2011. Since 2011, however the carving out of new districts, means that adding to that data is difficult without knowledge of how new districts map onto old ones. Further, the very concept of a ‘district’ differs depending on the public authority. For instance, police forces can have their own definitions of what constitutes a district, usually under the jurisdiction of an SP or DCP, and this is different from what the civil administration regards as a district. Thus, mapping crime data to other socio-economic data becomes a challenging exercise.
  3. Lack of clear GIS data. In India, there is no official, publicly available source of data, that is easily accessible, on GIS maps for the country, which remains updated to reflect latest geographical boundaries, both internal and external (e.g. has the government changed its maps to reflect the recent treaty with Bangladesh? If so, has this been released?).
  4. Data is in silos. Data released by one government department doesn’t necessarily map onto data released by another in geographic terms. (See point 2 above)

Despite this, our experience of working with public data has been hugely rewarding. The data can be complex, often confusing and maddeningly so. But taking the time to understand its complexity yields rich rewards in terms of understanding diverse socio-economic phenomena.

What is your take on visualising some of the new and non-traditional data inputs that are currently available? Also we are witnessing the movement towards a more “open data” architecture driven by the government, for instance through data.gov.in, that provides vast volumes of public data, what is your take on this?

Any tool, new or not, is only useful, when it is able to provide clear perspective on questions that the user/client is concerned with. Such tools include dashboards which allow the user to ‘cut’ the data in various ways, can be a very useful technique which allows the exploration of complex datasets. They also cover statistical techniques which, if used with knowledge of the underlying assumptions, can throw light on patterns within the data that are not immediately apparent.

As for the movement towards open data, this is a great move and data.gov.in certainly stands out among the range of government sites, in terms of ease of use. But individual departments and ministries should have a clear policy on releasing data at periodic intervals. Until this happens, the open data policy of the government will be implemented only partially.

With the increasing digitisation of public services, citizen level data trails are now being created and captured in government-created databases. What, if any, of these kinds of data do you think should be in the public sphere and what are the measures to be taken for data protection and privacy?

Data at the level of the individual citizens, such as names, mobile numbers etc, are obviously highly sensitive and should not be released, except under restricted circumstances (e.g. for researchers with the stipulation that they release data only in aggregated form). If released to the public, the data must be anonymised in a way that makes it difficult to trace the original identity of the citizen. Such data can also be released in more highly aggregated ways – e.g. at the level of a tehsil or district.

5
Aug

In Conversation with Kalpana Pandey, CEO, CRIF High Mark

In this blog post we feature a conversation between Bama Balakrishnan, CRO, IFMR Capital and Kalpana Pandey, CEO & Managing Director, CRIF High Mark. CRIF High Mark is one of the four credit bureaus that operates in the country.

What are your thoughts on the evolution of Credit Information Companies in India and your experience in this space?

Kalpana Pandey

Kalpana Pandey

India had one Credit Information Company (CIC) from 2004 till 2010. CIC(R)A 2005, the regulation for CICs, came into existence in 2006-07. Pursuant to that, CRIF High Mark (then High Mark) along with two other companies got license to operate as CIC in 2010.

CRIF High Mark was founded with a vision to create a comprehensive and all inclusive credit bureau. The Andhra Pradesh microfinance crisis of 2010 underscored the need for a bureau coverage for the sector, so we took the opportunity to partner with various participants of the microfinance industry and launched India’s first Credit Bureau for Microfinance lending in March 2011. We now operate World’s largest Microfinance Database and over past 5 years, we have supported this sector with reliable information on over 10 crore credit decisions.

CRIF High Mark now is a full-service credit bureau providing coverage for all borrower segments – group lending, individual lending and MSME/Commercial lending. We now work with not only MFIs but also Banks, NBFCs, Housing Finance Companies, RRBs, Coop Banks etc.

RBI since the last year onwards requires all financial institutions to report to all credit bureaus. This means CIBIL must be getting microfinance data, High Mark must be getting data from banks, HFCs, etc. Have you seen good progress in this process?

In addition to Microfinance data, as mentioned earlier, CRIF High Mark was receiving data from major Banks, NBFCs, HFCs, Coop Banks, and RRBs on retail, agri, rural, MSME and corporate lending even prior to 2015. We were missing data from a few players. Since the RBI Jan’2015 notification, these missing gaps have been filled up for us. Similarly, CIBIL must be getting microfinance data.

All systemically important institutions are now compliant with this regulation, and sharing data with all credit bureaus. The smaller cooperative banks and NBFCs are gradually signing up as members with CICs. Data sharing will be the next step for such smaller institutions. Each bureau deploys a filter criteria while uploading data from the files shared by Member institutions into its database. Our technology helps us differentiate ourselves in absorbing the maximum data out of those files and making sense from weaker data.

Bama Balakrishnan

Bama Balakrishnan

Our understanding is that compliance in this regard is still catching up – institutions are going through the process of registering with all the bureaus and would in due course commence sharing data regularly.

The other part which I wanted to understand is, what is your sense of the readiness of credit bureaus in being able to process this kind of information – such as the JLG data – which is different from traditional retail lending?

Where do you think the gaps might be even as institutions start reporting – do you think more needs to be done in terms of capacity building, technology, infrastructure by the bureaus to make sure that they can actually use this information and irrespective of which bureau the lending institution is pulling the report from, they would still get the full picture of indebtedness?

All large players are members of all credit bureaus and have shared data with all CICs. The smaller players while are registering with the bureaus, however these players do not have the IT wherewithawal and have higher dependence on their IT vendors – they have challenges with sharing of data, they will share data once their vendors are able to help them. Our Data Ops team hand-holds such smaller institutions through this process supporting them with mapping of data, file structure and best practices.

As regards data sharing, RBI has standardized the data sharing formats, and all institutions are expected to share data in these formats. CRIF High Mark’s data format was chosen as the format for group lending. Minor additions have been made to this format to cover reporting of SHG data. All these changes are now made in consultation with a Technical Working Group formed by RBI for this purpose.

One has to realise that the input data is becoming similar for the CICs, but each bureau brings its own differentiator through its underlying technology and products. Now CICs give Credit Score in the same range (300-900) to make it easier for people to understand.

In addition all of us expect an overlap between SHG and JLG — we found 40-45% overlap between the JLG and SHG customers on tests we did on data from a few banks. Banks, NBFCs and MFIs are entering into each other territories. So as a user, one should be able to get comprehensive view of customer in a single report. Though the group lending format is different from retail lending format, we are able to process these independently but bring them together to give one picture of the customer. We already have products to provide such full picture of Indebtedness of a customer across SHG data, JLG Data and Individual lending.

We have seen some KYC issues with microfinance customers where we see customers using multiple KYCs with different institutions. We know that bureaus may use relationship information to triangulate data but this may be approximate. With UID coming in and MFIN’s push to ensure authentication through UID, do you see the situation improving in terms of better triangulation of the identity of a customer?

When we launched our bureau for the microfinance segment, objective identifiers such as Voters ID etc were not consistently captured. Even names and addresses were not captured completely. Our technology worked on whatever was available (including relationship names) to bring out most relevant results.

Over the past five years, data quality has crossed many levels – IT systems are better, processes have improved, field staff is sensitized — objective KYC IDs come in now but are largely limited to Voters ID, Aadhaar and Ration Card. MFIN’s push to seed Aadhaar in every new loan being disbursed is seeing fantastic results.

Fragmentation of customer’s data across multiple KYC ID is not unique to Microfinance sector. It is observed in Retail lending space as well since PAN, Passport and Driving License ID are valid KYC IDs. Our bureau systems are tuned to handle these situations to provide comprehensive and also accurate credit reports.

The technology is not biased towards only objective KYC ID, it makes sense of data available across all fields – even single names, partial Voter ID, partial addresses etc – to bring best possible matches. Having said that, it doesn’t mean we would not use KYC ID – better KYC regime will certainly help. Today, we provide millions of credit reports every month with very minimal errors. We constantly invest in fine-tuning our technology, learning from the newer data that we see and the reported errors. With much better availability of a consistent KYC ID (such as Aadhaar) will minimize these errors.

One question on the recent trends of people trying to build platforms for lending using traditional and non-traditional data and algormithic underwriting. Do you think the Indian market is ready for these business models and secondly, do you see this transforming the microfinance lending business? We know that there is not much data, but do you think there is a possibility that with credit bureau data building up and more of it becoming available, people could think about such a business model where there is not much data available other than payment track record. Is that a possibility you have seen people starting to explore?

Our country has just seen use of traditional data (formal lending data). There is defintely an opportunity in exploring use of non-traditional data such as telephone or mobile data, utility bill payments. Many countries have seen benefits of expanding the coverage of population within the bureau with use of such non-traditional data. There have been studies highlighting linkages of such non-traditional data especially payment data with individual’s credit behaviour.

There was an RBI Committee to evaluate the possibility of getting utility data and telecom data into Credit Bureaus. The committee concluded that Telephone use data is meaningful, would help in the cause of inclusion and it should be brought into the credit bureau system. Some legal and regulatory amendments are required to enable telecom service providers to share data with CICs.

Many FinTech start-ups are bringing in novel ideas around providing alternate means of scoring customers and also on algorithmic lending. The success of these models are to be seen, but such new ideas is certainly driving more innovation in the lending related technologies. Most of these businesses are digital in nature, so they may not really be targeting the same customers as that of Microfinance institutions.

We are closely following these developments. RBI is also keenly watching the space, and has already taken initial steps in understanding the various models.

Looking forward, in the backdrop of small finance banks and payments banks that are about to enter the system and the increasing thrust towards financial access driven by various initiatives both by the regulator and the government, as a credit bureau, how do you see the road ahead and some of the key challenges that bureaus in general have to contend with going forward?

Small Finance Banks are none other than existing Microfinance players who will now be diversifying as Banks. They are not new to Credit Bureau environment, but will now expand beyond Microfinance lending and also beyond lending. They will now require Credit scores for Individuals as well as for MSMEs.

For Microfinance sector, the code of conduct guidelines may require a revisit. It currently applies only to NBFC-MFIs, but since Banks and other players are also competing in the same space, we suggest to make these guidelines applicable to all players into direct JLG lending. This is assuming more importance given many MFIs will be SFBs over next few months.

Payments Banks are not into lending so right now they cannot work with a credit bureau. But the payment data generated by Payments Banks is also a strong source of alternate data. Payment data from Payments Banks can expand information coverage across very large expanse of population, and could certainly supplement credit underwriting.

Credit Bureaus will have to constantly evolve to keep up with these changes, and to remain relevant. Newer businesses will have newer needs. Alternative data will require to be merged with the traditional data to make better sense for the customer. The road ahead seems very exciting.

9
Jun

“Open data improves the situation from a data privacy perspective” – Interview with S. Anand – Part 2

By Nishanth K, IFMR Finance Foundation

This post is a continuation from our earlier post about a conversation with S. Anand, Chief Data Scientist, Gramener. The earlier post covered the fundamentals of a good data visualisation and the nuances one has to keen in mind while undertaking such an effort. This post will cover aspects of challenges in public vs. private datasets, data privacy and open data movement.

In addition to working with individual organisations on data specific to them, Gramener has taken an interest in disseminating data relating to socio-economic issues such as Parliament elections, socio-economic census etc. What have been some of your personal experiences in working on public datasets? What are the challenges that you face when working with public datasets as opposed to private/organisation datasets?

Let’s talk about challenges: Now the good part about open and public data these days is that they are reasonably well structured. When comparing private and public datasets, there are three commonly discussed issues:

Data Cleanliness/quality: There are always data collection issues but I don’t see this as private/corporate versus public data source issue but I see it as manual versus automatic data collection issue. If I had to go to a bank in which account balances and transactions were entered manually, it would be just as messy in a private organisation as in a public organisation. Whereas, institutions where it’s collected through a system, it is obviously much better, irrespective of whether the institution is private or public. So it’s merely a question of: to what extent has automation entered a domain. Therefore, data cleanliness is not an issue that distinguishes public data from other kinds of data.

Availability of data: is commonly brought up too. There are a lot of people that tell that public data is harder to get one’s hands on. In my experience, private data is no easier. When we are called in to do a project for a private organisation, we ask if particular pieces of data are readily available. The answers you get are not particularly different from the answers you get from a government which is:

  • We don’t know if data is available
  • If it is available, we don’t know where it is or who has it
  • If it is available, then we don’t know what format it is in

Often times, we have been asked by both government and private organisations to scrap their own website. So, availability of data is also an issue that doesn’t distinguish private and public data either.

Data Linkages: Something that does distinguish the two is the linkages. A lot of public data is not standardised by the entities that use or provide it. For example is there a standard district code? The answer is ‘yes’ not one, but several hundreds of standards. Is there a standard school code? Is there a standard village code?

Every organisational unit in the government tends to have a say in what standards they pick and very often they pick differing standards. These differing standards can be seen even within organisations. For example, if I go to NCERT to collect information about marks of students and information about infrastructure in a school, these two pieces of data cannot be merged because there are no common set of IDs. It’s only now that this need for standardization is coming in because there have been several grassroots initiatives around standardization. So, the single largest problem in working with public data is that it is often difficult to link across datasets availability.

Many governments are moving towards an “open data” culture by making datasets publicly available in order to increase transparency (For example: data.gov.in). What are your thoughts on the impact of these movements and how crucial can visualisations be in making sense of such large volumes of publicly available government data?

Open data movements are clearly good in the sense that you now have access to more data and more can be done with them barring privacy concerns. It also allows the government to outsource and/or rather crowd-source some of its analysis. So why should I have to create a team that does analysis when I can get the public to do the same work which certainly helps.

How does visualisation help? It can help us understand better some things that are not obvious. To take an example: we were recently working with the Ministry of Finance on a project to help them understand the budget in a more intuitive way, from under-spend and over-spend respectively.

So we put together a standard tree-map kind of visualisation where we have boxes, each box represents the size of the budget, the colour represents the degree of under-spend or over-spend.

GIF_3

It is easier to see that one department is spending considerably more than others and some departments spend considerably less. You can now break it down into various sub-departments to see where exactly the problem is coming from, move back and so on. These kinds of explorations make it easier to argue and debate and we are no longer stuck in a situation where you have to understand raw numbers. The task is now simplified to looking at something and focusing on the conclusions. It becomes a lot easier to see what was otherwise a much more complex or intractable item. These visuals also help you explore the dataset in a much more intuitive way.

Another example: We were working on a semi-public dataset along with NCERT on the national achievement survey. The question here was: what influences the marks of students? Can we identify the social and behavioural characteristics that have an impact on the child’s marks? This was done on a reasonably large sample (100080 children) across the country studying in class 8.

GIF_4The complete analysis is available here.

If you look at the table where it shows a variety of factors –for example the gender, age, mode of education, reading books etc. and the influence that it has on the total marks as well as marks by subjects. Let’s take the number of siblings as an example. So the number of siblings it says has a 2.4% impact on the total marks. How does that break up? What it shows us is that individual children scored more marks than children who have one sibling who score more marks than two siblings than three siblings. This does not necessarily indicate that having siblings hurts a child. It’s probably just a correlation between various other economic factors. But we do know that the extent of influence that the number of siblings has is very real. We can start looking at overall influence on each of these parameters to say what has a larger influence that enables us to explore these relationships in more detail.

For instance, one of the things that we now know is that watching TV is not such a bad idea in fact if I look at the overall impact of watching TV against the various subjects, it shows us that reading ability actually improves if children watch TV every day. On the other hand, mathematics ability is dropped dramatically if they watch TV every day. It tells us watching TV once a week roughly is a sweet spot for scoring well in most subjects. On the other hand, if we take a look at how much does playing games make a difference and it turns out that it’s almost exact opposite. Playing games improves mathematical ability considerably but actually hurts your reading ability a bit. Of course never playing is a bad idea. The extent to which you play has a different impact on different subjects.

This sort of analysis would not be possible if the data didn’t come out in the open. Even if this kind of data is available in the open, it requires a good visualisation for it to reach a wider audience.

With the increase in the amount of data being collected and shared by various organisations, what are your thoughts on data privacy?

My thought on this is that any data collection and capturing mechanism makes data privacy a serious issue. Open data, on the other hand, improves the situation from a data privacy perspective. I will give you an example. Let’s take land records: who owns a particular piece of land is very useful information. Also consider data from the voter registry: who are eligible to vote is very useful information- at least for certain sets of people. Considering that both these datasets are available to the government and not available to anyone outside the government, it means that the government has more influence and power than public. This effectively means that to a certain extent the incumbent party is more influential or anyone who manages to get access to the data, within the framework of the law, has more power over someone who doesn’t have access to the data. The data exists and continues to exist ever since people have been writing down on paper and pen voter rolls.

Technology is what is raising the privacy question. Today we are living in a world where there is incredible power and technology where we have the ability to track where a person’s mobile is at any point in time, who they are calling etc. This information is certainly available to the ISPs and to any party that has the ability to subpoena this. Open data is merely making it available to a wider audience. So i see open data more as a leveller that at least makes the lack of privacy more uniformly distributed than it really is.

The fact that when data is open, we would have to enforce controls on it and enforcing it in a reasonably uniform way means that the discussion is brought out in the open. Earlier, what was a privately debated and privately enforced policy is now going to be a publicly debated and publicly enforced policy. The fact that open data is bringing that discussion out in the open and also making data access more uniform is good. The privacy issues stem not from the data but from the existence of the technology itself. The case of NSA and Edward Snowden has shown us that there exists authorities who have the ability to extract the data. The question is now how does one govern these authorities. This discussion becomes easier if you say that the same data is potentially, in some form or the other, available to anyone.

Aside from data visualisation as a method of disseminating information, you have also been recently talking about the emergence of Data Art. Can you tell us more about this?

It is certainly at a nascent stage. Data art, if I have to term it, I would say it is something that uses data to create art without purpose or any specific objective. This is relatively new and people are dabbling around with it in the same way that people have been dabbling with new art forms in the past.

People who were purely focusing on aesthetics are now paying more attention to data and how they could utilise it. For example, design schools such as NID and Shrishti etc. are now talking about data visuals. We are also seeing clients, who were earlier focused on the aesthetics, are now interested in moving towards infographics. Media is a classic example: they were earlier focussed on stories and narratives but now are moving towards infographics and in a few cases, are moving towards data driven visual representations.

On the other hand, people who were focussing a lot on the hard content and numbers are gently moving towards more visual representations as well. Financial Analysts are now saying that they would like to see data in a visual representation and companies are wondering if they can make their annual reports more pleasing purely from an aesthetic perspective. So right across there is clearly growing appreciation of this intersection between aesthetics and data per se.

To give an example of data art, consider this visual:

Image_1_090616

This is directly drawn from data where each of these represents one song. The arc tells you the length of song and the completion of arc represent a total of six minutes. Within that, the strips represent the spectrogram of the song, effectively the frequencies and the beats of the song. So queen’s “We will rock you” has different beats in between that has a very different structure to Eric Clapton’s “Wonderful Tonight” which is remarkably uniform and homogenous. One could argue that this could be useful to understand the structure of the song but in reality it has no purpose. It was created because it could be done. In some sense, art is done because you can do it and because you feel good while you are doing it and not because there’s an audience in mind whose objective you want to satisfy.

6
Jun

“Data can tell different stories, important is, which is the one you want to tell?” – Interview with S. Anand – Part 1

By Nishanth K, IFMR Finance Foundation

The role of data in shaping public perception and informing policy decisions cannot be emphasised enough. Given the shift towards increasing digitisation and an influx of data points that is now available in the public domain, it is fundamental that one has an improved understanding of the nuances in handling large volumes of data and in better appreciating the different facets that accompany it. Beginning this post we intend to interview prominent experts in India and abroad, who through this series, will help provide an insight into the process of collecting, collating and interpreting data and ways in which these could be effectively disseminated. The series will also cover the technical, legal and design aspects that make data, and the accompanying challenges, a fascinating subject that has both universal appeal and relevance.

In the first post as part of the series we feature an interview with S. Anand. Anand is the Chief Data Scientist at Gramener and is an accomplished expert in handling large datasets having worked extensively across public, private and government entities. He uses insights from data and communicates these as visual stories and is a well-known figure in the Indian data science community. In this two-part interview he shares his thoughts on data visualisation and the different aspects that one has to keep in mind while undertaking such an effort.

Note: Credit to Kaliamoorthy A, IFMR Finance Foundation, for the transcription.

What are the some of the key trends that are fundamentally driving data visualisation as a field globally? Is there something unique about how it is shaping up in the Indian context?

A strong trend that I see is that there is a lot of automation that is happening that can be captured in three aspects:

  1. Visualisations are getting automated: It is no longer the case that people are subtly creating data visualisations. We are steadily moving away from infographics which are custom drawn and used to convey data to more automated, data-driven visualisations where the design itself comes from the data.
  2. Analysis is getting automated: There are enough patterns of analysis that work across datasets irrespective of the domain and in many cases quite independent of structure as well that lead to patterns of insight. So for example questions such as: “what influences what?”, “what do the outliers look like?”, “What items/elements of the data are similar?” are questions that seem to be applicable across almost any dataset. Once we have the data in a moderately standard structure irrespective of the domain, people are able to apply automated techniques to analyse the data.
  3. Story telling is also getting automated: There are enough ways in which one can communicate insights around a certain pattern. To give you an example: The below work is where we were doing a simple narrative visualisation and at the same it has elements of automation. In this example the reaction of the stock market to the budget announcements is captured, so for instance on the day after budget how does the stock market move in various years. If I have to pick, say, the year 2008 you can see that the banking and financial services didn’t respond too well whereas Oil and gas industry was neutral and Telecom industry reacted well.

Image1_060616

There are some annotations on the left. All of these are generated automatically using what are very obvious templates that one can see working behind the scenes. Once you have enough of these templates and if you decide to show some of these only when it is an interesting statement, then you’re approaching what is effectively borderline artificial intelligence in the sense that’s what a human would do. One would look for a series of patterns and tell a story based on that. So narration of the story is also getting automated in some way. This is probably the strongest trend that I see in the data visualisation space.

In the Indian context, the market for data visualisations is growing as more people and organisations are becoming aware of data visualisation as a tool for effective communication. There are many people from India who are contributing very strongly in this sphere. For example, Vega is a famous data visualisation library and one of the chief contributors is Arvind Satyanarayan. So, there is considerable work that’s happening in India which will hopefully bear fruit in the future.

What are the fundamentals of a good visualisation effort? Are there any nuances that someone has to keep in mind while balancing breaking down complex data and its visual presentation?

Every data visualisation is a function which takes in data and transforms it into certain visual attributes. Therefore, all the principles of standard design apply. The reason I mention standard principles of design is because these are often forgotten when one starts looking at programmatic design. We spent a lot of time putting together work that constitutes “not bad” design and in the form of a poster, a simple one-page PDF, that explains how one should go about creating better design.

Beyond that, in terms of how one goes about getting a good data visualisation, a good resource would be ‘Noah Illinsky’s book: Designing Data Visualizations. Particularly, one table from the book which shows various encodings that can be used to transform data into a set of visual attributes.

Image_2_060616

This list by and large covers most of the possible visual attributes set one can map the data to. Given some data, I can represent the data in terms of a position on the X axis and Y axis. I can represent it as a measurement of length, size, angle, colour or a variety of other parameters. Different types of data work well when they are assigned to certain encodings. For example, for representing ordered data, colour encodings aren’t always best as it’s not clear whether yellow is larger than blue or smaller than blue. Whereas colour encoding work really well for categorical data.

These are also ordered roughly in decreasing order of effectiveness. We are much better at discerning position than we are at discerning colour. So if I have to take a column of data and use it to represent a visual in an accurate way, then position is our best way i.e., scatter plot is usually an excellent choice. On top of that if I wanted to add another parameter, for example, I could use area as one of the dimension and position as another dimension. Adding different parameters one can create a new type of chart.

Broadly there are two ways in which we can evaluate a visualisation: aesthetics and functionality. This table also gives us a way of evaluating whether a particular visualisation is better than another functionally speaking. We can make sure that we are using an optimal encoding for the data and at the same time not violate any of the basic design principles.

It doesn’t talk about aesthetics though, which is a different topic altogether. While we can easily argue that a pie chart is worse than a bar chart functionally, it is not that an easy argument to make from an aesthetics standpoint. In terms of functionality, we know what the rules are as explained above, whereas for aesthetics we don’t know what the rules are, at least I don’t!

How would you characterise the demand for Data Visualisation driven Journalism as opposed to the more traditional forms such as anecdotal evidence and story-telling?

A happy development in recent times is that the role of data itself in journalism is growing. Visualisation in journalism has always existed in many forms. Infographics, as a specific aspect of data visualisation in journalism, has always been there especially in scientific journalism. The Scientific American data visualisation team, for instance, is spectacular. The kind of visuals they come up with are amazing. So, it has always been there, however its gaining more prominence now.

I think that there is a huge market for stories. Ultimately people consume stories and visuals are useful. In fact it’s quite interesting if you look at any newspaper or magazine and remove all that text and look at just the pictures and try and see if you can guess what the story is about. More than half the times you will find that you actually can’t figure out what the story is about from the picture. The picture is not there at all to explain the story or to elaborate the story. It exists, by and large, to make things look prettier.

At a recent conference, I pulled out a random story from a newspaper which had a picture that was an aerial view of the city. Just looking at that picture, the story could have been about anything. It could have been about traffic, about riots in the city or about retail growth in the city etc. It happened to be that it was about how the construction industry is slumping. The point is that the story is what is selling from a journalism perspective but not the visual. Therefore, the market while it’s growing, will probably saturate at something smaller than what at least the data visualisation enthusiast hopes for. On the other hand, data driven journalism, irrespective of the visualisation, has the potential to scale a lot. That’s because data is a source for stories.

Let me show you an example of a considerable data visualisation story that is scalable. Just go to google.co.in and type in the words “how to”. Now you will get four suggestions:

How to kiss, how to download YouTube videos, how to lose weight and how to reduce weight?

In some sense, this represents India’s priorities today at least as far as the google search is concerned. This is a source of insight in itself and what if we could capture all of this and put it into a list and say if you start searching for “how to” – what are the recommendations in different countries? Can it provide startling insights into what different countries are thinking of against various kinds of questions? This is a point-in-time analysis and it keeps changing every day and this tool gives you an insight into the ever changing minds of geographies over time.

GIF_1

This is not a story but a source for several stories which is what makes data journalism powerful. So, I believe that data visualisation in journalism has been there, is growing and will grow little more. But data itself in journalism has much more potential.

In the context of trying to visualize and disseminate data which would impact, lets say public policy, how important is it to keep in mind the target audience for a visual?

The context is at least as important as the data itself. A piece of data can explain very different things to different people. Similarly, the same inference can also come out from very different pieces of data. The context and data together is what drives good data visualisation.

To give you an example of how the context can completely change the structure of a visual is our work on the 2014 election results page. When we came up with the initial sketch, we thought, we had arrived at a reasonably sophisticated visual that covered various aspects in a dashboard that most people would like. However, there were several problems with this. With subsequent feedback and multiple iterations later, the final version was completely different from the visual we had in the beginning.

Even though we thought we had a pretty well-designed original visual, the contextual feedback helped create a visual that addressed the questions that the audience would like answered. The same data can tell different stories, important is, which is the one you want to tell? And that depends on the audience and what they find interesting.

Part 2 of the interview will cover aspects of challenges in public vs. private datasets, data privacy and open data movement.

16
Apr

Interview with Justice Srikrishna, Chairman, FSLRC

srikrishnaBWAs part of our blog series on the FSLRC report, we will be conducting a series of interviews with key experts to get their perspective on the report and its implications. Below we are posting an email interview that we did with Justice (Retd.) B. N. Srikrishna, Chairman of the Financial Sector Legislative Reforms Commission (FSLRC). Justice Srikrishna is a former Judge of the Supreme Court of India (2002-2006), Chief Justice of the Kerala High Court and Judge of the Bombay High Court. Previously he has been Chairman, Sixth Pay Commission of the Government of India and Chairman, Committee for Separate Telangana before taking over as Chairman, FSLRC.

Q.The future of financial market regulation in India will be determined by two needs – the need for greater financial innovation and the need to protect consumers from the hazards of excessive financialisation of markets. How has the Commission thought of this while recommending an overall regulatory architecture for India? How will the recommended structure foster greater innovation while protecting the rights of the consumer?

The Commission was of the view that all financial laws and regulators are intended to protect the interest of consumers and that should be the primary focal point of our thinking. Hence, a dedicated forum for relief to consumers and detailed provisions for protection of unwary customers against mis-selling, overselling, underselling, wrong advice, defrauding by smaller print etc has been recommended. There is a marked shift from the traditional thinking of buyer beware, as consumers in our country, even if otherwise educated, are not financially savvy. The overall regulatory architecture is designed around this philosophy. Both regulators have to make regulations to carry forward and implement this philosophy at the ground level. Apart from this, each regulator must be clear and upfront with why it is regulating, what it intends to achieve, get inputs from the regulated entities and the public at large, and put out a statement of costs and benefits analysis in the public domain. The attitude of all is well and status quo, and resistance to all innovation by overkill of excessive caution has also been attempted to be met.

Q.What changes do you expect the recommendations bring to the mission of deepening financial inclusion in the country? What measures have been suggested to ensure that there is a more equitable development of financial markets and enhanced participation of marginalized groups in the country?

The burden of financial inclusion cannot always be thrown on the regulated entities. If financial inclusion means development of the regulated entities to expand the scope of their activities so as to benefit larger numbers, that is legitimate scope of the regulator and can be done by the regulators. If financial inclusion is intended to achieve a social purpose that is not the legitimate scope of the regulator, then it becomes a part of the government’s policy and the government is obligated to compensate the regulated entities which have to carry the additional cost of such financial inclusion. If the government requires the regulators to regulate to this end, then it has to bear the additional cost. That way, there is financial inclusion, but that cost is not solely thrown upon the regulated entities. That comes out clearly in the recommendations made.

Q.FSLRC’s approach to consumer protection represents a paradigm shift from the approach traditionally followed in India (caveat emptor). What is the rationale behind this shift in approach and placing consumer protection at the heart of regulation?

The rationale is that the consumer in India is still unaware of his own rights, either because he is ignorant, illiterate and uneducated. Even if literate and educated, he is trammelled by the traditional caveat emptor approach. To some extent, there is consumer awareness in the general sectors and the consumer fora are alleviating the situation. But, financial products are much more complex as compared to other consumer goods and therefore it was felt that a consumer forum dedicated to deal with problems and grievances of consumers of financial product was necessary at the trial and appellate stage. That certainly marks a new step forward.

Q.What transitional issues do you foresee if the recommendations of the Commission are accepted? What safeguards need to be kept in place to ensure that the smooth functioning of the market is not disrupted during this period?

Transitional issues are many. First, the status quo mindset has to change. Second, the notion that there is no need to fix the system as it is not broken, has to be abandoned because what is envisaged by the Commission is not routine repair work, but creation of an ethos necessary to be the frontrunners in the world economy at some time in the not too far future. Third, the existing regulators will oppose the recommendations as they are bound to see them as affecting their turfs. Fourth, setting up of the UFA (Unified Financial Agencywith requisite qualified personnel will pose a formidable challenge. Finally, recruiting suitable persons with requisite knowledge to man the consumer fora and the FSAT (Financial Sector Appellate Tribunal) will also be a difficult task. All these are no doubt challenges, but can be overcome with determination and persistence. Else, we as a country must give up ambition to be the leaders in world economy some day and continue with our all is well attitude. As a high functionary in the finance ministry told me in Australia, “It is better to make changes in peacetime and not when war has broken out”.

Q.The Indian Financial Code written by the Commission itself represents a shift in the way drafting of laws have been done in the country- it is written in simple English and is immensely readable. What factors motivated the Commission to break away from the traditional over-use of legalese and how difficult was it to ensure that a high quality of drafting?

We adopted our existing legislative language from the British and continue to use it even now, although they have abandoned it. Reading their latest Financial Acts there we were struck by the fact that it is written in simple English. While they have advanced in this regard, we continue to use unnecessarily complicated language and make life difficult for everybody. We modelled the language on the present British law.

Q.The Commission has proposed a Unified Financial Code for India – What ramifications does it have for the other financial sector laws in India-will they become null and void or will they need to be suitably amended? Can you throw more light on the law-making process that will ensue if the recommendations are accepted?

The Commission has recommended some amendments to existing laws, some wholesale repeals and some new legislations. These changes will have to be carefully brought about accordingly. The remit of the Commission was to suggest a Financial Architecture that can withstand the pace of development that is envisioned. It is envisaged that by 2050s our economy will grow into 35 trillion USD economy. The big issue is, can our creaking financial apparatus keep pace with it? In the view of most of the experts who interacted with the Commission, the answer was in the negative, hence, the bold and – perhaps to unaccustomed thinking – brash attempt to design a new financial architecture taking the wisdom from several experts in this country and abroad.