Interview with Text mining/NLP expert – Asad Ahmad


How is text mining different from traditional data mining?

There are slight differences between these two. Text Mining is the process of extracting useful patterns from textual data which is unstructured or semi-structured whereas in data mining mostly structured data is used. It is also possible to use data mining techniques on text data after it is been transformed in structured form. In text mining major effort goes towards structuring the data and understanding the semantics and grammar. Also Text preprocessing steps differ from structured data cleaning/processing steps.

What are different text mining tools you use in your job?

There are multiple commercial and open source tools available in market – and they have their own levels of flexibility and rigidity when it comes to usage. I generally use SPSS text analytics, R packages related to text mining such as tm, different plugins for tm, openNLP, qdap, wordcloud, wordnet etc., and Python (NLTK) in my text mining projects. SPSS text analytics is easy to use interface and it also has inbuilt dictionaries for customer satisfaction, banking, insurance, genomics and security which can be customized for different requirements. Natural language toolkit (NLTK) provides easy to use interface for Python users for different text mining tasks. Statisticians and researchers also prefer R for text mining.

Which are the tools college graduates should learn to get into text mining?

For programming background users, I will recommend learning R or Python for text mining tasks. Both have good community and support on web. Others can use IBM SPSS text analytics / Clarabridge /Attensity which comes with cost. I would also recommend that college graduates keep eye on web to figure out latest innovations which are being done. There are number of tools which are available for text mining and NLP.

How is your usual day when working on text mining projects?

A typical day starts with reading few text files and understanding the structure behind and metadata associated with text data. Writing regular expressions to extract the metadata and other structured information from text. Extracting named entities, topics, events, relationships, sentiments and semantics from text. Also major time goes working on feature generation and selection and applying data mining or machine learning models and creating visualization.

What are the major commercial and open source tools for performing TM?

Commercial Tools:

– IBM SPSS Text Analytics

– Clarabridge

– Attensity

– SAS Text Miner

Open Source:

– R

– Python

– GATE (General Architecture for Text Engineering)

– MeTA (Modern Text Analysis)

– OpenNLP

– Stanford NLP Libraries

I have mainly used above mentioned tools till now in my career however you can find more detailed list


How does text mining technologies maps with big data?

Three Vs of big data- volume, variety and velocity fit to text data perfectly. We have more than 80% unstructured data in organizations. Emails, social media, log files, contact center chats & notes, open ended survey questions and sensors keep generating more unstructured data with time. Big data gives an edge for analyzing and extracting insights from all this data. You can find text mining features in every big data technologies such as Hadoop, Pig, Hive, Mahout, and Spark. Real time analysis of unstructured can be achieved using streaming applications.

How difficult is to define ROI for text mining projects?

It is certainly not as easy as for data mining tasks. It depends on the application; if we are working on a voice of customer project and it impact customer satisfaction then ROI can be measured in terms of increase in customer satisfaction and how it affects company profits. If we are working on customer churn project then ROI can be calculated based on number of customers retained using insights/ recommendation generated using text mining. However these gains can only be quantified on a longer timescale – hence patience is the key when measuring results.

What are major applications of text mining?

Text mining applications are widely used in the area of Marketing, CRM, Product management, Life sciences, Insurance, Media & Publishing, etc. Major applications being search engines, contextual ad placement, fraud identification, voice of customer, enterprise search, sentiment/mood analysis, log analysis, human resource analytics. It is getting more popular day by day, and analytics practitioners are always looking for new avenues where text mining can be useful.

How is a typical process of text mining project?

Typical process of text mining project includes data collection (for example web scraping, web-crawling data pull using OAuth APIs etc., databases), text preprocessing (stemming, POS tagging, tokenization, language models, topic models, tagging), Annotation, Semantic analysis, information extraction (entity, concept, event, relationship etc.), followed by visualization and machine learning. Major effort also goes in deploying the application and updating it further based on end users feedbacks.

How is text mining different from Natural language processing?

NLP focuses more on extracting meaning out of text whereas text mining is about extracting useful patterns from textual data. NLP includes techniques such as stemming, lemmatization, word disambiguation, POS tagging, anaphora resolution etc. Text mining includes techniques from information retrieval, statistics and machine learning. Text mining generally uses NLP for low level tasks such feature creation, information extraction etc.

What would be future of text mining?

Future of text mining is bright as we are generating more unstructured data now than ever before. In future I expect text mining will be used more for Artificial intelligence based applications. Technologies like Watson which can understand, search, hypothesize and reply will be common. NLP will get more advanced, and we will have systems which can correct even the meaning of sentences in text. Text mining use cases will be discovered in new areas.

How we can learn more about text mining field?

To start read text book on text mining and its applications. Join communities and forums related to text mining and NLP. Participate in online competitions and at the end to get nitty gritty of unstructured data program text mining applications. Local meetups would also help. A good list of resources for NLP can be found here