I’m talking about search indexing. And more specifically I’m talking about this piece of software accessing a file in the background to read it when it could in fact be a virus posing as a legit PDF.
Let’s break this down…
EM Client can search for text within PDF attachments. So if I do a search for “July Proposal” it will return all emails that have that text in subject, body or even contained within PDF attachments.
However, like virtually everything that has search functionality these days it does not set out scanning every email for that content at the time I hit enter on the search as it would take too long to produce results. So it uses indexing and caching. In essence this means going out in the background (prior to a search being made) to scan every email and store the information in an organized relational database for easy lookup in the future based on keyword.
So rather than having to go to each email and checking if it contains the words “July Proposal” it instead references the cached index that it created previously in the background. It looks for “July Proposal” in that database and it returns a list of IDs for specific emails/attachments that contain those words.
Now, in order to create that index it needs to read all the emails and attachments ahead of time. So it scans through the emails. If it sees there is an attachment it then decides how to process it. And this is where the security risk is depending how it’s coded and handled.
It’s probably going to look at the attachment file type to see if it’s a “supported” attachment that can be searched. (Do they only support searching PDF, or others?). If it has the right extension the program opens the file. And when I say it opens the file I don’t mean it opens it in Adobe reader or something. I mean it accesses the file and reads the contents of it as a binary file in the background. PDF’s don’t store the content as plain text that can be read like traditional text. So the application reads that binary and processes what it’s reading. A PDF can contain a LOT of stuff. Text, images, forms, marcos, security, signatures…etc…etc. So the software needs to determine what it’s reading in that file and determine what is text and should be indexed and what is not and should be ignored.
All well and good, HOWEVER. It poses a big security risk if that PDF files it’s scanning for text isn’t a PDF at all and has some code in it designed to exploit vulnerabilities in applications reading or indexing the file. IE maybe EM Client is reading the binary of that PDF and it thinks it’s reading text in the PDF but there is some memory overload or escape character that allows other code in the fake PDF to be executed. Maybe the text is actually some javascript that seems harmless to index but maybe there is a flaw in EM Client where javascript from a search result gets rendered when displayed. Who knows? There’s tons of exploits like this and people that spend a lot of time trying to find them and use them.
All of which could be avoided if it were possible to control which attachments are being read. To explicitly tell the program NOT to index attachments unless they are on emails in a certain folder. Even having a “whitelist” based on the sender isn’t safe because anyone can make an email look like it’s coming from anywhere without any coding or skill required.
Anyways, I just tested it. And yes it does open and cache attachments in the background when you view an email regardless of what is set in the search filter. So if you click to preview an email in the junk folder before deleting it and it’s got an attachment EM client goes ahead and reads the attachment for it’s cache. So… pretty insecure.