real AI (artificial intelligence) for everyday business is regexp, but not C++ or LISP especially (updated)

Submitted by olecom
on June 16, 2010 - 10:53am

regexp with `sed` are an easiest way to simplify manual analysis in bookkeeping, when plain text can be available from the ugly so called "software", and PCs are a form of typewriters.

After studying and process of making Ph.D. (undone), I already two years am sitting in my native village in The Republic of Belarus, I care after my grandma, she is immobilized after strokes and final brain infarction (food feeding & sh1t clearing every day, ya know). Small family (me and mum), small income and our hospices for such patients are hell.

Facing real-life tasks that people (using their brain) do on PC, I see how employees cannot do simple analysis of a data in a table form. This is because programmers do and sell dumb FoxPro crap, which forces people input data in inconvenient ways (database forms, no even idea of external import) and do primitive preprogrammed things. The only output is happened to be a printer and paper (so called reports).

I managed to optimize analysis by very quick and simple (for me) way with some tools [1]. New software application, which is reversed to that, what is used, was done using same tools and a bit of ms office visual basic (mainly to convert excel sheets to plain text).

[1] cygwin tools:

cygiconv-2.dll cygintl-3.dll cygintl-8.dll cygncurses-8.dll cygreadline6.dll cygwin1.dll msys-1.0.dll
mkdir.exe rm.exe sed.exe sh.exe

The first and easiest way is to teach employees to analyze data in ms excel (opening of plain files, AutoFilter, PivotTable) and force programmers to implement printing reports in plain text files (export), once data have gotten in some kind of database software. Once imported, data can be analyzed in many different ways without programmers and support.

Other way, and reversed way, is to have employees doing their reports in usual (the most natural) way: ms word or ms excel. Then to have a kind of backend script, which will collect, convert, pre-analyze and export data to ms excel for final stages of analysis.

Why this is the most natural way? Because daily report (document file) can be copied (as a template) and changed for new dates or sets of items by the way people usually do it on PC- typewrites. There is no need in "forms" and databases. Just a template, directory structure filled with files, external tools and final xls to finish the job.

Example #1: making FoxPro to "export" data in plain text (programmers have no idea of that)

--- 1.fxp	2010-06-16 17:57:53.921875000 +0300
+++ 2.fxp	2010-06-16 17:58:03.750000000 +0300
@@ -1,5 +1,7 @@
+   _ASCIICOLS=1024
+   _ASCIIROWS=9999
    FOR i=1 TO setprn(3)
        SELECT curnakl
        GO TOP
-       Report Form repprichod NOCONSOLE TO PRINTER RANGE page_beg,page_end
+       Report Form repprichod TO FILE 'd:\report.txt' ASCII RANGE page_beg,page_end
    ENDFOR

http://pastebin.org/336342
Result: 'd:\report.txt' is opened in ms excel and analyzed (table edit, f(x), AutoFilter, PivotTable)) by any clerk (see picture and next examples below)


Example #2: making chart reports from telephone company plain text information
Result: instead of nothing, boss has an easy from of information, about how cable telephones are used and who is calling to mobile phones (expensive calls, btw it’s useless, because there are mobile phones to call from). Otherwise all this info was printed on kilometers of paper sheets, sent to us and was just left in bookkeeper’s archives for "better times". Once made, same xls sheets with charts can be copied with small change in data source
Here’s April (Апрель) report:


Here’s summary on overall calls and calls to mobiles from November to March:

Here’s info from the telephone company, which we (finally) get by the e-mail. Bold lines are those with overall sum per phone number and mobile calls, which must be summed.

 2010-04-апрель.txt
Почтовый курьер информационной системы PhoneStat (C), Версия 2.0
Copyright (c) 1995-2004, Ruslan Shevelinski, BELPAK

================================================================================
ПЛАТЕЛЬЩИК: Л/С Э2223416, УО УЧ.ЦЕНТР КОМИТЕТА ПО С/Х И ПРОДОВОЛЬСТВИЮ БР.ОБЛИСП
 ПЕРИОД ОКАЗАНИЯ УСЛУГ: С 01/04/2010 ПО 30/04/2010, ВИД НАЧИСЛЕНИЙ: БЕЗНАЛИЧНЫЙ
================================================================================
    Дата Услуга                                               Ед.          Сумма
ТЕЛЕФОН 236917:
01/04/10 Абонплата                                             30,1      3600,00
01/04/10 Дополнительные услуги                                 30,2      1858,00
01/04/10 Дополнительные устройства                             30,1       476,00
01/04/10 Повременная плата                                     1362     19476,60
ИТОГО ПО ТЕЛЕФОНУ 236917:                                    4;         25410,60
ТЕЛЕФОН 237046:
01/04/10 Абонплата                                             30,1      3600,00
01/04/10 Дополнительные услуги                                   30       929,00
17/04/10 междугород, Каменец, Брестская обл.    67322, 14:30      3        82,22
08/04/10 междугород, Жабинка, Брестская обл.    27427, 09:09      2        54,81
08/04/10 междугород, Береза, Брестская обл., РБ 23042, 09:21      1        27,41
08/04/10 междугород, Береза, Брестская обл., РБ 23042, 14:53      3        82,22
08/04/10 междугород, Дрогичин, Брестская обл.   59432, 09:51      1        27,41
09/04/10 междугород, г. Минск, РБ             2789438, 09:35      3       163,80
01/04/10 Повременная плата                                      298      4261,40
08/04/10 расшифровка (электронная почта)                          5         5,80
17/04/10 расшифровка (электронная почта)                          1         1,16
ИТОГО ПО ТЕЛЕФОНУ 237046:                                   11;          9235,23
ТЕЛЕФОН 945381:
01/04/10 Абонплата                                             30,1      3600,00
17/04/10 СИС Брест, 109 Брест, Брестская обл., РБ 109, 09:06      1       152,12
09/04/10 междугород, Барановичи                435234, 09:34      1        27,41
22/04/10 междугород, Барановичи                446397, 08:28      1        27,41
16/04/10 междугород, Барановичи                490412, 12:10      3        82,22
07/04/10 междугород, Жабинка, Брестская обл.    28666, 10:19      3        82,22
20/04/10 междугород, Жабинка, Брестская обл.    36132, 15:24      1        27,41
27/04/10 междугород, Береза, Брестская обл., РБ 26207, 11:06      7       191,84
16/04/10 междугород, Ивацевичи, Брестская обл.  31362, 10:01      4       109,62
14/04/10 междугород, Иваново, Брестская обл.    25659, 12:07      1        27,41
15/04/10 междугород, Пинск, Брестская обл., РБ 324412, 16:25      1        27,41
05/04/10 междугород, Пинск, Брестская обл., РБ 342520, 10:58      1        27,41
05/04/10 междугород, Пинск, Брестская обл., РБ 354803, 10:56      1        27,41
15/04/10 междугород, Столин, Брестская обл., РБ 29064, 16:28      4       109,62
30/04/10 междугород, г. Минск, РБ             2751702, 08:32      4       218,40
15/04/10 междугород, г. Могилев, РБ            396120, 09:53      1        54,60
08/04/10 сотовая св, МЦС, РБ                  1345061, 10:27      2       480,90
26/04/10 сотовая св, МЦС, РБ                  3038834, 12:48      2       480,90
14/04/10 сотовая св, МЦС, РБ                  3408117, 15:29      1       240,45
15/04/10 сотовая св, МЦС, РБ                  3408117, 12:41      2       480,90
17/04/10 сотовая св, МЦС, РБ                  3501233, 11:34      7      1683,15
14/04/10 сотовая св, МЦС, РБ                  6675023, 15:30      2       480,90
01/04/10 сотовая св, МЦС, РБ                  6980861, 08:48      2       480,90
01/04/10 сотовая св, МЦС, РБ                  9276668, 10:08      1       240,45
09/04/10 сотовая св, Мобильные ТелеСистемы    2882013, 15:39      1       223,65
01/04/10 Повременная плата                                      242      3460,60
01/04/10 расшифровка (электронная почта)                         15        17,40
16/04/10 расшифровка (электронная почта)                          9        10,44
ИТОГО ПО ТЕЛЕФОНУ 945381:                                   28;         13073,15

Script, which does the job is using tools [1]:

analyse_.sh
#!/bin/sh
for i in 20*.txt
do
echo "Анализируем $i"
bin/sed '/ИТОГО ПО Т/p;d' <"$i">"number-$i"
exec 3>"number-mobile-$i"
bin/sed '/^ТЕЛЕФОН/,/^ИТОГО ПО Т/{
/^ТЕЛЕФОН/p;
/сотовая св/p;
/^ИТОГО ПО Т/p;
};d' <"$i" | while read L
do case "$L" in
ТЕЛЕФОН*) S=0;N=${L#* };;
ИТОГО*)   printf "$N =$S\r\n";;
*)S="$S+${L##* }";;
esac
done >&3
exec 3>&-
done
echo "Файлы готовы для импорта в Excel [прочитать и нажать Enter]"
read i

http://pastebin.org/338550
In its turn, to make ms windows users happy, it is started by:

_start.bat
bin\sed 's \r  g' anatxt_.sh >anatxt.sh
bin\sh anatxt.sh

http://pastebin.org/336418
[2] Line with `sed` is needed, when changes in script are done with a windoze text editor, which saves file with CR LF line endings, and this will not work in UNIX-stylish shell.

Here we get output files like following, which are opened in ms excel opening wizard. To strip columns I could use `sed`. But I simply use ms excel and mouse for this and creating of charts.

number-2010-04-апрель.txt
ИТОГО ПО ТЕЛЕФОНУ 236917:                                    4;         25410,60
ИТОГО ПО ТЕЛЕФОНУ 237046:                                   11;          9235,23
ИТОГО ПО ТЕЛЕФОНУ 237117:                                   22;         10936,05
ИТОГО ПО ТЕЛЕФОНУ 238062:                                   16;         13163,06
ИТОГО ПО ТЕЛЕФОНУ 945239:                                    4;          8243,48
ИТОГО ПО ТЕЛЕФОНУ 945248:                                   44;         19655,05
ИТОГО ПО ТЕЛЕФОНУ 945370:                                    1;          3600,00
ИТОГО ПО ТЕЛЕФОНУ 945381:                                   28;         13073,15
ИТОГО ПО ТЕЛЕФОНУ 945405:                                   64;         32433,93
ИТОГО ПО ТЕЛЕФОНУ 945420:                                   49;         21177,40
ИТОГО ПО ТЕЛЕФОНУ 945464:                                   13;          5074,59
ИТОГО ПО ТЕЛЕФОНУ 945545:                                   68;         21395,76
ИТОГО ПО ТЕЛЕФОНУ 945797:                                    3;         25122,40
ИТОГО ПО ТЕЛЕФОНУ 945995:                                   11;          4683,97
ИТОГО ПО ТЕЛЕФОНУ 975657:                                   17;         11847,85
ИТОГО ПО ТЕЛЕФОНУ 975661:                                    8;          6389,03
ИТОГО ПО ТЕЛЕФОНУ 975665:                                    2;          4529,00
ИТОГО ПО ТЕЛЕФОНУ 975786:                                    1;          3600,00
number-mobile-2010-04-апрель.txt
236917: =0
237046: =0
237117: =0
238062: =0
945239: =0
945248: =0
945370: =0
945381: =0+480,90+480,90+240,45+480,90+1683,15+480,90+480,90+240,45+223,65
945405: =0+240,45+273,00+223,65+624,75+223,65+223,65+447,30+670,95+124,95+124,95+124,95+223,65+223,65+223,65
945420: =0+240,45+240,45+480,90+447,30+447,30+223,65+447,30+223,65+1118,25+223,65
945464: =0
945545: =0
945797: =0
945995: =0
975657: =0
975661: =0
975665: =0
975786: =0

Example #3: dining room bookkeeper has daily menu food items (weight and price) in directories, which format is [date-#of_food_helpings]. These papers are signed by boss. Files are used as helpers to print stuff on paper, i.e. PC is a typewriter.
Result: bookkeeper is taught what information source is, how to make a task for a programmer, bits of ms excel’s AutoFilter and PivotTable features. After this, all s/he is doing is mouse moving, instead of manual calculation by a calculator of all numbers from the printed papers (month report). This stupid work with a lot of potential errors, recalculations and so on, would take days of really hard brain work.

$ ls
01.05.2010-43/ 03.05.2010-42/ 04.05.2010-50/
05.05.2010-41/ 06.05.2010-37/ 07.05.2010-48/
08.05.2010-45/ 10.05.2010-46/ 11.05.2010-41/ ...

OK, we have 43 portions of some meat, chicken, rice, sauce, etc.

$ ls 01.05.2010-43/
КК мясо отварное.xls
КК мясо отварное.xls.txt
Кисельxls.xls
Кисельxls.xls.txt
Птица , тушеная в соусе с овощами.xls
Птица , тушеная в соусе с овощами.xls.txt
Рис отварной.xls
Рис отварной.xls.txt
Соус красный основной .xls
Соус красный основной .xls.txt
Суп карт. с мак.изделиями.xls
Суп карт. с мак.изделиями.xls.txt

Picture of a boiled rice card file:


Problem is to count all this stuff in the end of the month and calculate remaining. TXT files with plain text information were made by these scripts in ms excel (as it was got by "File\Export File…" menu):

Attribute VB_Name = "Module1"
Sub Конвертированить_xls_в_txt()
Attribute Конвертированить_xls_в_txt.VB_Description = "Макрос записан 01.05.2010 (Пользователь)"
Attribute Конвертированить_xls_в_txt.VB_ProcData.VB_Invoke_Func = " \n14"
Set fs = CreateObject("Scripting.FileSystemObject")

Cells.Select
Selection.Delete Shift:=xlUp
Range("A1").Select

ChDrive (Left(ActiveWorkbook.Path, 1))
ChDir (ActiveWorkbook.Path)
If ActiveWorkbook.Path <> CurDir Then
    MsgBox "Не работает смена директорий в макросе. Конец."
    Exit Sub
End If
ChDir ("..")
Cells(1, 1) = "Рабочая директория:"
Cells(1, 2) = CurDir & "\"

i = 1
e = Dir("", vbDirectory)
Do While e <> ""
    If e <> "__Анализ" And e <> "." And e <> ".." And (GetAttr(e) And vbDirectory) = vbDirectory Then
        i = i + 1
        Cells(i, 1) = e
    End If
    e = Dir
Loop
Do While i <> 1
    ChDir (Cells(1, 2) & Cells(i, 1))
    e = Dir("*.xls")
    j = 1
    Do While e <> ""
        j = j + 1
        Cells(i, j) = e
        Workbooks.Open Filename:=e
        e = e + ".txt"
        On Error Resume Next
        fs.DeleteFile e, True
        ActiveWorkbook.SaveAs Filename:=e, FileFormat:=xlText, CreateBackup:=False
        ActiveWorkbook.Close (True)
        e = Dir
    Loop
    i = i - 1
Loop
ChDir (Cells(1, 2))
End Sub

http://pastebin.org/336396
This one is called by the following script with simple yes/no dialog on xls open. If source xls card files were not modified, there is no need in processing, which takes some time, because of its global convert-all form (yes, I could use `stat` or `test -nt` to find updates :).

VERSION 1.0 CLASS
BEGIN
  MultiUse = -1  'True
END
Attribute VB_Name = "ЭтаКнига"
Attribute VB_GlobalNameSpace = False
Attribute VB_Creatable = False
Attribute VB_PredeclaredId = True
Attribute VB_Exposed = True
Private Sub Workbook_Open()
r = MsgBox("Конвертировать/обновлять данные из XLS в TXT для дальнейшего анализа?", vbYesNo + vbDefaultButton2)
If r = vbYes Then
    Конвертированить_xls_в_txt
End If
End Sub

http://pastebin.org/336397
Here’s picture of plain text rice file:


Looks not that nice as in ms excel, but this is our information. To couple the look, here is script, which actually does AI. IMNSHO, this is very good coding example of `sh` (error handling) and `sed`(regexp) programming. This script also starts ms excel to convert xls to txt.

anatxt_.sh
#!/bin/sh

trap 'echo "
Неожиданная ошибка. (Прочитать сообщения и нажать ВВОД)" >&2;read i;' 0
_exit(){
echo "$1
[Всё. Нажми ВВОД]" >&2
read i
trap '' 0
exit 
}

PATH="$PWD/bin/:$PATH"
set -e
case `echo *xls` in
*.xls);;
*) echo "
Где в текущей директории конвертер XLS в TXT файлы?"
exit 1;;
esac

echo "ВНИМАНИЕ!
1) После нажатия ENTER будет запущен EXCEL, макросы НЕ отключать.
   На предложение отвечать [ДА], если данные менялись.
   Проследить за тем, какие файлы были обработаны.
   Закрыть EXCEL"
read i
cmd.exe /c "convXLS_to_TXT.xls"
echo "2) Нажмате ENTER и дождитесь появления файла _BCE.xls"
read i
cd ..

{
printf "дата\tназвание+цена\tкол-во гр.\tсумма\tпорции\tвсего гр.\tвсего руб.\r\n"
for d in [!_]*[!t][!x][!t]
do cd "$d" ; c=${d%% *} ; c=${c#??.??.????} ; c=${c#-}
	SUBs=''
	items(){
	sed -n '
/^[[:blank:]]*Наимен/{
:_b
n
/^[[:blank:]]*Итого/q
/ до /{
h
x
s/до .*//
x
}
/^[[:blank:]]\{4,\}/{
s/^[[:blank:]]*//
x
G
h
s/\n.*$//
x
s/\n//g
}
# преобразуем точки в запятые из-за и для тупого Excel
s/\(\t[[:digit:]]\{1,\}\)[.]\([[:digit:]]\{1,\}\)/\1,\2/g
#                 наимен   ед.изм    кол-во      цена      сумма        дата       назв+цена  кол-во сумма порции
s/^[[:blank:]]*\([^\t]*\)\t[^\t]*\t\([^\t]*\)\t\([^\t]*\)\t\([^\t]*\).*$/'"${d%%[ -]*}"'\t\1+\3\t\2\t\4\t'"$c"'/p
b_b
}
'	
}
	set -- '' "Мясо отварное" "Соус красный основной" "Фрикадельки мясные"
	for f in *"мясо отварное"*txt *"Соус красный основной"*txt *"Фрикадельки мясные"*txt
	do f=${f%%[*]*} ; shift 1 ; i=$1
	[ "$f" ] && SUBs=$SUBs"

/$i/"`items <"$f" | sed '1s|^|c\\\\\\n|;s/[\\r]*$/\\t'"$i"'\\\\/'`
	done
	
	for f in *txt
	do case $f in *"мясо отварное"* | *"Соус красный основной"* | *"Фрикадельки мясные"*)
		echo "Пропускаем '$f'" >&2
		continue;;
		esac
		echo "Обработка файла: $f" >&2
		items <"$f" | sed "$SUBs" | sed 's|\\$||'
		sed -n '/Наценка/{s|\t\{1,\}|\t|;s|^|'"${d%%[ -]*}\t$c"'\t|;s|$|\t'"${f%????}"'|;p;q}' <"$f" >&7
	done
	# printf "${d%%[ -]*}\tВсего:\tВБИТЬ СУММУ\tВБИТЬ СУММУ\t${c:-??}\tпроций\n"
	cd ..
done 
} >"_BCE.xls" 7>"_HaceHKA.txt"
echo  "Файл $PWD/_BCE.xls готов...(ждите)"
read -t 7 c || :
cmd.exe /c '_BCE.xls' || echo "Откройте $PWD/_BCE.xls сами."
_exit 'Конец.'

http://pastebin.org/336415
Same tools [1] and same starting (see [2]):

_start.bat
bin\sed 's \r  g' anatxt_.sh >anatxt.sh
bin\sh anatxt.sh

http://pastebin.org/336418

OK, results. B, which is "food_name+its_price", is our key data -- i.e. it is received, consumed in cooking and finally remains. Note F: it shows that complex food item (sauce in this case) was split on ingredients. This was done in script, see items().

Now, mouse moving:
1) AutoFilter and removing of zero D (not used) food items. This could be done in script, but I also have to teach bookkeeper where information is, where noise is, and how to deal with all this.


2) pivot table (using wizard): automatic daily and overall report of quantity and price of food consumption. Ta-daa!!!

FoxPro dudes said, this will take a half a year to do this logic&anal&stuff. I don’t know about price of this final happiness, though. It took me and bookkeeper two days partially.

For my village in the Republic of Belarus, this is a Work.

Plus for me -- now I have dinners for free :), because cook must not to create reports of what was consumed. This info is getting for free from bookkeeper’s cards. And I’m glad, that she just prepares appetizing meal.

Another thing. Note bold code in the anatxt_.sh. This small and really quick hack gets extra charge in price, i.e. what cook with assistant earned for their month of work. Another bookkeeper could easily make this table shown below from the "_HaceHKA.txt" file and to calculate the sum (no paper magic again):

Next step will be using OCR (optical character recognition) + manual checking (until some time of the electronic documents in future) we get all food info from the transportation documents (original names, numbers). Mobile telephones with Bluetooth, where electronic docs can be brought to us, are good start IMHO (flash-stick drives are usually with various kinds of autorun/EXE viruses, e-mail with all that SPAM and SPAM fighting crap is too high technology).