Groph is my new personal project to study big data analysis.
Data set
tencent qq group leak data around 2012.
Tech Stack:
- Sql server 2008 r2, to restore the original data set.
- Neo4j, graph database to store data and relations
- python3, lib:
- pipenv: setup virtualenv
- pymssql: python lib to connect to sql server
Day 0
- install sql server
- WIP import data into sql server
- import query:
sp_attach_single_file_db @dbname='GroupData5_Data' . @physname='[path to your data set folder]\GroupData5_Data.MDF'
Day 1, 2017–3–30
- init git repo, github
- setup python virtualenv use pipenv
- connect sql server from python script
- need to config sql server to enable tcp/ip connection first, doc. run
C:\Windows\SysWOW64\SQLServerManager10.msc
to open the Configuration Manager if you can't find it. - in python script,
conn = pymssql.connect(server='SX-DEV', database="GroupData1_Data")
Day 2, 2017–4–5
- init django into project
- setup local neo4j use docker
- install python3.6
- use py3.6 for pyenv, on windows:
pipenv install --python=E:\python36\python.exe
- well, pymssql not support python3.6 yet, will still need to use py3.5
- create a django command to port data.
Day 3, 2017–04–07
- setup dotenv
Day 4, 2017–04–08
- setup neo4j connector, py2neo
- add methods to add group nodes
Day 5, 2017–04–12, I got engaged today!
- optimize port command to handle exception, node creating should be resumed at where it stopped.
Day 6, 2019–04–06, Yes, two years later I’m back on this project again
- discarded old django code base
- create new repo use nestjs
- migrate the data base importer code
- install sql server 2008 again!
Day 7, 2019–04–07
- create neo4j vm on google cloud
- refactored the python importer code. tried to run the importer, but it’s way too slow, about 7 nodes/second
Day 8, 2019–04–08
- Rewrite the importer in nodejs, so I can easily import for all databases at same time. It was super fast at beginning, but slow down after few minutes. The reason turns out is the CPU usage is above 100% on neo4j server. So I tried to increase the server resources but didn’t help.
Day 9, 2019–04–09
- Programmatically run import seems won’t work. I gotta find another way. Then I discovered the neo4j-admin import tool which might help.
- However that import tool only works with CSV files, which mean I have to export the original data sit inside SQL server to CSV files. After some research, I finally made the export works programatically in bash with the
bcp
command from SQL server.bcp ${table} out ${file}.csv -S AXE-PC -w -T -d ${db} -o ${file}.log
- Wrote some bash script to run neo4j-admin import with those generated csv
Day 10, 2019–04–10
- Create a new disk in GCP and attached it to the VM, then upload all csv to that disk.
- Run import script with all the uploaded CSV.. it’s till running while I’m writting, dont know how long it will take.
Day 11, 2019–04–15
The import took 2 days already, and doesn’t seem will be done shortly. Must be something is wrong with the neo4j admin import command or my data.
After some research about this issue, I found out the reason was the import command not works well with duplicate IDs. I have to de-dup all my data. And that’s not easy, since those groups and people data are separated in different database and different tables.
My solution is to write a bash script to generate a sql query that union
all the tables, then create a database view of it. Luckily the bcp command support export from query, here the actually command I use:
bcp "select * from people" queryout people.csv -S AXE-PC -w -T
Note that in the command instead use out
I gotta use queryout
After generated and uploaded all the csv files, I re-run the import command, this time it only took about 2 hour to import all the data.
Day 12, 2019–04-16
Due to a stupid mistake I made when I wrote the query generator bash script, all the people IDs are wrong. Have to repeat the import process again.
Day 13, 2019–04–17
Data finally in there, but when I check them in the neo4j UI with some sample data, I found some in_group
relationships are missing. Then I realized that the group info data is not completed, how ever there are lots relationship trying to build to those missing group nodes. In this case I need to extract all the unknown groups from the relationships. However, there is problem, most groups from the relationship are duplicated to those ones from groups table. Which means I can not simply combine them then throw them to the import command. Also I can’t really use union clause for this problem since the union is for entire row but not for one column.
One way to solve this problem is I cat these two files into one file, then sort it with unique option. however this is not doable on windows since the sort command on window not recognize UTF16LE encoding. So I gotta zip it then upload to the server to do the sorting there. hopefully this would work.
Day 14, 2019–04–20
Sadly that was not working. After done some searching, I found when I export with the bcp
command, I could just use the -c
option instead of -w
to make the out put file to be UTF8 encoding. That would solve everything, since I can finally use the sort
command to de-dup all the groups. The only problem is I gotta restart everything over. I guess that’s not a problem at all, since this project already been going on for 2 years, lol.
After 2 days data processing, uploading and importing, all data finally in there. Now I’m creating index on those nodes, so I can query all those node much faster. Creating a index on nodes is easy:
CREATE INDEX ON :Person(QQNum)
However creating index is not blocking, it’s running as a background task. To see the process of the index you can run the query:
CALL db.indexes
After the indexing is done, I will make the vm as small as possible. It burnt me 120 dollars these days already!
Day 15, 2019–04–22
indexing person node took way long time due to memory limit. I have to increase the machine memory to 30G. Indexing is done finally after one day. After that I downgrade the machine to basic config (1 cpu, 3.75G memory). Query seems fine and quick:
It’s time to actually write some code to exposing this service and do some interesting analysis.